Processors

Processors transform tree sequences from simulators into tensor representations suitable for neural network training. They serve as the bridge between population genetic simulations and machine learning models.

Overview

The processor pipeline:

Tree Sequence Input: Receives tree sequences from simulators
Feature Extraction: Converts genetic data into numerical tensors
Preprocessing: Applies filtering, normalization, and formatting
Output: Returns tensors ready for embedding networks

Each processor is designed to work with specific embedding network architectures, ensuring compatible tensor shapes and data representations.

Processor Types

BaseProcessor

class workflow.scripts.ts_processors.BaseProcessor(config: dict, default: dict)[source]: Bases: object

Genotype-based Processors

These processors extract genotype matrices and related features directly from tree sequences. The autodoc sections below summarize constructor arguments and methods; the bullet points highlight typical behaviour and outputs.

genotypes_and_distances

Simple processor that extracts genotype matrices with inter-SNP distances.

Filters SNPs by allele frequency
Optionally phases/unphases genotypes
Adds scaled positional information
Output shape: (n_snps, n_individuals + 1)

class workflow.scripts.ts_processors.genotypes_and_distances(config: dict)[source]

Bases: BaseProcessor

Genotype matrix and distance to next SNP

default_config = {'max_freq': 1.0, 'max_snps': 2000, 'min_freq': 0.0, 'phased': False, 'polarised': True, 'position_scaling': 1000.0}

cnn_extract

Sophisticated processor using dinf’s HaplotypeMatrix for CNN-compatible features.

Handles single and multiple populations
Creates position and genotype channels
Pads populations to equal sizes when needed
Output shape varies by population structure: - Single pop: (2, n_individuals, n_snps) - Multiple pops: (n_pops, 2, n_individuals, n_snps)

class workflow.scripts.ts_processors.cnn_extract(config: dict)[source]

Bases: BaseProcessor

Extract genotype matrices from tree sequences using dinf’s feature extractor. Handles both single and multiple population cases automatically.

default_config = {'maf_thresh': 0.05, 'n_snps': 500, 'phased': False, 'polarised': False}

Summary Statistics Processors

These processors compute population genetic summary statistics.

tskit_sfs

Computes site frequency spectra (SFS) for single or joint populations.

Supports normalized and unnormalized SFS
Optional log transformation
Handles both folded and unfolded spectra
Output: 1D array (single pop) or multi-dimensional (joint SFS)

class workflow.scripts.ts_processors.tskit_sfs(config: dict)[source]

Bases: BaseProcessor

Site frequency spectrum processor that handles both single and multiple populations. For single population: returns normalized SFS For multiple populations: returns normalized joint SFS

default_config = {'log1p': False, 'mode': 'site', 'normalised': True, 'polarised': False, 'sample_sets': None, 'span_normalise': False, 'windows': None}

tskit_windowed_sfs_plus_ld

Advanced processor combining SFS with linkage disequilibrium (LD) statistics.

Computes mean r² across distance bins
Calculates windowed SFS
Aggregates statistics across genomic windows
Currently supports single population only

class workflow.scripts.ts_processors.tskit_windowed_sfs_plus_ld(config: dict)[source]

Bases: BaseProcessor

Summary statistics processor that returns a vector of the mean r2 across distances and the mean afs where the mean is taken over windows.

Mean currently only for the single population case.

default_config = {'mode': 'site', 'polarised': True, 'sample_sets': None, 'span_normalise': False, 'window_size': 1000000}

Network-specific Processors

These processors format data for specific embedding network architectures.

SPIDNA_processor

Formats data specifically for SPIDNA embedding networks.

Creates position channel and SNP channels
Applies MAF filtering and pads the SNP axis to n_snps (default: 400)
Output shape: (n_samples + 1, n_snps); the sample dimension reflects the number of individuals generated by the simulator, not a fixed constant

class workflow.scripts.ts_processors.SPIDNA_processor(config: dict)[source]

Bases: BaseProcessor

default_config = {'maf': 0.05, 'n_snps': 400, 'phased': True, 'polarised': True, 'relative_position': True}

ReLERNN_processor

Formats data for ReLERNN architecture.

Requires phased genotypes
Recodes alleles to -1/1
Normalizes positions to [0,1]
Pads to fixed SNP count
Output shape: (n_snps, n_samples + 1)

class workflow.scripts.ts_processors.ReLERNN_processor(config: dict)[source]

Bases: BaseProcessor

default_config = {'max_freq': 1.0, 'min_freq': 0.0, 'n_snps': 2000, 'phased': True, 'polarised': True}

Configuration

Processors are configured in the workflow YAML files:

processor:
  class_name: cnn_extract
  n_snps: 500
  phased: False
  maf_thresh: 0.05

Processer configuration values live directly under the processor block. Any key other than class_name must appear in the corresponding default_config inside workflow/scripts/ts_processors.py; unsupported keys cause a configuration error.

Common Parameters

Genotype processors (``genotypes_and_distances``, ``cnn_extract``):

max_snps / n_snps: Upper bound on SNPs retained from each simulation
phased: Toggle between haploid and diploid encodings (cnn_extract must stay unpolarised)
min_freq / max_freq / maf_thresh: Allele-frequency filters before downsampling
position_scaling: Scale factor for inter-SNP distances (genotypes_and_distances only)

Summary statistics processors (``tskit_sfs``, ``tskit_windowed_sfs_plus_ld``):

normalised / span_normalise: Normalisation flags for frequency spectra
polarised: Whether to use ancestral-state information
mode: "site" or "branch" tallying for SFS
window_size: Genomic window used when aggregating LD + SFS statistics

Processor-Network Compatibility

Each processor is designed to work with specific embedding networks:

Processor-Network Compatibility
Processor	Compatible Networks
genotypes_and_distances	RNN, generic MLPs
cnn_extract	ExchangeableCNN
tskit_sfs	SummaryStatisticsEmbedding
tskit_windowed_sfs_plus_ld	SummaryStatisticsEmbedding
SPIDNA_processor	SPIDNA_embedding_network
ReLERNN_processor	ReLERNN

Usage Examples

Single Population CNN

processor:
  class_name: cnn_extract
  n_snps: 1000
  phased: True
  maf_thresh: 0.01

Multiple Population CNN

processor:
  class_name: cnn_extract
  n_snps: 500
  phased: False
  maf_thresh: 0.05

Summary Statistics

processor:
  class_name: tskit_sfs
  normalised: True
  polarised: False
  log1p: True  # Log transform

Custom Processors

To create a custom processor:

Inherit from BaseProcessor
Define default_config with parameters
Implement __call__(self, ts) to return a tensor

Example:

class MyProcessor(BaseProcessor):
    default_config = {
        "my_param": 42,
        "filter_singletons": True
    }

    def __init__(self, config: dict):
        super().__init__(config, self.default_config)

    def __call__(self, ts: tskit.TreeSequence) -> np.ndarray:
        # Extract features from tree sequence
        features = self.extract_features(ts)

        # Apply preprocessing
        if self.filter_singletons:
            features = self.filter(features)

        return features

Technical Notes

All processors return numpy arrays or torch tensors
Output shapes must be consistent for batching
Variable-length outputs are padded with -1 or 0
Processors handle both haploid and diploid data
Population structure is preserved in multi-population processors
MAF filtering is applied before size limits