Processors

Processors transform tree sequences from simulators into tensor representations suitable for neural network training. They serve as the bridge between population genetic simulations and machine learning models.

Overview

The processor pipeline:

  1. Tree Sequence Input: Receives tree sequences from simulators

  2. Feature Extraction: Converts genetic data into numerical tensors

  3. Preprocessing: Applies filtering, normalization, and formatting

  4. Output: Returns tensors ready for embedding networks

Each processor is designed to work with specific embedding network architectures, ensuring compatible tensor shapes and data representations.

Processor Types

BaseProcessor

class workflow.scripts.ts_processors.BaseProcessor(config: dict, default: dict)[source]

Bases: object

Genotype-based Processors

These processors extract genotype matrices and related features directly from tree sequences. The autodoc sections below summarize constructor arguments and methods; the bullet points highlight typical behaviour and outputs.

genotypes_and_distances

Simple processor that extracts genotype matrices with inter-SNP distances.

  • Filters SNPs by allele frequency

  • Optionally phases/unphases genotypes

  • Adds scaled positional information

  • Output shape: (n_snps, n_individuals + 1)

class workflow.scripts.ts_processors.genotypes_and_distances(config: dict)[source]

Bases: BaseProcessor

Genotype matrix and distance to next SNP

default_config = {'max_freq': 1.0, 'max_snps': 2000, 'min_freq': 0.0, 'phased': False, 'polarised': True, 'position_scaling': 1000.0}
cnn_extract

Sophisticated processor using dinf’s HaplotypeMatrix for CNN-compatible features.

  • Handles single and multiple populations

  • Creates position and genotype channels

  • Pads populations to equal sizes when needed

  • Output shape varies by population structure: - Single pop: (2, n_individuals, n_snps) - Multiple pops: (n_pops, 2, n_individuals, n_snps)

class workflow.scripts.ts_processors.cnn_extract(config: dict)[source]

Bases: BaseProcessor

Extract genotype matrices from tree sequences using dinf’s feature extractor. Handles both single and multiple population cases automatically.

default_config = {'maf_thresh': 0.05, 'n_snps': 500, 'phased': False, 'polarised': False}

Summary Statistics Processors

These processors compute population genetic summary statistics.

tskit_sfs

Computes site frequency spectra (SFS) for single or joint populations.

  • Supports normalized and unnormalized SFS

  • Optional log transformation

  • Handles both folded and unfolded spectra

  • Output: 1D array (single pop) or multi-dimensional (joint SFS)

class workflow.scripts.ts_processors.tskit_sfs(config: dict)[source]

Bases: BaseProcessor

Site frequency spectrum processor that handles both single and multiple populations. For single population: returns normalized SFS For multiple populations: returns normalized joint SFS

default_config = {'log1p': False, 'mode': 'site', 'normalised': True, 'polarised': False, 'sample_sets': None, 'span_normalise': False, 'windows': None}
tskit_windowed_sfs_plus_ld

Advanced processor combining SFS with linkage disequilibrium (LD) statistics.

  • Computes mean r² across distance bins

  • Calculates windowed SFS

  • Aggregates statistics across genomic windows

  • Currently supports single population only

class workflow.scripts.ts_processors.tskit_windowed_sfs_plus_ld(config: dict)[source]

Bases: BaseProcessor

Summary statistics processor that returns a vector of the mean r2 across distances and the mean afs where the mean is taken over windows.

Mean currently only for the single population case.

default_config = {'mode': 'site', 'polarised': True, 'sample_sets': None, 'span_normalise': False, 'window_size': 1000000}

Network-specific Processors

These processors format data for specific embedding network architectures.

SPIDNA_processor

Formats data specifically for SPIDNA embedding networks.

  • Creates position channel and SNP channels

  • Applies MAF filtering and pads the SNP axis to n_snps (default: 400)

  • Output shape: (n_samples + 1, n_snps); the sample dimension reflects the number of individuals generated by the simulator, not a fixed constant

class workflow.scripts.ts_processors.SPIDNA_processor(config: dict)[source]

Bases: BaseProcessor

default_config = {'maf': 0.05, 'n_snps': 400, 'phased': True, 'polarised': True, 'relative_position': True}
ReLERNN_processor

Formats data for ReLERNN architecture.

  • Requires phased genotypes

  • Recodes alleles to -1/1

  • Normalizes positions to [0,1]

  • Pads to fixed SNP count

  • Output shape: (n_snps, n_samples + 1)

class workflow.scripts.ts_processors.ReLERNN_processor(config: dict)[source]

Bases: BaseProcessor

default_config = {'max_freq': 1.0, 'min_freq': 0.0, 'n_snps': 2000, 'phased': True, 'polarised': True}

Configuration

Processors are configured in the workflow YAML files:

processor:
  class_name: cnn_extract
  n_snps: 500
  phased: False
  maf_thresh: 0.05

Processer configuration values live directly under the processor block. Any key other than class_name must appear in the corresponding default_config inside workflow/scripts/ts_processors.py; unsupported keys cause a configuration error.

Common Parameters

Genotype processors (``genotypes_and_distances``, ``cnn_extract``):

  • max_snps / n_snps: Upper bound on SNPs retained from each simulation

  • phased: Toggle between haploid and diploid encodings (cnn_extract must stay unpolarised)

  • min_freq / max_freq / maf_thresh: Allele-frequency filters before downsampling

  • position_scaling: Scale factor for inter-SNP distances (genotypes_and_distances only)

Summary statistics processors (``tskit_sfs``, ``tskit_windowed_sfs_plus_ld``):

  • normalised / span_normalise: Normalisation flags for frequency spectra

  • polarised: Whether to use ancestral-state information

  • mode: "site" or "branch" tallying for SFS

  • window_size: Genomic window used when aggregating LD + SFS statistics

Processor-Network Compatibility

Each processor is designed to work with specific embedding networks:

Processor-Network Compatibility

Processor

Compatible Networks

genotypes_and_distances

RNN, generic MLPs

cnn_extract

ExchangeableCNN

tskit_sfs

SummaryStatisticsEmbedding

tskit_windowed_sfs_plus_ld

SummaryStatisticsEmbedding

SPIDNA_processor

SPIDNA_embedding_network

ReLERNN_processor

ReLERNN

Usage Examples

Single Population CNN

processor:
  class_name: cnn_extract
  n_snps: 1000
  phased: True
  maf_thresh: 0.01

Multiple Population CNN

processor:
  class_name: cnn_extract
  n_snps: 500
  phased: False
  maf_thresh: 0.05

Summary Statistics

processor:
  class_name: tskit_sfs
  normalised: True
  polarised: False
  log1p: True  # Log transform

Custom Processors

To create a custom processor:

  1. Inherit from BaseProcessor

  2. Define default_config with parameters

  3. Implement __call__(self, ts) to return a tensor

Example:

class MyProcessor(BaseProcessor):
    default_config = {
        "my_param": 42,
        "filter_singletons": True
    }

    def __init__(self, config: dict):
        super().__init__(config, self.default_config)

    def __call__(self, ts: tskit.TreeSequence) -> np.ndarray:
        # Extract features from tree sequence
        features = self.extract_features(ts)

        # Apply preprocessing
        if self.filter_singletons:
            features = self.filter(features)

        return features

Technical Notes

  • All processors return numpy arrays or torch tensors

  • Output shapes must be consistent for batching

  • Variable-length outputs are padded with -1 or 0

  • Processors handle both haploid and diploid data

  • Population structure is preserved in multi-population processors

  • MAF filtering is applied before size limits