Processors
Processors transform tree sequences from simulators into tensor representations suitable for neural network training. They serve as the bridge between population genetic simulations and machine learning models.
Overview
The processor pipeline:
Tree Sequence Input: Receives tree sequences from simulators
Feature Extraction: Converts genetic data into numerical tensors
Preprocessing: Applies filtering, normalization, and formatting
Output: Returns tensors ready for embedding networks
Each processor is designed to work with specific embedding network architectures, ensuring compatible tensor shapes and data representations.
Processor Types
BaseProcessor
Genotype-based Processors
These processors extract genotype matrices and related features directly from tree sequences. The autodoc sections below summarize constructor arguments and methods; the bullet points highlight typical behaviour and outputs.
- genotypes_and_distances
Simple processor that extracts genotype matrices with inter-SNP distances.
Filters SNPs by allele frequency
Optionally phases/unphases genotypes
Adds scaled positional information
Output shape: (n_snps, n_individuals + 1)
- class workflow.scripts.ts_processors.genotypes_and_distances(config: dict)[source]
Bases:
BaseProcessorGenotype matrix and distance to next SNP
- default_config = {'max_freq': 1.0, 'max_snps': 2000, 'min_freq': 0.0, 'phased': False, 'polarised': True, 'position_scaling': 1000.0}
- cnn_extract
Sophisticated processor using dinf’s HaplotypeMatrix for CNN-compatible features.
Handles single and multiple populations
Creates position and genotype channels
Pads populations to equal sizes when needed
Output shape varies by population structure: - Single pop: (2, n_individuals, n_snps) - Multiple pops: (n_pops, 2, n_individuals, n_snps)
- class workflow.scripts.ts_processors.cnn_extract(config: dict)[source]
Bases:
BaseProcessorExtract genotype matrices from tree sequences using dinf’s feature extractor. Handles both single and multiple population cases automatically.
- default_config = {'maf_thresh': 0.05, 'n_snps': 500, 'phased': False, 'polarised': False}
Summary Statistics Processors
These processors compute population genetic summary statistics.
- tskit_sfs
Computes site frequency spectra (SFS) for single or joint populations.
Supports normalized and unnormalized SFS
Optional log transformation
Handles both folded and unfolded spectra
Output: 1D array (single pop) or multi-dimensional (joint SFS)
- class workflow.scripts.ts_processors.tskit_sfs(config: dict)[source]
Bases:
BaseProcessorSite frequency spectrum processor that handles both single and multiple populations. For single population: returns normalized SFS For multiple populations: returns normalized joint SFS
- default_config = {'log1p': False, 'mode': 'site', 'normalised': True, 'polarised': False, 'sample_sets': None, 'span_normalise': False, 'windows': None}
- tskit_windowed_sfs_plus_ld
Advanced processor combining SFS with linkage disequilibrium (LD) statistics.
Computes mean r² across distance bins
Calculates windowed SFS
Aggregates statistics across genomic windows
Currently supports single population only
- class workflow.scripts.ts_processors.tskit_windowed_sfs_plus_ld(config: dict)[source]
Bases:
BaseProcessorSummary statistics processor that returns a vector of the mean r2 across distances and the mean afs where the mean is taken over windows.
Mean currently only for the single population case.
- default_config = {'mode': 'site', 'polarised': True, 'sample_sets': None, 'span_normalise': False, 'window_size': 1000000}
Network-specific Processors
These processors format data for specific embedding network architectures.
- SPIDNA_processor
Formats data specifically for SPIDNA embedding networks.
Creates position channel and SNP channels
Applies MAF filtering and pads the SNP axis to
n_snps(default: 400)Output shape: (n_samples + 1, n_snps); the sample dimension reflects the number of individuals generated by the simulator, not a fixed constant
- class workflow.scripts.ts_processors.SPIDNA_processor(config: dict)[source]
Bases:
BaseProcessor- default_config = {'maf': 0.05, 'n_snps': 400, 'phased': True, 'polarised': True, 'relative_position': True}
- ReLERNN_processor
Formats data for ReLERNN architecture.
Requires phased genotypes
Recodes alleles to -1/1
Normalizes positions to [0,1]
Pads to fixed SNP count
Output shape: (n_snps, n_samples + 1)
- class workflow.scripts.ts_processors.ReLERNN_processor(config: dict)[source]
Bases:
BaseProcessor- default_config = {'max_freq': 1.0, 'min_freq': 0.0, 'n_snps': 2000, 'phased': True, 'polarised': True}
Configuration
Processors are configured in the workflow YAML files:
processor:
class_name: cnn_extract
n_snps: 500
phased: False
maf_thresh: 0.05
Processer configuration values live directly under the processor block. Any key other than class_name must appear in the corresponding default_config inside workflow/scripts/ts_processors.py; unsupported keys cause a configuration error.
Common Parameters
Genotype processors (``genotypes_and_distances``, ``cnn_extract``):
max_snps/n_snps: Upper bound on SNPs retained from each simulationphased: Toggle between haploid and diploid encodings (cnn_extractmust stay unpolarised)min_freq/max_freq/maf_thresh: Allele-frequency filters before downsamplingposition_scaling: Scale factor for inter-SNP distances (genotypes_and_distancesonly)
Summary statistics processors (``tskit_sfs``, ``tskit_windowed_sfs_plus_ld``):
normalised/span_normalise: Normalisation flags for frequency spectrapolarised: Whether to use ancestral-state informationmode:"site"or"branch"tallying for SFSwindow_size: Genomic window used when aggregating LD + SFS statistics
Processor-Network Compatibility
Each processor is designed to work with specific embedding networks:
Processor |
Compatible Networks |
|---|---|
genotypes_and_distances |
RNN, generic MLPs |
cnn_extract |
ExchangeableCNN |
tskit_sfs |
SummaryStatisticsEmbedding |
tskit_windowed_sfs_plus_ld |
SummaryStatisticsEmbedding |
SPIDNA_processor |
SPIDNA_embedding_network |
ReLERNN_processor |
ReLERNN |
Usage Examples
Single Population CNN
processor:
class_name: cnn_extract
n_snps: 1000
phased: True
maf_thresh: 0.01
Multiple Population CNN
processor:
class_name: cnn_extract
n_snps: 500
phased: False
maf_thresh: 0.05
Summary Statistics
processor:
class_name: tskit_sfs
normalised: True
polarised: False
log1p: True # Log transform
Custom Processors
To create a custom processor:
Inherit from
BaseProcessorDefine
default_configwith parametersImplement
__call__(self, ts)to return a tensor
Example:
class MyProcessor(BaseProcessor):
default_config = {
"my_param": 42,
"filter_singletons": True
}
def __init__(self, config: dict):
super().__init__(config, self.default_config)
def __call__(self, ts: tskit.TreeSequence) -> np.ndarray:
# Extract features from tree sequence
features = self.extract_features(ts)
# Apply preprocessing
if self.filter_singletons:
features = self.filter(features)
return features
Technical Notes
All processors return numpy arrays or torch tensors
Output shapes must be consistent for batching
Variable-length outputs are padded with -1 or 0
Processors handle both haploid and diploid data
Population structure is preserved in multi-population processors
MAF filtering is applied before size limits