API Reference
This section provides detailed API documentation for the popgen-npe package modules.
Tree Sequence Processors
The ts_processors module transforms tree sequences into tensor representations for neural networks.
BaseProcessor
genotypes_and_distances
- class workflow.scripts.ts_processors.genotypes_and_distances(config: dict)[source]
Bases:
BaseProcessorGenotype matrix and distance to next SNP
Extracts genotype matrix with inter-SNP distances.
- default_config = {'max_freq': 1.0, 'max_snps': 2000, 'min_freq': 0.0, 'phased': False, 'polarised': True, 'position_scaling': 1000.0}
cnn_extract
- class workflow.scripts.ts_processors.cnn_extract(config: dict)[source]
Bases:
BaseProcessorExtract genotype matrices from tree sequences using dinf’s feature extractor. Handles both single and multiple population cases automatically.
Feature extractor for CNN architectures using dinf’s HaplotypeMatrix.
- default_config = {'maf_thresh': 0.05, 'n_snps': 500, 'phased': False, 'polarised': False}
tskit_sfs
- class workflow.scripts.ts_processors.tskit_sfs(config: dict)[source]
Bases:
BaseProcessorSite frequency spectrum processor that handles both single and multiple populations. For single population: returns normalized SFS For multiple populations: returns normalized joint SFS
Computes site frequency spectra for single or multiple populations.
- default_config = {'log1p': False, 'mode': 'site', 'normalised': True, 'polarised': False, 'sample_sets': None, 'span_normalise': False, 'windows': None}
tskit_windowed_sfs_plus_ld
- class workflow.scripts.ts_processors.tskit_windowed_sfs_plus_ld(config: dict)[source]
Bases:
BaseProcessorSummary statistics processor that returns a vector of the mean r2 across distances and the mean afs where the mean is taken over windows.
Mean currently only for the single population case.
Combines windowed SFS with linkage disequilibrium statistics.
- default_config = {'mode': 'site', 'polarised': True, 'sample_sets': None, 'span_normalise': False, 'window_size': 1000000}
SPIDNA_processor
- class workflow.scripts.ts_processors.SPIDNA_processor(config: dict)[source]
Bases:
BaseProcessorProcessor specifically designed for SPIDNA embedding networks.
- default_config = {'maf': 0.05, 'n_snps': 400, 'phased': True, 'polarised': True, 'relative_position': True}
ReLERNN_processor
- class workflow.scripts.ts_processors.ReLERNN_processor(config: dict)[source]
Bases:
BaseProcessorProcessor for ReLERNN architecture with phased genotype requirements.
- default_config = {'max_freq': 1.0, 'min_freq': 0.0, 'n_snps': 2000, 'phased': True, 'polarised': True}
Embedding Networks
The embedding_networks module provides neural network architectures that process tensor outputs from processors.
RNN
- class workflow.scripts.embedding_networks.RNN(*args: Any, **kwargs: Any)[source]
Bases:
ModuleA recurrent neural network using bidirectional GRU layers for processing sequential genetic data.
Parameters:
input_size (int) – The input size of the GRU layer (e.g., num_individuals * ploidy)
output_size (int) – The dimension of the output feature vector
num_layers (int, optional) – Number of GRU layers (default: 2)
dropout (float, optional) – Dropout probability (default: 0.0)
Architecture:
Bidirectional GRU with configurable layers
MLP head with dropout for final embedding
ExchangeableCNN
- class workflow.scripts.embedding_networks.ExchangeableCNN(*args: Any, **kwargs: Any)[source]
Bases:
Module- This implements the Exchangeable CNN or permuation-invariant CNN from:
Chan et al. 2018, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7687905/
which builds in the invariance of the haplotype matrices to permutations of the individuals
Main difference is that the first cnn has wider kernel and stride to capture the long range LD.
If input features come from multiple populations that may differ in num_snps and/or num_individuals, then provide a list of tuples with each populations haplotype matrix shape in unmasked_x_shps. The forward pass will then mask out all padded values of -1 which pad each haplotype matrix to the shape of the largest in the set
It has two cnn layers, followed by symmetric layer that pools over the individual axis and feature extractor (fully connected network). Each CNN layer has 2D convolution layer with kernel and stride height = 1, ELU activation, and Batch normalization layer. If the number of popultion is greater than one, the output of the first CNN layer is concatenated along the last axis. (same as pg-gan by Mathieson et al.) Then global pool make output dim (batch_size, outchannels2, 1, 1) and then pass to the feature extractor.
Implements the Exchangeable CNN (permutation-invariant CNN) from Chan et al. 2018. This architecture builds in invariance to permutations of individuals in haplotype matrices.
Parameters:
output_dim (int, optional) – Dimension of the final output vector (default: 64)
input_rows (list of int, optional) – Number of rows (samples) per population
input_cols (list of int, optional) – Number of cols (SNPs) per population
channels (int, optional) – Number of input channels (default: 2)
symmetric_func (str, optional) – Symmetric pooling function: “max”, “mean”, or “sum” (default: “max”)
Architecture:
Two CNN layers with 2D convolutions (kernel heights = 1)
ELU activation and batch normalization
Symmetric pooling layer for permutation invariance
Global average pooling
Feature extractor MLP
Notes:
Supports multiple populations with different dimensions
Automatically masks padded values (-1) when processing multiple populations
First CNN layer uses wider kernel and stride for long-range LD capture
SummaryStatisticsEmbedding
- class workflow.scripts.embedding_networks.SummaryStatisticsEmbedding(*args: Any, **kwargs: Any)[source]
Bases:
ModuleEmbed summary statistics of a tree sequence. This is simply an identity layer that takes in a tensor of summary statistics (e.g., SFS) and outputs the same tensor.
For single population SFS: input shape is (num_samples + 1,) For joint SFS: input shape is (num_samples_pop1 + 1, num_samples_pop2 + 1)
Identity embedding layer for pre-computed summary statistics.
Parameters:
output_dim (int, optional) – Not used, maintained for API consistency
Input Formats:
Single population SFS: shape (num_samples + 1,)
Joint SFS: shape (num_samples_pop1 + 1, num_samples_pop2 + 1)
Notes:
Simply passes through pre-computed summary statistics
Automatically flattens multi-dimensional statistics
Converts numpy arrays to torch tensors if needed
SPIDNA_embedding_network
- class workflow.scripts.embedding_networks.SPIDNA_embedding_network(*args: Any, **kwargs: Any)[source]
Bases:
ModuleSPIDNA architecture for processing genetic data.
- Parameters:
output_dim (int) – Dimension of the output feature vector
num_block (int) – Number of SPIDNA blocks in the network
num_feature (int) – Number of features in the convolutional layers
SPIDNA (Spatially-aware Population genomics with Deep neural Networks) architecture for processing genetic data with positional information.
Parameters:
output_dim (int, optional) – Dimension of output features (default: 64)
num_block (int, optional) – Number of SPIDNA blocks (default: 3)
num_feature (int, optional) – Number of convolutional features (default: 32)
Architecture:
Separate convolutional processing for position and SNP data
Sequential SPIDNA blocks with residual connections
Progressive feature aggregation across blocks
Input Format:
Shape: (batch, channels, samples, snps)
Channel 0: positional information
Channels 1+: SNP/haplotype data
ReLERNN
- class workflow.scripts.embedding_networks.ReLERNN(*args: Any, **kwargs: Any)[source]
Bases:
ModuleThis module constructs a bi-directional GRU based RNN following the architecture from https://github.com/kr-colab/ReLERNN/blob/master/ReLERNN/networks.py#L7.
It processes haplotype data along with corresponding positional information to produce a feature embedding. Its output is consistent with the other embedding networks in this module.
- Parameters:
input_size (int) – The input size for the GRU layer (typically num_individuals * ploidy).
num_positions (int) – The number of genome positions in the input data.
output_dim (int, optional) – The dimension of the final embedded feature vector (default: 64).
Input
-----
x (torch.Tensor, shape (batch, sequence_length, 1 + input_size)) – The first feature along the last dimension is assumed to be positional data while the remaining features are the haplotype representation.
Output
------
torch.Tensor – The embedded feature vector.
(batch (shape) – The embedded feature vector.
output_dim) – The embedded feature vector.
ReLERNN architecture following the design from https://github.com/kr-colab/ReLERNN/. Combines recurrent processing of haplotypes with positional information.
Parameters:
input_size (int) – Input size for GRU (num_individuals * ploidy)
n_snps (int) – Number of SNPs in the input data
output_size (int, optional) – Output embedding dimension (default: 64)
shuffle_genotypes (bool, optional) – Shuffle genotypes during training (default: False)
Architecture:
Bidirectional GRU for haplotype processing
Separate linear layer for positional encoding
Concatenated features passed through MLP
Dropout for regularization
Input Format:
Shape: (batch, sequence_length, 1 + input_size)
First feature: positional data
Remaining features: haplotype representation
Supporting Classes
- class workflow.scripts.embedding_networks.SymmetricLayer(*args: Any, **kwargs: Any)[source]
Bases:
ModuleLayer that performs some permutation-invariant function along a specified axis of input data.
The permuation invariant function can be any of max, mean, or sum
Permutation-invariant pooling layer.
Parameters:
axis (int) – Dimension along which to apply the symmetric function
func (str, optional) – Function type: “max”, “mean”, or “sum” (default: “max”)
- class workflow.scripts.embedding_networks.SPIDNABlock(*args: Any, **kwargs: Any)[source]
Bases:
ModuleSPIDNA architecture for processing genetic data, basic unit
Basic building block for SPIDNA architecture.
Parameters:
num_feature (int) – Number of feature channels
output_dim (int) – Output dimension for feature aggregation
Architecture:
Convolutional layer with batch normalization
Sample-wise averaging for feature extraction
Residual connection to output
Max pooling for spatial dimension reduction