API Reference

This section provides detailed API documentation for the popgen-npe package modules.

Tree Sequence Processors

The ts_processors module transforms tree sequences into tensor representations for neural networks.

BaseProcessor

class workflow.scripts.ts_processors.BaseProcessor(config: dict, default: dict)[source]

Bases: object

Base class for all processors. Handles configuration and default parameters.

genotypes_and_distances

class workflow.scripts.ts_processors.genotypes_and_distances(config: dict)[source]

Bases: BaseProcessor

Genotype matrix and distance to next SNP

Extracts genotype matrix with inter-SNP distances.

default_config = {'max_freq': 1.0, 'max_snps': 2000, 'min_freq': 0.0, 'phased': False, 'polarised': True, 'position_scaling': 1000.0}

cnn_extract

class workflow.scripts.ts_processors.cnn_extract(config: dict)[source]

Bases: BaseProcessor

Extract genotype matrices from tree sequences using dinf’s feature extractor. Handles both single and multiple population cases automatically.

Feature extractor for CNN architectures using dinf’s HaplotypeMatrix.

default_config = {'maf_thresh': 0.05, 'n_snps': 500, 'phased': False, 'polarised': False}

tskit_sfs

class workflow.scripts.ts_processors.tskit_sfs(config: dict)[source]

Bases: BaseProcessor

Site frequency spectrum processor that handles both single and multiple populations. For single population: returns normalized SFS For multiple populations: returns normalized joint SFS

Computes site frequency spectra for single or multiple populations.

default_config = {'log1p': False, 'mode': 'site', 'normalised': True, 'polarised': False, 'sample_sets': None, 'span_normalise': False, 'windows': None}

tskit_windowed_sfs_plus_ld

class workflow.scripts.ts_processors.tskit_windowed_sfs_plus_ld(config: dict)[source]

Bases: BaseProcessor

Summary statistics processor that returns a vector of the mean r2 across distances and the mean afs where the mean is taken over windows.

Mean currently only for the single population case.

Combines windowed SFS with linkage disequilibrium statistics.

default_config = {'mode': 'site', 'polarised': True, 'sample_sets': None, 'span_normalise': False, 'window_size': 1000000}

SPIDNA_processor

class workflow.scripts.ts_processors.SPIDNA_processor(config: dict)[source]

Bases: BaseProcessor

Processor specifically designed for SPIDNA embedding networks.

default_config = {'maf': 0.05, 'n_snps': 400, 'phased': True, 'polarised': True, 'relative_position': True}

ReLERNN_processor

class workflow.scripts.ts_processors.ReLERNN_processor(config: dict)[source]

Bases: BaseProcessor

Processor for ReLERNN architecture with phased genotype requirements.

default_config = {'max_freq': 1.0, 'min_freq': 0.0, 'n_snps': 2000, 'phased': True, 'polarised': True}

Embedding Networks

The embedding_networks module provides neural network architectures that process tensor outputs from processors.

RNN

class workflow.scripts.embedding_networks.RNN(*args: Any, **kwargs: Any)[source]

Bases: Module

A recurrent neural network using bidirectional GRU layers for processing sequential genetic data.

Parameters:

input_size (int) – The input size of the GRU layer (e.g., num_individuals * ploidy)
output_size (int) – The dimension of the output feature vector
num_layers (int, optional) – Number of GRU layers (default: 2)
dropout (float, optional) – Dropout probability (default: 0.0)

Architecture:

Bidirectional GRU with configurable layers
MLP head with dropout for final embedding

forward(x)[source]

ExchangeableCNN

class workflow.scripts.embedding_networks.ExchangeableCNN(*args: Any, **kwargs: Any)[source]

Bases: Module

This implements the Exchangeable CNN or permuation-invariant CNN from:: Chan et al. 2018, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7687905/

which builds in the invariance of the haplotype matrices to permutations of the individuals

Main difference is that the first cnn has wider kernel and stride to capture the long range LD.

If input features come from multiple populations that may differ in num_snps and/or num_individuals, then provide a list of tuples with each populations haplotype matrix shape in unmasked_x_shps. The forward pass will then mask out all padded values of -1 which pad each haplotype matrix to the shape of the largest in the set

It has two cnn layers, followed by symmetric layer that pools over the individual axis and feature extractor (fully connected network). Each CNN layer has 2D convolution layer with kernel and stride height = 1, ELU activation, and Batch normalization layer. If the number of popultion is greater than one, the output of the first CNN layer is concatenated along the last axis. (same as pg-gan by Mathieson et al.) Then global pool make output dim (batch_size, outchannels2, 1, 1) and then pass to the feature extractor.

Implements the Exchangeable CNN (permutation-invariant CNN) from Chan et al. 2018. This architecture builds in invariance to permutations of individuals in haplotype matrices.

Parameters:

output_dim (int, optional) – Dimension of the final output vector (default: 64)
input_rows (list of int, optional) – Number of rows (samples) per population
input_cols (list of int, optional) – Number of cols (SNPs) per population
channels (int, optional) – Number of input channels (default: 2)
symmetric_func (str, optional) – Symmetric pooling function: “max”, “mean”, or “sum” (default: “max”)

Architecture:

Two CNN layers with 2D convolutions (kernel heights = 1)
ELU activation and batch normalization
Symmetric pooling layer for permutation invariance
Global average pooling
Feature extractor MLP

Notes:

Supports multiple populations with different dimensions
Automatically masks padded values (-1) when processing multiple populations
First CNN layer uses wider kernel and stride for long-range LD capture

forward(x)[source]

SummaryStatisticsEmbedding

class workflow.scripts.embedding_networks.SummaryStatisticsEmbedding(*args: Any, **kwargs: Any)[source]

Bases: Module

Embed summary statistics of a tree sequence. This is simply an identity layer that takes in a tensor of summary statistics (e.g., SFS) and outputs the same tensor.

For single population SFS: input shape is (num_samples + 1,) For joint SFS: input shape is (num_samples_pop1 + 1, num_samples_pop2 + 1)

Identity embedding layer for pre-computed summary statistics.

Parameters:

output_dim (int, optional) – Not used, maintained for API consistency

Input Formats:

Single population SFS: shape (num_samples + 1,)
Joint SFS: shape (num_samples_pop1 + 1, num_samples_pop2 + 1)

Notes:

Simply passes through pre-computed summary statistics
Automatically flattens multi-dimensional statistics
Converts numpy arrays to torch tensors if needed

embedding(x)[source]: Consistent with other embedding networks, provide an embedding method that returns the same output as forward() since this is an identity layer

forward(x)[source]

SPIDNA_embedding_network

class workflow.scripts.embedding_networks.SPIDNA_embedding_network(*args: Any, **kwargs: Any)[source]

Bases: Module

SPIDNA architecture for processing genetic data.

Parameters:

output_dim (int) – Dimension of the output feature vector
num_block (int) – Number of SPIDNA blocks in the network
num_feature (int) – Number of features in the convolutional layers

SPIDNA (Spatially-aware Population genomics with Deep neural Networks) architecture for processing genetic data with positional information.

Parameters:

output_dim (int, optional) – Dimension of output features (default: 64)
num_block (int, optional) – Number of SPIDNA blocks (default: 3)
num_feature (int, optional) – Number of convolutional features (default: 32)

Architecture:

Separate convolutional processing for position and SNP data
Sequential SPIDNA blocks with residual connections
Progressive feature aggregation across blocks

Input Format:

Shape: (batch, channels, samples, snps)
Channel 0: positional information
Channels 1+: SNP/haplotype data

embedding(x)[source]

forward(x)[source]

ReLERNN

class workflow.scripts.embedding_networks.ReLERNN(*args: Any, **kwargs: Any)[source]

Bases: Module

This module constructs a bi-directional GRU based RNN following the architecture from https://github.com/kr-colab/ReLERNN/blob/master/ReLERNN/networks.py#L7.

It processes haplotype data along with corresponding positional information to produce a feature embedding. Its output is consistent with the other embedding networks in this module.

Parameters:

input_size (int) – The input size for the GRU layer (typically num_individuals * ploidy).
num_positions (int) – The number of genome positions in the input data.
output_dim (int, optional) – The dimension of the final embedded feature vector (default: 64).
Input
-----
x (torch.Tensor, shape (batch, sequence_length, 1 + input_size)) – The first feature along the last dimension is assumed to be positional data while the remaining features are the haplotype representation.
Output
------
torch.Tensor – The embedded feature vector.
(batch (shape) – The embedded feature vector.
output_dim) – The embedded feature vector.

ReLERNN architecture following the design from https://github.com/kr-colab/ReLERNN/. Combines recurrent processing of haplotypes with positional information.

Parameters:

input_size (int) – Input size for GRU (num_individuals * ploidy)
n_snps (int) – Number of SNPs in the input data
output_size (int, optional) – Output embedding dimension (default: 64)
shuffle_genotypes (bool, optional) – Shuffle genotypes during training (default: False)

Architecture:

Bidirectional GRU for haplotype processing
Separate linear layer for positional encoding
Concatenated features passed through MLP
Dropout for regularization

Input Format:

Shape: (batch, sequence_length, 1 + input_size)
First feature: positional data
Remaining features: haplotype representation

forward(x)[source]

Supporting Classes

class workflow.scripts.embedding_networks.SymmetricLayer(*args: Any, **kwargs: Any)[source]

Bases: Module

Layer that performs some permutation-invariant function along a specified axis of input data.

The permuation invariant function can be any of max, mean, or sum

Permutation-invariant pooling layer.

Parameters:

axis (int) – Dimension along which to apply the symmetric function
func (str, optional) – Function type: “max”, “mean”, or “sum” (default: “max”)

forward(x)[source]

class workflow.scripts.embedding_networks.SPIDNABlock(*args: Any, **kwargs: Any)[source]

Bases: Module

SPIDNA architecture for processing genetic data, basic unit

Basic building block for SPIDNA architecture.

Parameters:

num_feature (int) – Number of feature channels
output_dim (int) – Output dimension for feature aggregation

Architecture:

Convolutional layer with batch normalization
Sample-wise averaging for feature extraction
Residual connection to output
Max pooling for spatial dimension reduction

forward(x, output)[source]