API Reference

This section provides detailed API documentation for the popgen-npe package modules.

Tree Sequence Processors

The ts_processors module transforms tree sequences into tensor representations for neural networks.

BaseProcessor

class workflow.scripts.ts_processors.BaseProcessor(config: dict, default: dict)[source]

Bases: object

Base class for all processors. Handles configuration and default parameters.

genotypes_and_distances

class workflow.scripts.ts_processors.genotypes_and_distances(config: dict)[source]

Bases: BaseProcessor

Genotype matrix and distance to next SNP

Extracts genotype matrix with inter-SNP distances.

default_config = {'max_freq': 1.0, 'max_snps': 2000, 'min_freq': 0.0, 'phased': False, 'polarised': True, 'position_scaling': 1000.0}

cnn_extract

class workflow.scripts.ts_processors.cnn_extract(config: dict)[source]

Bases: BaseProcessor

Extract genotype matrices from tree sequences using dinf’s feature extractor. Handles both single and multiple population cases automatically.

Feature extractor for CNN architectures using dinf’s HaplotypeMatrix.

default_config = {'maf_thresh': 0.05, 'n_snps': 500, 'phased': False, 'polarised': False}

tskit_sfs

class workflow.scripts.ts_processors.tskit_sfs(config: dict)[source]

Bases: BaseProcessor

Site frequency spectrum processor that handles both single and multiple populations. For single population: returns normalized SFS For multiple populations: returns normalized joint SFS

Computes site frequency spectra for single or multiple populations.

default_config = {'log1p': False, 'mode': 'site', 'normalised': True, 'polarised': False, 'sample_sets': None, 'span_normalise': False, 'windows': None}

tskit_windowed_sfs_plus_ld

class workflow.scripts.ts_processors.tskit_windowed_sfs_plus_ld(config: dict)[source]

Bases: BaseProcessor

Summary statistics processor that returns a vector of the mean r2 across distances and the mean afs where the mean is taken over windows.

Mean currently only for the single population case.

Combines windowed SFS with linkage disequilibrium statistics.

default_config = {'mode': 'site', 'polarised': True, 'sample_sets': None, 'span_normalise': False, 'window_size': 1000000}

SPIDNA_processor

class workflow.scripts.ts_processors.SPIDNA_processor(config: dict)[source]

Bases: BaseProcessor

Processor specifically designed for SPIDNA embedding networks.

default_config = {'maf': 0.05, 'n_snps': 400, 'phased': True, 'polarised': True, 'relative_position': True}

ReLERNN_processor

class workflow.scripts.ts_processors.ReLERNN_processor(config: dict)[source]

Bases: BaseProcessor

Processor for ReLERNN architecture with phased genotype requirements.

default_config = {'max_freq': 1.0, 'min_freq': 0.0, 'n_snps': 2000, 'phased': True, 'polarised': True}

Embedding Networks

The embedding_networks module provides neural network architectures that process tensor outputs from processors.

RNN

class workflow.scripts.embedding_networks.RNN(*args: Any, **kwargs: Any)[source]

Bases: Module

A recurrent neural network using bidirectional GRU layers for processing sequential genetic data.

Parameters:

  • input_size (int) – The input size of the GRU layer (e.g., num_individuals * ploidy)

  • output_size (int) – The dimension of the output feature vector

  • num_layers (int, optional) – Number of GRU layers (default: 2)

  • dropout (float, optional) – Dropout probability (default: 0.0)

Architecture:

  • Bidirectional GRU with configurable layers

  • MLP head with dropout for final embedding

forward(x)[source]

ExchangeableCNN

class workflow.scripts.embedding_networks.ExchangeableCNN(*args: Any, **kwargs: Any)[source]

Bases: Module

This implements the Exchangeable CNN or permuation-invariant CNN from:

Chan et al. 2018, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7687905/

which builds in the invariance of the haplotype matrices to permutations of the individuals

Main difference is that the first cnn has wider kernel and stride to capture the long range LD.

If input features come from multiple populations that may differ in num_snps and/or num_individuals, then provide a list of tuples with each populations haplotype matrix shape in unmasked_x_shps. The forward pass will then mask out all padded values of -1 which pad each haplotype matrix to the shape of the largest in the set

It has two cnn layers, followed by symmetric layer that pools over the individual axis and feature extractor (fully connected network). Each CNN layer has 2D convolution layer with kernel and stride height = 1, ELU activation, and Batch normalization layer. If the number of popultion is greater than one, the output of the first CNN layer is concatenated along the last axis. (same as pg-gan by Mathieson et al.) Then global pool make output dim (batch_size, outchannels2, 1, 1) and then pass to the feature extractor.

Implements the Exchangeable CNN (permutation-invariant CNN) from Chan et al. 2018. This architecture builds in invariance to permutations of individuals in haplotype matrices.

Parameters:

  • output_dim (int, optional) – Dimension of the final output vector (default: 64)

  • input_rows (list of int, optional) – Number of rows (samples) per population

  • input_cols (list of int, optional) – Number of cols (SNPs) per population

  • channels (int, optional) – Number of input channels (default: 2)

  • symmetric_func (str, optional) – Symmetric pooling function: “max”, “mean”, or “sum” (default: “max”)

Architecture:

  • Two CNN layers with 2D convolutions (kernel heights = 1)

  • ELU activation and batch normalization

  • Symmetric pooling layer for permutation invariance

  • Global average pooling

  • Feature extractor MLP

Notes:

  • Supports multiple populations with different dimensions

  • Automatically masks padded values (-1) when processing multiple populations

  • First CNN layer uses wider kernel and stride for long-range LD capture

forward(x)[source]

SummaryStatisticsEmbedding

class workflow.scripts.embedding_networks.SummaryStatisticsEmbedding(*args: Any, **kwargs: Any)[source]

Bases: Module

Embed summary statistics of a tree sequence. This is simply an identity layer that takes in a tensor of summary statistics (e.g., SFS) and outputs the same tensor.

For single population SFS: input shape is (num_samples + 1,) For joint SFS: input shape is (num_samples_pop1 + 1, num_samples_pop2 + 1)

Identity embedding layer for pre-computed summary statistics.

Parameters:

  • output_dim (int, optional) – Not used, maintained for API consistency

Input Formats:

  • Single population SFS: shape (num_samples + 1,)

  • Joint SFS: shape (num_samples_pop1 + 1, num_samples_pop2 + 1)

Notes:

  • Simply passes through pre-computed summary statistics

  • Automatically flattens multi-dimensional statistics

  • Converts numpy arrays to torch tensors if needed

embedding(x)[source]

Consistent with other embedding networks, provide an embedding method that returns the same output as forward() since this is an identity layer

forward(x)[source]

SPIDNA_embedding_network

class workflow.scripts.embedding_networks.SPIDNA_embedding_network(*args: Any, **kwargs: Any)[source]

Bases: Module

SPIDNA architecture for processing genetic data.

Parameters:
  • output_dim (int) – Dimension of the output feature vector

  • num_block (int) – Number of SPIDNA blocks in the network

  • num_feature (int) – Number of features in the convolutional layers

SPIDNA (Spatially-aware Population genomics with Deep neural Networks) architecture for processing genetic data with positional information.

Parameters:

  • output_dim (int, optional) – Dimension of output features (default: 64)

  • num_block (int, optional) – Number of SPIDNA blocks (default: 3)

  • num_feature (int, optional) – Number of convolutional features (default: 32)

Architecture:

  • Separate convolutional processing for position and SNP data

  • Sequential SPIDNA blocks with residual connections

  • Progressive feature aggregation across blocks

Input Format:

  • Shape: (batch, channels, samples, snps)

  • Channel 0: positional information

  • Channels 1+: SNP/haplotype data

embedding(x)[source]
forward(x)[source]

ReLERNN

class workflow.scripts.embedding_networks.ReLERNN(*args: Any, **kwargs: Any)[source]

Bases: Module

This module constructs a bi-directional GRU based RNN following the architecture from https://github.com/kr-colab/ReLERNN/blob/master/ReLERNN/networks.py#L7.

It processes haplotype data along with corresponding positional information to produce a feature embedding. Its output is consistent with the other embedding networks in this module.

Parameters:
  • input_size (int) – The input size for the GRU layer (typically num_individuals * ploidy).

  • num_positions (int) – The number of genome positions in the input data.

  • output_dim (int, optional) – The dimension of the final embedded feature vector (default: 64).

  • Input

  • -----

  • x (torch.Tensor, shape (batch, sequence_length, 1 + input_size)) – The first feature along the last dimension is assumed to be positional data while the remaining features are the haplotype representation.

  • Output

  • ------

  • torch.Tensor – The embedded feature vector.

  • (batch (shape) – The embedded feature vector.

  • output_dim) – The embedded feature vector.

ReLERNN architecture following the design from https://github.com/kr-colab/ReLERNN/. Combines recurrent processing of haplotypes with positional information.

Parameters:

  • input_size (int) – Input size for GRU (num_individuals * ploidy)

  • n_snps (int) – Number of SNPs in the input data

  • output_size (int, optional) – Output embedding dimension (default: 64)

  • shuffle_genotypes (bool, optional) – Shuffle genotypes during training (default: False)

Architecture:

  • Bidirectional GRU for haplotype processing

  • Separate linear layer for positional encoding

  • Concatenated features passed through MLP

  • Dropout for regularization

Input Format:

  • Shape: (batch, sequence_length, 1 + input_size)

  • First feature: positional data

  • Remaining features: haplotype representation

forward(x)[source]

Supporting Classes

class workflow.scripts.embedding_networks.SymmetricLayer(*args: Any, **kwargs: Any)[source]

Bases: Module

Layer that performs some permutation-invariant function along a specified axis of input data.

The permuation invariant function can be any of max, mean, or sum

Permutation-invariant pooling layer.

Parameters:

  • axis (int) – Dimension along which to apply the symmetric function

  • func (str, optional) – Function type: “max”, “mean”, or “sum” (default: “max”)

forward(x)[source]
class workflow.scripts.embedding_networks.SPIDNABlock(*args: Any, **kwargs: Any)[source]

Bases: Module

SPIDNA architecture for processing genetic data, basic unit

Basic building block for SPIDNA architecture.

Parameters:

  • num_feature (int) – Number of feature channels

  • output_dim (int) – Output dimension for feature aggregation

Architecture:

  • Convolutional layer with batch normalization

  • Sample-wise averaging for feature extraction

  • Residual connection to output

  • Max pooling for spatial dimension reduction

forward(x, output)[source]