Soup-to-Nuts Tutorial: Rate Landscape Estimation ================================================== In this tutorial we will build a complete simulation-based inference pipeline that estimates mutation rate and recombination rate across a genomic region. By the end we will have: - A custom **simulator** that generates tree sequences under a constant-size population model with two free parameters (recombination rate and mutation rate) - A **YAML config** that wires the simulator to a CNN-based processor and embedding network - A trained neural posterior estimator - **Windowed posterior estimates** of both rates across a simulated genome with known, spatially varying rate landscapes, so we can check that the model recovers the ground truth In this example training uses windows with *constant* rates (drawn from a prior), while prediction applies the trained model window-by-window to a VCF where rates *vary spatially* — producing a rate landscape. .. figure:: _static/rate_landscape_example.png :width: 100% :alt: Windowed rate landscape estimates vs. ground truth The final result: posterior estimates of recombination rate (top) and mutation rate (bottom) across 30 genomic windows, compared against the true rates used to simulate the example data. Blue shading shows the 90% credible interval; dashed red line is the ground truth. Vertical dotted lines mark chromosome boundaries. .. note:: With 5,000 training simulations and the default config, the full tutorial (training + prediction) takes about **15 minutes** on a machine with an NVIDIA A100 GPU using 4 cores. Simulation is the bottleneck; on a cluster you can parallelize it further with ``n_chunk``. Training on CPU will be slower but still feasible. Prerequisites ------------- Make sure you have installed popgen-npe and activated the environment following the instructions in :doc:`installation`. All commands below assume the conda environment is active and that you are working from the repository root. What we are building -------------------- .. code-block:: text Files we will create ───────────────────── workflow/ ├── scripts/ │ └── ts_simulators.py ← add MutRecRate class here └── config/ └── MutRecRate_cnn.yaml ← new config file Files shipped with the repo ─────────────────────────── example_data/ └── MutRecRate/ ├── test.vcf.gz ← synthetic VCF with variable rates ├── test.vcf.gz.tbi ├── popmap.yaml ├── windows.bed └── true_rates.tsv ← ground truth for comparison Step 1: Write the simulator ---------------------------- Open ``workflow/scripts/ts_simulators.py`` in your favorite editor and add the following class. The simulator API ^^^^^^^^^^^^^^^^^ Every simulator must: 1. Inherit from ``BaseSimulator`` 2. Define a ``default_config`` dict whose keys are either **fixed** parameters (single values) or **random** parameters (two-element ``[low, high]`` lists defining uniform prior bounds) 3. In ``__init__``, build ``self.parameters`` (an ordered list of random parameter names) and ``self.prior`` (a ``BoxUniform`` over those ranges) 4. Implement ``__call__(self, seed) -> (tskit.TreeSequence, np.ndarray)`` which samples parameters, simulates a tree sequence with mutations, and returns both Full code listing ^^^^^^^^^^^^^^^^^ .. code-block:: python class MutRecRate(BaseSimulator): """ Constant-size population with variable mutation and recombination rates. Designed for windowed estimation of rate landscapes we train on single windows with constant rates, then predict per-window on real data. """ default_config = { # Fixed parameters "samples": {"pop": 20}, "sequence_length": 500000, "pop_size": 10000, # Random parameters — [low, high] uniform prior bounds "recombination_rate": [1e-9, 1e-7], "mutation_rate": [1e-9, 1e-7], } def __init__(self, config: dict): super().__init__(config, self.default_config) self.parameters = ["recombination_rate", "mutation_rate"] self.prior = BoxUniform( low=torch.tensor([self.recombination_rate[0], self.mutation_rate[0]]), high=torch.tensor([self.recombination_rate[1], self.mutation_rate[1]]), ) def __call__(self, seed: int = None) -> (tskit.TreeSequence, np.ndarray): torch.manual_seed(seed) theta = self.prior.sample().numpy() recomb_rate, mut_rate = theta demography = msprime.Demography() demography.add_population(name="pop", initial_size=self.pop_size) ts = msprime.sim_ancestry( samples={"pop": self.samples["pop"]}, demography=demography, sequence_length=self.sequence_length, recombination_rate=recomb_rate, random_seed=seed, ) ts = msprime.sim_mutations(ts, rate=mut_rate, random_seed=seed) return ts, theta **Key points:** - ``self.parameters`` defines the order of values in the returned ``theta`` array. This order is used everywhere downstream (training targets, posterior samples, diagnostic plots), so it must be consistent. - ``BoxUniform`` (from ``sbi.utils``) is already imported at the top of ``ts_simulators.py``. - The population here is named ``"pop"`` (a string, not an integer) — this is required for the prediction pipeline's population-name matching to work. - The ``sequence_length`` (500 kb) matches the window size in the BED file used for prediction. Training windows and prediction windows must be the same size. - The prior spans two orders of magnitude (1e-9 to 1e-7), covering the biologically relevant range for most eukaryotes. Quick sanity check ^^^^^^^^^^^^^^^^^^ .. code-block:: python from workflow.scripts.ts_simulators import MutRecRate sim = MutRecRate({"class_name": "MutRecRate"}) ts, theta = sim(seed=42) print(f"Parameters: {dict(zip(sim.parameters, theta))}") print(f"Num sites: {ts.num_sites}") print(f"Num trees: {ts.num_trees}") Step 2: Configure the processor --------------------------------- We use the built-in ``cnn_extract`` processor, which converts a tree sequence into a genotype matrix suitable for a convolutional neural network. No new code is needed here, but one can introduce there own custom processors. ``cnn_extract`` produces an array of shape ``(2, n_individuals, n_snps)`` for a single population. The two channels are SNP positions and genotype values. The ``ExchangeableCNN`` embedding network expects exactly this format. Key parameters ^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 60 20 * - Parameter - Meaning - Our value * - ``n_snps`` - Maximum number of SNPs to retain - 1000 * - ``maf_thresh`` - Minor allele frequency filter; 0.0 keeps all sites - 0.0 * - ``phased`` - Whether to keep haplotype-level data (True) or collapse to diploid genotypes (False) - False * - ``polarised`` - Must be False for ``cnn_extract`` - False We use ``maf_thresh: 0.0`` because the full site frequency spectrum (including rare variants) carries information about the mutation rate. Step 3: Write the config YAML ------------------------------- Create the file ``workflow/config/MutRecRate_cnn.yaml`` with the contents below. Every field is annotated. .. code-block:: yaml # ── Project location ───────────────────────────────────────────── # All outputs go under this directory, inside a UID-based # subdirectory. Change this to wherever you want results written. project_dir: "/path/to/your/project/dir" # ── Resource allocation, if you want to run on a slurm cluster cpu_resources: runtime: "2h" mem_mb: 16000 gpu_resources: runtime: "4h" mem_mb: 50000 gpus: 1 slurm_partition: "gpu" slurm_extra: "--gres=gpu:1" # ── Simulation settings ────────────────────────────────────────── random_seed: 42 n_chunk: 10 # number of parallel simulation jobs n_train: 5000 # training simulations n_val: 500 # validation simulations n_test: 500 # test simulations (used for diagnostics) # ── Training hyperparameters ───────────────────────────────────── train_embedding_net_separately: True # two-stage training use_cache: True # load features into CPU memory optimizer: "Adam" batch_size: 64 learning_rate: 0.0005 max_num_epochs: 200 stop_after_epochs: 50 # early stopping patience clip_max_norm: 5 packed_sequence: False # ── Simulator ──────────────────────────────────────────────────── # class_name must match the class you added to ts_simulators.py. # sequence_length must equal the window size in the BED file. simulator: class_name: "MutRecRate" samples: pop: 20 sequence_length: 500000 pop_size: 10000 # ── Processor ──────────────────────────────────────────────────── processor: class_name: "cnn_extract" n_snps: 1000 maf_thresh: 0.0 # ── Embedding network ──────────────────────────────────────────── # input_rows: number of individuals per population (list, one per pop) # input_cols: number of SNPs per population (list, one per pop) embedding_network: class_name: "ExchangeableCNN" output_dim: 64 input_rows: [20] input_cols: [1000] # ── Prediction ─────────────────────────────────────────────────── prediction: n_chunk: 5 vcf: "example_data/MutRecRate/test.vcf.gz" population_map: "example_data/MutRecRate/popmap.yaml" windows: "example_data/MutRecRate/windows.bed" min_snps_per_window: 10 **What are all these yaml fields anyway? Some explanation:** - ``project_dir`` — the workflow creates a subdirectory named ``MutRecRate-cnn_extract-ExchangeableCNN-42-5000-sep`` under this path (the naming scheme is ``{simulator}-{processor}-{embedding}-{seed}-{n_train}-{sep|e2e}``). - ``n_chunk`` controls parallelism. With ``n_train: 5000`` and ``n_chunk: 10``, each chunk simulates 500 tree sequences. On a laptop, ``n_chunk: 1`` or ``2`` is fine; on a cluster you can go much higher. - ``train_embedding_net_separately: True`` means the CNN embedding is pre-trained, then the normalizing flow is trained on the learned embeddings. This is can be more stable than end-to-end training. - ``sequence_length: 500000`` — this **must match** the window size in ``windows.bed``. Training windows and prediction windows must be the same size so the network sees the same scale of data. - ``input_rows`` and ``input_cols`` must match the processor output: 20 individuals and 1000 SNPs. Step 4: Run the training workflow ---------------------------------- From the repository root: .. code-block:: bash snakemake \ --configfile workflow/config/MutRecRate_cnn.yaml \ --snakefile workflow/training_workflow.smk \ --cores 4 .. tip:: On a local machine with no GPU, training will still work (PyTorch falls back to CPU) but will be slower. With 5,000 training simulations and the network sizes above, expect roughly 10–30 minutes depending on your machine. What happens ^^^^^^^^^^^^ The workflow runs these stages in order: 1. **Setup** — creates the Zarr data store and divides the simulations into chunks. 2. **Simulate** — each chunk calls ``MutRecRate(seed=...)`` repeatedly, drawing random rate pairs from the prior and simulating constant-rate windows. 3. **Process** — each chunk calls ``cnn_extract`` on every tree sequence, storing the resulting tensors in Zarr. 4. **Train embedding network** — trains the ``ExchangeableCNN`` to produce useful summary statistics from genotype matrices. 5. **Train normalizing flow** — trains a neural posterior estimator conditioned on the learned embeddings. 6. **Diagnostics** — generates posterior calibration, concentration, and simulation summary plots. Output files ^^^^^^^^^^^^ After a successful run you will find (under your ``project_dir``): .. code-block:: text MutRecRate-cnn_extract-ExchangeableCNN-42-5000-sep/ ├── tensors/zarr/ # Zarr store with features and targets ├── trees/ # Simulated tree sequences (.trees) ├── logs/ # TensorBoard training logs ├── pretrain_embedding_network # Pickled trained CNN ├── pretrain_normalizing_flow # Pickled trained flow └── plots/ ├── posterior_calibration.png ├── posterior_concentration.png ├── posterior_expectation.png ├── posterior_at_prior_mean.png ├── posterior_at_prior_low.png ├── posterior_at_prior_high.png ├── simulation_stats.png ├── stats_vs_params_pairplot.png └── stats_heatmaps.png Inspecting results ^^^^^^^^^^^^^^^^^^ The diagnostic plots are the quickest way to check that inference is working: - **posterior_calibration.png** — simulation-based calibration plot; points falling on the diagonal indicate well-calibrated posteriors. - **posterior_concentration.png** — shows how ratio of posterior to prior width change as a function of coverage for each parameter. - **posterior_expectation.png** — posterior means vs. true parameter values; points near the diagonal mean the model is recovering the parameters. - **posterior_at_prior_mean.png / _low.png / _high.png** — posterior distributions evaluated at specific points in the prior (mean, lower bound, upper bound), useful for spotting bias at the edges of the parameter space. - **simulation_stats.png** — summary statistics of the simulated datasets. - **stats_vs_params_pairplot.png** — pairwise relationships between simulation summary statistics and the underlying parameters. - **stats_heatmaps.png** — heatmaps showing correlations between summary statistics and parameters. You can also monitor training loss in real time with TensorBoard: .. code-block:: bash tensorboard --logdir /path/to/your/project/dir/MutRecRate-cnn_extract-ExchangeableCNN-42-5000-sep/logs Step 5: Run prediction on the example data -------------------------------------------- The prediction workflow applies your trained model to genomic data stored in VCF format, producing windowed posterior estimates across the genome. The bundled example data in ``example_data/MutRecRate/`` contains a synthetic VCF simulated with spatially varying recombination and mutation rates across three chromosomes (5 Mb each, 10 windows of 500 kb per chromosome). This lets you verify that the trained model recovers the known rate landscape. The example data includes: - **test.vcf.gz** — simulated with ``msprime.RateMap`` objects so that rates vary across windows (e.g., chr0 has varying recombination but constant mutation; chr1 has varying mutation but constant recombination; chr2 has both varying in opposite directions). - **popmap.yaml** — maps all 20 VCF samples to population ``"pop"``. - **windows.bed** — 30 windows (10 per chromosome, 500 kb each). - **true_rates.tsv** — ground truth recombination and mutation rates per window, for comparison with the model's predictions. Requirements for prediction data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you are bringing your own data instead of using the example, note: - The VCF must be bgzipped and tabix-indexed. - Every contig must have a length in the VCF header (``##contig=``). If contig lengths are missing, ``tsinfer`` will fail with ``sequence_length cannot be zero or less``. - The number of samples in the VCF (after filtering via the population map) must **exactly match** the simulator's ``samples`` config — in our case, 20 diploid individuals assigned to population ``"pop"``. - Window sizes in the BED file should match the simulator's ``sequence_length``. The population map assigns VCF sample names to simulator population names: .. code-block:: yaml # popmap.yaml IND0: "pop" IND1: "pop" IND2: "pop" # ... one entry per VCF sample Run the prediction workflow ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``prediction`` block is already in the config from Step 3. Run: .. code-block:: bash snakemake \ --configfile workflow/config/MutRecRate_cnn.yaml \ --snakefile workflow/prediction_workflow.smk \ --cores 4 What the prediction workflow does ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. **Convert VCF** — converts the bgzipped VCF to Zarr format (``vcf2zarr``). 2. **Setup** — validates that VCF samples match the simulator, defines windows, and creates Zarr storage. 3. **Infer trees** — for each window, infers a tree sequence from the genotype data using ``tsinfer``. 4. **Process** — applies the same ``cnn_extract`` processor to each inferred tree sequence. 5. **Predict** — loads the trained embedding network and normalizing flow, and draws 1,000 posterior samples per window. 6. **Diagnostics** — generates summary plots. Prediction output ^^^^^^^^^^^^^^^^^ .. code-block:: text test.vcf.gz/ ├── vcz/ # Zarr-encoded VCF ├── trees/ # Inferred tree sequences (one per window) ├── tensors/zarr/ # Features and posterior samples │ └── predictions # Shape: (n_windows, n_parameters, 1000) └── plots/ ├── posteriors-across-windows.png └── tree_stats_hist.png - **posteriors-across-windows.png** — a heatmap showing the posterior distribution for each parameter across all 30 genomic windows. This is the main result: you should see the posterior mode tracking the true rate landscape across windows and chromosomes (separated by red dashed lines). - **tree_stats_hist.png** — summary statistics of the inferred tree sequences compared with simulated training data, useful for checking that the real data falls within the range the model was trained on. Step 6: Interpret the results ------------------------------ The raw posterior samples are stored in ``tensors/zarr/predictions`` as a NumPy array of shape ``(n_windows, n_parameters, 1000)``. You can load them for downstream analysis and compare against the known ground truth: .. code-block:: python import zarr import numpy as np import matplotlib.pyplot as plt # Load predictions and ground truth project = "/path/to/your/project/dir/MutRecRate-cnn_extract-ExchangeableCNN-42-5000-sep" predictions = zarr.load(f"{project}/test.vcf.gz/tensors/zarr/predictions") true_rates = np.loadtxt("example_data/MutRecRate/true_rates.tsv", skiprows=1, usecols=[3, 4]) param_names = ["recombination_rate", "mutation_rate"] fig, axes = plt.subplots(2, 1, figsize=(10, 6), sharex=True) for i, (ax, name) in enumerate(zip(axes, param_names)): # Plot posterior medians and 90% credible intervals medians = np.median(predictions[:, i, :], axis=1) lo = np.quantile(predictions[:, i, :], 0.05, axis=1) hi = np.quantile(predictions[:, i, :], 0.95, axis=1) windows = np.arange(predictions.shape[0]) ax.fill_between(windows, lo, hi, alpha=0.3, label="90% CI") ax.plot(windows, medians, label="Posterior median") ax.step(windows, true_rates[:, i], where="mid", color="red", linestyle="--", label="True rate") ax.set_ylabel(name) ax.legend(loc="upper right") # Mark chromosome boundaries for b in [10, 20]: ax.axvline(b - 0.5, color="gray", linestyle=":") axes[-1].set_xlabel("Window index") plt.tight_layout() plt.savefig("rate_landscape_comparison.png", dpi=150) Next steps ---------- - **Try log-scale priors** — rates spanning orders of magnitude are more naturally modeled with log-uniform priors. You could sample in log space (e.g., ``log10_recomb_rate: [-9, -7]``) and exponentiate inside ``__call__``, similar to the ``VariablePopulationSize`` simulator. - **Try different architectures** — swap ``cnn_extract`` / ``ExchangeableCNN`` for ``genotypes_and_distances`` / ``RNN``, or use summary statistics with ``tskit_sfs`` / ``SummaryStatisticsEmbedding``. See the :doc:`processor–network compatibility table ` for valid combinations. - **Explore existing configs** — the ``workflow/config/`` directory has examples for multi-population models (``YRI_CEU_cnn.yaml``), variable population size (``variable_popnSize_spidna.yaml``), and more. - **Scale up** — increase ``n_train`` and ``n_chunk`` for better inference, and use SLURM resource blocks for cluster execution. See :doc:`usage` for details. - **Consult the API reference** — :doc:`simulators` and :doc:`processors` document every built-in class, their parameters, and their configuration options.