iTrop Workshop, Day 2: Workshop on demographic inference in population genomics

An introduction to Approximate Bayesian Computation and random forests

Why do we start with ABC? Because it is an excellent way to learn a bit more about simulations in population genetics. And why is that later point important?

Very often, we tend to interpret the value of summary statistics depending on assumptions that we do not actually test. For example, we may find that a given genomic region displays a strongly negative Tajima’s D. This is often interpreted as evidence of (positive or negative) selection, or population expansion. We may look at other genomic windows along the genome, and see that they are all displaying values near zero, or even positive, in which case this may make us feel more confident about selection as the reason for our focal window being an outlier. But what if there is a lot of variance? Or if the pattern we see is not that clear? What about other summary statistics, or properties? Can we produce such a negative Tajima’s D through purely neutral processes? This is a question that we can start answering with ABC (or machine learning, which share the same philosophy). On the other hand, one may wonder: can we reconstruct the processes that generated the genome-wide distribution of Tajima’s D? Does it contain enough information about these processes?

Originally, ABC was developed to investigate the demographic history of a population. By simulating many trajectories (changes in the population size over time, changes in the migration rate from and to neighbouring populations, time since the split between populations, etc.), one can obtain expectations for (well-chosen) summary statistics, and test whether they allow distinguishing between various situations. If that is the case, we can then obtain the same summary statistics from our observed dataset, and check where it falls among our simulations, then derive the generating parameters that produce simulations close to our own observation.

Approximate Bayesian Computation relies on simulations. At the moment, there are two simulators that I would encourage you to read about. The one we will use today is called ms. It may seem a bit difficult in terms of syntax, but it has the advantage of being extremely fast. You can also look at fastsimcoal2 as a simulator which may prove a bit more intuitive in terms of coding your demographic history. I tend to think that if you can do it with ms, you are set for any nightmarish software that can come next. These programs are fast because they run coalescent simulations: they go backwards in time, and change the rate at which lineages coalesce based on the neutral theory and assuming what we call a Wright-Fisher model. It basically makes assumptions which make it fast, but your model species may deviate from these assumptions!

A basic description of ABC. We simulate the same statistics as the ones obtained from an observed dataset (green), using a simulator. We then use algorithms which identify the closest set of simulations to our observed data, and extract the parameters that were used to generate them as a posterior probability. This can also be used to weigh the relative probabilities of distinct scenarios (blue v. orange circles here).

If you are working with species with a very weird life history, you may want to look at SLiM5. This simulator allows you to do literally anything biologically realistic. For neutral simulations (no selection), it can be made very fast using several tricks (such as tree recapitation and tree-recording). However, this is a very advanced feature that we are not going to cover today. We can discuss together during the workshop if you feel like this is something that you may find interesting.

The first step: simulating data

In itself, this could be a whole course. Here, we will simulate 100 genomic windows of 10kb under three very basic scenarios: one with a population of constant Ne , another one where the population shrinked to its present day size, and another one where the population expanded to its present size, at a given time that is free to vary. Our simulations will not take into account linkage between windows (each locus is simulated independently). This is not necessarily an issue, but for an actual dataset, you may want to ensure that the observed data are obtained from effectively independent windows (for example, based on the LD-decay distance, see tomorrow’s workshop).

We have an observed dataset consisting of 100 distinct loci for 25 diploid individuals, so we simulate 100 loci for 50 haplotypes. Recombination rate is the same as mutation rate. We assume a mutation rate of 5.10-9 mutations per site per generation, a generation time of one year, a locus length of 10kb.

The following script will run 1000 simulations with ms for three distinct scenarios, using R to draw parameters from a log-uniform distribution. Note that we are using variable declaration, which might simplify recycling the script for your own project. An actual analysis would require something like 100,000 to 1,000,000 simulations per scenario.

ms is a bit barbaric, and we will talk a bit about how it works (don’t hesitate to ask questions). You should find the manual along with this workshop. You will find a list of options below:

A quick explanation of ms syntax
ms Option Meaning
-t Set value of 4 x N0 x u. N0 is the reference diploid population size. You need to pick one and scale everything with it. You can of course change the population size right away at the first generation.
-r ρ nsites Recombination: ρ = 4 x N0 x r where r is the recombination rate between the ends of the segment being simulated (not the per site rate) and nsites is the number of sites between which recombination can occur.
-ej t i j At time t, population i moves into population j, backward in time. This means, forward in time, that j gives an offshoot population i at time t in the past.
-en t i x At time t, the size of population i changes by a factor x, backward in time. This means that you would code a population expansion forward in time with x < 1.
-eg t i αi Set the exponential growth rate of population i to αi at time t.

When using -e followed by a capital letter, the change is done for all populations, and you remove the i index. For example, -eN 0.2 0.1 will reduce all populations by a factor of 90% at generation 0.2 x 4 x N0, backward in time (so it is an expansion event forward in time).

WE CANNOT LAUNCH THE FOLLOWING BASH SCRIPT FROM THE QUARTO DOCUMENT. YOU NEED TO RUN IT FROM A TERMINAL. The easiest way consists in copying and pasting the content in the Terminal tab:

You can also copy this in a bash file that you execute with bash.

##Loading modules, cluster-specific! EXECUTE THIS IN THE TERMINAL !
module load r/4.5.1
module load vcftools/0.1.16

#Setting up our environment.
mkdir ABC
#Move the files we need to the right folder. This includes ms and ms2vcf
cp ms ABC
cp ms2vcf ABC
cp simulations_complicated_script.sh ABC
cd ABC
chmod +x ms2vcf

# Bash script to generate simulations. 
bash simulations_complicated_script.sh

You should have a look at the simulations_complicated_script.sh to understand what it is doing. You can also find it at the end of this workshop, in the Annex section. We will cover it together.

Here we use vcftools and compute two statistics, nucleotide diversity and Tajima’s D, for the sake of simplicity. This works because we are working with ‘perfect’ datasets, with no missing data, and we can assume that sites with no variation are monomorphic (and not just missing/not sequenced). Otherwise, there are many (and probably better) ways to obtain summary statistics. A good option is pixy, (of which an older version exists on the cluster). The newest version computes many relevant summary statistics, such as Tajima’s D, Watterson’s theta, nucleotide diversity, FST and dXY (for two or more populations). Tomorrow we will also use functions in the scikit allel Python module (https://scikit-allel.readthedocs.io/en/stable/#). Overall, try using the same method on both observed and simulated data, and be aware of possible biases (for example, how do these methods handle missing data?). You could also add a step to incorporate systematic biases in your observed data, such as low sequencing depth, high amount of missing data, etc. But today, let’s keep it ‘simple’.

What do you think of the priors we are using? Of the models? What could be a criticism?

Second step: ABC inference proper

Checking that simulations are close to our observation

We can first import the data and the simulation. We assume that we are in the main folder (you should see the ABC folder we just created on the bottom right panel in Rstudio):

library(ggplot2)

observed=read.table("observed_stats.txt")
sims=read.table("ABC/simulated_stats.txt",h=T)
hist(sims$pi_mean,main="",xlab="Nucleotide diversity")
abline(v=observed[,3]) #another way to select a column, the third column here, which is pi_mean.

###Check our summary statistics with base R:
plot(sims$pi_mean,sims$TajimaD_mean,col=as.factor(sims$model),xlab="Nucleotide diversity",ylab="Tajima's D")
points(observed[,3],observed[,1])

####Check our priors:
hist(sims$N_PRESENT,main="Current effective population size Ne",ylab="Ncurrent")

boxplot(sims$N_PRESENT ~ sims$model,ylab="Current Ne",xlab="Scenario")

Feel free to explore a bit more the summary statistics, their distribution, you can play a bit here, or use ggplot2.

A good way to explore the results of an ABC procedure is to check whether it produces simulations that cover our observation. A Principal Component Analysis on summary statistics can do the trick:

stats <- sims[, c("TajimaD_mean", "TajimaD_sd", "pi_mean", "pi_sd")]

#We can run a PCA on summary statistics
pca <- prcomp(stats, scale. = TRUE)
pca_df <- cbind(sims[, c("model", "simID")], pca$x)

# We can project our observed data on the PCA, to see the fit:
obs_scaled <- scale(observed, center = pca$center, scale = pca$scale)
obs_proj <- obs_scaled %*% pca$rotation
obs_proj <- as.data.frame(obs_proj)
obs_proj$label <- "Observed"


ggplot() +
  geom_point(data = pca_df, aes(PC1, PC2, color = model), alpha = 0.6) +
  geom_point(data = obs_proj, aes(PC1, PC2), color = "black", size = 4, shape = 8) +
  theme_minimal(base_size = 14) +
  labs(title = "PCA of simulated vs observed summary statistics")

What do you think?

Predictive power for model comparison

First, we need to check that we can differentiate between our three scenarios. The method (roughly) takes simulations from each of the three categories and checks whether they can be correctly classified. Typically, this is easier if the three scenarios produce clearly distinct sets of summary statistics. You can already check that with dimension reduction methods on your summary statistics, such as the PCA we did earlier.

if(!require("abc",character.only = TRUE)) install.packages("abc")
Loading required package: abc
Loading required package: abc.data
Loading required package: nnet
Loading required package: quantreg
Loading required package: SparseM
Loading required package: MASS
Loading required package: locfit
locfit 1.5-9.12      2025-03-05
library(abc)

stats <- sims[, c("TajimaD_mean", "TajimaD_sd", "pi_mean", "pi_sd")]

cv.modsel.mnlog <- cv4postpr(sims$model,stats, nval=100, tol=.05, method = "mnlogistic")
summary(cv.modsel.mnlog)
Confusion matrix based on 100 samples for each model.

$tol0.05
            contraction expansion stable
contraction          65        10     25
expansion             2        69     29
stable               25        15     60


Mean model posterior probabilities (mnlogistic)

$tol0.05
            contraction expansion stable
contraction      0.6324    0.1084 0.2592
expansion        0.1121    0.6760 0.2119
stable           0.2877    0.1940 0.5183
#neural net is slower, but can be better, especially with more summary statistics. Uncomment to try
#cv.modsel.nnet <- cv4postpr(sims$model,stats, nval=100, tol=.05, method = "neuralnet")
#summary(cv.modsel.nnet)

##To plot confusion matrices
plot(cv.modsel.mnlog)

There are three main categories of algorithms in the R package ABC, rejection, logistic, and neuralnet. There is another R package, abcrf, which can also use random forests (a type of machine learning) to perform the same task.

In a nutshell, if you have many summary statistics and/or cannot obtain 100,000 or 1,000,000 simulations in a decent amount of time, go for random forest, but know that one can only estimate one parameter at a time, not their joint distribution.

If you have a moderately high number of summary statistics (say 20), you may want to try neural networks. Beware of p-hacking, if one method gives you very good results and not the others, it may be better to test why exactly…

It is worth knowing about abcrf. Let’s see how this one performs:

library(abcrf)

stats <- sims[, c("TajimaD_mean", "TajimaD_sd", "pi_mean", "pi_sd")]

modsel_rf <- abcrf(model ~ ., data = data.frame(model = as.factor(sims$model), stats), ntree = 500)

#The confusion matrix:
modsel_rf$model.rf$confusion.matrix
            contraction expansion stable class.error
contraction         672        96    232       0.328
expansion            58       702    240       0.298
stable              109       184    707       0.293

Model comparison and Goodness of fit (g.o.f.)

Now, we are reasonably confident of our ability to use these four statistics to discriminate between our three scenarios. So, what are our observed data telling us? First, we obtain the posterior probability of each of our three models. Then, we check whether each model can provide a good fit to the observed data. At the very least, we hope that the most likely model provides a decent fit!

##just to make sure that you have the right stats:
stats=sims[,c("TajimaD_mean", "TajimaD_sd", "pi_mean", "pi_sd")]

modsel.mnlog <- postpr(observed, sims$model, stats, tol=.05, method="mnlogistic")
summary(modsel.mnlog)
Call: 
postpr(target = observed, index = sims$model, sumstat = stats, 
    tol = 0.05, method = "mnlogistic")
Data:
 postpr.out$values (150 posterior samples)
Models a priori:
 contraction, expansion, stable
Models a posteriori:
 contraction, expansion, stable

Proportion of accepted simulations (rejection):
contraction   expansion      stable 
     0.0133      0.9733      0.0133 

Bayes factors:
            contraction expansion  stable
contraction      1.0000    0.0137  1.0000
expansion       73.0000    1.0000 73.0000
stable           1.0000    0.0137  1.0000


Posterior model probabilities (mnlogistic):
contraction   expansion      stable 
          0           1           0 

Bayes factors:
             contraction    expansion       stable
contraction 1.000000e+00 0.000000e+00 2.924308e+53
expansion   2.774605e+23 1.000000e+00 8.113801e+76
stable      0.000000e+00 0.000000e+00 1.000000e+00
## neural nets are taking a bit longer, but try it if you want!
#modsel.nnet <- postpr(observed, sims$model, stats, tol=.05, method="neuralnet")
#summary(modsel.nnet)

##Do the models provide a good fit?
fit.expansion=gfit(target=observed, sumstat=sims[sims$model=="expansion",c("TajimaD_mean", "TajimaD_sd", "pi_mean", "pi_sd")],statistic=median, nb.replicate=100,tol=0.05)

fit.stable=gfit(target=observed, sumstat=sims[sims$model=="stable",c("TajimaD_mean", "TajimaD_sd", "pi_mean", "pi_sd")],statistic=median, nb.replicate=100,tol=0.05)


fit.contraction=gfit(target=observed, sumstat=sims[sims$model=="contraction",c("TajimaD_mean", "TajimaD_sd", "pi_mean", "pi_sd")],statistic=median, nb.replicate=100,tol=0.05)


summary(fit.expansion)
$pvalue
[1] 0.94

$s.dist.sim
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.673   2.114   2.578   4.251   4.293  29.734 

$dist.obs
[1] 1.834603
summary(fit.stable)
$pvalue
[1] 0.24

$s.dist.sim
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.927   2.408   3.219   6.662   7.548  30.122 

$dist.obs
[1] 8.787899
summary(fit.contraction)
$pvalue
[1] 0.27

$s.dist.sim
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.639   2.122   2.353   2.580   3.002   5.190 

$dist.obs
[1] 2.964216

We will discuss this together, but what seems to be the best-performing model? Is there any model that significantly does not fit?

In ABCRF, we trained a classification algorithm. We can now predict.

colnames(observed)=colnames(stats)
estimate = predict(modsel_rf, observed, data.frame(model = as.factor(sims$model), stats),paral=T, ncores=4)
estimate
  selected model votes model1 votes model2 votes model3 post.proba
1      expansion            4          996            0      0.997

Ability to actually infer correct parameters (cross-validation and out-of-bag).

Now, we stick to one model, but want to determine whether it can actually properly estimate parameters. To do this, we use cross-validation:

parameters=sims[sims$model=="expansion",c("N_PRESENT", "T_EVENT", "FACTOR", "SIZE")]
sumstats_expansion=sims[sims$model=="expansion",c("TajimaD_mean", "TajimaD_sd", "pi_mean", "pi_sd")]

crossval.expansion.loclinear = cv4abc(parameters, sumstats_expansion, nval=100,tols=c(0.05), method = "loclinear")
#diagnostic plot:
plot(crossval.expansion.loclinear)

What do you think of our ability to infer the correct parameters? What could we do to improve that?

In abcrf, the procedure is slightly different. First we train a regression task on the set of statistics for our model of interest. The issue with abcrf, at the moment, is its inability to infer more than parameter at a time. It requires distinct training on each parameter in turn.

#par(mfrow=c(4,2))
for (i in 1:4){
parameter=parameters[,i]
data=data.frame(parameter,sumstats_expansion) 
mod = regAbcrf(parameter~., sumstats_expansion, ntree=1000, paral=T, ncores=4)
err.regAbcrf(mod, data)
estimate2 <- predictOOB(mod, data, paral = TRUE,ncores =8)
plot(parameter,estimate2$expectation,xlab="True",ylab="Estimate")
abline(0,1)
}

Parameter inference

res <- abc(target=observed, param=parameters,
sumstat=sumstats_expansion, tol=0.05, transf=c("log"), method = "loclinear")
Warning: All parameters are "log" transformed.
posteriors=as.data.frame(res$adj.values)
#joint distribution
plot(posteriors$N_PRESENT*10000,posteriors$SIZE*10000,xlab="Present size",ylab="Past size",pch=16)

hist(posteriors$T_EVENT*4*10000,xlab="Time since expansion, in years (posterior)",main="Posterior of expansion time")

#posterior distribution for one parameter

You can play a bit with the transformation. Overall, if your parameter is not expected to affect your summary statistics in a strictly linear way, it may be good to use a log or logit transformation. It is usually a good idea to play a bit with the transformations to check their behaviour.

We can also use random forests. It starts the same way as the previous abcrf command, since we need to train the model before making the inference. Once could put everything in a single loop.

res_rf = list()
for (i in 1:4){
parameter=parameters[,i]
data=data.frame(parameter,sumstats_expansion) 

mod = regAbcrf(parameter~., data, ntree=1000, paral=T, ncores=4)
estimate = predict(mod, observed, sumstats_expansion,paral=T, ncores=4)
param_name = colnames(parameters)[i]
res_rf[[param_name]] = list()
res_rf[[param_name]][['expectation']] = estimate$expectation
res_rf[[param_name]][['variance']] = estimate$variance
res_rf[[param_name]][['quantile025']] = estimate$quantiles[1]
res_rf[[param_name]][['quantile975']] = estimate$quantiles[2] 
}   

##To check a given parameter
res_rf$N_PRESENT
$expectation
[1] 1.514154

$variance
[1] 7.940509

$quantile025
[1] 0.2258075

$quantile975
[1] 11.60039

What is the demographic history of this population? How confident are we about it?

Now that we have our estimated parameters and their posterior distributions, what do you think we could do to further check that the model performs well compared to others?

DILS (https://github.com/dils-popgen/dils) is a good example of a more user-friendly approach, which automates the process we just described for one or two populations. However, I would not recommend using it without a proper understanding of what it does. But now, you should be better equipped to use ABC, and even create your own models!

But wait… What about the ‘truth’?

Below is the script that was used to generate the observed dataset (yes, we cheated… it is always easier when you use simulations to generate a pseudo-observed dataset). What do you notice? What do you think of our inference?

###creating the actual demography: bottleneck 10000 generations ago, assuming a N0 of 10000. The population fell to 100 in size during the bottleneck, used to be 100000, and goes back to N0 of 10000 following an exponential growth. We simulate 100 loci for 50 haplotypes. Recombination rate is the same as mutation rate. We assume a mutation rate of 5e-9/generation, a generation time of one year, a locus length of 10kb.

ms 50 100 -t 2.0 -r 2.0 10000 -eN 0 1 -eG 0 18.42 -eN 0.25 0.01 | ./ms2vcf -length 10000 > simulation_neutral.vcf

##Obtain statistics per locus
#module load vcftools
vcftools --vcf simulation_neutral.vcf --TajimaD 10000 --out Tajima
vcftools --vcf simulation_neutral.vcf --window-pi 10000 --out pi
##Average statistics per simulation/dataset

awk '{x[NR]=$4; s+=$4} END{mean=s/NR; for(i=1;i<=NR;i++){ss+=(x[i]-mean)^2} sd=sqrt(ss/NR); print mean,sd}' Tajima.Tajima.D > tmp1
awk '{x[NR]=$5; s+=$5} END{mean=s/NR; for(i=1;i<=NR;i++){ss+=(x[i]-mean)^2} sd=sqrt(ss/NR); print mean,sd}' pi.windowed.pi > tmp2
paste tmp1 tmp2 > observed_stats.txt

You can check the demography with POPdemog (https://github.com/YingZhou001/POPdemog), a R package allowing to visualize a ms command line as a demographic scenario.

if(!require("POPdemog",character.only = TRUE)) install.packages("https://github.com/YingZhou001/POPdemog/raw/master/POPdemog_1.0.3.tar.gz", repos=NULL)
Loading required package: POPdemog
library(POPdemog)
##Type
?PlotMS()
###to get a list of options
PlotMS(input.cmd = "ms 50 100 -t 2.0 -r 2.0 10000 -eG 0 18.42 -eN 0.25 0.01",type="ms",time.scale="log10year",N4=40000,gen=1,col.pop=c("coral3"),pops=c("Fake population"),demo.out=F)
There are  2 time events for  1  populations
read N and g, done!
read m, done!
read pos and update, done!
demographic initiation, done!
plot initiation, done!

If you have time, you can rerun the ABC pipeline with the dataset containing 100,000 simulations per scenario.

Methods based on the Sequentially Markovian coalescent (SMC).

These methods are interesting since they allow to fit models with many changes in population size, and take care of placing the times (‘epochs’) at which the size changes. In a sense, they are more naive, and can be really useful to use on whole-genome sequencing data to get a first sense of what is happening.

We will use two methods derived from PSMC, MSMC2 and SMC++. We will discuss them in our lecture introducing the topic, but in a nutshell:

  • the old PSMC algorithm works with a single, unphased diploid individual, not more.

  • you can use MSMC2 instead of PSMC on a single diploid genome to obtain estimates of coalescence rates. There are a few differences between the algorithms, but MSMC2 is much faster and addresses a few issues related to variation in recombination rates.

  • MSMC2 can deal with up to 8 diploid individuals (realistically), as well as with two populations.

  • There are more recent programs based on the same idea, such as cobraa, Relate, or an adaptation of the PSMC algorithm to background selection. ARG-based methods such as Relate ar emuch more appropriate if you are dealing with really large datasets.

At last (Rémi Tournebize could tell you a lot more), keep in mind that while we often interpret changes in coalescence rates as changes in effective population sizes, there are many other ways to interpret these changes in coalescence times. In particular, one can interpret them as changes in connectivity between populations in an island model. SNIF is a method that can fit such models to your data.

For today, given our limited time, we will mostly focus on interpreting the output of SMC++/MSMC2.

First, let’s simulate a 10Mb chromosome! Note that here we use the -p option, to increase the precision of the ms output to the 6th digit.

./ms 50 1 -t 2000 -r 2000 10000000 -p 6 -eN 0 1 -eG 0 18.42 -eN 0.25 0.01 | ./ms2vcf -length 10000000 > simulation_chromosome.vcf

We also provide a mask file, mask.bed, which we will use as an illustration. For an actual dataset, this may correspond to the regions that we can trust. Typically we exclude regions where mapping short reads would be difficult. A good software for that task is genmap, or SNPable. For MSMC2, we can also specify an individual mask, to further remove regions that are poorly covered in one individual from the analysis. Therefore, pick a set of individuals that are higher depth!

MSMC2

For MSMC2, we cannot focus on all 25 diploid individuals. But we DO need many individuals for phasing. MSMC2 can only take multiple individuals if they are phased. The ms output is already phased, but I provide scripts for phasing so you can reuse this workshop directly on your own data.

Here is a pipeline that will phase the data with SHAPEIT4, then uses bcftools to create 50 individual VCFs, then takes the first three diploid individuals and the mask file to create the MSMC2 input. This is of course only a template, you could adjust it to loop over chromosomes and individuals. Execute in the terminal!


module load shapeit4
module load msmc2/2.1.4
module load bcftools
module load tabix

bgzip simulation_chromosome.vcf
tabix -p vcf simulation_chromosome.vcf.gz
shapeit4 --input simulation_chromosome.vcf.gz --region 1 --thread 4 --output phased_simulation_chromosome.vcf.gz
tabix -p vcf phased_simulation_chromosome.vcf.gz

rm -f samples.txt
echo HG01 >> samples.txt
echo HG02 >> samples.txt
echo HG03 >> samples.txt
echo HG04 >> samples.txt
echo HG05 >> samples.txt

while read sample; do
  bcftools view -s "$sample" -Oz -o "${sample}.vcf.gz" phased_simulation_chromosome.vcf.gz
done < samples.txt

##get msmctools
git clone https://github.com/stschiff/msmc-tools.git

./msmc-tools/generate_multihetsep.py --chr 1 --mask mask.bed HG01.vcf.gz  HG02.vcf.gz  HG03.vcf.gz HG04.vcf.gz HG05.vcf.gz> five_samples.multihetsep.txt

##create bootstraps (optional):

./msmc-tools/multihetsep_bootstrap.py -n 100 --chunk_size 50000 --chunks_per_chromosome 20 --nr_chromosomes 1 bootstraps_${i} five_samples.multihetsep.txt

Now, we can run MSMC2 proper (execute in the Terminal, not by clicking on the green arrow). To gain time, we only run on the first six haplotypes, but you could use more:

module load msmc2/2.1.4

msmc2_Linux -I 0,1,2,3,4,5 -p 1*2+10*1+1*2 -o five_samples_inference five_samples.multihetsep.txt -t 7

Let’s now plot our results (in R, come back here!).

library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following object is masked from 'package:MASS':

    select
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
# Read input data
table <- read.table("five_samples_inference.final.txt", header = TRUE)

# Define constants
mut_rate_year <- 5.e-9
generation_time <- 1  # Adjust as needed

# Compute time and Ne estimates
table <- table %>%
  mutate(left_time_demo = left_time_boundary / mut_rate_year,
         right_time_demo = right_time_boundary / mut_rate_year,
         Ne = (1 / lambda) / (2 * mut_rate_year * generation_time)
         )

# Create the plot
ggplot(table, aes(x = right_time_demo, y = Ne)) +
    geom_step(size = 1.2) +  # Step plot to match the "type=s" in base R
    scale_x_log10() +
    scale_y_log10(limits = c(100, 2000000)) +
    labs(x = "Time in years", y = "Ne", title = "Effective Population Size Over Time") + geom_vline(xintercept = 10000,linetype="dotted")+
    theme_minimal()  +
    theme(legend.position = "right", legend.title = element_blank())
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_step()`).

What do you think? Does it work?

What you could try now: run on the bootstraps we generated before, and assess our confidence. You can also try to run directly on the ms output, to check whether phasing has a strong impact.

SMC++

Here, we can use all individuals and skip phasing, which is very neat. Note that if you trust the ancestral/derived state in your VCF, you can use the so-called unfolded frequency spectrum, with the –unfold option. Have a look at https://github.com/popgenmethods/smcpp for a list of options. SMC++ uses ‘distinguished lineages’, and it is usually recommended to estimate a composite likelihood by combining inputs generated using different distinguished lineages. The script below does just that for three individuals (we will not use the phase here).

Be careful! The mask file in SMC++ gives the regions one can NOT trust!

#module load smcpp/1.15.4 does not work on IFB for some reason.
module load tabix
module load bedtools


echo -e '1\t0\t10000000' | bedtools subtract -a -  -b mask.bed.gz > mask_smcpp.bed
##SMC++ takes zipped and tabixed (~indexed) bed files. Do not forget that step.
bgzip mask_smcpp.bed
tabix -p bed mask_smcpp.bed.gz

list_pop_ind=$(zcat phased_simulation_chromosome.vcf.gz | grep CHR | cut -f10- | sed "s:\t:,:g")
chrom=1
pop=sim
rm -rf smcpp_input
mkdir smcpp_input
for ind in HG01 
##You could add more individuals!! HG02 HG03
do
singularity exec docker://terhorst/smcpp:latest smc++ vcf2smc -d ${ind} ${ind} phased_simulation_chromosome.vcf.gz --mask mask_smcpp.bed.gz smcpp_input/${ind}_${pop}_${chrom}.smc.gz ${chrom} ${pop}:${list_pop_ind} --length 10000000

##You could use docker on your own laptop:
#sudo docker run --rm -v $PWD:/mnt terhorst/smcpp:latest vcf2smc -d ${ind} ${ind} phased_simulation_chromosome.vcf.gz --mask mask.bed.gz smcpp_input/${ind}_${pop}_${chrom}.smc.gz ${chrom} ${pop}:${list_pop_ind} --length 10000000

done


###SMCPP inference proper
LIST_FILES=$(ls smcpp_input/*.smc.gz | tr '\n' ' ')

singularity exec docker://terhorst/smcpp:latest smc++ estimate --knots 10 -o smcpp_output/ 5e-09 ${LIST_FILES}

##You could use docker on your own laptop:
#sudo docker run --rm -v $PWD:/mnt terhorst/smcpp:latest estimate --knots 10 -o smcpp_output/ 5e-09 ${LIST_FILES}

singularity exec docker://terhorst/smcpp:latest smc++ plot smcpp.pdf -c -g 1 smcpp_output/model.final.json 

##You could use docker on your own laptop:
#sudo docker run --rm -v $PWD:/mnt terhorst/smcpp:latest plot smcpp.pdf -c -g 1 smcpp_output/model.final.json 

In R:

library(ggplot2)
SMCPP=read.csv("smcpp.csv",h=T)
plot(SMCPP$x,SMCPP$y,lty=2,lwd=3,type="s",col="red",log="xy",xlab="Time in years",ylab="Ne estimate")
Warning in xy.coords(x, y, xlabel, ylabel, log): 1 x value <= 0 omitted from
logarithmic plot

How does it perform? What could be a reason?

SMC++ does not come with a built-in way to perform bootstrap, but someone (https://github.com/popgenmethods/smcpp/issues/37) nicely designed a block-bootstrap tool to resample from a generated SMC input file, that I provide here, named bootstrap_smcpp.py.

The inference is a bit long, but if you want to test it, do:

python bootstrap_smcpp.py --nr_bootstraps 10 --nr_chromosomes 1 --chunks_per_chromosome 100 SMCPP_boots smcpp_input/HG01_sim_1.smc.gz

##Then you run smc++ on each bootstrap, for example, on the second bootstrapped dataset:
sudo docker run --rm -v $PWD:/mnt terhorst/smcpp:latest estimate --knots 10 -o smcpp_output/ 5e-09 SMCPP_boots_2/*

Likelihood-based methods (allele/site frequency spectrum, AFS/SFS).

With the advent of massive NGS data, collecting millions of SNPs has become doable for thousands of individuals. This type of dataset remains difficult. Quite useful, easier to deal with whole-genome data and large datasets compared to ABC. Less prone to human error. But we need enough data, at the very least what we obtained from our 10Mb chromosome above (if you wish, you can simulate more SNPs).

Let’s try on the dataset we generated for SMC methods. First, we need to extract the allele frequency spectrum!

We directly use python functions implemented in dadi here (https://dadi.readthedocs.io/en/latest/, another method that can infer demography from the AFS, tell me if you want to know more that what I covered in the lecture), but for an actual dataset, do not forget to use bcftools first, to filter on missing data per population, remove low-quality genotypes, or specify white lists of loci.

Note that an alternative is realSFS from the ANGSD suite (https://github.com/ANGSD/angsd). It starts with BAM files, and the output can be directly used for dadi or fastsimcoal (with a few modifications). This can be useful in case of low sequencing depth, otherwise the approach above is fine. We will cover this in tomorrow’s session.

If you have large whole-genome data, I would rather recommend winsfs: https://github.com/malthesr/winsfs

We first need to create a file linking individuals to their population of origin. In our case, we can do (in the terminal):

zcat simulation_chromosome.vcf.gz | grep CHROM  | cut -f10- | sed "s:\t:\tpop1\n:g" | sed "s:HG25:HG25\tpop1:" > list_pops_individuals.txt

##Prepare! Import the dadi module!
module load dadi/2.2.0 

mkdir boot ##to save our bootstraps
#execute the script
python script_dadi_generate_SFS.py

This is just to make sure that we avoid as much as possible editing by hand, but it might give an idea of what can be done with a few bash commands when in need. You can run a jupyter notebook for that part (scripts_dadi_import_manipulate_spectra.ipynb), but the python script we launch in the cell above is found below (do not execute the following cell, and do not forget to import the dadi module before):

import dadi
import numpy
from numpy import array
import nlopt


# Parse the VCF file to generate a data dictionary
datafile = 'simulation_chromosome.vcf.gz'
dd = dadi.Misc.make_data_dict_vcf(datafile, 'list_pops_individuals.txt')
dd

# Extract the spectra, we do NOT project here, but we could. If so, we want to project to the minimum 
# number of sampled alleles in the population.
pop_ids, ns = ['pop1'], [50]

fs_obs = dadi.Spectrum.from_data_dict(dd, pop_ids, ns,polarized=False)
print(fs_obs.S()) ##About 22,000 SNPs.

fs_obs.to_file('simulation_neutral.fs')

# Generate 100 bootstrap datasets, by dividing the genome into 10kb Mb chunks and
# resampling from those chunks.
Nboot, chunk_size = 100, 1e4
chunks = dadi.Misc.fragment_data_dict(dd, chunk_size)
boots = dadi.Misc.bootstraps_from_dd_chunks(chunks, Nboot, pop_ids, ns)

for i, fs_boot in enumerate(boots):
    filename = f"boot/simulation_neutral_boot_{i+1}.fs"
    fs_boot.to_file(filename)
    
    
###We can visualize our spectra like this:

dadi.Plotting.plot_1d_fs(fs_obs)

Note that our spectrum is folded: we assume that we do not know about the ancestral state of each allele (and the VCF does not contain any information dadi can use). We therefore use the frequency of the minor (less frequent) allele instead. This is not actually true, ms variants are derived variants, so we do know their status. But for many species, it may be safer to assume that we do not know about the ancestral state, and work with the folded frequency spectrum.

We have our spectrum. It could be directly used with dadi, but we are going to use fastsimcoal2, which is (arguably) faster and can deal with scenarios of any complexity. We do have a few example scripts for dadi and similar methods (see GADMA), contact me if you want to work with those.

#header_fsc.txt
#1 observations
#1 50

sed -n '2p' simulation_neutral.fs | cat header_fsc.txt - | sed "s:^0:10000000:" > simulation_neutral_MSFS.obs
##10000000 should actually be 10000000 minus the number of polymorphic sites, but each one of you may have a different number... Feel free to adjust, or leave it as is, it will not change much the result.

Here, keep in mind that we simulated 100 10kb loci, so the total amount of sequence that produced the SNPs we have is 1Mb minus the number of polymorphic sites. We need to include the number of invariant sites in our analysis if we are to use the mutation rate!!

Now, we want to fit an expansion model to our SFS.

Here is the simulation_neutral.tpl file:

//Parameters for the coalescence simulation program : simcoal.exe
1 samples to simulate :
//Population effective sizes (number of genes)
NPRESENT
//Samples sizes and samples age
50 0
//Growth rates: negative growth implies population expansion
0
//Number of migration matrices : 0 implies no migration between demes
0
//historical event: time, source, sink, migrants, new deme size, growth rate, migr mat index
1 historical event
TCHANGE 0 0 1 RESIZE1 0 0
//Number of independent loci [chromosome]
1 0
//Per chromosome: Number of contiguous linkage Block: a block is a set of contiguous loci
1
//per Block:data type, number of loci, per gen recomb and mut rates
FREQ 1 0 5e-9 OUTEXP

And the prior file (.est)

// Priors and rules file
// *********************
[PARAMETERS]
//#isInt? #name #dist.#min #max
//all N are in number of haploid individuals
1 NPRESENT unif 100 100000 output
1 ANCSIZE1 unif 100 NPRESENT output paramInRange
1 TCHANGE unif 10 50000 output

[COMPLEX PARAMETERS]
0 RESIZE1 = ANCSIZE1/NPRESENT hide

We are ready to go. Let’s try:

./fsc28 -t simulation_neutral.tpl -n 100000 -m -e simulation_neutral.est --multiSFS -M -L 20 -q -c6

It is recommended to run several runs to avoid getting stuck with a local optimum. Below is an example that may be relevant. Today, we can simply let each of you run a few runs, and collect the one with the lowest likelihood. So, how well does it work?

Note that you could also use the SFS as a summary statistics for ABC inference! It works rather well with random forests (due to its ability to deal with many summary statistics).

To run bootstraps, you would basically need to run inference on each bootstrapped dataset, and obtain confidence intervals for each parameter. Note that you can use an option to avoid running, say, 50 replicates for each of the 100 bootstraps. You can provide fastsimcoal2 with a set of prior values that should be close enough to the actual ones, which will make it reach faster the optimum. This way you can only run each bootstrapped inference once.

For example, to obtain the bootstrapped inference on the first bootstrap, and save it in its own folder, one could adjust the following script, and run a loop on all the boostraps (or use a SLURM script to use an array job, we can discuss this together).

##first use the fsc header to edit the .fs bootstrap then, for example
mkdir boot_1
cp simulation_neutral_boot_1_MSFS.obs boot_1/simulation_neutral_MSFS.obs
cp simulation_neutral.* boot_1
cp simulation_neutral/simulation_neutral.pv boot_1
cd boot_1
./fsc28 -t simulation_neutral.tpl -n 100000 -m -e simulation_neutral.est simulation_neutral.pv --multiSFS -M -L 20 -q -c6

Annex

Below is the script we used to generate 3000 simulations under three distinct demographic scenarios for ABC. A bit hard to digest, and certainly perfectible. Try to understand what it does.

#!/usr/bin/env bash
N_SIM=1000
nHAP=50
nLOCI=100
L=10000    # sequence length
t=2        # theta = 4 times N0, N0 is 10 000 diploid individuals
r=2        # recombination rate (4N0rL)

#header for the simulations
echo -e "model\tsimID\tN_PRESENT\tT_EVENT\tFACTOR\tSIZE\tTajimaD_mean\tTajimaD_sd\tpi_mean\tpi_sd" > simulated_stats.txt

for MODEL in stable contraction expansion; do
  for i in $(seq 1 $N_SIM); do

    # Draw parameters using R (uniform distributions)
read N_PRESENT T_EVENT CONTRACTION_FACTOR EXPANSION_FACTOR <<< $(Rscript -e '
  set.seed(as.integer(Sys.time()) + Sys.getpid())

  log_runif <- function(n, min, max) exp(runif(n, log(min), log(max)))

  N_PRESENT <- log_runif(1, 0.02, 20)
  T_EVENT <- log_runif(1, 0.01, 5)
  CONTRACTION_FACTOR <- log_runif(1, 2, 10)  #At last 2 times smaller than before T_EVENT
  EXPANSION_FACTOR <- log_runif(1, 0.01, 0.5) #At last 2 times larger than before T_EVENT

  cat(N_PRESENT, T_EVENT, CONTRACTION_FACTOR, EXPANSION_FACTOR)
')

    # Calculate ancestral sizes (in units of 4xN0)
    SIZE_CONTRACTION=$(awk -v n0="$N_PRESENT" -v f="$CONTRACTION_FACTOR" 'BEGIN{print n0*f}')
    SIZE_EXPANSION=$(awk -v n0="$N_PRESENT" -v f="$EXPANSION_FACTOR" 'BEGIN{print n0*f}')

    OUTPREFIX="sim_${MODEL}_${i}"

    if [[ "$MODEL" == "contraction" ]]; then
      echo "[${i}] contraction N0=${N_PRESENT} T=${T_EVENT} F=${CONTRACTION_FACTOR} SIZE=${SIZE_CONTRACTION}"
      ./ms $nHAP $nLOCI -t $t -r $r $L \
        -eN 0 $N_PRESENT \
        -eN $T_EVENT $CONTRACTION_FACTOR \
        | ./ms2vcf -length $L > ${OUTPREFIX}.vcf

    elif [[ "$MODEL" == "expansion" ]]; then
      echo "[${i}] expansion N0=${N_PRESENT} T=${T_EVENT} F=${EXPANSION_FACTOR} SIZE=${SIZE_EXPANSION}"
      ./ms $nHAP $nLOCI -t $t -r $r $L \
        -eN 0 $N_PRESENT \
        -eN $T_EVENT $EXPANSION_FACTOR \
        | ./ms2vcf -length $L > ${OUTPREFIX}.vcf

    else
      echo "[${i}] stable N0=${N_PRESENT}"
      ./ms $nHAP $nLOCI -t $t -r $r $L \
        -eN 0 $N_PRESENT \
        | ./ms2vcf -length $L > ${OUTPREFIX}.vcf
    fi

    #summary stats
    vcftools --vcf ${OUTPREFIX}.vcf --TajimaD 10000 --out Tajima
    vcftools --vcf ${OUTPREFIX}.vcf --window-pi 10000 --out pi

    awk 'NR>1{vals[NR]=$4; sum+=$4} END{mean=sum/(NR-1); for(i in vals){ss+=(vals[i]-mean)^2} sd=sqrt(ss/(NR-2)); print mean,sd}' Tajima.Tajima.D > tmp1
    awk 'NR>1{vals[NR]=$5; sum+=$5} END{mean=sum/(NR-1); for(i in vals){ss+=(vals[i]-mean)^2} sd=sqrt(ss/(NR-2)); print mean,sd}' pi.windowed.pi > tmp2

    if [[ "$MODEL" == "contraction" ]]; then
      paste <(echo -e "${MODEL}\t${i}\t${N_PRESENT}\t${T_EVENT}\t${CONTRACTION_FACTOR}\t${SIZE_CONTRACTION}") tmp1 tmp2 >> simulated_stats.txt
    elif [[ "$MODEL" == "expansion" ]]; then
      paste <(echo -e "${MODEL}\t${i}\t${N_PRESENT}\t${T_EVENT}\t${EXPANSION_FACTOR}\t${SIZE_EXPANSION}") tmp1 tmp2 >> simulated_stats.txt
    else
      paste <(echo -e "${MODEL}\t${i}\t${N_PRESENT}\tNA\tNA\tNA") tmp1 tmp2 >> simulated_stats.txt
    fi

    #clean up files
    rm -f Tajima.* pi.* tmp1 tmp2
   rm -f *.vcf
  done
done