---
title: 'Introduction to Population Genetics'
author: "International Statical Genetics Workshop"
#date: '2023-03-07'
output:
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
**This practical runs on the workshop RStudio server.**
First login to RStudio
```{bash, eval=FALSE}
https://workshop.colorado.edu/rstudio/
```
Copy the Practical document into your home directory using the following `R` commands.
```{R, eval=FALSE}
setwd("~/.")
system("mkdir Day2")
system("cp /home/loic/2023/Practical_PopGen.html ~/Day2/.")
setwd("Day2")
```
## Part 1: Visualize changes in allele frequencies
The `R` commands below define two functions: (1: `generateNextGeneration()`) generates genotypes for the next generation from a set of genotypes in the current population, and (2: `sim`), which uses (1) to simulate the evolution over $g$ generations of a population with a constant sample size ($N$), a certain number $m$ of SNPs segregating in the founding (or ancestral) population with a frequency ($p$). The `sim` function then plot the frequency of each of the `m` alleles over the course of `g` generations (each curve corresponds to a SNP).
```{R, eval=TRUE}
generateNextGeneration <- function(x,n){
m <- ncol(x)
N <- nrow(x)
iM <- sample(1:N,n,replace = TRUE)
iF <- sample(1:N,n,replace = TRUE)
xm <- x[iM,]
xf <- x[iF,]
xa <- t(sapply(1:n,function(k) rbinom(m,1,prob=0.5*xm[k,])))
xb <- t(sapply(1:n,function(k) rbinom(m,1,prob=0.5*xf[k,])))
xn <- xa + xb
return(xn)
}
sim <- function(N, # sample size
p, # allele frequency in the founding population
m, # number of marker
g){ # number of generations
## Simulate genotypes in the founding population
x <- do.call("cbind",lapply(1:m,function(j) rbinom(N,size=2,prob=p)))
alleleFreq <- matrix(NA,nrow=g+1,ncol=m)
alleleFreq[1,] <- colMeans(x)/2
for(gen in 1:g){
# cat(paste0("\tGen #",gen,"\n"))
x <- generateNextGeneration(x,N)
alleleFreq[gen+1,] <- colMeans(x)/2
}
par(mar=c(5,5,3,2))
matplot(1:(1+g),alleleFreq,type="l",lty=1,axes=FALSE,
main=paste0("Size of the population: N = ",N),
xlab="Number of generations",ylab="Allele frequencies",
col=sample(colors(),m),log="x",cex.lab=1.2)
axis(1);axis(2)
abline(h=p,col="grey")
return(alleleFreq)
}
```
**Execute these two functions in your `R` terminal then run the following command.**
```{R, eval=FALSE}
t1 <- system.time( m50 <- sim(N=50 ,p=0.05,m=20,g=1000) )
```
*Question 1a. What is the allele frequency after 1000 generations?*
*Question 1b. Do you see the same patterns if you run this command multiple times?*
*Question 1c. Try different sizes for the ancestral population (e.g., $N=100$, or $N=200$). How does it affect the frequency trajectories?*
*Question 1d. Try different allele frequency in the ancestral population (e.g., $p=0.1$, or $p=0.01$). How does it affect the frequency trajectories?*
*Question 1e. Run the following command. What can you say about the effect of the ancestral population size on the probability of fixation?*
```{R, eval=FALSE}
t1 <- system.time( m50 <- sim(N=50 ,p=0.05,m=20,g=1000) )
t2 <- system.time( m100 <- sim(N=100 ,p=0.5,m=20,g=1000) )
t3 <- system.time( m200 <- sim(N=200 ,p=0.5,m=20,g=1000) )
```
The objects `m50`, `m100` and `m200` contains the frequency trajectories of three simulated populations with sizes 50, 100 and 200 respectively. The following `R` commands count and visualize the proportion of fixed alleles over 1000 generations.
**Run the following command. What can you can**
```{R, eval=FALSE}
f50 <- apply(m50*(1-m50)==0 ,1,mean)
f100 <- apply(m100*(1-m100)==0,1,mean)
f200 <- apply(m200*(1-m200)==0,1,mean)
par(mar=c(5,5,3,2))
matplot(cbind(f50,f100,f200),type="l",lwd=2,lty=1,
xlab="Number of generations",axes=FALSE,cex.lab=2,
ylab="Fraction of fixed allele frequencies")
axis(1);axis(2)
legend(800,0.25,legend=c("N=50","N=100","N=200"),
box.lty=0,fill=1:3,horiz=FALSE,border=0,cex=1)
```
## Part 2: Genetic drift creates more differentiation
The `R` commands below define a function called `FstSim`, which tracks $F_{ST}(t)$ between the ancestral population and the populuation at time $t$. This functions reuses the `generateNextGeneration` function defined above. Input parameters for that function are size of the ancestral population ($N$), the allele frequency ($p$) in the ancestral population, the number $m$ of SNPs segregating in the ancestral population, the number $g$ of generations, the grow rate $r$ (default $r=0$, i.e. the population does not grow over time), and $k$ an index for the replicate.
**Run the following command to define the function.**
```{R, eval=TRUE}
FstSim <- function(N=100, # sample size
p=0.5, # allele frequency in the ancestral population
m=100, # number of markers
g=500, # number of generations
r=0.0, # growth rate (r=0, no growth; r=0.01: 1% growth per generation)
k=0, # replicate index
plotIt=TRUE){
## Simulate base population
x <- do.call("cbind",lapply(1:m,function(j) rbinom(N,size=2,prob=p)))
alleleFreq <- matrix(NA,nrow=g+1,ncol=m)
alleleFreq[1,] <- colMeans(x)/2
n <- N
for(gen in 1:g){
n <- round(n*(1+r))
x <- generateNextGeneration(x,n)
alleleFreq[gen+1,] <- colMeans(x)/2
}
cat(paste0("\tReplicate #",k,"\n"))
fst <- apply(alleleFreq,1,function(x) mean((x-p)^2) / (p*(1-p)) )
return(fst)
}
```
*Question 2a. Under the Wright-Fisher model, we can predict that $F_{ST}(t) \approx 1-e^{-t/(2N)}$. Run the following commands to verify how well this prediction works. If $N=100$, then how many years do we have to wait for $F_{ST}(t)$ to reach 0.1 ? What if $N=10,000$? (We assume that 1 generation $\approx$ 25 years).*
```{R, eval=FALSE}
r <- 0.0 # growth rate of population size
m <- 100 # number of SNPs
p0 <- 0.5 # allele frequency in the ancestral population
nG <- 500 # number of Generations
N <- 100 # size of the ancestral population
dt <- 0:nG # Time (in number of generations)
# Run simulation
Fst <- FstSim(N=N,p=p0,m=m,g=nG,r=r,k=0)
plot(dt,Fst,pch=19,cex=0.5,ylim=c(0,1),cex.lab=1.0,
xlab="Number of generations",ylab="Fst between Current and Ancestral population",
axes=FALSE);axis(1);axis(2)
ExpectedFst <- sapply(dt,function(t) 1-exp(-t/(2*N)))
lines(dt,ExpectedFst,col=2,lwd=2)
legend(0.5*nG,0.2,legend=c("Theoretical Expectation"),col=c(2,4),box.lty=0,lwd=2)
```
*Question 2b. Change the growth rate of the population to $r=0.01$ (1%). What do you see?* **[WARNING: Setting growth rate beyond 1% will lead to extremely slow runs].**
*Question 2c. Change the growth rate of the population to $r=-0.05$ (-5%). What do you see?*