# Create a new folder in your account and copy the material
mkdir Day4/PGS
cp -r /home/perline/2024/PGSpractical Day4/PGS
cd Day4/PGS/PGSpractical # go to your copy of the folder
ls #look at what the folder contains and how it is structured
Creating and using polygenic scores
Set up
We will use bash and R. You can use both from RStudio, bash in the Terminal and R in the Console.
Step 1: Obtain GWAS summary statistics from a large discovery sample
For this analysis we will be using the latest GWAS summary statistics for body height from Yengo et al. (2022), downloaded from GIANT consortium results.
The GWAS summary statistics for predominantly East Asian ancestry sub-sample can be found among the files you copied into your working directory. To make it manageable, we already reduce the file to only SNPs associated with height with p <1e-2.
# Decompress the summary statistics
gunzip input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2.gz
# Count the number of lines in the file
wc -l input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2
# Look at the first 10 lines
head input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2
What is the number of variants in this sumstats file?
When the phenotype is binary, the GWAS effect size might be given as odd ratio (OR), rather than BETA (for continuous traits). You might want to transform this effect to log(OR) so that the PRS can be computed using summation. See 3.3 PRS Unit in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7612115/
Step 2: Obtain independent sample that has genome-wide data
We will use some made-up data, based on the individuals with East-Asian ancestry present in the 1000Genome phase 3 and some simulated phenotype. You need a sample with enough genotyped participants (with QC’d data) and a phenotype of interest.
The genotype data is in the plink files 1kg_ph3_EAS. The phenotype data is in the file HEIGHT.pheno.
Step 3: Select SNPs in common between the two samples
We want to maximize the number of SNPs potentially used for our PGS. For this, we select SNPs present in both the base sample and the target sample, before clumping.
Because we use a toy target sample that is based on the reference sample, we are skipping this step as this will be done while clumping.
Step 4: Select independent SNPs by pruning based on LD
Clumping is a greedy algorithm that reduces the full set of SNPs to a set of ‘clumps’, where each clump is represented by one index SNP, the SNP that has the strongest association with the outcome among all SNPs in that clump.
Choosing a reference genome to infer the LD structure
The correlation between the SNPs is calculated using a reference genotype dataset in plink bed/bim/fam format. These genotypes should be representative of the GWAS sample in terms of genetic ancestry. Therefore, we use the East-Asian ancestry samples from the 1000 Genomes Phase 3 data. The files (which you copied to your working directory) were downloaded from https://www.cog-genomics.org/plink/2.0/resources#phase3_1kg, and selection of EA ancestry and reformating to plink format was already done (you can find the code in the scripts folder).
Clumping
The plink clumping approach takes the GWAS results as starting point. Clumps (i.e. groups of SNPs in LD) are formed around central ‘index SNPs’. Index SNPs are chosen greedily starting with the lowest P-value. SNPs that meet the –clump-p1 threshold, but have already been assigned to another clump, do not form their own clump unless –clump-best is specified. By default, SNPs that are less than 250k base pairs away from an index SNP, that have and r2 larger than 0.5 with that index SNP, and have an association P-value < 0.0001 are assigned to that index variant’s clump (unless they have already been assigned to another clump). These arguments can be changed, how strict you should be with these vlaues will depend on why you need the PGS. The function outputs a list of the index SNPs.
#Info on the arguments:
# --bfile: the reference files in plink format
# --clump: the sumstats containing the SNPs to clump
# --out: name you want to give to the output
# --clump-snp-field: name of the SNP ID column in the sumstats
# --clump-field: name of pvalue column in the sumstats
# --clump-p1: p-value threshold for a SNP to be included as index
# --clump-kb: will clump any SNPs that are within X kb to both ends of the index SNP, that is a 2xX kb window
# --clump-r2: removes SNPs with a r2 higher than X with the index SNPs. The r2 measure is simply calculated as the square of the correlation coefficient of the alleles between two SNPs, and it is therefore constrained to take values between 0 and 1.
#careful the bfiles are in a shared folder to make it manageable
plink --bfile /usr/local/data/1kG/1kG_ph3_EAS \
--clump input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2 \
--out output/HEIGHT_EAS_p1e-2_rsq60 \
--clump-snp-field RSID \
--clump-field P \
--clump-p1 1e-2 \
--clump-kb 1000 \
--clump-r2 0.6
head output/HEIGHT_EAS_p1e-2_rsq60.clumped
# Extract the list of index SNPs (independently associated SNPs)
# the column with the SNP IDs is column 3
awk 'NR>1{print $3}' output/HEIGHT_EAS_p1e-2_rsq60.clumped > output/HEIGHT_EAS_p1e-2_rsq60.snps
How many clumps were made?
Step 5: Construct the scores
Now we can make the PGIs. We will make the PGIs for the same sample we used to estimate LD (1000 Genomes EAS ancestry). Of course, in reality, this does not have to be the case, i.e. the reference panel is usually not the same as the target sample (But it is with the same ancestry!).
We will make several scores for each participant, to investigate the effect of the pvalue threshold chosen on how predicitve the PGI is.
# Make the PGI for P<1e-2 #take all the snps in the .snps file, which we have earlier selected to be with P<1e-2
# Have a look at plink documentation if you are confused about the functions and arguments e.g. https://www.cog-genomics.org/plink/2.0/score
# Info on the functions:
# --extract: only keeps the SNPs listed in this file
# --score: 2nd column is SNPID, 5th is the effect allele, 8th is the effect size,
# list-variants tells it to save the list of SNPs used to create the PGS
plink2 --bfile /usr/local/data/1kG/1kG_ph3_EAS \
--extract output/HEIGHT_EAS_p1e-2_rsq60.snps \
--score input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2 2 5 8 list-variants\
--out output/PGI_height_1kG_ph3_EAS.p1e-2
#We actually want to adapt the p-value threshold and test different thresholds.
# Create range file
echo -e "p5e-8 0.00 5e-8\np1e-5 0.00 1e-5\np1e-3 0.00 1e-3" > output/range.txt
# Make PGIs for P<1e-3, P<1e-5, P<5e-8
# --q-score-range: 1st arg defines the rnage of pvalues wanted, 2nd arg takes the file where the pvalues are defined, here 2nd column is SNPID and 10th column is the pvalue
plink2 --bfile /usr/local/data/1kG/1kG_ph3_EAS \
--extract output/HEIGHT_EAS_p1e-2_rsq60.snps \
--score input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2 2 5 8 list-variants\
--out output/PGI_height_1kG_ph3_EAS \
--q-score-range output/range.txt input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2 2 10 'header'
How many did you just created? What are they?
You should now have in your output folder 4 files ending in .sscore, which contain the PGS for each individual, created with the pvalue threshold writen in the file name. You also get a .log file and a .sccore.vars file which contains the IDs of the SNPs used.
How many SNPs were used to make each PGS?
Hum not that easy… The log files tell you only one number “22378 variants processed”, but does not tell you how many variants were actually used for each p-threshold. We have to check it by ourselves.
One way of doing it is to get the list of SNPs used (which should amount to 22378), and check how many are in each range of pvalues. This is seriously a pain to do in unix, so lets do it in R instead:
setwd("2024/PGSpractical")
library(data.table)
#BONUS get number of SNPs#####
<- fread("output/PGI_height_1kG_ph3_EAS.sscore.vars", header=F)
snps_used <- fread("input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2")
gwas <- merge(snps_used, gwas, by.x= "V1", by.y="RSID")
snps_used_pval head(snps_used_pval)
# Number of SNPs with p<0.01
nrow(snps_used_pval[snps_used_pval$P<0.01,])
nrow(snps_used_pval[snps_used_pval$P<0.001,])
nrow(snps_used_pval[snps_used_pval$P<1e-5,])
nrow(snps_used_pval[snps_used_pval$P<5e-8,])
So the results are:
P<1e-2: 22378
P<1e-3: 9762
P<1e-5: 3530
P<5e-8: 1724
Step 6: Building a prediction model
Now we will use these PGIs to run some prediction analyses in R. In this example we will test if this PGI of height is indeed predictive of height in our independent sample.
Let’s move to R! We first read in the files containing the PGIs we just created.
setwd("Day4/PGS/PGSpractical")
library(data.table)
# Read the PGS files
<- fread("output/PGI_height_1kG_ph3_EAS.p1e-2.sscore", header=T)
df_PGI_p1e2 <- fread("output/PGI_height_1kG_ph3_EAS.p1e-3.sscore", header=T)
df_PGI_p1e3 <- fread("output/PGI_height_1kG_ph3_EAS.p1e-5.sscore", header=T)
df_PGI_p1e5 <- fread("output/PGI_height_1kG_ph3_EAS.p5e-8.sscore", header=T)
df_PGI_p5e8
# How do these files look like?
head(df_PGI_p1e2)
# Make histograms
hist(df_PGI_p1e2$SCORE1_AVG, breaks=30)
hist(df_PGI_p1e3$SCORE1_AVG, breaks=30)
hist(df_PGI_p1e5$SCORE1_AVG, breaks=30)
hist(df_PGI_p5e8$SCORE1_AVG, breaks=30)
For how many participants did we just create PGS? What does the distribution of PGS look like?
Standardize the PGS
It is common to standardize the PGI to have mean zero and standard deviation one in order to interpret effect estimates in terms of “a one standard-deviation increase in PGI is expected to increase Y by β”. Standardize all PGIs using the scale function:
# Standardize the PGS
$PGI_p1e2_std <- scale(df_PGI_p1e2$SCORE1_AVG)
df_PGI_p1e2$PGI_p1e3_std <- scale(df_PGI_p1e3$SCORE1_AVG)
df_PGI_p1e3$PGI_p1e5_std <- scale(df_PGI_p1e5$SCORE1_AVG)
df_PGI_p1e5$PGI_p5e8_std <- scale(df_PGI_p5e8$SCORE1_AVG) df_PGI_p5e8
Join with phenotype file
We have prepared a simulated phenotype for this tutorial: HEIGHT.pheno. Read the file into R. The file also contains sex and 10 principal components. We want to merge this phenotype data frame (let’s assume it’s called HEIGHT) with the PGI data frames.
# Read in phenotype data
<- read.table("input/HEIGHT.pheno", header=T)
HEIGHT # notice that the phenotype data contains less participants than the genetic data
# Merge all data frames
<- merge(HEIGHT, df_PGI_p1e2, by="IID")
df <- merge(df, df_PGI_p1e3, by="IID")
df <- merge(df, df_PGI_p1e5, by="IID")
df <- merge(df, df_PGI_p5e8, by="IID")
df # You might get a warning as some columns have the same name, but this does not affect us (ideally you would clean up this output)
We created the principal components for this practical, you can find more information how to create those yourself following this tutorial: https://choishingwan.github.io/PRS-Tutorial/plink/#accounting-for-population-stratification
Regression
We can now use ordinary least squares (OLS) to estimate a linear model, where we regress Height on sex and the PGI, with an intercept in the model. In R this can be done using the lm function (lm stands for linear model):
# Run linear regression of height on each PGI and sex
<- lm(height ~ PGI_p1e2_std + sex, data=df)
model_p1e2 <- lm(height ~ PGI_p1e3_std + sex, data=df)
model_p1e3 <- lm(height ~ PGI_p1e5_std + sex, data=df)
model_p1e5 <- lm(height ~ PGI_p5e8_std + sex, data=df)
model_p5e8
summary(model_p5e8)
Is the coefficient of PGI_p5e8_std statistically significant at the 1% level? How many centimeters increase in height is one standard deviation increase in PGI_p5e8 associated with?
R squared
The coefficient of the PGI is not a metric to evaluate the prediction performance that is normally used. A popular measure for continuous phenotypes is incremental-R2 : the increase in the coefficient of determination when the PGI is added to a regression with only control variables. This can be computed as below:
# Run linear regression of height on sex as the baseline model
<- lm(height ~ sex, data=df)
model_baseline
# Get incremental Rsq
summary(model_p1e2)$r.squared - summary(model_baseline)$r.squared
summary(model_p1e3)$r.squared - summary(model_baseline)$r.squared
summary(model_p1e5)$r.squared - summary(model_baseline)$r.squared
summary(model_p5e8)$r.squared - summary(model_baseline)$r.squared
What is the incremental-R2 of the P<0.01 PGI?
This incremental-R2 seems somewhat large. The reason, as you may have guessed, is that we have not controlled for the principal components. Even if your GWAS results are free from confounding due to population stratification, your PGI may still be correlated to the PCs (and thereby population structure) in your hold-out sample. Thus, not controlling for the lead PCs of the genetic data in your hold-out sample may cause the estimated coefficient for the PGI to be biased due to population stratification.
Principal components
Now let’s obtain the incremental-R2 of the PGI when they are added to a model that controls for the first 10 PCs.
Instead of typing the PCs one by one, we can do some coding magic:
# Add PCs to the models
<- paste0("PC", 1:10)
PC <- paste0(PC, collapse="+")
PC <- lm(paste0("height ~ sex +", PC), data=df)
model_baseline <- lm(paste0("height ~ PGI_p1e2_std + sex +",PC), data=df)
model_p1e2 <- lm(paste0("height ~ PGI_p1e3_std + sex +",PC), data=df)
model_p1e3 <- lm(paste0("height ~ PGI_p1e5_std + sex +",PC), data=df)
model_p1e5 <- lm(paste0("height ~ PGI_p5e8_std + sex +",PC), data=df)
model_p5e8
# Get incremental-Rsq of adding the PGI to a model that controls for sex and 10 PCs
<- summary(model_p1e2)$r.squared - summary(model_baseline)$r.squared
r2_p1e2 <-summary(model_p1e3)$r.squared - summary(model_baseline)$r.squared
r2_p1e3 <-summary(model_p1e5)$r.squared - summary(model_baseline)$r.squared
r2_p1e5 <-summary(model_p5e8)$r.squared - summary(model_baseline)$r.squared r2_p5e8
What is the new incremental-R2 of the P<0.01 PGI?
Other covariates that you will most likely need to include in your real analyses are the genotyping chips and/or batches. These can capture measurement variations and errors, and might also be correlated with specific phenotypes if participants were recruited at different times for different reasons.
Note that if you look at a binary outcome, you would need to use glm instead of lm and compute a Nagelkerke’s R.
BONUS: Getting SEs for Rsquared
It is also informative to report a 95% confidence interval for the incremental-R2. This is usually done by boostrapping. First, define a function that resamples the sample with replacement and outputs incremental-R2 from that sample:
# Bootstrapping function to get 95% CIs
# set seed number
set.seed(42)
<- function(s,PGI){
bs_func # draw indexes for resampling (with replacement)
= sample(1:nrow(df), nrow(df), replace=TRUE)
BOOT_IN
# compute incremental R2 given the drawn indexes.
summary(lm(paste0("height ~ sex +", PGI, "+", PC), df[BOOT_IN,]))$r.squared - summary(lm(paste0("height ~ sex + ", PC), df[BOOT_IN,]))$r.squared
}
# Run the bootstrapping function for each PGI
= 1000
rep = c()
pgi = c()
lci = c()
uci for (p in c("1e2", "1e3", "1e5", "5e8")) {
= sapply(1:rep, bs_func, paste0("PGI_p",p,"_std"))
BOOT = sort(BOOT)
BOOT = c(pgi,p)
pgi = c(lci,BOOT[25])
lci = c(uci,BOOT[975])
uci
}
#combine the results in a nice dataframe to make plotting easier
<- data.frame(pgi,lci, uci)
rci names(rci) <- c("pgi","lci","uci")
$estimate <- c(r2_p1e2,r2_p1e3,r2_p1e5,r2_p5e8) rci
Plotting
Let’s create a barplot to visualize the association between the different PGS and height.
# Create the barplot with error bars
library(ggplot2)
<- ggplot(rci, aes(x = pgi, y = estimate)) +
p geom_bar(stat = "identity", fill = "skyblue", color = "blue") +
geom_errorbar(aes(ymin = lci, ymax = uci), width = 0.2, color = "black", size = 0.7) +
labs(x = "PGI", y = "R-squared", title = "R-squared with 95% CI") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Optional: Rotate x-axis labels
# Print the plot
p
What does the plot tell us about the genetic architecture of Height?
Which PGS p-value threshold would you use for your main analysis?
Resources
PGS practical with plink, ldpred, PRSice and lassosum: https://choishingwan.github.io/PRS-Tutorial/
Free statistical genetics texbook including PRS practical with PRSice and LDpred: https://mitpress.mit.edu/9780262357449/an-introduction-to-statistical-genetic-data-analysis/
Aysu and Dan’s PGI practical for 2023 Boulder workshop