---
title: "Creating and using polygenic scores"
author: "Perline Demange"
format: 
  html:
    self-contrained: true 
    embed-resources: true
editor: visual
execute: 
  eval: false
---

## Set up

We will use bash and R. You can use both from RStudio, bash in the Terminal and R in the Console.

```{bash}
# Create a new folder in your account and copy the material 
mkdir Day4/PGS 
cp -r /home/perline/2024/PGSpractical Day4/PGS
cd Day4/PGS/PGSpractical # go to your copy of the folder
ls #look at what the folder contains and how it is structured
```

## Step 1: Obtain GWAS summary statistics from a large discovery sample

For this analysis we will be using the latest GWAS summary statistics for body height from Yengo et al. (2022), downloaded from [GIANT consortium results](https://www.joelhirschhornlab.org/giant-consortium-results).

The GWAS summary statistics for predominantly East Asian ancestry sub-sample can be found among the files you copied into your working directory. To make it manageable, we already reduce the file to only SNPs associated with height with p \<1e-2.

```{bash}
# Decompress the summary statistics
gunzip input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2.gz

# Count the number of lines in the file
wc -l input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2

# Look at the first 10 lines
head input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2
```

::: callout-tip
## Question 1

What is the number of variants in this sumstats file?
:::

::: {.callout-note collapse="true" icon="false"}
## Hint

You can use wc -l to count the number of lines in a file. There is also a header in the file, so the answer is number of lines minus one)
:::

::: {.callout-warning collapse="true" icon="false"}
## Solution

87904 SNPs.
:::

::: callout-important
## Type of effect

When the phenotype is binary, the GWAS effect size might be given as odd ratio (OR), rather than BETA (for continuous traits). You might want to transform this effect to log(OR) so that the PRS can be computed using summation. See 3.3 PRS Unit in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7612115/
:::

## Step 2: Obtain independent sample that has genome-wide data

We will use some made-up data, based on the individuals with East-Asian ancestry present in the 1000Genome phase 3 and some simulated phenotype. You need a sample with enough genotyped participants (with QC'd data) and a phenotype of interest.

The genotype data is in the plink files 1kg_ph3_EAS. The phenotype data is in the file HEIGHT.pheno.

## Step 3: Select SNPs in common between the two samples

We want to maximize the number of SNPs potentially used for our PGS. For this, we select SNPs present in both the base sample and the target sample, before clumping.

Because we use a toy target sample that is based on the reference sample, we are skipping this step as this will be done while clumping.

## Step 4: Select independent SNPs by pruning based on LD

Clumping is a greedy algorithm that reduces the full set of SNPs to a set of ‘clumps’, where each clump is represented by one index SNP, the SNP that has the strongest association with the outcome among all SNPs in that clump.

### Choosing a reference genome to infer the LD structure

The correlation between the SNPs is calculated using a reference genotype dataset in plink bed/bim/fam format. These genotypes should be representative of the GWAS sample in terms of genetic ancestry. Therefore, we use the East-Asian ancestry samples from the 1000 Genomes Phase 3 data. The files (which you copied to your working directory) were downloaded from [https://www.cog-genomics.org/plink/2.0/resources#phase3_1kg](#0), and selection of EA ancestry and reformating to plink format was already done (you can find the code in the scripts folder).

### Clumping

The plink clumping approach takes the GWAS results as starting point. Clumps (i.e. groups of SNPs in LD) are formed around central ‘index SNPs’. Index SNPs are chosen greedily starting with the lowest P-value. SNPs that meet the --clump-p1 threshold, but have already been assigned to another clump, do not form their own clump unless --clump-best is specified. By default, SNPs that are less than 250k base pairs away from an index SNP, that have and r^2^ larger than 0.5 with that index SNP, and have an association P-value \< 0.0001 are assigned to that index variant's clump (unless they have already been assigned to another clump). These arguments can be changed, how strict you should be with these vlaues will depend on why you need the PGS. The function outputs a list of the index SNPs.

```{bash}
#Info on the arguments: 
# --bfile: the reference files in plink format
# --clump: the sumstats containing the SNPs to clump 
# --out: name you want to give to the output 
# --clump-snp-field: name of the SNP ID column in the sumstats
# --clump-field: name of pvalue column in the sumstats 
# --clump-p1: p-value threshold for a SNP to be included as index
# --clump-kb: will clump any SNPs that are within X kb to both ends of the index SNP, that is a 2xX kb window
# --clump-r2: removes SNPs with a r2 higher than X with the index SNPs. The r2 measure is simply calculated as the square of the correlation coefficient of the alleles between two SNPs, and it is therefore constrained to take values between 0 and 1.

#careful the bfiles are in a shared folder to make it manageable  

plink --bfile /usr/local/data/1kG/1kG_ph3_EAS \
    --clump input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2 \
    --out output/HEIGHT_EAS_p1e-2_rsq60 \
    --clump-snp-field RSID \
    --clump-field P \
    --clump-p1 1e-2 \
    --clump-kb 1000 \
    --clump-r2 0.6 
    
head output/HEIGHT_EAS_p1e-2_rsq60.clumped 

# Extract the list of index SNPs (independently associated SNPs)
# the column with the SNP IDs is column 3 
awk 'NR>1{print $3}' output/HEIGHT_EAS_p1e-2_rsq60.clumped >  output/HEIGHT_EAS_p1e-2_rsq60.snps

```

::: callout-tip
## Question

How many clumps were made?
:::

::: {.callout-note collapse="true" icon="false"}
## Hint

Have a look at the log file.
:::

::: {.callout-warning collapse="true" icon="false"}
## Solution

22412 clumps formed from 84101 top variants.
:::

## Step 5: Construct the scores

Now we can make the PGIs. We will make the PGIs for the same sample we used to estimate LD (1000 Genomes EAS ancestry). Of course, in reality, this does not have to be the case, i.e. the reference panel is usually not the same as the target sample (But it is with the same ancestry!).

We will make several scores for each participant, to investigate the effect of the pvalue threshold chosen on how predicitve the PGI is.

```{bash}
# Make the PGI for P<1e-2 #take all the snps in the .snps file, which we have earlier selected to be with P<1e-2
# Have a look at plink documentation if you are confused about the functions and arguments e.g. https://www.cog-genomics.org/plink/2.0/score
# Info on the functions: 
# --extract: only keeps the SNPs listed in this file
# --score: 2nd column is SNPID, 5th is the effect allele, 8th is the effect size, 
# list-variants tells it to save the list of SNPs used to create the PGS
plink2 --bfile /usr/local/data/1kG/1kG_ph3_EAS \
    --extract output/HEIGHT_EAS_p1e-2_rsq60.snps \
    --score input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2 2 5 8 list-variants\
    --out output/PGI_height_1kG_ph3_EAS.p1e-2 

#We actually want to adapt the p-value threshold and test different thresholds.
# Create range file
echo -e "p5e-8 0.00 5e-8\np1e-5 0.00 1e-5\np1e-3 0.00 1e-3" > output/range.txt

# Make PGIs for  P<1e-3, P<1e-5, P<5e-8
# --q-score-range: 1st arg defines the rnage of pvalues wanted, 2nd arg takes the file where the pvalues are defined, here 2nd column is SNPID and 10th column is the pvalue 
plink2 --bfile /usr/local/data/1kG/1kG_ph3_EAS \
    --extract output/HEIGHT_EAS_p1e-2_rsq60.snps \
    --score input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2 2 5 8 list-variants\
    --out output/PGI_height_1kG_ph3_EAS \
    --q-score-range output/range.txt input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2 2 10 'header' 
    

```

::: callout-tip
## Question

How many files did you just created? What are they?
:::

::: {.callout-note collapse="true"}
## Solution

You should now have in your output folder 4 files ending in .sscore, which contain the PGS for each individual, created with the pvalue threshold writen in the file name. You also get a .log file and a .sccore.vars file which contains the IDs of the SNPs used.
:::

::: callout-tip
## Bonus question

How many SNPs were used to make each PGS?
:::

::: {.callout-note collapse="true"}
## Solution

Hum not that easy... The log files tell you only one number "22378 variants processed", but does not tell you how many variants were actually used for each p-threshold. We have to check it by ourselves.\
One way of doing it is to get the list of SNPs used (which should amount to 22378), and check how many are in each range of pvalues. This is seriously a pain to do in unix, so lets do it in R instead:

``` r
setwd("2024/PGSpractical")
library(data.table)
#BONUS get number of SNPs#####
snps_used <- fread("output/PGI_height_1kG_ph3_EAS.sscore.vars", header=F)
gwas <- fread("input/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EAS_1e-2")
snps_used_pval <- merge(snps_used, gwas, by.x= "V1", by.y="RSID")
head(snps_used_pval)
# Number of SNPs with p<0.01
nrow(snps_used_pval[snps_used_pval$P<0.01,])
nrow(snps_used_pval[snps_used_pval$P<0.001,])
nrow(snps_used_pval[snps_used_pval$P<1e-5,])
nrow(snps_used_pval[snps_used_pval$P<5e-8,])
```

So the results are:

P\<1e-2: 22378

P\<1e-3: 9762

P\<1e-5: 3530

P\<5e-8: 1724
:::

## Step 6: Building a prediction model

Now we will use these PGIs to run some prediction analyses in R. In this example we will test if this PGI of height is indeed predictive of height in our independent sample.

Let's move to R! We first read in the files containing the PGIs we just created.

```{r}
setwd("Day4/PGS/PGSpractical")
library(data.table)

# Read the PGS files 

df_PGI_p1e2 <- fread("output/PGI_height_1kG_ph3_EAS.p1e-2.sscore", header=T)
df_PGI_p1e3 <- fread("output/PGI_height_1kG_ph3_EAS.p1e-3.sscore", header=T)
df_PGI_p1e5 <- fread("output/PGI_height_1kG_ph3_EAS.p1e-5.sscore", header=T)
df_PGI_p5e8 <- fread("output/PGI_height_1kG_ph3_EAS.p5e-8.sscore", header=T)

# How do these files look like? 
head(df_PGI_p1e2)

# Make histograms
hist(df_PGI_p1e2$SCORE1_AVG, breaks=30)
hist(df_PGI_p1e3$SCORE1_AVG, breaks=30)
hist(df_PGI_p1e5$SCORE1_AVG, breaks=30)
hist(df_PGI_p5e8$SCORE1_AVG, breaks=30)
```

::: callout-tip
## Questions

For how many participants did we just create PGS? What does the distribution of PGS look like?
:::

::: {.callout-warning collapse="true" icon="false"}
## Solution

For 584 participants.

The distribution looks normal (bell-shaped)
:::

### Standardize the PGS

It is common to standardize the PGI to have mean zero and standard deviation one in order to interpret effect estimates in terms of “a one standard-deviation increase in PGI is expected to increase Y by β”. Standardize all PGIs using the scale function:

```{r}
# Standardize the PGS
df_PGI_p1e2$PGI_p1e2_std <- scale(df_PGI_p1e2$SCORE1_AVG)
df_PGI_p1e3$PGI_p1e3_std <- scale(df_PGI_p1e3$SCORE1_AVG)
df_PGI_p1e5$PGI_p1e5_std <- scale(df_PGI_p1e5$SCORE1_AVG)
df_PGI_p5e8$PGI_p5e8_std <- scale(df_PGI_p5e8$SCORE1_AVG)
```

### Join with phenotype file

We have prepared a simulated phenotype for this tutorial: **HEIGHT.pheno.** Read the file into R. The file also contains sex and 10 principal components. We want to merge this phenotype data frame (let's assume it's called HEIGHT) with the PGI data frames.

```{r}
# Read in phenotype data
HEIGHT <- read.table("input/HEIGHT.pheno", header=T)
# notice that the phenotype data contains less participants than the genetic data 

# Merge all data frames
df <- merge(HEIGHT, df_PGI_p1e2, by="IID")
df <- merge(df, df_PGI_p1e3, by="IID")
df <- merge(df, df_PGI_p1e5, by="IID")
df <- merge(df, df_PGI_p5e8, by="IID")
# You might get a warning as some columns have the same name, but this does not affect us (ideally you would clean up this output)

```

::: callout-note
## Create PCs

We created the principal components for this practical, you can find more information how to create those yourself following this tutorial: <https://choishingwan.github.io/PRS-Tutorial/plink/#accounting-for-population-stratification>
:::

### Regression

We can now use ordinary least squares (OLS) to estimate a linear model, where we regress Height on sex and the PGI, with an intercept in the model. In R this can be done using the lm function (lm stands for linear model):

```{r}
# Run linear regression of height on each PGI and sex
model_p1e2 <- lm(height ~ PGI_p1e2_std + sex, data=df)
model_p1e3 <- lm(height ~ PGI_p1e3_std + sex, data=df)
model_p1e5 <- lm(height ~ PGI_p1e5_std + sex, data=df)
model_p5e8 <- lm(height ~ PGI_p5e8_std + sex, data=df)

summary(model_p5e8)
```

::: callout-tip
## Question

Is the coefficient of PGI_p5e8_std statistically significant at the 1% level? How many centimeters increase in height is one standard deviation increase in PGI_p5e8 associated with?
:::

::: {.callout-warning collapse="true" icon="false"}
## Solution

Yes, the PGI is associated to height with a pvalue of 1.21e-12. One SD increase in PGI is associated with a 2.24 cm increase in height.
:::

### R squared

The coefficient of the PGI is not a metric to evaluate the prediction performance that is normally used. A popular measure for continuous phenotypes is incremental-*R*^2^ : the increase in the coefficient of determination when the PGI is added to a regression with only control variables. This can be computed as below:

```{r}
# Run linear regression of height on sex as the baseline model
model_baseline <- lm(height ~ sex, data=df)

# Get incremental Rsq
summary(model_p1e2)$r.squared - summary(model_baseline)$r.squared
summary(model_p1e3)$r.squared - summary(model_baseline)$r.squared
summary(model_p1e5)$r.squared - summary(model_baseline)$r.squared
summary(model_p5e8)$r.squared - summary(model_baseline)$r.squared
```

::: callout-tip
## Question

What is the incremental-*R*^2^ of the P\<0.01 PGI?
:::

::: {.callout-warning collapse="true" icon="false"}
## Solution

The incremental-*R*^2^ of the P\<0.01 PGI is 0.196.
:::

This incremental-*R*^2^ seems somewhat large. The reason, as you may have guessed, is that we have not controlled for the principal components. Even if your GWAS results are free from confounding due to population stratification, your PGI may still be correlated to the PCs (and thereby population structure) in your hold-out sample. Thus, not controlling for the lead PCs of the genetic data in your hold-out sample may cause the estimated coefficient for the PGI to be biased due to population stratification.

## Principal components

Now let's obtain the incremental-*R*^2^ of the PGI when they are added to a model that controls for the first 10 PCs.

Instead of typing the PCs one by one, we can do some coding magic:

```{r}
# Add PCs to the models
PC <- paste0("PC", 1:10)
PC <- paste0(PC, collapse="+")
model_baseline <- lm(paste0("height ~ sex +", PC), data=df)
model_p1e2 <- lm(paste0("height ~ PGI_p1e2_std + sex +",PC), data=df)
model_p1e3 <- lm(paste0("height ~ PGI_p1e3_std + sex +",PC), data=df)
model_p1e5 <- lm(paste0("height ~ PGI_p1e5_std + sex +",PC), data=df)
model_p5e8 <- lm(paste0("height ~ PGI_p5e8_std + sex +",PC), data=df)

# Get incremental-Rsq of adding the PGI to a model that controls for sex and 10 PCs
r2_p1e2 <- summary(model_p1e2)$r.squared - summary(model_baseline)$r.squared
r2_p1e3 <-summary(model_p1e3)$r.squared - summary(model_baseline)$r.squared
r2_p1e5 <-summary(model_p1e5)$r.squared - summary(model_baseline)$r.squared
r2_p5e8 <-summary(model_p5e8)$r.squared - summary(model_baseline)$r.squared
```

::: callout-tip
## Question

What is the new incremental-*R*^2^ of the P\<0.01 PGI?
:::

::: {.callout-warning collapse="true" icon="false"}
## Solution

The new incremental-*R*^2^ of the P\<0.01 PGI is 0.111.
:::

::: callout-important
## Other typical covariates: Genotyping chips and/or batches

Other covariates that you will most likely need to include in your real analyses are the genotyping chips and/or batches. These can capture measurement variations and errors, and might also be correlated with specific phenotypes if participants were recruited at different times for different reasons.
:::

::: callout-important
## Binary outcome

Note that if you look at a binary outcome, you would need to use glm instead of lm and compute a Nagelkerke's R.
:::

### BONUS: Getting SEs for Rsquared

It is also informative to report a 95% confidence interval for the incremental-*R*^2^. This is usually done by boostrapping. First, define a function that resamples the sample with replacement and outputs incremental-*R*^2^ from that sample:

```{r}
# Bootstrapping function to get 95% CIs
# set seed number
set.seed(42)

bs_func <- function(s,PGI){
    # draw indexes for resampling (with replacement)
    BOOT_IN = sample(1:nrow(df), nrow(df), replace=TRUE)

    # compute incremental R2 given the drawn indexes.
    summary(lm(paste0("height ~ sex +", PGI, "+", PC), df[BOOT_IN,]))$r.squared -  summary(lm(paste0("height ~ sex + ", PC), df[BOOT_IN,]))$r.squared  
}

# Run the bootstrapping function for each PGI
rep = 1000
pgi = c()
lci = c()
uci = c()
for (p in c("1e2", "1e3", "1e5", "5e8")) {
    BOOT = sapply(1:rep, bs_func, paste0("PGI_p",p,"_std"))
    BOOT = sort(BOOT)
    pgi = c(pgi,p)
    lci = c(lci,BOOT[25])
    uci = c(uci,BOOT[975])
}

#combine the results in a nice dataframe to make plotting easier 
rci <-  data.frame(pgi,lci, uci)
names(rci) <- c("pgi","lci","uci")
rci$estimate <- c(r2_p1e2,r2_p1e3,r2_p1e5,r2_p5e8)
```

### Plotting

Let's create a barplot to visualize the association between the different PGS and height.

```{r}
# Create the barplot with error bars
library(ggplot2)
p <- ggplot(rci, aes(x = pgi, y = estimate)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "blue") +
  geom_errorbar(aes(ymin = lci, ymax = uci), width = 0.2, color = "black", size = 0.7) +
  labs(x = "PGI", y = "R-squared", title = "R-squared with 95% CI") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Optional: Rotate x-axis labels

# Print the plot
p
```

::: callout-tip
## Question

What does the plot tell us about the genetic architecture of Height?

Which PGS p-value threshold would you use for your main analysis?
:::

::: {.callout-warning collapse="true" icon="false"}
## Solution

Much of the genetic architecture of height appears to be driven by variants with small effect sizes. GWAS significant SNPs (5e-8) are only weakly predictive of height alone, but prediction is improved by including lots of other SNPs, including those with smaller effects (e.g., p values .001 to .5)

If you are trying to have a PGI that is the most predictive of height, then using the one created with a pvalue threshold of 1e-2 seems to be the most advantageous. In our study, it is good to report that you have tested the association with different PGIs and the reason why you pick this one for your main analysis.
:::

## Resources

PGS practical with plink, ldpred, PRSice and lassosum: <https://choishingwan.github.io/PRS-Tutorial/>

Free statistical genetics texbook including PRS practical with PRSice and LDpred: [https://mitpress.mit.edu/9780262357449/an-introduction-to-statistical-genetic-data-analysis/](https://mitpress-mit-edu.ezproxy.uio.no/9780262357449/an-introduction-to-statistical-genetic-data-analysis/)

Aysu and Dan's PGI practical for 2023 Boulder workshop

##