########### 2023 International Statistical Genetics Workshop ########### 

# Learning Goals: This tutorial aims to make you familiar with genome-wide association analysis, the way in which PLINK2 works (and some of the useful options it provides!), and some tools to visualize your results.
# It is important that you explore the outputs and the log files and that you understand what they mean. Please ask if you have any questions, and check the plink file formats page.
# Please note, most of the options that you will be using work the same for logistic and linear regression. 

####### DAY 2. GWAS TUTORIAL ####### 

### Key information can be found here (links accessed on May 2021):
# https://www.cog-genomics.org/plink/2.0/
# https://www.cog-genomics.org/plink/2.0/assoc
# https://www.cog-genomics.org/plink/2.0/formats


###
# You will find the commands and their output in Day2_GWAS_Practical_CHEATSHEET.pdf.
# Feel free to take a look through it during the practical, but a good strategy is to first try the commands yourself or with your group, then troubleshoot things yourself if they don't work, then ask your group, then ask others & faculty, then check the cheatsheet. 
# The cheatsheet has all the expected answers and figures, but the process of doing the GWAS and exploring the output is what will help you learn.
###


### We invite you to work in groups to find the commands that are missing (indicated with XXX), run the analyses, and discuss the questions proposed.


###############################################################
###############################################################
#########  FIRST: Copy the files to your working directory & navigate to that directory:

cp -r /faculty/luke/2023/gwas2/ .
cd gwas2


###############################################################
###############################################################
#########  EXERCISE 1.  LOGISTIC REGRESSION (BINARY TRAIT).


#### 1.0. Go to the case-control folder and check the files you have there. The phenotype is Alzheimer's Disease (AD) status. 

cd casecontrol


#### 1.1. Run a logistic regression for the phenotype (AD status) including the principal components in file adpc.txt as covariates to correct for genetic ancestry.
# [hint: you can find the flags and modifiers that you can include in the logistic and linear regressions on Association analysis > Regression with covariates, https://www.cog-genomics.org/plink/2.0/assoc#glm in the PLINK 2.0 website]

plink2 --bfile adclean.cc --glm sex --covar adpc.txt --out 1.1_logistic.ad.cc --threads 4

# Check the log file. How many cases and controls were detected? How many covariates?


# Check the results (stored in 1.1_logistic.ad.cc.PHENO1.glm.logistic.hybrid) and be sure that you understand the content of each of the columns
# [hint: you can check File formats > .glm.logistic, https://www.cog-genomics.org/plink/2.0/formats#glm_logistic, on the PLINK2.0 website]. Remember, when PLINK 1.9 uses bfiles, the effect allele is A1, which is the minor allele, and may or may not be the ALT allele. So an OR > 1 means A1 is associated with an increased risk relative to A2.

head 1.1_logistic.ad.cc.PHENO1.glm.logistic.hybrid


# Did any SNP associations reach genome-wide significance? What is the grep command doing here?
grep ADD 1.1_logistic.ad.cc.PHENO1.glm.logistic.hybrid | awk '$13<5e-8'

# What if cases and controls had been coded as 1 and 0, respectively? What could have we done to make PLINK interpret this coding appropriately?
# [hint: you can check Standard data input > Phenotypes > Phenotype encoding, https://www.cog-genomics.org/plink/2.0/input#pheno, in the PLINK 2.0 website] 

# What if sex had not been coded in the fam file, or was in both your covariate and fam files? What could have we done to make PLINK interpret this coding appropriately?
# [hint: you can check --glm modifiers in https://www.cog-genomics.org/plink/2.0/assoc#glm, in the PLINK 2.0 website] 


#### 1.2. Run a logistic regression for the case-control variable AD including the principal components as covariates and hiding the results of the covariates. 

plink2 --bfile adclean.cc --glm sex hide-covar --covar adpc.txt --out 1.2_adclean.cc

# What's the difference between the sets of results generated in 1.1 and 1.2? 


## 1.3. Run a logistic regression for the case-control variable AD including the principal components as covariates, hiding the results of the covariates, and getting 95% Confidence intervals of odds ratios.
# [hint: you can check again https://www.cog-genomics.org/plink/2.0/assoc#glm, in the PLINK 2.0 website] 
plink2 --bfile adclean.cc --glm sex hide-covar --ci 0.95 --covar adpc.txt --out 1.3_adclean.cc


## 1.4. Run a logistic regression for the case-control variable AD including the principal components as covariates, hiding the results of the covariates, and getting the allele frequencies.
# [hint: you can check Main functions > Basic statistics > --freq or Allele frequency, https://www.cog-genomics.org/plink/1.9/basic_stats#freq, in the PLINK 1.9 website]. 
plink2 --bfile adclean.cc --glm sex hide-covar --covar adpc.txt --XXX --out 1.4_adclean.cc

# Explore the output files and note you have an extra one, 1.4_adclean.cc.afrq. [hint: you can check File formats > .afrq in the PLINK website]. In this one, we can see REF, ALT, and ALT_Freq. 


###############################################################
###############################################################
#########  EXERCISE 2. LINEAR REGRESSION (CONTINUOUS TRAIT)


## 2.0. Go to the continuous folder and check the files you have there. The phenotype is a transcript probe (gene expression). Pay attention to the file adclean.cont.txt. 

cd ../continuous/


## 2.1. Run a linear regression for the continuous trait including the genetic principal components as covariates, hiding the results of the covariates, and using the --pheno option. The advantage of using an extra file for phenotypes (and the --pheno option) is that if there were several phenotypes, it would be possible to run analyses on them at the same time (something not possible with ped or fam files). Note that when using the --pheno option, the original ped or fam files can still contain a phenotype in the phenotypc column, but it can be missing (NA/nan/-9). See Standard data input > Phenotypes, https://www.cog-genomics.org/plink/2.0/input#pheno.

plink2 --bfile adclean.cont --glm sex hide-covar --pheno adclean.cont.txt --covar adpc.txt --out 2.1_adclean.cont --threads 4


## 2.2. Run a linear regression for the continuous trait including only PC1 as covariate, hiding the results of the covariate, using the --pheno option.
# [hint: again, you can check the commands you can use to run logisic or linear regressions on data input > Covariates,  https://www.cog-genomics.org/plink/2.0/input#pheno. For options related to how parameters are handled in other models (e.g., estimating non-additive effects), you can seehttps://www.cog-genomics.org/plink/2.0/assoc

plink2 --bfile adclean.cont --glm hide-covar --pheno adclean.cont.txt --covar adpc.txt --XXX XXX --out 2.2_adclean.cont --threads 4


## 2.3. Run a linear regression for the continuous trait including the principal components as covariates, hiding the results of the covariates, using the --pheno option, and getting 95% confidence intervals for the beta.

plink2 --bfile adclean.cont --glm sex hide-covar --pheno adclean.cont.txt --covar adpc.txt --XXX XXX --out 2.1_adclean.cont --threads 4

# What is different about these results than 2.1 results? How might you use the added information?


###############################################################
###############################################################
#########  EXERCISE 3. PLOT OUT YOUR RESULTS
## 2.4.  Plot the results from 2.3.

# We will create three plots to explore the results. For that, we will need at least three columns: chromosome, base pair position, and p-value; having a SNP column (containing the rs number) will allow extra options in our plots. We’ll first create a file containing this information, excluding the markers with no results.
# What is "|" doing in the next line? What does the grep command do, and why would you want to use that?

awk '{print $1,$2,$3,$14}' 2.3_adclean.cont.GI_34147330-S.glm.linear | grep -v NA >  plot.adclean.cont.linear.txt

# What’s the SNP with the lowest p-value (or top SNP)?

sort -k4 -g plot.adclean.cont.linear.txt | head

# What is the "-g" doing in the line above? Hint: use "man sort" to check the documentation of commands.


## QQ and Manhattan plots.
# Open in RStudio (https://workshop.colorado.edu/rstudio/) the script Rscript_qqMan.R and plot the results. Set your working directory to the folder where you have your results (you can check first what your working directory is by typing pwd in the command line). Run the script. Explore the QQ plot and the Manhattan plot. What information do you gather from these plots and the lambda value? Do you detect any anomalies?


## Regional plot.
# Still in RStudio, go to the "Files" section (bottom right quadrant of the browser) and check the file plot.adclean.cont.linear.txt and the two jpg images you just created. Then go to More and then Export. The file will download to your local computer.
# You will need to unzip the downloaded folder to get to the files, including the two images & the text file of the results.

# You can explore association p-value results and LD patterns by uploading your results (or a genomic region of interest) to LDassoc in LDlink: https://ldlink.nci.nih.gov/?tab=ldassoc.
# Upload your results using Browser (preferred: Chrome, Safari, Firefox 36+ or Internet Explorer). In real scenarios, you may want to upload only the information of a region of interest that you want to plot or only one chromosome because the upload time will be slow with very large files.
# Select the columns that contain the information on Chromosome, Position, and P-Value.
# LDassoc has three options for visualising regions of association: by gene, by region or by variant. We will select visualising our results by using our lowest p-value variant as the index variant (rs6050598 or chr20:25397257).
# Select the CEU population (European, Utah Residents from North and West Europe) as the 1000 Genomes sub-population of interest. LDlink will calculate measures of linkage disequilibrium according to this population, which is the one that best matches the ancestry of our study population. 
# Leave the rest of the options as default and press Calculate.
# Explore the interactive plot. Is the index variant in high LD with any of the nearby variants? Remember the R2 is a measure of linkage disequilibrium or correlation of alleles for two genetic variants.


#########
### If you have gotten this far with time left, you can go back to your logistic regression results and plot them:
## 1.5. Plot the results from 1.4. You will need to use the commands you ran in section two to generate the intermediate files for RStudio and the interactive browser plots. Your challenge will be to adapt those commands to the right files in the right sequence, run the RStudio script again, and create the plots.