########### 2021 International Statistical Genetics Workshop ########### 

# This tutorial aims to make you familiar with genome-wide association analysis, the way in which PLINK 1.9 works (and some of the useful options it provides!), and some tools to visualise your results.
# It is important that you explore the outputs and the log files and that you understand what they mean. Please ask if you have any questions.
# Please note, most of the options that you will be using work the same for logistic and linear regression. 

####### DAY 3. TUTORIAL – PART 2 ####### 

### Key information can be found here (links accessed on May 2021):
# https://www.cog-genomics.org/plink/1.9/assoc#linear
# https://www.cog-genomics.org/plink/1.9/formats
# https://www.cog-genomics.org/plink/1.9/index
###
# You will find a summary of the most important commands in day3_part2_cheatsheet.docx, feel free to check it.
###


### We invite you to work in groups to find the commands that are missing (indicated with XXX), run the analyses, and discuss the questions proposed.


# Copy the files to your working directory:

cp -r /faculty/lucia/2021/GWAS/* .


### EXERCISE 1.  LOGISTIC REGRESSION (BINARY TRAIT).


## 1.0. Go to the case-control folder and check the files you have there. The phenotype is Alzheimer's Disease (AD) status. 

cd casecontrol


## 1.1. Run a logistic regression for the phenotype (AD status) including the principal components in file adpc.txt as covariates to correct for genetic ancestry.
# [hint: you can find the flags and modifiers that you can include in the logistic and linear regressions on Association analysis > Regression with covariates, https://www.cog-genomics.org/plink/1.9/assoc#linear in the PLINK 1.9 website]

plink --bfile adclean.cc --logistic --covar adpc.txt --out 1.1_adclean.cc

# Check the log file. How many cases and controls were detected? How many covariates?

# Check the results (stored in 1.1_adclean.cc.assoc.logistic) and that you know the content of each of the columns
# [hint: you can check File formats > .assoc.logistic, https://www.cog-genomics.org/plink/1.9/formats#assoc_linear, on the PLINK 1.9 website]. Remember, when PLINK 1.9 uses bfiles, the effect allele is A1, which is the minor allele. So an OR > 1 means A1 is associated with an increased risk relative to A2.

# What if cases and controls had been coded as 1 and 0, respectively? What could have we done to make PLINK interpret this coding appropriately?
# [hint: you can check Standard data input > Phenotypes > Phenotype encoding, https://www.cog-genomics.org/plink/1.9/input#pheno, in the PLINK 1.9 website] 


## 1.2. Run a logistic regression for the case-control variable AD including the principal components as covariates and hiding the results of the covariates. 

plink --bfile adclean.cc --logistic hide-covar --covar adpc.txt --out 1.2_adclean.cc

# What's the difference between the sets of results generated in 1.1 and 1.2? 

# Where can we find the SNP-phenotype association after controlling for the covariates?
# [hint: that is a FAQ that was listed in the PLINK website, check: http://zzz.bwh.harvard.edu/plink/faq.shtml#faq11]


## 1.3. Run a logistic regression for the case-control variable AD including the principal components as covariates, hiding the results of the covariates, and getting regression coefficients (betas) instead of odds ratios.
# [hint: you can check again Association analysis > Regression with covariates,  https://www.cog-genomics.org/plink/1.9/assoc#linear], in the PLINK 1.9 website]

plink --bfile adclean.cc --logistic hide-covar XXX --covar adpc.txt --out 1.3_adclean.cc


## 1.4. Run a logistic regression for the case-control variable AD including the principal components as covariates, hiding the results of the covariates, and getting the allele frequencies.
# [hint: you can check Main functions > Basic statistics > --freq or Allele frequency, https://www.cog-genomics.org/plink/1.9/basic_stats#freq, in the PLINK 1.9 website]. 

plink --bfile adclean.cc --logistic hide-covar --covar adpc.txt --XXX --out 1.4_adclean.cc

# Explore the output files and note you have an extra one, 1.4_adclean.cc.frq. [hint: you can check File formats > .frq in the PLINK website]. In this one, we can see A1, A2, and MAF. We know that A1 is the effect allele or the tested allele. A1 is the minor allele by default when we work with a bfile format. PLINK gives you the MAF, which is the allele frequency for A1. For bi-allelic variants, the allele frequency for A2, should we need it, could be calculated by subtraction (1-MAF). 

# Note we could get the allele frequencies separately for cases and controls by adding the "case control" modifier to --freq; it’d look like: plink --bfile adclean.cc --logistic hide-covar beta --covar adpc.txt --freq case-control --out 1.5_adclean.cc [and you can find more information in Main functions > Basic statistics > --freq or Allele frequency, https://www.cog-genomics.org/plink/1.9/basic_stats#freq, in the PLINK 1.9 website].



### EXERCISE 2. LINEAR REGRESSION (CONTINUOUS TRAIT)


## 2.0. Go to the contiuous folder and check the files you have there. The phenotype is a transcript probe (gene expression). Pay attention to the file adclean.cont.txt. 

cd ../continuous/


## 2.1. Run a linear regression for the continuous trait including the genetic principal components as covariates, hiding the results of the covariates, and using the --pheno option. The advantage of using an extra file for phenotypes (and the --pheno option) is that if there were several phenotypes, it would be possible to run analyses on them at the same time (something not possible with ped or fam files). Note that when using the --pheno option, the original ped or fam files must still contain a phenotype Column 6 (see Standard data input > Phenotypes, https://www.cog-genomics.org/plink/1.9/input#pheno, in the PLINK 1.9 website). 

plink --bfile adclean.cont --linear hide-covar --pheno adclean.cont.txt --covar adpc.txt --out 2.1_adclean.cont


## 2.2. Run a linear regression for the continuous trait including only PC1 as covariate, hiding the results of the covariate, using the --pheno option, and obtaining a standardised beta (to mean zero, unit variance).
# [hint: again, you can check the commands you can use to run logisic or linear regressions on Association analysis > Regression with covariates,  https://www.cog-genomics.org/plink/1.9/assoc#linear in the PLINK 1.9 website; for options related to the covariates, you can see Standard data input > Covariates,https://www.cog-genomics.org/plink/1.9/input#covar]

plink --bfile adclean.cont --linear hide-covar standard-beta --pheno adclean.cont.txt --covar adpc.txt --XXX XXX --out 2.2_adclean.cont


## 2.3. Run a linear regression for the continuous trait including the principal components as covariates, hiding the results of the covariates, using the --pheno option, and getting 95% confidence intervals for the beta.

plink --bfile adclean.cont --linear hide-covar --pheno adclean.cont.txt --covar adpc.txt --ci 0.95 --out 2.3_adclean.cont


## 2.4.  Plot the results from 2.3.

# We will create three plots to explore the results. For that, we will need at least three columns: chromosome, base pair position, and p-value; having a SNP column (containing the rs number) will allow extra options in our plots. We’ll first create a file containing this information, excluding the markers with no results.

awk '{print $1,$2,$3,$12}' 2.3_adclean.cont.assoc.linear | grep -v NA >  plot.adclean.cont.linear.txt

# What’s the SNP with the lowest p-value (or top SNP)?

sort -k4 -g plot.adclean.cont.linear.txt | head

## QQ and Manhattan plots.
# Open in RStudio (https://workshop.colorado.edu/rstudio/) the script Rscript_qqMan.R and plot the results. Set your working directory to the folder where you have your results (you can check first what your working directory is by typing pwd in the command line). Run the script. Explore the QQ plot and the Manhattan plot. What information do you gather from these plots and the lambda value? Do you detect any anomalies?


## Regional plot.
# Still in RStudio, go to the "Files" section and check the file plot.adclean.cont.linear.txt. Then go to More and then Export. The file will download to your local computer.
# You can explore association p-value results and LD patterns by uploading your results (or a genomic region of interest) to LDassoc in LDlink: https://ldlink.nci.nih.gov/?tab=ldassoc.
# Upload your results using Browser (preferred: Chrome, Safari, Firefox 36+ or Internet Explorer). In real scenarios, you may want to upload only the information of a region of interest that you want to plot or only one chromosome.
# Select the columns that contain the information on Chromosome, Position, and P-Value.
# LDassoc has three options for visualising regions of association: by gene, by region or by variant. We will select visualising our results by using our lowest p-value variant as the index variant (rs6050598 or chr20:25397257).
# Select the CEU population (European, Utah Residents from North and West Europe) as the 1000 Genomes sub-population of interest. LDlink will calculate measures of linkage disequilibrium according to this population, which is the one that best matches the ancestry of our study population. 
# Leave the rest of the options as default and press Calculate.
# Explore the interactive plot. Is the index variant in high LD with any of the nearby variants? Remember the R2 is a measure of linkage disequilibrium or correlation of alleles for two genetic variants.


### If you have got here, you can go back to your logistic regression results and plot them:


## 1.5. Plot the results from 1.4. 

# We will create three plots to explore the results. For that, we will need at least three columns: chromosome, base pair position, and p-value; having a SNP column (containing the rs number) will allow extra options in our plots. We’ll first create a file containing this information, excluding the markers with no results.

cd ../casecontrol

awk '{print $1,$2,$3,$9}' 1.4_adclean.cc.assoc.logistic | grep -v NA >  plot.adclean.cc.logistic.txt

# What’s the SNP with the lowest p-value (or top SNP)?

sort -k4 -g plot.adclean.cc.logistic.txt | head

## QQ and Manhattan plots.
# Open in RStudio (https://workshop.colorado.edu/rstudio/) the script Rscript_qqMan.R and plot the results. Set your working directory to the folder where you have your results (you can check first what your working directory is by typing pwd in the command line). Run the script. Explore the QQ plot and the Manhattan plot. What information do you gather from these plots and the lambda value? Do you detect any anomalies?


## Regional plot.
# Still in RStudio, go to the "Files" section and check the file plot.adclean.cc.logistic.txt. Then go to More and then Export. The file will download to your local computer.
# You can explore association p-value results and LD patterns by uploading your results (or a genomic region of interest) to LDassoc in LDlink: https://ldlink.nci.nih.gov/?tab=ldassoc
# Upload your results using Browser (preferred: Chrome, Safari, Firefox 36+ or Internet Explorer). In real scenarios, you may want to upload only the information of a region of interest that you want to plot or only one chromosome.
# Select the columns that contain the information on Chromosome, Position, and P-Value.
# LDassoc has three options for visualising regions of association: by gene, by region or by variant. We will select visualising our results by using our lowest p-value variant as the index variant (rs4420638 or chr19:45422946).
# Select the CEU population (European, Utah Residents from North and West Europe) as the 1000 Genomes sub-population of interest. LDlink will calculate measures of linkage disequilibrium according to this population, which is the one that best matches the ancestry of our study population. 
# Leave the rest of the options as default and press Calculate.
# Explore the interactive plot. Is the index variant in high LD with any of the nearby variants? Remember the R2 is a measure of linkage disequilibrium or correlation of alleles for two genetic variants.


### If you have reached this part of the practical and have spare time, you can explore how the commands would be different if we were using PLINK 2. Did you realise that instead of using --logistic or --linear you can use --glm?