This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||
homework_6_ancestry [2015/11/13 15:14] scott |
homework_6_ancestry [2015/11/15 18:17] scott /* Code for Homework */ |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
+ | ==== Notes ==== | ||
+ | |||
+ | |||
+ | |||
+ | ==== Code for Homework ==== | ||
+ | |||
# PSYC 7102 -- Statistical Genetics | # PSYC 7102 -- Statistical Genetics | ||
- | # Homework #6: Ancestry | + | # Homework #6: Ancestry |
- | # Due: November 25th | + | # Due: November 25th |
- | + | ||
- | # PLEASE CONTACT DAVID TO GET YOUR NEW VCF. | + | |
- | + | ||
- | # Overview: Using your variant calls contained in your VCF files, | + | |
- | # combined with the existing 1000 Genomes variant calls from the 1000 | + | |
- | # Genomes project, you will conduct a principal components analysis | + | |
- | # of the 1000 Genomes samples using PLINK and estimate your position in | + | |
- | # the 1000G PCA space | + | |
- | + | ||
- | # This homework is a little bit different because the commands are | + | |
- | # slightly complicated, | + | |
- | # to specific conceptual questions, I ask you to run the commands | + | |
- | # below (changing directories appropriately as needed) and then | + | |
- | # explain to me what the command did and why we did it. I don't need | + | |
- | # super technical details but simply a solid conceptual overview of | + | |
- | # the purpose of the command. | + | |
- | + | ||
- | ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT | + | |
- | ### HAVE TO EXPLAIN EVERY COMMAND! | + | |
- | + | ||
- | # For many questions you'll want to run analyses by chromosome. To do | + | |
- | # | + | |
- | + | ||
- | qsub -I -l walltime=23: | + | |
- | + | ||
- | + | ||
- | + | ||
- | ############### | + | |
- | ### STEP 1 ### | + | |
- | ############### | + | |
- | # Here I take a minute to describe the various files you'll use to get started | + | |
- | + | ||
- | # | + | |
- | / | + | |
- | + | ||
- | + | ||
- | # | + | |
- | # | + | |
- | # | + | |
- | # | + | |
- | # | + | |
- | + | ||
- | (zcat chr1.filtered.PASS.beagled.vcf.gz | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | zgrep -v '#' | + | |
- | + | ||
- | ### Add in rsIDs from dbSNP. PLINK needs these to reconcile | + | |
- | ### positions/ | + | |
- | | + | |
- | ### The previous command keeps only variants with rsIDs, otherwise | + | |
- | ### plink throws an error that there are >1 variants with ID = " | + | |
- | ### | + | |
- | ### | + | |
- | zgrep -E '# | + | |
- | + | ||
- | ############## | + | |
- | ### STEP 2 ### | + | |
- | ############## | + | |
- | ### Whole bunch of data cleaning. Retain in the 1000 Genomes | + | |
- | ### | + | |
- | ### | + | |
- | + | ||
- | # | + | |
- | zcat chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | cut -f3 | grep ' | + | |
- | # | + | |
- | # | + | |
- | # | + | |
- | # | + | |
- | ### | + | |
- | for i in {1..22}; do | + | |
- | zgrep -E ' | + | |
- | | + | |
- | + | ||
- | ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common | + | |
- | ### | + | |
- | ### | + | |
- | ### | + | |
- | for i in {1..22}; do | + | |
- | plink --vcf chr$i.1000G.biallelic.vcf.gz \ | + | |
- | | + | |
- | --maf .05 \ | + | |
- | | + | |
- | --out chr$i.1000G.biallelic.common.plink & | + | |
- | | + | |
- | + | ||
- | ### Remove any remaining duplicates from the 1000G reference files | + | |
- | ### A) Identify duplicate sites | + | |
- | ### | + | |
- | ### | + | |
- | ### | + | |
- | for i in {1..22}; do | + | |
- | zcat chr$i.1000G.biallelic.vcf.gz | cut -f3 | sort | uniq -c | grep -E ' | + | |
- | | + | |
- | + | ||
- | ### B) Remove the duplicate sites | + | |
- | for i in {1..22}; do | + | |
- | plink --bfile chr$i.1000G.biallelic.common.plink \ | + | |
- | | + | |
- | | + | |
- | --out chr$i.1000G.biallelic.common.plink.nodups & | + | |
- | | + | |
- | + | ||
- | rm chr*.1000G.biallelic.vcf.gz | + | |
- | rm chr*.1000G.biallelic.common.plink.bed | + | |
- | rm chr*.1000G.biallelic.common.plink.bim | + | |
- | rm chr*.1000G.biallelic.common.plink.fam | + | |
- | + | ||
- | + | ||
- | ## Identify variants in LD | + | |
- | for i in {1..22}; do | + | |
- | plink --bfile chr$i.1000G.biallelic.common.plink.nodups \ | + | |
- | | + | |
- | --maf .05 \ | + | |
- | --out chr$i.snpsToExtract & | + | |
- | | + | |
- | + | ||
- | + | ||
- | ## Extract LD-independent variants | + | |
- | for i in {1..22}; do | + | |
- | plink --bfile chr$i.1000G.biallelic.common.plink.nodups \ | + | |
- | | + | |
- | | + | |
- | --out chr$i.1000G.biallelic.common.plink.nodups.LDpruned & | + | |
- | | + | |
- | + | ||
- | rm chr*.1000G.biallelic.common.plink.nodups.bed | + | |
- | rm chr*.1000G.biallelic.common.plink.nodups.bim | + | |
- | rm chr*.1000G.biallelic.common.plink.nodups.fam | + | |
- | rm dups* | + | |
- | + | ||
- | + | ||
- | ############## | + | |
- | ### STEP 3 ### | + | |
- | ############## | + | |
- | ### Merge the new and improved 1000 Genomes plink files into a single | + | |
- | ### file with all chromosomes. This is easier to do when they' | + | |
- | ### files | + | |
- | # Example Command: | + | # PLEASE CONTACT DAVID TO GET YOUR NEW VCF. |
+ | |||
+ | # Overview: Using your variant calls contained in your VCF files, | ||
+ | # combined with the existing 1000 Genomes variant calls from the 1000 | ||
+ | # Genomes project, you will conduct a principal components analysis | ||
+ | # of the 1000 Genomes samples using PLINK and estimate your position in | ||
+ | # the 1000G PCA space | ||
+ | |||
+ | # This homework is a little bit different because the commands are | ||
+ | # slightly complicated, | ||
+ | # to specific conceptual questions, I ask you to run the commands | ||
+ | # below (changing directories appropriately as needed) and then | ||
+ | # explain to me what the command did and why we did it. I don't need | ||
+ | # super technical details but simply a solid conceptual overview of | ||
+ | # the purpose of the command. | ||
+ | |||
+ | ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT | ||
+ | ### HAVE TO EXPLAIN EVERY COMMAND! But please run all commands to produce | ||
+ | ### in the end a PCA plot containing yourself compared to all 1000 Genomes | ||
+ | ### samples. | ||
+ | |||
+ | # For many questions you'll want to run analyses by chromosome. To do | ||
+ | # | ||
+ | |||
+ | qsub -I -l walltime=23: | ||
+ | |||
+ | ### Load apigenome, plink | ||
+ | module load apigenome_0.0.2 | ||
+ | module load plink_latest | ||
+ | |||
+ | |||
+ | ############### | ||
+ | ### STEP 1 ### | ||
+ | ############### | ||
+ | # Here I take a minute to describe the various files you'll use to get started | ||
+ | |||
+ | # The 1000 Genomes vcfs from the 1000 Genomes Consortium | ||
+ | / | ||
+ | |||
+ | |||
+ | # A vcf file with, depending on your level of comfort, your own | ||
+ | # genotypes or 1kg sample HG000096. All examples that follow use | ||
+ | # HG00096 for convenience, | ||
+ | # your own sample just replace the $10 with $NF, assuming your | ||
+ | # sample is in the last column of the vcf. | ||
+ | |||
+ | (zcat chr1.filtered.PASS.beagled.vcf.gz | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | zgrep -v '#' | ||
+ | |||
+ | ### Add in rsIDs from dbSNP. PLINK needs these to reconcile | ||
+ | ### positions/ | ||
+ | vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db / | ||
+ | ### The previous command keeps only variants with rsIDs, otherwise | ||
+ | ### plink throws an error that there are >1 variants with ID = " | ||
+ | ### | ||
+ | ###------ QUESTION 1: WHAT DOES THIS COMMAND DO? (1 point) | ||
+ | zgrep -E '# | ||
+ | |||
+ | ############## | ||
+ | ### STEP 2 ### | ||
+ | ############## | ||
+ | ### Whole bunch of data cleaning. Retain in the 1000 Genomes | ||
+ | ### | ||
+ | ### | ||
+ | |||
+ | # Get list of your SNPs | ||
+ | zcat chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | cut -f3 | grep ' | ||
+ | # | ||
+ | # Reduce the 1000 Genomes VCFs to bi-allelic SNPs (that' | ||
+ | # can handle anyway). BE SURE TO LOG INTO A COMPUTE NODE BEFORE | ||
+ | # RUNNING ANY OF THESE LOOPS!!!!! | ||
+ | ### | ||
+ | ###------ QUESTION 2: WHAT DOES THIS COMMAND (THE ENTIRE FOR LOOP) DO? (2 points) | ||
+ | for i in {1..22}; do | ||
+ | zgrep -E ' | ||
+ | done | ||
+ | |||
+ | ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common | ||
+ | ### because we're going to conduct PCA on these SNPs and only want common ones. | ||
+ | ### | ||
+ | ###------ QUESTION 3: WHY WOULD WE RETAIN ONLY COMMON SNPS, OTHER THAN IT | ||
+ | ### | ||
+ | for i in {1..22}; do | ||
+ | plink --vcf chr$i.1000G.biallelic.vcf.gz \ | ||
+ | --extract MyrsIDs.txt \ | ||
+ | --maf .05 \ | ||
+ | --make-bed \ | ||
+ | --out chr$i.1000G.biallelic.common.plink & | ||
+ | done | ||
+ | |||
+ | ### Remove any remaining duplicates from the 1000G reference files | ||
+ | ### A) Identify duplicate sites | ||
+ | ### | ||
+ | ###------ QUESTION 4: WALK ME THROUGH WHAT THIS COMMAND (THE ENTIRE | ||
+ | ### | ||
+ | for i in {1..22}; do | ||
+ | zcat chr$i.1000G.biallelic.vcf.gz | cut -f3 | sort | uniq -c | grep -E ' | ||
+ | done | ||
+ | |||
+ | ### B) Remove the duplicate sites | ||
+ | for i in {1..22}; do | ||
+ | plink --bfile chr$i.1000G.biallelic.common.plink \ | ||
+ | --exclude dups$i.txt \ | ||
+ | --make-bed \ | ||
+ | --out chr$i.1000G.biallelic.common.plink.nodups & | ||
+ | done | ||
+ | |||
+ | rm chr*.1000G.biallelic.vcf.gz | ||
+ | rm chr*.1000G.biallelic.common.plink.bed | ||
+ | rm chr*.1000G.biallelic.common.plink.bim | ||
+ | rm chr*.1000G.biallelic.common.plink.fam | ||
+ | |||
+ | |||
+ | ## Identify variants in LD | ||
+ | for i in {1..22}; do | ||
+ | plink --bfile chr$i.1000G.biallelic.common.plink.nodups \ | ||
+ | --indep-pairwise 100k 5 .01 \ #------ QUESTION 5: EXPLAIN WHAT THIS ARGUMENT IS DOING (1 point) | ||
+ | --maf .05 \ | ||
+ | --out chr$i.snpsToExtract & | ||
+ | done | ||
+ | |||
+ | |||
+ | ## Extract LD-independent variants | ||
+ | for i in {1..22}; do | ||
+ | plink --bfile chr$i.1000G.biallelic.common.plink.nodups \ | ||
+ | --extract chr$i.snpsToExtract.prune.in \ | ||
+ | --make-bed \ | ||
+ | --out chr$i.1000G.biallelic.common.plink.nodups.LDpruned & | ||
+ | done | ||
+ | |||
+ | rm chr*.1000G.biallelic.common.plink.nodups.bed | ||
+ | rm chr*.1000G.biallelic.common.plink.nodups.bim | ||
+ | rm chr*.1000G.biallelic.common.plink.nodups.fam | ||
+ | rm dups* | ||
+ | |||
+ | |||
+ | ############## | ||
+ | ### STEP 3 ### | ||
+ | ############## | ||
+ | ### Merge the new and improved 1000 Genomes plink files into a single | ||
+ | ### file with all chromosomes. This is easier to do when they' | ||
+ | ### files | ||
+ | ### Example Command: | ||
for i in {1..22}; do | for i in {1..22}; do | ||
plink --bfile chr$i.1000G.biallelic.common.plink.nodups.LDpruned --recode-vcf --out tmp$i & | plink --bfile chr$i.1000G.biallelic.common.plink.nodups.LDpruned --recode-vcf --out tmp$i & | ||
Line 172: | Line 185: | ||
###----- QUESTION 6: HOW MANY VARIANTS ARE IN THE OUTPUT FILE FROM THE | ###----- QUESTION 6: HOW MANY VARIANTS ARE IN THE OUTPUT FILE FROM THE | ||
- | ### | + | ### |
Line 215: | Line 228: | ||
### average for each site in the score file. | ### average for each site in the score file. | ||
### | ### | ||
- | ###------ QUESTION 7: EXPLAIN WHAT THE FOLLOWING COMMANDS ARE DOING | + | ###------ QUESTION 7: EXPLAIN WHAT THE FOLLOWING COMMANDS ARE DOING (2 points) |
plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | ||
plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | ||
Line 241: | Line 254: | ||
R | R | ||
### Then in this R session run: ' | ### Then in this R session run: ' | ||
- | ### If you log out and back in, you'll have to again run this command: | + | ### If you log out and back in, you'll have to again run this command |
export R_LIBS=/ | export R_LIBS=/ | ||
Line 247: | Line 260: | ||
###------ QUESTION 8: PLEASE INTERPRET THE RESULTING GRAPH. WHAT IS | ###------ QUESTION 8: PLEASE INTERPRET THE RESULTING GRAPH. WHAT IS | ||
### | ### | ||
- | ### | + | ### |
+ | ### | ||
echo " | echo " | ||
Line 257: | Line 271: | ||
### ' | ### ' | ||
- | kg_sf <- read.table(' | + | kg_sf <- read.table(' |
sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, | sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, |