homework_6_ancestry
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| homework_6_ancestry [2015/11/13 21:00] – Created page with " <syntaxhighlight lang="bash"> # PSYC 7102 -- Statistical Genetics # Homework #6: Ancestry # Due: November 25th # PLEASE CONTACT DAVID TO GET YOUR NEW VCF. # Overview..." scott | homework_6_ancestry [2015/11/16 17:18] (current) – scott | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | |||
| - | # PSYC 7102 -- Statistical Genetics | ||
| - | # Homework #6: Ancestry | ||
| - | # Due: November 25th | ||
| - | |||
| - | # PLEASE CONTACT DAVID TO GET YOUR NEW VCF. | ||
| - | |||
| - | # Overview: Using your variant calls contained in your VCF files, | ||
| - | # combined with the existing 1000 Genomes variant calls from the 1000 | ||
| - | # Genomes project, you will conduct a principal components analysis | ||
| - | # of the 1000 Genomes samples using PLINK and estimate your position in | ||
| - | # the 1000G PCA space | ||
| - | |||
| - | # This homework is a little bit different because the commands are | ||
| - | # slightly complicated, | ||
| - | # to specific conceptual questions, I ask you to run the commands | ||
| - | # below (changing directories appropriately as needed) and then | ||
| - | # explain to me what the command did and why we did it. I don't need | ||
| - | # super technical details but simply a solid conceptual overview of | ||
| - | # the purpose of the command. | ||
| - | |||
| - | ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT | ||
| - | ### HAVE TO EXPLAIN EVERY COMMAND! | ||
| - | |||
| - | # For many questions you'll want to run analyses by chromosome. To do | ||
| - | # | ||
| - | |||
| - | qsub -I -l walltime=23: | ||
| - | |||
| - | |||
| - | |||
| - | ############### | ||
| - | ### STEP 1 ### | ||
| - | ############### | ||
| - | # Here I take a minute to describe the various files you'll use to get started | ||
| - | |||
| - | # | ||
| - | / | ||
| - | |||
| - | |||
| - | # | ||
| - | # | ||
| - | # | ||
| - | # | ||
| - | # | ||
| - | |||
| - | (zcat chr1.filtered.PASS.beagled.vcf.gz | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | zgrep -v '#' | ||
| - | |||
| - | ### Add in rsIDs from dbSNP. PLINK needs these to reconcile | ||
| - | ### positions/ | ||
| - | | ||
| - | ### The previous command keeps only variants with rsIDs, otherwise | ||
| - | ### plink throws an error that there are >1 variants with ID = " | ||
| - | ### | ||
| - | ### | ||
| - | zgrep -E '# | ||
| - | |||
| - | ############## | ||
| - | ### STEP 2 ### | ||
| - | ############## | ||
| - | ### Whole bunch of data cleaning. Retain in the 1000 Genomes | ||
| - | ### | ||
| - | ### | ||
| - | |||
| - | # | ||
| - | zcat chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | cut -f3 | grep ' | ||
| - | # | ||
| - | # | ||
| - | # | ||
| - | # | ||
| - | ### | ||
| - | for i in {1..22}; do | ||
| - | zgrep -E ' | ||
| - | done | ||
| - | |||
| - | ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common | ||
| - | ### | ||
| - | ### | ||
| - | ### | ||
| - | for i in {1..22}; do | ||
| - | plink --vcf chr$i.1000G.biallelic.vcf.gz \ | ||
| - | | ||
| - | --maf .05 \ | ||
| - | | ||
| - | --out chr$i.1000G.biallelic.common.plink & | ||
| - | done | ||
| - | |||
| - | ### Remove any remaining duplicates from the 1000G reference files | ||
| - | ### A) Identify duplicate sites | ||
| - | ### | ||
| - | ### | ||
| - | ### | ||
| - | for i in {1..22}; do | ||
| - | zcat chr$i.1000G.biallelic.vcf.gz | cut -f3 | sort | uniq -c | grep -E ' | ||
| - | done | ||
| - | |||
| - | ### B) Remove the duplicate sites | ||
| - | for i in {1..22}; do | ||
| - | plink --bfile chr$i.1000G.biallelic.common.plink \ | ||
| - | | ||
| - | | ||
| - | --out chr$i.1000G.biallelic.common.plink.nodups & | ||
| - | done | ||
| - | |||
| - | rm chr*.1000G.biallelic.vcf.gz | ||
| - | rm chr*.1000G.biallelic.common.plink.bed | ||
| - | rm chr*.1000G.biallelic.common.plink.bim | ||
| - | rm chr*.1000G.biallelic.common.plink.fam | ||
| - | |||
| - | |||
| - | ## Identify variants in LD | ||
| - | for i in {1..22}; do | ||
| - | plink --bfile chr$i.1000G.biallelic.common.plink.nodups \ | ||
| - | | ||
| - | --maf .05 \ | ||
| - | --out chr$i.snpsToExtract & | ||
| - | done | ||
| - | |||
| - | |||
| - | ## Extract LD-independent variants | ||
| - | for i in {1..22}; do | ||
| - | plink --bfile chr$i.1000G.biallelic.common.plink.nodups \ | ||
| - | | ||
| - | | ||
| - | --out chr$i.1000G.biallelic.common.plink.nodups.LDpruned & | ||
| - | done | ||
| - | |||
| - | rm chr*.1000G.biallelic.common.plink.nodups.bed | ||
| - | rm chr*.1000G.biallelic.common.plink.nodups.bim | ||
| - | rm chr*.1000G.biallelic.common.plink.nodups.fam | ||
| - | rm dups* | ||
| - | |||
| - | |||
| - | ############## | ||
| - | ### STEP 3 ### | ||
| - | ############## | ||
| - | ### Merge the new and improved 1000 Genomes plink files into a single | ||
| - | ### file with all chromosomes. This is easier to do when they' | ||
| - | ### files | ||
| - | # Example Command: | + | ==== Notes ==== |
| + | |||
| + | |||
| + | |||
| + | ==== Code for Homework ==== | ||
| + | |||
| + | |||
| + | # PSYC 7102 -- Statistical Genetics | ||
| + | # Homework #6: Ancestry | ||
| + | # Due: November 25th | ||
| + | |||
| + | # PLEASE CONTACT DAVID TO GET YOUR NEW VCF. | ||
| + | |||
| + | # Overview: Using your variant calls contained in your VCF files, | ||
| + | # combined with the existing 1000 Genomes variant calls from the 1000 | ||
| + | # Genomes project, you will conduct a principal components analysis | ||
| + | # of the 1000 Genomes samples using PLINK and estimate your position in | ||
| + | # the 1000G PCA space | ||
| + | |||
| + | # This homework is a little bit different because the commands are | ||
| + | # slightly complicated, | ||
| + | # to specific conceptual questions, I ask you to run the commands | ||
| + | # below (changing directories appropriately as needed) and then | ||
| + | # explain to me what the command did and why we did it. I don't need | ||
| + | # super technical details but simply a solid conceptual overview of | ||
| + | # the purpose of the command. | ||
| + | |||
| + | ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT | ||
| + | ### HAVE TO EXPLAIN EVERY COMMAND! But please run all commands to produce | ||
| + | ### in the end a PCA plot containing yourself compared to all 1000 Genomes | ||
| + | ### samples. | ||
| + | |||
| + | # For many questions you'll want to run analyses by chromosome. To do | ||
| + | # | ||
| + | |||
| + | qsub -I -l walltime=23: | ||
| + | |||
| + | ### Load apigenome, plink | ||
| + | module load apigenome_0.0.2 | ||
| + | module load plink_latest | ||
| + | module load tabix_0.2.6 | ||
| + | |||
| + | |||
| + | ############### | ||
| + | ### STEP 1 ### | ||
| + | ############### | ||
| + | # Here I take a minute to describe the various files you'll use to get started | ||
| + | |||
| + | # The 1000 Genomes vcfs from the 1000 Genomes Consortium | ||
| + | / | ||
| + | |||
| + | |||
| + | # A vcf file with, depending on your level of comfort, your own | ||
| + | # genotypes or 1kg sample HG000096. All examples that follow use | ||
| + | # HG00096 for convenience, | ||
| + | # your own sample just replace the $10 with $NF, assuming your | ||
| + | # sample is in the last column of the vcf. | ||
| + | |||
| + | (zcat chr1.filtered.PASS.beagled.vcf.gz | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | zgrep -v '#' | ||
| + | |||
| + | ### Add in rsIDs from dbSNP. PLINK needs these to reconcile | ||
| + | ### positions/ | ||
| + | vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db / | ||
| + | ### The previous command keeps only variants with rsIDs, otherwise | ||
| + | ### plink throws an error that there are >1 variants with ID = " | ||
| + | ### | ||
| + | ###------ QUESTION 1: WHAT DOES THIS COMMAND DO? (1 point) | ||
| + | zgrep -E '# | ||
| + | |||
| + | ############## | ||
| + | ### STEP 2 ### | ||
| + | ############## | ||
| + | ### Whole bunch of data cleaning. Retain in the 1000 Genomes | ||
| + | ### | ||
| + | ### | ||
| + | |||
| + | # Get list of your SNPs | ||
| + | zcat chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | cut -f3 | grep ' | ||
| + | # | ||
| + | # Reduce the 1000 Genomes VCFs to bi-allelic SNPs (that' | ||
| + | # can handle anyway). BE SURE TO LOG INTO A COMPUTE NODE BEFORE | ||
| + | # RUNNING ANY OF THESE LOOPS!!!!! | ||
| + | ### | ||
| + | ###------ QUESTION 2: WHAT DOES THIS COMMAND (THE ENTIRE FOR LOOP) DO? (2 points) | ||
| + | for i in {1..22}; do | ||
| + | zgrep -E ' | ||
| + | done | ||
| + | |||
| + | ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common | ||
| + | ### because we're going to conduct PCA on these SNPs and only want common ones. | ||
| + | ### | ||
| + | ###------ QUESTION 3: WHY WOULD WE RETAIN ONLY COMMON SNPS, OTHER THAN IT | ||
| + | ### | ||
| + | for i in {1..22}; do | ||
| + | plink --vcf chr$i.1000G.biallelic.vcf.gz \ | ||
| + | --extract MyrsIDs.txt \ | ||
| + | --maf .05 \ | ||
| + | --make-bed \ | ||
| + | --out chr$i.1000G.biallelic.common.plink & | ||
| + | done | ||
| + | |||
| + | ### Remove any remaining duplicates from the 1000G reference files | ||
| + | ### A) Identify duplicate sites | ||
| + | ### | ||
| + | ###------ QUESTION 4: WALK ME THROUGH WHAT THIS COMMAND (THE ENTIRE | ||
| + | ### | ||
| + | for i in {1..22}; do | ||
| + | zcat chr$i.1000G.biallelic.vcf.gz | cut -f3 | sort | uniq -c | grep -E ' | ||
| + | done | ||
| + | |||
| + | ### B) Remove the duplicate sites | ||
| + | for i in {1..22}; do | ||
| + | plink --bfile chr$i.1000G.biallelic.common.plink \ | ||
| + | --exclude dups$i.txt \ | ||
| + | --make-bed \ | ||
| + | --out chr$i.1000G.biallelic.common.plink.nodups & | ||
| + | done | ||
| + | |||
| + | rm chr*.1000G.biallelic.vcf.gz | ||
| + | rm chr*.1000G.biallelic.common.plink.bed | ||
| + | rm chr*.1000G.biallelic.common.plink.bim | ||
| + | rm chr*.1000G.biallelic.common.plink.fam | ||
| + | |||
| + | |||
| + | ## Identify variants in LD | ||
| + | for i in {1..22}; do | ||
| + | plink --bfile chr$i.1000G.biallelic.common.plink.nodups \ | ||
| + | --indep-pairwise 100k 5 .01 \ #------ QUESTION 5: EXPLAIN WHAT THIS ARGUMENT IS DOING (1 point) | ||
| + | --maf .05 \ | ||
| + | --out chr$i.snpsToExtract & | ||
| + | done | ||
| + | |||
| + | |||
| + | ## Extract LD-independent variants | ||
| + | for i in {1..22}; do | ||
| + | plink --bfile chr$i.1000G.biallelic.common.plink.nodups \ | ||
| + | --extract chr$i.snpsToExtract.prune.in \ | ||
| + | --make-bed \ | ||
| + | --out chr$i.1000G.biallelic.common.plink.nodups.LDpruned & | ||
| + | done | ||
| + | |||
| + | rm chr*.1000G.biallelic.common.plink.nodups.bed | ||
| + | rm chr*.1000G.biallelic.common.plink.nodups.bim | ||
| + | rm chr*.1000G.biallelic.common.plink.nodups.fam | ||
| + | rm dups* | ||
| + | |||
| + | |||
| + | ############## | ||
| + | ### STEP 3 ### | ||
| + | ############## | ||
| + | ### Merge the new and improved 1000 Genomes plink files into a single | ||
| + | ### file with all chromosomes. This is easier to do when they' | ||
| + | ### files | ||
| + | ### Example Command: | ||
| for i in {1..22}; do | for i in {1..22}; do | ||
| plink --bfile chr$i.1000G.biallelic.common.plink.nodups.LDpruned --recode-vcf --out tmp$i & | plink --bfile chr$i.1000G.biallelic.common.plink.nodups.LDpruned --recode-vcf --out tmp$i & | ||
| Line 172: | Line 186: | ||
| ###----- QUESTION 6: HOW MANY VARIANTS ARE IN THE OUTPUT FILE FROM THE | ###----- QUESTION 6: HOW MANY VARIANTS ARE IN THE OUTPUT FILE FROM THE | ||
| - | ### | + | ### |
| Line 215: | Line 229: | ||
| ### average for each site in the score file. | ### average for each site in the score file. | ||
| ### | ### | ||
| - | ###------ QUESTION 7: EXPLAIN WHAT THE FOLLOWING COMMANDS ARE DOING | + | ###------ QUESTION 7: EXPLAIN WHAT THE FOLLOWING COMMANDS ARE DOING (2 points) |
| plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | ||
| plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | ||
| Line 241: | Line 255: | ||
| R | R | ||
| ### Then in this R session run: ' | ### Then in this R session run: ' | ||
| - | ### If you log out and back in, you'll have to again run this command: | + | ### If you log out and back in, you'll have to again run this command |
| export R_LIBS=/ | export R_LIBS=/ | ||
| Line 247: | Line 261: | ||
| ###------ QUESTION 8: PLEASE INTERPRET THE RESULTING GRAPH. WHAT IS | ###------ QUESTION 8: PLEASE INTERPRET THE RESULTING GRAPH. WHAT IS | ||
| ### | ### | ||
| - | ### | + | ### |
| + | ### | ||
| echo " | echo " | ||
| Line 257: | Line 272: | ||
| ### ' | ### ' | ||
| - | kg_sf <- read.table(' | + | kg_sf <- read.table(' |
| sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, | sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, | ||
homework_6_ancestry.1447448424.txt.gz · Last modified: by scott
