User Tools

Site Tools


homework_6_ancestry

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
homework_6_ancestry [2015/11/13 15:16]
scott
homework_6_ancestry [2015/11/15 18:17]
scott /* Code for Homework */
Line 1: Line 1:
 +
 +==== Notes ====
 +
 +
 +
 +==== Code for Homework ====
 +
  
 # PSYC 7102 -- Statistical Genetics # PSYC 7102 -- Statistical Genetics
Line 21: Line 28:
  
 ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT
-### HAVE TO EXPLAIN EVERY COMMAND!+### HAVE TO EXPLAIN EVERY COMMAND! But please run all commands to produce  
 +### in the end a PCA plot containing yourself compared to all 1000 Genomes  
 +### samples.
  
 # For many questions you'll want to run analyses by chromosome. To do # For many questions you'll want to run analyses by chromosome. To do
Line 28: Line 37:
 qsub -I -l walltime=23:00:00 -q short -l mem=10gb -l nodes=1:ppn=22 qsub -I -l walltime=23:00:00 -q short -l mem=10gb -l nodes=1:ppn=22
  
 +### Load apigenome, plink
 +module load apigenome_0.0.2
 +module load plink_latest
  
  
Line 70: Line 82:
 ### Add in rsIDs from dbSNP. PLINK needs these to reconcile ### Add in rsIDs from dbSNP. PLINK needs these to reconcile
 ### positions/alleles ### positions/alleles
-vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db /Users/scvr9332/reference_data/dbsnp.144.b37.vcf.gz | bgzip -c > chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz+vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db /Users/scvr9332/reference_data/dbsnp_144/dbsnp.144.b37.vcf.gz | bgzip -c > chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz
 ### The previous command keeps only variants with rsIDs, otherwise ### The previous command keeps only variants with rsIDs, otherwise
 ### plink throws an error that there are >1 variants with ID = "." ### plink throws an error that there are >1 variants with ID = "."
 ### ###
-###------ QUESTION 1: WHAT DOES THIS COMMAND DO?+###------ QUESTION 1: WHAT DOES THIS COMMAND DO? (1 point)
 zgrep -E '#|\srs[0-9]' chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz | bgzip -c > chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz zgrep -E '#|\srs[0-9]' chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz | bgzip -c > chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz
  
Line 90: Line 102:
 #  can handle anyway). BE SURE TO LOG INTO A COMPUTE NODE BEFORE #  can handle anyway). BE SURE TO LOG INTO A COMPUTE NODE BEFORE
 #  RUNNING ANY OF THESE LOOPS!!!!! #  RUNNING ANY OF THESE LOOPS!!!!!
-###------ QUESTION 2: WHAT DOES THIS COMMAND (THE ENTIRE FOR LOOP) DO?+### 
 +###------ QUESTION 2: WHAT DOES THIS COMMAND (THE ENTIRE FOR LOOP) DO? (2 points)
 for i in {1..22}; do for i in {1..22}; do
-    zgrep -E '^#|\s[ACTG]\s[ACTG]\s' /Users/scvr9332/1000Gp3/variant_calls /ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | bgzip -c > chr$i.1000G.biallelic.vcf.gz &+    zgrep -E '^#|\s[ACTG]\s[ACTG]\s' /Users/scvr9332/1000Gp3/variant_calls/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | bgzip -c > chr$i.1000G.biallelic.vcf.gz &
 done done
  
 ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common
-### +### because we're going to conduct PCA on these SNPs and only want common ones. 
-###------ QUESTION 3: WHY WOULD WE REMOVE COMMON SNPS, OTHER THAN IT +###  
-###------             MAKES EVERY COMMAND LATER FASTER?+###------ QUESTION 3: WHY WOULD WE RETAIN ONLY COMMON SNPS, OTHER THAN IT 
 +###------             MAKES EVERY COMMAND LATER FASTER? (2 points)
 for i in {1..22}; do for i in {1..22}; do
     plink --vcf chr$i.1000G.biallelic.vcf.gz \     plink --vcf chr$i.1000G.biallelic.vcf.gz \
Line 111: Line 125:
 ### ###
 ###------ QUESTION 4: WALK ME THROUGH WHAT THIS COMMAND (THE ENTIRE ###------ QUESTION 4: WALK ME THROUGH WHAT THIS COMMAND (THE ENTIRE
-###------             FOR LOOP) IS DOING.+###------             FOR LOOP) IS DOING. (3 points)
 for i in {1..22}; do for i in {1..22}; do
     zcat chr$i.1000G.biallelic.vcf.gz | cut -f3 | sort | uniq -c | grep -E '\s2\s' | grep '^rs' | gawk '{print $2}' >  dups$i.txt &      zcat chr$i.1000G.biallelic.vcf.gz | cut -f3 | sort | uniq -c | grep -E '\s2\s' | grep '^rs' | gawk '{print $2}' >  dups$i.txt & 
Line 133: Line 147:
 for i in {1..22}; do for i in {1..22}; do
     plink --bfile chr$i.1000G.biallelic.common.plink.nodups \     plink --bfile chr$i.1000G.biallelic.common.plink.nodups \
-        --indep-pairwise 100k 5 .01 \  #------ QUESTION 5: EXPLAIN WHAT THIS ARGUMENT IS DOING+        --indep-pairwise 100k 5 .01 \  #------ QUESTION 5: EXPLAIN WHAT THIS ARGUMENT IS DOING (1 point)
         --maf .05 \         --maf .05 \
         --out chr$i.snpsToExtract &         --out chr$i.snpsToExtract &
Line 171: Line 185:
  
 ###----- QUESTION 6: HOW MANY VARIANTS ARE IN THE OUTPUT FILE FROM THE ###----- QUESTION 6: HOW MANY VARIANTS ARE IN THE OUTPUT FILE FROM THE
-###-----             COMMAND ABOVE?+###-----             COMMAND ABOVE? (1 point)
  
  
Line 214: Line 228:
 ### average for each site in the score file. ### average for each site in the score file.
 ### ###
-###------ QUESTION 7: EXPLAIN WHAT THE FOLLOWING COMMANDS ARE DOING +###------ QUESTION 7: EXPLAIN WHAT THE FOLLOWING COMMANDS ARE DOING (2 points)
 plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz  --score PCA-score-file.txt 2 3 4 header --out PC1 plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz  --score PCA-score-file.txt 2 3 4 header --out PC1
 plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz  --score PCA-score-file.txt 2 3 5 header --out PC2 plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz  --score PCA-score-file.txt 2 3 5 header --out PC2
Line 240: Line 254:
 R R
 ### Then in this R session run: 'install.packages('car', repos='http://cran.r-project.org')' ### Then in this R session run: 'install.packages('car', repos='http://cran.r-project.org')'
-### If you log out and back in, you'll have to again run this command:+### If you log out and back in, you'll have to again run this command again to get the car library loaded:
 export R_LIBS=/Users/scvr9332/R export R_LIBS=/Users/scvr9332/R
  
Line 246: Line 260:
 ###------ QUESTION 8: PLEASE INTERPRET THE RESULTING GRAPH. WHAT IS ###------ QUESTION 8: PLEASE INTERPRET THE RESULTING GRAPH. WHAT IS
 ###------             BEING PLOTTED? WHAT DOES IT SAY ABOUT THE ###------             BEING PLOTTED? WHAT DOES IT SAY ABOUT THE
-###------             ANCESTRY OF YOUR SAMPLE (OR WHATEVER+###------             ANCESTRY OF YOUR SAMPLE (OR WHATEVER SAMPLE YOU 
 +###------             USED) (2 points)
 echo " echo "
  
Line 256: Line 271:
  
 ### 'Self-reported' ancestry of 1000g participants ### 'Self-reported' ancestry of 1000g participants
-kg_sf <- read.table('20130502.sequence.index', header=T, sep='\t', fill=T, stringsAsFactors=F)+kg_sf <- read.table('/Users/scvr9332/PCA/20130502.sequence.index', header=T, sep='\t', fill=T, stringsAsFactors=F)
  
 sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, POPULATION=kg_sf$POPULATION, stringsAsFactors=F)) sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, POPULATION=kg_sf$POPULATION, stringsAsFactors=F))
homework_6_ancestry.txt · Last modified: 2015/11/16 10:18 by scott