User Tools

Site Tools


homework_6_ancestry

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
homework_6_ancestry [2015/11/13 17:51]
scott
homework_6_ancestry [2015/11/15 18:17]
scott /* Code for Homework */
Line 1: Line 1:
 +
 +==== Notes ====
 +
 +
 +
 +==== Code for Homework ====
 +
  
 # PSYC 7102 -- Statistical Genetics # PSYC 7102 -- Statistical Genetics
Line 21: Line 28:
  
 ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT
-### HAVE TO EXPLAIN EVERY COMMAND!+### HAVE TO EXPLAIN EVERY COMMAND! But please run all commands to produce  
 +### in the end a PCA plot containing yourself compared to all 1000 Genomes  
 +### samples.
  
 # For many questions you'll want to run analyses by chromosome. To do # For many questions you'll want to run analyses by chromosome. To do
Line 28: Line 37:
 qsub -I -l walltime=23:00:00 -q short -l mem=10gb -l nodes=1:ppn=22 qsub -I -l walltime=23:00:00 -q short -l mem=10gb -l nodes=1:ppn=22
  
-### Load apigenome, plink, and R+### Load apigenome, plink
 module load apigenome_0.0.2 module load apigenome_0.0.2
 module load plink_latest module load plink_latest
Line 73: Line 82:
 ### Add in rsIDs from dbSNP. PLINK needs these to reconcile ### Add in rsIDs from dbSNP. PLINK needs these to reconcile
 ### positions/alleles ### positions/alleles
-vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db /Users/scvr9332/reference_data/dbsnp.144.b37.vcf.gz | bgzip -c > chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz+vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db /Users/scvr9332/reference_data/dbsnp_144/dbsnp.144.b37.vcf.gz | bgzip -c > chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz
 ### The previous command keeps only variants with rsIDs, otherwise ### The previous command keeps only variants with rsIDs, otherwise
 ### plink throws an error that there are >1 variants with ID = "." ### plink throws an error that there are >1 variants with ID = "."
Line 96: Line 105:
 ###------ QUESTION 2: WHAT DOES THIS COMMAND (THE ENTIRE FOR LOOP) DO? (2 points) ###------ QUESTION 2: WHAT DOES THIS COMMAND (THE ENTIRE FOR LOOP) DO? (2 points)
 for i in {1..22}; do for i in {1..22}; do
-    zgrep -E '^#|\s[ACTG]\s[ACTG]\s' /Users/scvr9332/1000Gp3/variant_calls /ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | bgzip -c > chr$i.1000G.biallelic.vcf.gz &+    zgrep -E '^#|\s[ACTG]\s[ACTG]\s' /Users/scvr9332/1000Gp3/variant_calls/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | bgzip -c > chr$i.1000G.biallelic.vcf.gz &
 done done
  
 ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common
-### +### because we're going to conduct PCA on these SNPs and only want common ones. 
-###------ QUESTION 3: WHY WOULD WE REMOVE COMMON SNPS, OTHER THAN IT+###  
 +###------ QUESTION 3: WHY WOULD WE RETAIN ONLY COMMON SNPS, OTHER THAN IT
 ###------             MAKES EVERY COMMAND LATER FASTER? (2 points) ###------             MAKES EVERY COMMAND LATER FASTER? (2 points)
 for i in {1..22}; do for i in {1..22}; do
Line 244: Line 254:
 R R
 ### Then in this R session run: 'install.packages('car', repos='http://cran.r-project.org')' ### Then in this R session run: 'install.packages('car', repos='http://cran.r-project.org')'
-### If you log out and back in, you'll have to again run this command:+### If you log out and back in, you'll have to again run this command again to get the car library loaded:
 export R_LIBS=/Users/scvr9332/R export R_LIBS=/Users/scvr9332/R
  
Line 261: Line 271:
  
 ### 'Self-reported' ancestry of 1000g participants ### 'Self-reported' ancestry of 1000g participants
-kg_sf <- read.table('20130502.sequence.index', header=T, sep='\t', fill=T, stringsAsFactors=F)+kg_sf <- read.table('/Users/scvr9332/PCA/20130502.sequence.index', header=T, sep='\t', fill=T, stringsAsFactors=F)
  
 sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, POPULATION=kg_sf$POPULATION, stringsAsFactors=F)) sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, POPULATION=kg_sf$POPULATION, stringsAsFactors=F))
homework_6_ancestry.txt · Last modified: 2015/11/16 10:18 by scott