Differences

This shows you the differences between two versions of the page.

--- homework_6_ancestry [2015/11/13 17:51]
scott
+++ homework_6_ancestry [2015/11/16 10:18] (current)
scott
@@ Line 1: / Line 1: @@
+==== Notes ====
+==== Code for Homework ====
 # PSYC 7102 -- Statistical Genetics
@@ Line 21: / Line 28: @@
 ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT
-### HAVE TO EXPLAIN EVERY COMMAND!
+### HAVE TO EXPLAIN EVERY COMMAND! But please run all commands to produce
+### in the end a PCA plot containing yourself compared to all 1000 Genomes
+### samples.
 # For many questions you'll want to run analyses by chromosome. To do
@@ Line 28: / Line 37: @@
 qsub -I -l walltime=23:00:00 -q short -l mem=10gb -l nodes=1:ppn=22
-### Load apigenome, plink, and R
+### Load apigenome, plink
 module load apigenome_0.0.2
 module load plink_latest
+module load tabix_0.2.6
@@ Line 73: / Line 83: @@
 ### Add in rsIDs from dbSNP. PLINK needs these to reconcile
 ### positions/alleles
-vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db /Users/scvr9332/reference_data/dbsnp.144.b37.vcf.gz | bgzip -c > chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz
+vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db /Users/scvr9332/reference_data/dbsnp_144/dbsnp.144.b37.vcf.gz | bgzip -c > chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz
 ### The previous command keeps only variants with rsIDs, otherwise
 ### plink throws an error that there are >1 variants with ID = "."
@@ Line 96: / Line 106: @@
 ###------ QUESTION 2: WHAT DOES THIS COMMAND (THE ENTIRE FOR LOOP) DO? (2 points)
 for i in {1..22}; do
-    zgrep -E '^#|\s[ACTG]\s[ACTG]\s' /Users/scvr9332/1000Gp3/variant_calls /ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | bgzip -c > chr$i.1000G.biallelic.vcf.gz &
+    zgrep -E '^#|\s[ACTG]\s[ACTG]\s' /Users/scvr9332/1000Gp3/variant_calls/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | bgzip -c > chr$i.1000G.biallelic.vcf.gz &
 done
 ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common
-###
+### because we're going to conduct PCA on these SNPs and only want common ones.
-###------ QUESTION 3: WHY WOULD WE REMOVE COMMON SNPS, OTHER THAN IT
+###
+###------ QUESTION 3: WHY WOULD WE RETAIN ONLY COMMON SNPS, OTHER THAN IT
 ###------             MAKES EVERY COMMAND LATER FASTER? (2 points)
 for i in {1..22}; do
@@ Line 244: / Line 255: @@
 R
 ### Then in this R session run: 'install.packages('car', repos='http://cran.r-project.org')'
-### If you log out and back in, you'll have to again run this command:
+### If you log out and back in, you'll have to again run this command again to get the car library loaded:
 export R_LIBS=/Users/scvr9332/R
@@ Line 261: / Line 272: @@
 ### 'Self-reported' ancestry of 1000g participants
-kg_sf <- read.table('20130502.sequence.index', header=T, sep='\t', fill=T, stringsAsFactors=F)
+kg_sf <- read.table('/Users/scvr9332/PCA/20130502.sequence.index', header=T, sep='\t', fill=T, stringsAsFactors=F)
 sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, POPULATION=kg_sf$POPULATION, stringsAsFactors=F))

IBG Wiki

User Tools

Site Tools

Differences

Page Tools