User Tools

Site Tools


homework_6_ancestry

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
homework_6_ancestry [2015/11/14 10:12]
scott
homework_6_ancestry [2015/11/16 10:18] (current)
scott
Line 1: Line 1:
-11/14/2015 N.B.: Chelsie noted that the <code>apigenome</code> module only loaded on the head node, and not during an interactive session. The <code>apigenome</code> module is only needed to run the <code>vcf-add-rsid<code> command, which can be run on the head node. Scott has requested BioFrontiers to make <code>apigenome</code> available everywhere but till then please load the module and run <code>vcf-add-rsid<code> on the head node. + 
-</code></code></code></code>+==== Notes ==== 
 + 
 + 
 + 
 +==== Code for Homework ==== 
  
 # PSYC 7102 -- Statistical Genetics # PSYC 7102 -- Statistical Genetics
Line 23: Line 28:
  
 ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT
-### HAVE TO EXPLAIN EVERY COMMAND!+### HAVE TO EXPLAIN EVERY COMMAND! But please run all commands to produce  
 +### in the end a PCA plot containing yourself compared to all 1000 Genomes  
 +### samples.
  
 # For many questions you'll want to run analyses by chromosome. To do # For many questions you'll want to run analyses by chromosome. To do
Line 30: Line 37:
 qsub -I -l walltime=23:00:00 -q short -l mem=10gb -l nodes=1:ppn=22 qsub -I -l walltime=23:00:00 -q short -l mem=10gb -l nodes=1:ppn=22
  
-### Load apigenome, plink, and R+### Load apigenome, plink
 module load apigenome_0.0.2 module load apigenome_0.0.2
 module load plink_latest module load plink_latest
 +module load tabix_0.2.6
  
  
Line 75: Line 83:
 ### Add in rsIDs from dbSNP. PLINK needs these to reconcile ### Add in rsIDs from dbSNP. PLINK needs these to reconcile
 ### positions/alleles ### positions/alleles
-vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db /Users/scvr9332/reference_data/dbsnp.144.b37.vcf.gz | bgzip -c > chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz+vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db /Users/scvr9332/reference_data/dbsnp_144/dbsnp.144.b37.vcf.gz | bgzip -c > chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz
 ### The previous command keeps only variants with rsIDs, otherwise ### The previous command keeps only variants with rsIDs, otherwise
 ### plink throws an error that there are >1 variants with ID = "." ### plink throws an error that there are >1 variants with ID = "."
Line 102: Line 110:
  
 ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common
-### +### because we're going to conduct PCA on these SNPs and only want common ones. 
-###------ QUESTION 3: WHY WOULD WE REMOVE COMMON SNPS, OTHER THAN IT+###  
 +###------ QUESTION 3: WHY WOULD WE RETAIN ONLY COMMON SNPS, OTHER THAN IT
 ###------             MAKES EVERY COMMAND LATER FASTER? (2 points) ###------             MAKES EVERY COMMAND LATER FASTER? (2 points)
 for i in {1..22}; do for i in {1..22}; do
Line 263: Line 272:
  
 ### 'Self-reported' ancestry of 1000g participants ### 'Self-reported' ancestry of 1000g participants
-kg_sf <- read.table('/Users/scvr9332/20130502.sequence.index', header=T, sep='\t', fill=T, stringsAsFactors=F)+kg_sf <- read.table('/Users/scvr9332/PCA/20130502.sequence.index', header=T, sep='\t', fill=T, stringsAsFactors=F)
  
 sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, POPULATION=kg_sf$POPULATION, stringsAsFactors=F)) sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, POPULATION=kg_sf$POPULATION, stringsAsFactors=F))
Line 327: Line 336:
 " | R --vanilla " | R --vanilla
    
- 
homework_6_ancestry.1447521146.txt.gz · Last modified: 2015/11/14 10:12 by scott