This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||
homework_6_ancestry [2015/11/13 15:16] scott |
homework_6_ancestry [2015/11/15 18:17] scott /* Code for Homework */ |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
+ | ==== Notes ==== | ||
+ | |||
+ | |||
+ | |||
+ | ==== Code for Homework ==== | ||
+ | |||
# PSYC 7102 -- Statistical Genetics | # PSYC 7102 -- Statistical Genetics | ||
Line 21: | Line 28: | ||
### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT | ### ONLY EXPLAIN COMMANDS WHERE I SPECIFICALLY REQUEST IT! YOU DO NOT | ||
- | ### HAVE TO EXPLAIN EVERY COMMAND! | + | ### HAVE TO EXPLAIN EVERY COMMAND! |
+ | ### in the end a PCA plot containing yourself compared to all 1000 Genomes | ||
+ | ### samples. | ||
# For many questions you'll want to run analyses by chromosome. To do | # For many questions you'll want to run analyses by chromosome. To do | ||
Line 28: | Line 37: | ||
qsub -I -l walltime=23: | qsub -I -l walltime=23: | ||
+ | ### Load apigenome, plink | ||
+ | module load apigenome_0.0.2 | ||
+ | module load plink_latest | ||
Line 70: | Line 82: | ||
### Add in rsIDs from dbSNP. PLINK needs these to reconcile | ### Add in rsIDs from dbSNP. PLINK needs these to reconcile | ||
### positions/ | ### positions/ | ||
- | vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db / | + | vcf-add-rsid -vcf chrALL.filtered.PASS.beagled.HG00096.vcf.gz -db / |
### The previous command keeps only variants with rsIDs, otherwise | ### The previous command keeps only variants with rsIDs, otherwise | ||
### plink throws an error that there are >1 variants with ID = " | ### plink throws an error that there are >1 variants with ID = " | ||
### | ### | ||
- | ###------ QUESTION 1: WHAT DOES THIS COMMAND DO? | + | ###------ QUESTION 1: WHAT DOES THIS COMMAND DO? (1 point) |
zgrep -E '# | zgrep -E '# | ||
Line 90: | Line 102: | ||
# can handle anyway). BE SURE TO LOG INTO A COMPUTE NODE BEFORE | # can handle anyway). BE SURE TO LOG INTO A COMPUTE NODE BEFORE | ||
# RUNNING ANY OF THESE LOOPS!!!!! | # RUNNING ANY OF THESE LOOPS!!!!! | ||
- | ###------ QUESTION 2: WHAT DOES THIS COMMAND (THE ENTIRE FOR LOOP) DO? | + | ### |
+ | ###------ QUESTION 2: WHAT DOES THIS COMMAND (THE ENTIRE FOR LOOP) DO? (2 points) | ||
for i in {1..22}; do | for i in {1..22}; do | ||
- | zgrep -E ' | + | zgrep -E ' |
done | done | ||
### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common | ### Retain in the 1000 Genomes VCF only your SNPs that are also fairly common | ||
- | ### | + | ### because we're going to conduct PCA on these SNPs and only want common ones. |
- | ###------ QUESTION 3: WHY WOULD WE REMOVE | + | ### |
- | ### | + | ###------ QUESTION 3: WHY WOULD WE RETAIN ONLY COMMON SNPS, OTHER THAN IT |
+ | ### | ||
for i in {1..22}; do | for i in {1..22}; do | ||
plink --vcf chr$i.1000G.biallelic.vcf.gz \ | plink --vcf chr$i.1000G.biallelic.vcf.gz \ | ||
Line 111: | Line 125: | ||
### | ### | ||
###------ QUESTION 4: WALK ME THROUGH WHAT THIS COMMAND (THE ENTIRE | ###------ QUESTION 4: WALK ME THROUGH WHAT THIS COMMAND (THE ENTIRE | ||
- | ### | + | ### |
for i in {1..22}; do | for i in {1..22}; do | ||
zcat chr$i.1000G.biallelic.vcf.gz | cut -f3 | sort | uniq -c | grep -E ' | zcat chr$i.1000G.biallelic.vcf.gz | cut -f3 | sort | uniq -c | grep -E ' | ||
Line 133: | Line 147: | ||
for i in {1..22}; do | for i in {1..22}; do | ||
plink --bfile chr$i.1000G.biallelic.common.plink.nodups \ | plink --bfile chr$i.1000G.biallelic.common.plink.nodups \ | ||
- | --indep-pairwise 100k 5 .01 \ #------ QUESTION 5: EXPLAIN WHAT THIS ARGUMENT IS DOING | + | --indep-pairwise 100k 5 .01 \ #------ QUESTION 5: EXPLAIN WHAT THIS ARGUMENT IS DOING (1 point) |
--maf .05 \ | --maf .05 \ | ||
--out chr$i.snpsToExtract & | --out chr$i.snpsToExtract & | ||
Line 171: | Line 185: | ||
###----- QUESTION 6: HOW MANY VARIANTS ARE IN THE OUTPUT FILE FROM THE | ###----- QUESTION 6: HOW MANY VARIANTS ARE IN THE OUTPUT FILE FROM THE | ||
- | ### | + | ### |
Line 214: | Line 228: | ||
### average for each site in the score file. | ### average for each site in the score file. | ||
### | ### | ||
- | ###------ QUESTION 7: EXPLAIN WHAT THE FOLLOWING COMMANDS ARE DOING | + | ###------ QUESTION 7: EXPLAIN WHAT THE FOLLOWING COMMANDS ARE DOING (2 points) |
plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | ||
plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | plink --vcf chrALL.filtered.PASS.beagled.HG00096.rsIDsOnly.vcf.gz | ||
Line 240: | Line 254: | ||
R | R | ||
### Then in this R session run: ' | ### Then in this R session run: ' | ||
- | ### If you log out and back in, you'll have to again run this command: | + | ### If you log out and back in, you'll have to again run this command |
export R_LIBS=/ | export R_LIBS=/ | ||
Line 246: | Line 260: | ||
###------ QUESTION 8: PLEASE INTERPRET THE RESULTING GRAPH. WHAT IS | ###------ QUESTION 8: PLEASE INTERPRET THE RESULTING GRAPH. WHAT IS | ||
### | ### | ||
- | ### | + | ### |
+ | ### | ||
echo " | echo " | ||
Line 256: | Line 271: | ||
### ' | ### ' | ||
- | kg_sf <- read.table(' | + | kg_sf <- read.table(' |
sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, | sample_ids <- unique(data.frame(IID=kg_sf$SAMPLE_NAME, |