This is an old revision of the document!
An updated genotype file is here: https://drive.google.com/file/d/0B608ps4vtHUaeW9RSC1ZQVliR1k/view?usp=sharing To understand the annotation, read this: http://snpeff.sourceforge.net/SnpEff_manual.html#input
### Let's take a peak at the file less -S hu916767_20170324191934.1kgALTAllele.withHeader.snpEff.vcf
### Wow, that's a lot of information compared to our sparse 23andMe file. ### Let's do some counts
### How many variants are there? wc -l hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf
### Let's try to find variants that can lead to stop gains grep –color 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf
### How many did 23andMe genotype? grep 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf | wc -l
### How many missense variants are there? grep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf | wc -l
### Are there any missense variants in our alcohol metabolism genes? grep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf | grep 'ADH1B\|ADH1C\|ALDH2'
### What about phenylketonuria? grep 'PAH' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf | grep 'stop\|missense'
Pick either your favorite gene or favorite phenotype. If the latter, pick a gene associated with that phenotype. Search the 23andMe file to discover:
Feel free to ignore this section – it's only here to document what I've done for my own future reference, and for any interested student.
1. 23andMe format was converted to vcf format.
./bcftools/bcftools view hu916767_20170324191934.bcf -O vcf | bgzip -c > hu916767_20170324191934.vcf.gz
2. I merged sites with 1000 genomes in order to get reference and alternate alleles at each site.
library(data.table) options(stringsAsFactors=F) kg ← fread(“chrALL.vcf”, header=F) kg$V1 ← as.character(kg$V1) head(kg) hu23andme ← fread(“zgrep -v '#' hu916767_20170324191934.vcf.gz”, header=F) dat ← merge(kg, hu23andme, by=c(“V1”, “V2”, “V3”, “V4”), all.y=T) dat$V5 ← ifelse(dat$V5.y == “.”, dat$V5.x, dat$V5.y) dat2 ← dat[,c(1,2,3,4,12,7:11)] dat2[is.na(dat2)] ← “.” write.table(dat2, file=“hu916767_20170324191934.1kgALTallele.vcf”, col.names=F, row.names=F, quote=F, sep=“\t”)
### Add the vcf header back in (zgrep '#' hu916767_20170324191934.vcf.gz ; cat hu916767_20170324191934.1kgALTallele.vcf ) | bgzip -c > hu916767_20170324191934.1kgALTallele.withHeader.vcf.gz
3. I annotated the resulting file with snpEff (http://snpeff.sourceforge.net/SnpEff_manual.html#run)
java -jar snpEff/snpEff.jar -v GRCh37.75 hu916767_20170324191934.1kgALTallele.withHeader.vcf.gz | bgzip -c > hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz