This is an old revision of the document!
An updated genotype file is here: https://drive.google.com/file/d/0B608ps4vtHUaTWxGeFlIc2JtYjQ/view?usp=sharing
To understand the annotation, read this: http://snpeff.sourceforge.net/SnpEff_manual.html#input
If you're using Safari on a Mac you will need to do the following before downloading the file:
To walk you through how to read a vcf I ran the following command, which will show the contents of the file minus the header, and minus the annotation column. This just cleans things up by removing parts of the file that we don't need.
zgrep -v '##' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | cut -f-5,9- | head
The output is here:
#CHROM POS ID REF ALT FORMAT hu916767_20170324191934 1 82154 rs4477212 A . GT 0/0 1 752566 rs3094315 G A GT 1/0 1 752721 rs3131972 A G GT 0/1 1 768448 rs12562034 G A GT 0/0 1 776546 rs12124819 A G GT 0/1 1 798959 rs11240777 G A GT 1/0 1 800007 rs6681049 T C GT 1/1 1 838555 rs4970383 C A GT 0/0 1 846808 rs4475691 C T GT 0/1
The the columns are
To decode the genotype, you must combine the last column with the REF and ALT allele information. Take the last line, rs4475691. The REF allele is “C”, and the ALT allele is “T”. The genotype is 0/1, which tells you that one chromosome of this individual carries 0 ALT alleles (i.e., 1 reference allele) and the other chromosome carries 1 ALT allele. So the genotype is C/T.
For another example, take rs6681049. The REF allele is T and the ALT is C. The genotype is 1/1. That means that one chromosome of this individual carries 1 ALT allele (i.e., a C) and the other chromosome also carries 1 ALT allele (i.e., a C). So the genotype for this individual at that site is C/C.
### The file is zipped, which means we have to use slighlty different commands ### Let's take a peak at the file zless -S hu916767_20170324191934.1kgALTAllele.withHeader.snpEff.vcf.gz
### Wow, that's a lot of information compared to our sparse 23andMe file. ### Let's do some counts
### How many variants are there? zgrep -v '#' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l
### Let's try to find variants that can lead to stop gains zgrep –color 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
### How many did 23andMe genotype? zgrep 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l
### How many missense variants are there? zgrep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l
### Are there any missense variants in our alcohol metabolism genes? zgrep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | grep 'ADH1B\|ADH1C\|ALDH2'
### What about phenylketonuria? zgrep 'PAH' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | grep 'stop\|missense'
Pick either your favorite gene or favorite phenotype. If the latter, pick a gene associated with that phenotype. Search the 23andMe file to discover:
Feel free to ignore this section – it's only here to document what I've done for my own future reference, and for any interested student.
1. 23andMe format was converted to vcf format.
./bcftools/bcftools view hu916767_20170324191934.bcf -O vcf | bgzip -c > hu916767_20170324191934.vcf.gz
2. I merged sites with 1000 genomes in order to get reference and alternate alleles at each site.
library(data.table) options(stringsAsFactors=F) kg ← fread(“chrALL.vcf”, header=F) kg$V1 ← as.character(kg$V1) head(kg) hu23andme ← fread(“zgrep -v '#' hu916767_20170324191934.vcf.gz”, header=F) dat ← merge(kg, hu23andme, by=c(“V1”, “V2”, “V3”, “V4”), all.y=T) dat$V5 ← ifelse(dat$V5.y == “.”, dat$V5.x, dat$V5.y) dat2 ← dat[,c(1,2,3,4,12,7:11)] dat2[is.na(dat2)] ← “.” write.table(dat2, file=“hu916767_20170324191934.1kgALTallele.vcf”, col.names=F, row.names=F, quote=F, sep=“\t”)
### Add the vcf header back in (zgrep '#' hu916767_20170324191934.vcf.gz ; cat hu916767_20170324191934.1kgALTallele.vcf ) | bgzip -c > hu916767_20170324191934.1kgALTallele.withHeader.vcf.gz
3. I annotated the resulting file with snpEff (http://snpeff.sourceforge.net/SnpEff_manual.html#run)
java -jar snpEff/snpEff.jar -v GRCh37.75 hu916767_20170324191934.1kgALTallele.withHeader.vcf.gz | bgzip -c > hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz