User Tools

Site Tools


lab_2

This is an old revision of the document!


Genotype file

An updated genotype file is here. To understand the annotation, read this: http://snpeff.sourceforge.net/SnpEff_manual.html#input

Example commands

### The file is zipped, which means we have to use slighlty different commands ### Let's take a peak at the file zless -S hu916767_20170324191934.1kgALTAllele.withHeader.snpEff.vcf.gz

### Wow, that's a lot of information compared to our sparse 23andMe file. ### Let's do some counts

### How many variants are there? zgrep -v '#' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l

### Let's try to find variants that can lead to stop gains zgrep –color 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz

### How many did 23andMe genotype? zgrep 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l

### How many missense variants are there? zgrep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l

### Are there any missense variants in our alcohol metabolism genes? zgrep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | grep 'ADH1B\|ADH1C\|ALDH2'

### What about phenylketonuria? zgrep 'PAH' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | grep 'stop\|missense'

Tuesday, Apr 18, in class exercise

Pick either your favorite gene or favorite phenotype. If the latter, pick a gene associated with that phenotype. Search the 23andMe file to discover:

  1. The number of variants annotated against that gene.
  2. The number of variants predicted to have an effect on gene function.
  3. Compare the number you find to the number in ExAC (http://exac.broadinstitute.org). Why is there a difference?
  4. For all variants in #2, search clinvar to see if any are known to affect the phenotype.

What I've done to prepare the vcf file

Feel free to ignore this section – it's only here to document what I've done for my own future reference, and for any interested student.

1. 23andMe format was converted to vcf format.

./bcftools/bcftools view hu916767_20170324191934.bcf -O vcf | bgzip -c > hu916767_20170324191934.vcf.gz

2. I merged sites with 1000 genomes in order to get reference and alternate alleles at each site.

library(data.table) options(stringsAsFactors=F) kg ← fread(“chrALL.vcf”, header=F) kg$V1 ← as.character(kg$V1) head(kg) hu23andme ← fread(“zgrep -v '#' hu916767_20170324191934.vcf.gz”, header=F) dat ← merge(kg, hu23andme, by=c(“V1”, “V2”, “V3”, “V4”), all.y=T) dat$V5 ← ifelse(dat$V5.y == “.”, dat$V5.x, dat$V5.y) dat2 ← dat[,c(1,2,3,4,12,7:11)] dat2[is.na(dat2)] ← “.” write.table(dat2, file=“hu916767_20170324191934.1kgALTallele.vcf”, col.names=F, row.names=F, quote=F, sep=“\t”)

### Add the vcf header back in (zgrep '#' hu916767_20170324191934.vcf.gz ; cat hu916767_20170324191934.1kgALTallele.vcf ) | bgzip -c > hu916767_20170324191934.1kgALTallele.withHeader.vcf.gz

3. I annotated the resulting file with snpEff (http://snpeff.sourceforge.net/SnpEff_manual.html#run)

java -jar snpEff/snpEff.jar -v GRCh37.75 hu916767_20170324191934.1kgALTallele.withHeader.vcf.gz | bgzip -c > hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz

lab_2.1492544649.txt.gz · Last modified: 2017/04/18 13:44 by scott