====== Genotype file ======

An updated genotype file is here: https://drive.google.com/file/d/0B608ps4vtHUaTWxGeFlIc2JtYjQ/view?usp=sharing

To understand the annotation, read this: http://snpeff.sourceforge.net/SnpEff_manual.html#input

If you're using Safari on a Mac you will need to do the following before downloading the file:
  -  Open Safari
  -  Click the "General" tab
  -  Uncheck the box "Open safe files after downloading"


===== Reading a vcf file =====

To walk you through how to read a vcf I ran the following command, which will show the contents of the file minus the header, and minus the annotation column. This just cleans things up by removing parts of the file that we don't need.


zgrep -v '##' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | cut -f-5,9- | head


The output is here:
  #CHROM	POS	ID	REF	ALT	FORMAT	hu916767_20170324191934
  1	82154	rs4477212	A	.	GT	0/0
  1	752566	rs3094315	G	A	GT	1/0
  1	752721	rs3131972	A	G	GT	0/1
  1	768448	rs12562034	G	A	GT	0/0
  1	776546	rs12124819	A	G	GT	0/1
  1	798959	rs11240777	G	A	GT	1/0
  1	800007	rs6681049	T	C	GT	1/1
  1	838555	rs4970383	C	A	GT	0/0
  1	846808	rs4475691	C	T	GT	0/1

The the columns are 
  -  chromosome
  -  position
  -  unique identifier (rs ID)
  -  the reference allele (the allele found in the human reference genome)
  -  the alternate allele (an allele discovered in other individuals)
  -  the FORMAT the individuals genotypes are in (in this case they are coded in the "GT" format, which is 0/0, 0/1, 1/0, or 1/1; believe it or not there are other useful formats).  THIS COLUMN DOES NOT PROVIDE THE INDIVIDUAL'S GENOTYPES AND CAN BE SAFELY IGNORED! 
  -  the genotype of this 23andme individual

To decode the genotype, you must combine the last column with the REF and ALT allele information. Take the last line, rs4475691. The REF allele is "C", and the ALT allele is "T". The genotype is 0/1, which tells you that one chromosome of this individual carries 0 ALT alleles (i.e., 1 reference allele) and the other chromosome carries 1 ALT allele. So the genotype is C/T.

For another example, take rs6681049. The REF allele is T and the ALT is C. The genotype is 1/1. That means that one chromosome of this individual carries 1 ALT allele (i.e., a C) and the other chromosome also carries 1 ALT allele (i.e., a C). So the genotype for this individual at that site is C/C.


====== Lab assignment 2 ======


### Lab 2 assignment
### Assigned: 4/20/2017
### Due: 4/27/2017 at the beginning of class. Late assignments (even by 5 minutes)
###      will not be accepted!
###
### Note: all questions should be answered with respect to the
###       genotypes from hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz

### Question 1 (4 points)
### a) Extract a variant from the vcf and show me the command you used
### and the output of the command. Tell me what the individual's
### genotype is at this site.

### Question 2 (4 points)
### How many variants did 23andMe genotype in exons; that is, in protein coding sequences.
### Show me the commands you used to figure this out.

### Question 3 (8 points)
### a) Is this individual likely to be lactose intolerant? Show me the
### steps you used to figure this out.
### b) Pick one of the variants you used to determine lactose
### intolerance. What is the geographical distribution of this
### variant's allele frequency?


Example full credit answers

1. Most of you got this one right. The most common mistake was to include too much information and too many steps (although that generally did not cost you any points).

zgrep -w 'rs671' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz

 12	112241766	rs671	G	A	.	.	ANN=A|missense_variant|MODERATE|ALDH2|ENSG00000111275|transcript|ENST00000261733|protein_coding|12/13|c.1510G>A|p.Glu504Lys|1571/2018|1510/1554|504/517||,A|missense_variant|MODERATE|ALDH2|ENSG00000111275|transcript|ENST00000416293|protein_coding|11/12|c.1369G>A|p.Glu457Lys|1465/1572|1369/1413|457/470||,A|3_prime_UTR_variant|MODIFIER|ALDH2|ENSG00000111275|transcript|ENST00000548536|nonsense_mediated_decay|13/14|c.*1386G>A|||||22035|,A|3_prime_UTR_variant|MODIFIER|ALDH2|ENSG00000111275|transcript|ENST00000549106|nonsense_mediated_decay|3/4|c.*89G>A|||||89|WARNING_TRANSCRIPT_NO_START_CODON	GT	0/0

This individuals has 0 alternate alleles, so their genotype is G/G. Two reference alleles.

2. There are multiple ways to answer this. One of the most straightforward was as follows, although we could quibble over whether it should have included any splicing variants.

zgrep 'synonymous\|missense\|start_gain\|start_lost\|stop_gain\|stop_lost\|3_prime_UTR_variant\|5_prime_UTR_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l
   52772


3. One good answer was:
"a) This person is likely to be lactose intolerant. One primary variant for lactose intolerance is rs4988235, while another one is rs182549. For both of these, the genotype to be lactose intolerant is C/C. This person had the C/C genotype for both sites.


zgrep rs182549 hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz 
2	136616754	rs182549	C	T	.	.	
ANN=T|intron_variant|MODIFIER|MCM6|ENSG00000076003|transcript|ENST00000264156
|protein_coding|9/16|c.1362+117G>A||||||,T|intron_variant|MODIFIER|MCM6
|ENSG00000076003|transcript|ENST00000492091|processed_transcript|2/5
|n.181+3423G>A||||||	GT	0/0


zgrep rs4988235 hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
2	136608646	rs4988235	G	A	.	.	
ANN=A|intron_variant|MODIFIER|MCM6|ENSG00000076003|transcript|ENST00000264156|protein_coding
|13/16|c.1917+326C>T||||||,A|intron_variant|MODIFIER|MCM6|ENSG00000076003|transcript|ENST00000492091
|processed_transcript|3/5|n.343+326C>T||||||,A|intron_variant|MODIFIER|MCM6|ENSG00000076003
|transcript|ENST00000483902|retained_intron|1/1|n.544+326C>T||||||	GT	0/0


Though it looks like at the second site the genotype is G/G, this is reading from the positive strand. The negative strand [which is used on SNPpedia, and is the transcribed strand], would be C/C.

b) Geographical distribution for the allele frequency of rs182549

The Minor allele frequency is 0 in Africa, and southern Europe and Asia, while the minor allele is more prevalent (even becomes major) in Eastern Europe and Western US. Minor allele is slightly prevalent in northern South America."


====== Example commands ======


### The file is zipped, which means we have to use slighlty different commands
### Let's take a peak at the file
zless -S hu916767_20170324191934.1kgALTAllele.withHeader.snpEff.vcf.gz

### Wow, that's a lot of information compared to our sparse 23andMe file.
### Let's do some counts

### How many variants are there?
zgrep -v '#' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l

### Let's try to find variants that can lead to stop gains
zgrep --color 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz

### How many did 23andMe genotype?
zgrep 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l

### How many missense variants are there?
zgrep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l

### Are there any missense variants in our alcohol metabolism genes?
zgrep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | grep 'ADH1B\|ADH1C\|ALDH2'

### What about phenylketonuria?
zgrep 'PAH' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | grep 'stop\|missense'


====== Tuesday, Apr 18, in class exercise ======

Pick either your favorite gene or favorite phenotype. If the latter, pick a gene associated with that phenotype. Search the 23andMe file to discover:
  -  The number of variants annotated against that gene.
  -  The number of variants predicted to have an effect on gene function.
  -  Compare the number you find to the number in ExAC (http://exac.broadinstitute.org). Why is there a difference?
  -  For all variants in #2, search clinvar to see if any are known to affect the phenotype.


====== What I've done to prepare the vcf file ======

Feel free to ignore this section -- it's only here to document what I've done for my own future reference, and for any interested student.

1. 23andMe format was converted to vcf format.

bcftools convert --tsv2vcf hu916767_20170324191934.txt -f human_g1k_v37.fasta.gz -s hu916767_20170324191934 -Ob -o hu916767_20170324191934.bcf
bcftools view hu916767_20170324191934.bcf -O vcf | bgzip -c > hu916767_20170324191934.vcf.gz


2. I merged sites with 1000 genomes in order to get reference and alternate alleles at each site.

library(data.table)
options(stringsAsFactors=F)
kg <- fread("chrALL.vcf", header=F)
kg$V1 <- as.character(kg$V1)
head(kg)
hu23andme <- fread("zgrep -v '#' hu916767_20170324191934.vcf.gz", header=F)
dat <- merge(kg, hu23andme, by=c("V1", "V2", "V3", "V4"), all.y=T)
dat$V5 <- ifelse(dat$V5.y == ".", dat$V5.x, dat$V5.y)
dat2 <- dat[,c(1,2,3,4,12,7:11)]
dat2[is.na(dat2)] <- "."
write.table(dat2, file="hu916767_20170324191934.1kgALTallele.vcf", col.names=F, row.names=F, quote=F, sep="\t")


### Add the vcf header back in
(zgrep '#' hu916767_20170324191934.vcf.gz ; cat hu916767_20170324191934.1kgALTallele.vcf ) | bgzip -c > hu916767_20170324191934.1kgALTallele.withHeader.vcf.gz


3. I annotated the resulting file with snpEff (http://snpeff.sourceforge.net/SnpEff_manual.html#run)

java -jar snpEff/snpEff.jar -v GRCh37.75 hu916767_20170324191934.1kgALTallele.withHeader.vcf.gz | bgzip -c > hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz