User Tools

Site Tools


lab_2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lab_2 [2017/04/19 08:34]
scott /* Reading a vcf file */
lab_2 [2017/05/02 09:09] (current)
scott /* Lab assignment 2 */
Line 44: Line 44:
  
 For another example, take rs6681049. The REF allele is T and the ALT is C. The genotype is 1/1. That means that one chromosome of this individual carries 1 ALT allele (i.e., a C) and the other chromosome also carries 1 ALT allele (i.e., a C). So the genotype for this individual at that site is C/C. For another example, take rs6681049. The REF allele is T and the ALT is C. The genotype is 1/1. That means that one chromosome of this individual carries 1 ALT allele (i.e., a C) and the other chromosome also carries 1 ALT allele (i.e., a C). So the genotype for this individual at that site is C/C.
 +
 +
 +====== Lab assignment 2 ======
 +
 +
 +### Lab 2 assignment
 +### Assigned: 4/20/2017
 +### Due: 4/27/2017 at the beginning of class. Late assignments (even by 5 minutes)
 +###      will not be accepted!
 +###
 +### Note: all questions should be answered with respect to the
 +###       genotypes from hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
 +
 +### Question 1 (4 points)
 +### a) Extract a variant from the vcf and show me the command you used
 +### and the output of the command. Tell me what the individual's
 +### genotype is at this site.
 +
 +### Question 2 (4 points)
 +### How many variants did 23andMe genotype in exons; that is, in protein coding sequences.
 +### Show me the commands you used to figure this out.
 +
 +### Question 3 (8 points)
 +### a) Is this individual likely to be lactose intolerant? Show me the
 +### steps you used to figure this out.
 +### b) Pick one of the variants you used to determine lactose
 +### intolerance. What is the geographical distribution of this
 +### variant's allele frequency?
 +
 +
 +Example full credit answers
 +
 +1. Most of you got this one right. The most common mistake was to include too much information and too many steps (although that generally did not cost you any points).
 +
 +zgrep -w 'rs671' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
 +
 + 12 112241766 rs671 G A . . ANN=A|missense_variant|MODERATE|ALDH2|ENSG00000111275|transcript|ENST00000261733|protein_coding|12/13|c.1510G>A|p.Glu504Lys|1571/2018|1510/1554|504/517||,A|missense_variant|MODERATE|ALDH2|ENSG00000111275|transcript|ENST00000416293|protein_coding|11/12|c.1369G>A|p.Glu457Lys|1465/1572|1369/1413|457/470||,A|3_prime_UTR_variant|MODIFIER|ALDH2|ENSG00000111275|transcript|ENST00000548536|nonsense_mediated_decay|13/14|c.*1386G>A|||||22035|,A|3_prime_UTR_variant|MODIFIER|ALDH2|ENSG00000111275|transcript|ENST00000549106|nonsense_mediated_decay|3/4|c.*89G>A|||||89|WARNING_TRANSCRIPT_NO_START_CODON GT 0/0
 +
 +This individuals has 0 alternate alleles, so their genotype is G/G. Two reference alleles.
 +
 +2. There are multiple ways to answer this. One of the most straightforward was as follows, although we could quibble over whether it should have included any splicing variants.
 +
 +zgrep 'synonymous\|missense\|start_gain\|start_lost\|stop_gain\|stop_lost\|3_prime_UTR_variant\|5_prime_UTR_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l
 +   52772
 +
 +
 +3. One good answer was:
 +"a) This person is likely to be lactose intolerant. One primary variant for lactose intolerance is rs4988235, while another one is rs182549. For both of these, the genotype to be lactose intolerant is C/C. This person had the C/C genotype for both sites.
 +
 +
 +zgrep rs182549 hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz 
 +2 136616754 rs182549 C T . .
 +ANN=T|intron_variant|MODIFIER|MCM6|ENSG00000076003|transcript|ENST00000264156
 +|protein_coding|9/16|c.1362+117G>A||||||,T|intron_variant|MODIFIER|MCM6
 +|ENSG00000076003|transcript|ENST00000492091|processed_transcript|2/5
 +|n.181+3423G>A|||||| GT 0/0
 +
 +
 +zgrep rs4988235 hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
 +2 136608646 rs4988235 G A . .
 +ANN=A|intron_variant|MODIFIER|MCM6|ENSG00000076003|transcript|ENST00000264156|protein_coding
 +|13/16|c.1917+326C>T||||||,A|intron_variant|MODIFIER|MCM6|ENSG00000076003|transcript|ENST00000492091
 +|processed_transcript|3/5|n.343+326C>T||||||,A|intron_variant|MODIFIER|MCM6|ENSG00000076003
 +|transcript|ENST00000483902|retained_intron|1/1|n.544+326C>T|||||| GT 0/0
 +
 +
 +Though it looks like at the second site the genotype is G/G, this is reading from the positive strand. The negative strand [which is used on SNPpedia, and is the transcribed strand], would be C/C.
 +
 +b) Geographical distribution for the allele frequency of rs182549
 +
 +The Minor allele frequency is 0 in Africa, and southern Europe and Asia, while the minor allele is more prevalent (even becomes major) in Eastern Europe and Western US. Minor allele is slightly prevalent in northern South America."
  
  
Line 92: Line 163:
 1. 23andMe format was converted to vcf format. 1. 23andMe format was converted to vcf format.
  
-./bcftools/bcftools view hu916767_20170324191934.bcf -O vcf | bgzip -c > hu916767_20170324191934.vcf.gz+bcftools convert --tsv2vcf hu916767_20170324191934.txt -f human_g1k_v37.fasta.gz -s hu916767_20170324191934 -Ob -o hu916767_20170324191934.bcf 
 +bcftools view hu916767_20170324191934.bcf -O vcf | bgzip -c > hu916767_20170324191934.vcf.gz
  
  
lab_2.1492612486.txt.gz · Last modified: 2017/04/19 08:34 by scott