User Tools

Site Tools


lab_2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
lab_2 [2017/04/19 08:32]
scott /* Genotype file */
lab_2 [2017/04/30 22:20]
scott /* Lab assignment 2 */
Line 14: Line 14:
 ===== Reading a vcf file ===== ===== Reading a vcf file =====
  
-This command will show the contents of the file minus the header, and minus the annotation column. +To walk you through how to read a vcf I ran the following command, which will show the contents of the file minus the header, and minus the annotation column. This just cleans things up by removing parts of the file that we don't need. 
  
 zgrep -v '##' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | cut -f-5,9- | head zgrep -v '##' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | cut -f-5,9- | head
Line 37: Line 38:
   -  the reference allele (the allele found in the human reference genome)   -  the reference allele (the allele found in the human reference genome)
   -  the alternate allele (an allele discovered in other individuals)   -  the alternate allele (an allele discovered in other individuals)
-  -  the FORMAT the individuals genotypes are in (in this case they are coded in the GT format, which is 0/0, 0/1, 1/0, or 1/1; believe it or not there are other useful formats).+  -  the FORMAT the individuals genotypes are in (in this case they are coded in the "GTformat, which is 0/0, 0/1, 1/0, or 1/1; believe it or not there are other useful formats).  THIS COLUMN DOES NOT PROVIDE THE INDIVIDUAL'S GENOTYPES AND CAN BE SAFELY IGNORED! 
   -  the genotype of this 23andme individual   -  the genotype of this 23andme individual
  
Line 43: Line 44:
  
 For another example, take rs6681049. The REF allele is T and the ALT is C. The genotype is 1/1. That means that one chromosome of this individual carries 1 ALT allele (i.e., a C) and the other chromosome also carries 1 ALT allele (i.e., a C). So the genotype for this individual at that site is C/C. For another example, take rs6681049. The REF allele is T and the ALT is C. The genotype is 1/1. That means that one chromosome of this individual carries 1 ALT allele (i.e., a C) and the other chromosome also carries 1 ALT allele (i.e., a C). So the genotype for this individual at that site is C/C.
 +
 +
 +====== Lab assignment 2 ======
 +
 +
 +### Lab 2 assignment
 +### Assigned: 4/20/2017
 +### Due: 4/27/2017 at the beginning of class. Late assignments (even by 5 minutes)
 +###      will not be accepted!
 +###
 +### Note: all questions should be answered with respect to the
 +###       genotypes from hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
 +
 +### Question 1 (4 points)
 +### a) Extract a variant from the vcf and show me the command you used
 +### and the output of the command. Tell me what the individual's
 +### genotype is at this site.
 +
 +### Question 2 (4 points)
 +### How many variants did 23andMe genotype in exons; that is, in protein coding sequences.
 +### Show me the commands you used to figure this out.
 +
 +### Question 3 (8 points)
 +### a) Is this individual likely to be lactose intolerant? Show me the
 +### steps you used to figure this out.
 +### b) Pick one of the variants you used to determine lactose
 +### intolerance. What is the geographical distribution of this
 +### variant's allele frequency?
 +
 +
 +Example full credit answers
 +
 +1. Most of you got this one right. The most common mistake was to include too much information and too many steps (although that generally did not cost you any points).
 +
 +zgrep -w 'rs671' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
 +
 + 12 112241766 rs671 G A . . ANN=A|missense_variant|MODERATE|ALDH2|ENSG00000111275|transcript|ENST00000261733|protein_coding|12/13|c.1510G>A|p.Glu504Lys|1571/2018|1510/1554|504/517||,A|missense_variant|MODERATE|ALDH2|ENSG00000111275|transcript|ENST00000416293|protein_coding|11/12|c.1369G>A|p.Glu457Lys|1465/1572|1369/1413|457/470||,A|3_prime_UTR_variant|MODIFIER|ALDH2|ENSG00000111275|transcript|ENST00000548536|nonsense_mediated_decay|13/14|c.*1386G>A|||||22035|,A|3_prime_UTR_variant|MODIFIER|ALDH2|ENSG00000111275|transcript|ENST00000549106|nonsense_mediated_decay|3/4|c.*89G>A|||||89|WARNING_TRANSCRIPT_NO_START_CODON GT 0/0
 +
 +This individuals has 0 alternate alleles, so their genotype is G/G. Two reference alleles.
 +
 +2. There are multiple ways to answer this. One of the most straightforward is as follows, although we could quibble over whether I should have included any splicing variants.
 +
 +zgrep 'synonymous\|missense\|start_gain\|start_lost\|stop_gain\|stop_lost\|3_prime_UTR_variant\|5_prime_UTR_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l
 +   52772
 +
 +
 +3. Coming soon
  
  
Line 91: Line 139:
 1. 23andMe format was converted to vcf format. 1. 23andMe format was converted to vcf format.
  
-./bcftools/bcftools view hu916767_20170324191934.bcf -O vcf | bgzip -c > hu916767_20170324191934.vcf.gz+bcftools convert --tsv2vcf hu916767_20170324191934.txt -f human_g1k_v37.fasta.gz -s hu916767_20170324191934 -Ob -o hu916767_20170324191934.bcf 
 +bcftools view hu916767_20170324191934.bcf -O vcf | bgzip -c > hu916767_20170324191934.vcf.gz
  
  
lab_2.txt · Last modified: 2017/05/02 09:09 by scott