User Tools

Site Tools


lab_2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lab_2 [2017/04/18 13:16]
scott /* Example commands */
lab_2 [2017/05/02 09:09] (current)
scott /* Lab assignment 2 */
Line 2: Line 2:
 ====== Genotype file ====== ====== Genotype file ======
  
-An updated genotype file is here: https://drive.google.com/file/d/0B608ps4vtHUadlBkX25wTkdBaFE/view?usp=sharing+An updated genotype file is here: https://drive.google.com/file/d/0B608ps4vtHUaTWxGeFlIc2JtYjQ/view?usp=sharing
  
 To understand the annotation, read this: http://snpeff.sourceforge.net/SnpEff_manual.html#input To understand the annotation, read this: http://snpeff.sourceforge.net/SnpEff_manual.html#input
 +
 +If you're using Safari on a Mac you will need to do the following before downloading the file:
 +  -  Open Safari
 +  -  Click the "General" tab
 +  -  Uncheck the box "Open safe files after downloading"
 +
 +
 +===== Reading a vcf file =====
 +
 +To walk you through how to read a vcf I ran the following command, which will show the contents of the file minus the header, and minus the annotation column. This just cleans things up by removing parts of the file that we don't need.
 +
 +
 +zgrep -v '##' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | cut -f-5,9- | head
 +
 +
 +The output is here:
 +  #CHROM POS ID REF ALT FORMAT hu916767_20170324191934
 +  1 82154 rs4477212 A . GT 0/0
 +  1 752566 rs3094315 G A GT 1/0
 +  1 752721 rs3131972 A G GT 0/1
 +  1 768448 rs12562034 G A GT 0/0
 +  1 776546 rs12124819 A G GT 0/1
 +  1 798959 rs11240777 G A GT 1/0
 +  1 800007 rs6681049 T C GT 1/1
 +  1 838555 rs4970383 C A GT 0/0
 +  1 846808 rs4475691 C T GT 0/1
 +
 +The the columns are 
 +  -  chromosome
 +  -  position
 +  -  unique identifier (rs ID)
 +  -  the reference allele (the allele found in the human reference genome)
 +  -  the alternate allele (an allele discovered in other individuals)
 +  -  the FORMAT the individuals genotypes are in (in this case they are coded in the "GT" format, which is 0/0, 0/1, 1/0, or 1/1; believe it or not there are other useful formats).  THIS COLUMN DOES NOT PROVIDE THE INDIVIDUAL'S GENOTYPES AND CAN BE SAFELY IGNORED! 
 +  -  the genotype of this 23andme individual
 +
 +To decode the genotype, you must combine the last column with the REF and ALT allele information. Take the last line, rs4475691. The REF allele is "C", and the ALT allele is "T". The genotype is 0/1, which tells you that one chromosome of this individual carries 0 ALT alleles (i.e., 1 reference allele) and the other chromosome carries 1 ALT allele. So the genotype is C/T.
 +
 +For another example, take rs6681049. The REF allele is T and the ALT is C. The genotype is 1/1. That means that one chromosome of this individual carries 1 ALT allele (i.e., a C) and the other chromosome also carries 1 ALT allele (i.e., a C). So the genotype for this individual at that site is C/C.
 +
 +
 +====== Lab assignment 2 ======
 +
 +
 +### Lab 2 assignment
 +### Assigned: 4/20/2017
 +### Due: 4/27/2017 at the beginning of class. Late assignments (even by 5 minutes)
 +###      will not be accepted!
 +###
 +### Note: all questions should be answered with respect to the
 +###       genotypes from hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
 +
 +### Question 1 (4 points)
 +### a) Extract a variant from the vcf and show me the command you used
 +### and the output of the command. Tell me what the individual's
 +### genotype is at this site.
 +
 +### Question 2 (4 points)
 +### How many variants did 23andMe genotype in exons; that is, in protein coding sequences.
 +### Show me the commands you used to figure this out.
 +
 +### Question 3 (8 points)
 +### a) Is this individual likely to be lactose intolerant? Show me the
 +### steps you used to figure this out.
 +### b) Pick one of the variants you used to determine lactose
 +### intolerance. What is the geographical distribution of this
 +### variant's allele frequency?
 +
 +
 +Example full credit answers
 +
 +1. Most of you got this one right. The most common mistake was to include too much information and too many steps (although that generally did not cost you any points).
 +
 +zgrep -w 'rs671' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
 +
 + 12 112241766 rs671 G A . . ANN=A|missense_variant|MODERATE|ALDH2|ENSG00000111275|transcript|ENST00000261733|protein_coding|12/13|c.1510G>A|p.Glu504Lys|1571/2018|1510/1554|504/517||,A|missense_variant|MODERATE|ALDH2|ENSG00000111275|transcript|ENST00000416293|protein_coding|11/12|c.1369G>A|p.Glu457Lys|1465/1572|1369/1413|457/470||,A|3_prime_UTR_variant|MODIFIER|ALDH2|ENSG00000111275|transcript|ENST00000548536|nonsense_mediated_decay|13/14|c.*1386G>A|||||22035|,A|3_prime_UTR_variant|MODIFIER|ALDH2|ENSG00000111275|transcript|ENST00000549106|nonsense_mediated_decay|3/4|c.*89G>A|||||89|WARNING_TRANSCRIPT_NO_START_CODON GT 0/0
 +
 +This individuals has 0 alternate alleles, so their genotype is G/G. Two reference alleles.
 +
 +2. There are multiple ways to answer this. One of the most straightforward was as follows, although we could quibble over whether it should have included any splicing variants.
 +
 +zgrep 'synonymous\|missense\|start_gain\|start_lost\|stop_gain\|stop_lost\|3_prime_UTR_variant\|5_prime_UTR_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l
 +   52772
 +
 +
 +3. One good answer was:
 +"a) This person is likely to be lactose intolerant. One primary variant for lactose intolerance is rs4988235, while another one is rs182549. For both of these, the genotype to be lactose intolerant is C/C. This person had the C/C genotype for both sites.
 +
 +
 +zgrep rs182549 hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz 
 +2 136616754 rs182549 C T . .
 +ANN=T|intron_variant|MODIFIER|MCM6|ENSG00000076003|transcript|ENST00000264156
 +|protein_coding|9/16|c.1362+117G>A||||||,T|intron_variant|MODIFIER|MCM6
 +|ENSG00000076003|transcript|ENST00000492091|processed_transcript|2/5
 +|n.181+3423G>A|||||| GT 0/0
 +
 +
 +zgrep rs4988235 hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
 +2 136608646 rs4988235 G A . .
 +ANN=A|intron_variant|MODIFIER|MCM6|ENSG00000076003|transcript|ENST00000264156|protein_coding
 +|13/16|c.1917+326C>T||||||,A|intron_variant|MODIFIER|MCM6|ENSG00000076003|transcript|ENST00000492091
 +|processed_transcript|3/5|n.343+326C>T||||||,A|intron_variant|MODIFIER|MCM6|ENSG00000076003
 +|transcript|ENST00000483902|retained_intron|1/1|n.544+326C>T|||||| GT 0/0
 +
 +
 +Though it looks like at the second site the genotype is G/G, this is reading from the positive strand. The negative strand [which is used on SNPpedia, and is the transcribed strand], would be C/C.
 +
 +b) Geographical distribution for the allele frequency of rs182549
 +
 +The Minor allele frequency is 0 in Africa, and southern Europe and Asia, while the minor allele is more prevalent (even becomes major) in Eastern Europe and Western US. Minor allele is slightly prevalent in northern South America."
  
  
Line 10: Line 120:
  
  
 +### The file is zipped, which means we have to use slighlty different commands
 ### Let's take a peak at the file ### Let's take a peak at the file
-less -S hu916767_20170324191934.1kgALTAllele.withHeader.snpEff.vcf+zless -S hu916767_20170324191934.1kgALTAllele.withHeader.snpEff.vcf.gz
  
 ### Wow, that's a lot of information compared to our sparse 23andMe file. ### Wow, that's a lot of information compared to our sparse 23andMe file.
Line 17: Line 128:
  
 ### How many variants are there? ### How many variants are there?
-wc -hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf+zgrep -v '#' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l
  
 ### Let's try to find variants that can lead to stop gains ### Let's try to find variants that can lead to stop gains
-grep --color 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf+zgrep --color 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz
  
 ### How many did 23andMe genotype? ### How many did 23andMe genotype?
-grep 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf | wc -l+zgrep 'stop_gained' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l
  
 ### How many missense variants are there? ### How many missense variants are there?
-grep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf | wc -l+zgrep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | wc -l
  
 ### Are there any missense variants in our alcohol metabolism genes? ### Are there any missense variants in our alcohol metabolism genes?
-grep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf | grep 'ADH1B\|ADH1C\|ALDH2'+zgrep 'missense_variant' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | grep 'ADH1B\|ADH1C\|ALDH2'
  
 ### What about phenylketonuria? ### What about phenylketonuria?
-grep 'PAH' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf | grep 'stop\|missense'+zgrep 'PAH' hu916767_20170324191934.1kgALTallele.withHeader.snpEff.vcf.gz | grep 'stop\|missense'
  
  
Line 52: Line 163:
 1. 23andMe format was converted to vcf format. 1. 23andMe format was converted to vcf format.
  
-./bcftools/bcftools view hu916767_20170324191934.bcf -O vcf | bgzip -c > hu916767_20170324191934.vcf.gz+bcftools convert --tsv2vcf hu916767_20170324191934.txt -f human_g1k_v37.fasta.gz -s hu916767_20170324191934 -Ob -o hu916767_20170324191934.bcf 
 +bcftools view hu916767_20170324191934.bcf -O vcf | bgzip -c > hu916767_20170324191934.vcf.gz
  
  
lab_2.1492542975.txt.gz · Last modified: 2017/04/18 13:16 by scott