====== Instructions ======
PSYC 7102 -- Statistical Genetics. Final Exam
**Due:** December 17 2015 @ 5pm.
There are 5 questions, each worth 8 points, for a total of 40 possible points. The exam is "open book". Use whatever online sources are helpful to you. All questions contain multiple parts -- please read the questions carefully and answer all components.
Complete this exam on your own, without help from others.
For this final exam I ask that you conduct all requested analyses on the following files, each of which contains a single person from 1000 Genomes. There's no need to copy any of these files. Just use them out of my directory.
Aligned reads
/Users/scvr9332/final_exam_files/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
/Users/scvr9332/final_exam_files/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
Annotated VCFs
/Users/scvr9332/final_exam_files/chrALL.filtered.PASS.beagled.HG00096.rsID.anno.vcf.gz
/Users/scvr9332/final_exam_files/chrALL.filtered.PASS.beagled.HG00096.rsID.anno.vcf.gz.tbi
====== Question 1 ======
Interpret the phenotypic effects of variant rs16969968.
- Tell me the location, alleles, genotype, dosage, and number of reads at this site for HG00096 (1 point)
- What is the functional effect of the alternate allele at this site? (1 point)
- How does this variant impact HG00096's risk for smoking? Be specific and provide a citation to support your claim. (2 points)
- Use samtools to interactively visualize the reads at this site for HG00096 and describe the output for rs16969968. Include a screenshot. You will have to do some googling to understand the output! (4 points)
Here is an example samtools visualization command:
samtools tview -p chr:pos -d C in.bam /Users/scvr9332/reference_data/gotcloud.ref/hs37d5.fa
====== Question 2 ======
This section will draw from the quality control plots of a bad Illumina run here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html
- Look at the per tile sequence quality plot. Tell me what this type of plot represents, generally, and what this specific plot tells us about the sequencing run. (1 point)
- Do the same for the Kmer Content plot and table. (1 point)
- Looking at all the plots, what do you think caused the problems in this run? Why? (2 points)
- Give me the command to print out lines in a gzipped BAM file that correspond to reads where at least one base has the lowest possible alignment quality score. (Think regex) (4 points)
====== Question 3 ======
What do genetic ancestry PCA estimates, such as those you generated in this course, represent (3 points)? In your opinion how is genetic ancestry different from/related to race/ethnicity (3 points)? List and discuss two reasons why ancestry is crucial to consider in genetic association studies (2 points).
====== Question 4 ======
Describe the difference between coding and regulatory variation and summarize their relative roles in complex traits and diseases. (8 points)
====== Question 5 ======
On a conceptual level, how does imputation work (2 points)? List and discuss three advantages of conducting imputation in genetic association studies (6 points).