Table of Contents

Instructions

PSYC 7102 – Statistical Genetics. Final Exam

Due: December 17 2015 @ 5pm.

There are 5 questions, each worth 8 points, for a total of 40 possible points. The exam is “open book”. Use whatever online sources are helpful to you. All questions contain multiple parts – please read the questions carefully and answer all components.

Complete this exam on your own, without help from others.

For this final exam I ask that you conduct all requested analyses on the following files, each of which contains a single person from 1000 Genomes. There's no need to copy any of these files. Just use them out of my directory.

Aligned reads
/Users/scvr9332/final_exam_files/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
/Users/scvr9332/final_exam_files/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
Annotated VCFs
/Users/scvr9332/final_exam_files/chrALL.filtered.PASS.beagled.HG00096.rsID.anno.vcf.gz
/Users/scvr9332/final_exam_files/chrALL.filtered.PASS.beagled.HG00096.rsID.anno.vcf.gz.tbi

Question 1

Interpret the phenotypic effects of variant rs16969968.

  1. Tell me the location, alleles, genotype, dosage, and number of reads at this site for HG00096 (1 point)
  2. What is the functional effect of the alternate allele at this site? (1 point)
  3. How does this variant impact HG00096's risk for smoking? Be specific and provide a citation to support your claim. (2 points)
  4. Use samtools to interactively visualize the reads at this site for HG00096 and describe the output for rs16969968. Include a screenshot. You will have to do some googling to understand the output! (4 points)

Here is an example samtools visualization command:

samtools tview -p chr:pos -d C in.bam /Users/scvr9332/reference_data/gotcloud.ref/hs37d5.fa

Question 2

This section will draw from the quality control plots of a bad Illumina run here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

  1. Look at the per tile sequence quality plot. Tell me what this type of plot represents, generally, and what this specific plot tells us about the sequencing run. (1 point)
  2. Do the same for the Kmer Content plot and table. (1 point)
  3. Looking at all the plots, what do you think caused the problems in this run? Why? (2 points)
  4. Give me the command to print out lines in a gzipped BAM file that correspond to reads where at least one base has the lowest possible alignment quality score. (Think regex) (4 points)

Question 3

What do genetic ancestry PCA estimates, such as those you generated in this course, represent (3 points)? In your opinion how is genetic ancestry different from/related to race/ethnicity (3 points)? List and discuss two reasons why ancestry is crucial to consider in genetic association studies (2 points).

Question 4

Describe the difference between coding and regulatory variation and summarize their relative roles in complex traits and diseases. (8 points)

Question 5

On a conceptual level, how does imputation work (2 points)? List and discuss three advantages of conducting imputation in genetic association studies (6 points).