Practical 5: Understanding GQ and DP in sequence data¶

The cells of this practical can be entered (by cut and paste) into the IPython console.¶

Before entering the first cell, make sure you have changed to the directory hail-practical. Skip the first cell haven't closed IPython console since running the last practical.¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from math import log, isnan
from pprint import pprint
import matplotlib.patches as mpatches

from hail import *

%matplotlib inline

def qqplot(pvals):
    spvals = sorted([x for x in pvals if x and not(isnan(x))])
    exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]
    obs = [-log(p, 10) for p in spvals]
    plt.scatter(exp, obs)
    plt.plot(np.arange(0, max(max(exp), max(obs))), c="red")
    plt.xlabel("Expected p-value (-log10 scale)")
    plt.ylabel("Observed p-value (-log10 scale)")
    plt.xlim(xmin=0)
    plt.ylim(ymin=0)

hc = HailContext()

vds = hc.read('1kg.vds').cache()

Puzzle: Investigating the GQ distribution¶

GQ refresher¶

GQ is "genotype quality", which is roughly the log-scaled probability that your genotype is called wrong.

GQ 10 means 90% confidence.

GQ 20 means 99% confidence.

GQ 30 means 99.9% confidence.

GQ is truncated at 99, which corresponds to a ~ 0.0000000001 chance that your call is wrong (if it's calibrated!) | If we plot a histogram of GQ values, what will it look like?

DP refresher¶

DP is "depth", which is the total number of reads for a given sample at a given variant.

Produce a histogram of GQ and DP values for every genotype in our dataset¶

We'll use the `hist` functions again here. You definitely can't collect all the GQ values for a full dataset: with 1K Genomes, you'd get 200 billion values back (and you probably don't have 1600 gigabytes of RAM)¶

[gq_hist, dp_hist] = vds.query_genotypes(['gs.map(g => g.gq).hist(0, 100, 100)', 
                                          'gs.map(g => g.dp).hist(0, 30, 30)'])

plt.xlim(0, 100)
plt.ylim(0, 2500000)
plt.xlabel('GQ')
plt.ylabel('Count')
plt.title('GQ Histogram')
plt.bar(gq_hist.binEdges[:-1], gq_hist.binFrequencies, width=1, label='GQ')
plt.legend()
plt.show()

plt.xlim(0, 30)
plt.ylim(0, 3500000)
plt.xlabel('DP')
plt.ylabel('Count')
plt.title('DP Histogram')
plt.bar(dp_hist.binEdges[:-1], dp_hist.binFrequencies, width=1, label='DP')
plt.legend()
plt.show()

This GQ histogram is pretty strange. We're going to learn why.¶

By eye, we can identify at least 4 superimposed distributions. We need to pull these apart. Separating heterozygotes from homozygotes is a good place to start.¶

[het_gq_hist, hom_gq_hist] = vds.query_genotypes(['gs.filter(g => g.isHet).map(g => g.gq).hist(0, 100, 100)', 
                                                  'gs.filter(g => !g.isHet).map(g => g.gq).hist(0, 100, 100)'])

plt.xlim(0, 100)
plt.ylim(0, 2000000)
plt.xlabel('GQ')
plt.ylabel('Count')
plt.title('GQ Histogram')
plt.bar(het_gq_hist.binEdges[:-1], het_gq_hist.binFrequencies, width=1, color='red', label='heterozygotes')
plt.legend()
plt.show()

plt.xlim(0, 100)
plt.ylim(0, 2500000)
plt.xlabel('GQ')
plt.ylabel('Count')
plt.title('GQ Histogram')
plt.bar(hom_gq_hist.binEdges[:-1], hom_gq_hist.binFrequencies, width=1, label='homozygotes')
plt.legend()
plt.show()

Separating the heterzogotes helped a bit, but doesn't address why we see three superimposed histograms with very different magnitudes.¶

Let's go back to depth to investigate -- remember that depth is a more 'primitive' piece of metadata, and is used to produce GQ.¶

# argmax is the index of the largest value in a list
from numpy import argmax
gq_mode = int(hom_gq_hist.binEdges[argmax(hom_gq_hist.binFrequencies)])
dp_mode = int(dp_hist.binEdges[argmax(dp_hist.binFrequencies)])

print('GQ mode is %d' % gq_mode)
print('DP mode is %d' % dp_mode)

There are 3 superimposed GQ distributions for homozygotes.¶

The ratio between the mode GQ and DP is 3.¶

We can visually assess correlation by looking at a histogram of the ratio between the two.¶

gq_dp_hist = vds.query_genotypes('gs.filter(g => !g.isHet).map(g => g.gq / g.dp).hist(1, 7, 12)')

plt.xlim(0, 8)
plt.ylim(0, 20000000)
plt.xlabel('GQ/DP')
plt.ylabel('Frequency')
plt.title('GQ / DP Histogram')
plt.bar(gq_dp_hist.binEdges[:-1], gq_dp_hist.binFrequencies, width=0.5, label='homozygote GQ/DP')
plt.legend()
plt.show()

This ratio is extremely consistent! Remember also that DP is inherently quantized due to the process of short-read sequencing.¶

To learn more about where this ratio comes from, continue to the code below!.¶

Since GQ is roughly the probability that the second-most-likely genotype is actually correct, for homozygotes it's the probabilitly that the genotype was actually heterozygous.¶

One simple model is a binomial: if we have r reads, then the probability we call a homozygote wrong is the probability of flipping r heads from a fair coin (this is how we model seeing a reference read at every read).¶

from scipy.stats import binom
from numpy import log10

def binomial_p(num_reads):
    return binom(num_reads, 0.5).pmf(0)

# phred scaling is -10 * log10(x)
def phred_scale(x):
    return -10 * log10(x)
                   

xs = xrange(30)
ys = [phred_scale(binomial_p(x)) for x in xs]
observed_ratio = [3 * x for x in xs]

plt.scatter(xs, ys, label='binomial probability of 0 reads')
plt.plot(xs, map(lambda value: 3 * value, xs), color='k', label='slope=3')
plt.xlim(0, 30)
plt.ylim(0, 100)
plt.xlabel('Reads')
plt.ylabel('phred-scaled GQ')
plt.legend()
plt.show()

Why is the ratio 3? One more read (+1 DP) forces us to flip one more coin, multiplying our probability by 0.5. When we phred scale this...¶

- 10 * log10(0.5)

Bonus question (hard): Why do heterozygotes have such high GQ values?¶

GQ is truncated at 99, but you can see the true value by looking at the 2nd-smallest PL.¶

true_het_gq = vds.query_genotypes('gs.filter(g => g.isHet).map(g => min(g.pl[0], g.pl[2])).hist(0, 400, 100)')
plt.xlim(0, 400)
plt.ylim(0, 500000)
plt.xlabel('True GQ')
plt.ylabel('Count')
plt.title('True GQ Histogram')
plt.bar(true_het_gq.binEdges[:-1], true_het_gq.binFrequencies, width=4, color='red', label='heterozygotes')
plt.legend()
plt.show()

Is the binomial model appropriate? What other sources of uncertainty are there?¶

Exercise 1: calculate the mean depth and mean GQ of the entire dataset.¶

Fill in the <?> with code.

[mean_gq, mean_dp] = vds.query_genotypes(['<?>', '<?>'])
print("mean GQ is %f, mean DP is %f" % (mean_gq, mean_dp))

Exercise 2: what fraction of genotypes have GQ 20 or above? What fraction of homozygotes? What fraction of heterozygotes?¶