In [ ]:
from hail import *
from hail.representation import *
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn
from pprint import pprint
In [ ]:
hc = HailContext()
In [ ]:
vds = hc.read('hail-practical/1kg.vds').variant_qc().persist()

Plotting variant frequency distribution

From documentation:

collect(): Array[T]

Return an array with all of the elements in the aggregable. Missing values are removed.

It is not a good idea to run the below code on the full thousand genomes dataset. Why? 80 million variants!

In [ ]:
af = vds.query_variants('variants.map(v => va.qc.AF).collect()')
plt.hist(af, bins=100)
plt.show()

This code will work at any scale. Given the choice, always use a built-in Hail aggregator function!

In [ ]:
# This will always work!
af_hist = vds.query_variants('variants.map(v => va.qc.AF).hist(0, 1, 100)')
plt.bar(af_hist.binEdges[:-1], af_hist.binFrequencies, width=0.01)
plt.xlim(0, 1)
plt.show()

Mutation rate refresher

Not all base pairs are created equal. A "CG" motif (see wikipedia) has a high probability of mutating to "TG".

On the opposite strand, a "GC" motif is likely to mutate to "AC".

We expect to see many C/T and G/A SNPs in our dataset. Do we?

In [ ]:
alleles = vds.query_variants('variants.map(v => v.altAllele).counter()')
from collections import Counter
pprint(Counter(alleles).most_common())

Counts from the full thousand genomes dataset:

G/A    | 15689986
C/T    | 15685250
A/G    | 10779869
T/C    | 10736805
G/T    | 3735538
C/A    | 3727525
C/G    | 3436122
G/C    | 3425025
T/G    | 2861302
A/C    | 2860428
A/T    | 2715606
T/A    | 2713458
In [ ]:
contigs = vds.query_variants('variants.map(v => v.contig).counter()')

pprint(Counter(contigs).most_common())

Puzzle: Investigating the GQ distribution

GQ refresher

GQ is "genotype quality", which is roughly the log-scaled probability that your genotype is called wrong.

GQ 10 means 90% confidence.

GQ 20 means 99% confidence.

GQ 30 means 99.9% confidence.

GQ is truncated at 99, which corresponds to a ~ 0.0000000001 chance that your call is wrong (if it's calibrated!) | If we plot a histogram of GQ values, what will it look like?

DP refresher

DP is "depth", which is the total number of reads for a given sample at a given variant.

Produce a histogram of GQ and DP values for every genotype in our dataset

We'll use the hist fucntions again here. You definitely can't collect all the GQ values for a full dataset: with 1K Genomes, you'd get 200 billion values back!

In [ ]:
[gq_hist, dp_hist] = vds.query_genotypes(['gs.map(g => g.gq).hist(0, 100, 100)', 
                                          'gs.map(g => g.dp).hist(0, 30, 30)'])
In [ ]:
plt.xlim(0, 100)
plt.ylim(0, 2500000)
plt.xlabel('GQ')
plt.ylabel('Count')
plt.title('GQ Histogram')
plt.bar(gq_hist.binEdges[:-1], gq_hist.binFrequencies, width=1, label='GQ')
plt.legend()
plt.show()
In [ ]:
plt.xlim(0, 30)
plt.ylim(0, 3500000)
plt.xlabel('DP')
plt.ylabel('Count')
plt.title('DP Histogram')
plt.bar(dp_hist.binEdges[:-1], dp_hist.binFrequencies, width=1, label='DP')
plt.legend()
plt.show()

This GQ histogram is pretty strange. We're going to learn why.

By eye, we can identify at least 4 superimposed distributions. We need to pull these apart. Separating heterozygotes from homozygotes is a good place to start.

In [ ]:
[het_gq_hist, hom_gq_hist] = vds.query_genotypes(['gs.filter(g => g.isHet).map(g => g.gq).hist(0, 100, 100)', 
                                          'gs.filter(g => !g.isHet).map(g => g.gq).hist(0, 100, 100)'])
In [ ]:
plt.xlim(0, 100)
plt.ylim(0, 2000000)
plt.xlabel('GQ')
plt.ylabel('Count')
plt.title('GQ Histogram')
plt.bar(het_gq_hist.binEdges[:-1], het_gq_hist.binFrequencies, width=1, color='red', label='heterozygotes')
plt.legend()
plt.show()
In [ ]:
plt.xlim(0, 100)
plt.ylim(0, 2500000)
plt.xlabel('GQ')
plt.ylabel('Count')
plt.title('GQ Histogram')
plt.bar(hom_gq_hist.binEdges[:-1], hom_gq_hist.binFrequencies, width=1, label='homozygotes')
plt.legend()
plt.show()

Separating the heterzogotes helped a bit, but doesn't address why adjacent GQ values have very different frequencies.

Let's go back to depth to investigate -- remember that depth is a more 'primitive' piece of metadata, and is used to produce GQ.

In [ ]:
# argmax is the index of the largest value in a list
from numpy import argmax
gq_mode = int(hom_gq_hist.binEdges[argmax(hom_gq_hist.binFrequencies)])
dp_mode = int(dp_hist.binEdges[argmax(dp_hist.binFrequencies)])

print('GQ mode is %d' % gq_mode)
print('DP mode is %d' % dp_mode)

There are 3 superimposed GQ distributions for homozygotes.

The ratio between the mode GQ and DP is 3.

We can visually assess correlation by looking at a histogram of the ratio between the two.

In [ ]:
gq_dp_hist = vds.query_genotypes('gs.filter(g => !g.isHet).map(g => g.gq / g.dp).hist(1, 7, 12)')

plt.xlim(0, 8)
plt.ylim(0, 20000000)
plt.xlabel('GQ/DP')
plt.ylabel('Frequency')
plt.title('GQ / DP Histogram')
plt.bar(gq_dp_hist.binEdges[:-1], gq_dp_hist.binFrequencies, width=0.5, label='homozygote GQ/DP')
plt.legend()
plt.show()

This ratio is extremely consistent! Remember also that DP is inherently quantized due to the process of short-read sequencing.

To learn more about where this ratio comes from, continue to the code below!.

Since GQ is roughly the probability that the second-most-likely genotype is actually correct, for homozygotes it's the probabilitly that the genotype was actually heterozygous.

One simple model is a binomial: if we have r reads, then the probability we call a homozygote wrong is the probability of flipping r heads from a fair coin (this is how we model seeing a reference read at every read).

In [ ]:
from scipy.stats import binom
from numpy import log10

def binomial_p(num_reads):
    return binom(num_reads, 0.5).pmf(0)

# phred scaling is -10 * log10(x)
def phred_scale(x):
    return -10 * log10(x)
                   

xs = xrange(30)
ys = [phred_scale(binomial_p(x)) for x in xs]
observed_ratio = [3 * x for x in xs]

plt.scatter(xs, ys, label='binomial probability of 0 reads')
plt.plot(xs, map(lambda value: 3 * value, xs), color='k', label='slope=3')
plt.xlim(0, 30)
plt.ylim(0, 100)
plt.xlabel('Reads')
plt.ylabel('phred-scaled GQ')
plt.legend()
plt.show()

Bonus question (hard): Why do heterozygotes have such high GQ values?

GQ is truncated at 99, but you can see the true value by looking at the 2nd-smallest PL.

In [ ]:
true_het_gq = vds.query_genotypes('gs.filter(g => g.isHet).map(g => min(g.pl[0], g.pl[2])).hist(0, 400, 100)')
plt.xlim(0, 400)
plt.ylim(0, 500000)
plt.xlabel('True GQ')
plt.ylabel('Count')
plt.title('True GQ Histogram')
plt.bar(true_het_gq.binEdges[:-1], true_het_gq.binFrequencies, width=4, color='red', label='heterozygotes')
plt.legend()
plt.show()

Is the binomial model appropriate? What other sources of uncertainty are there?

Exercise 1: calculate the mean depth and mean GQ of the entire dataset.

Fill in the <?> with code.

In [ ]:
[mean_gq, mean_dp] = vds.query_genotypes(['<?>', '<?>'])
print("mean GQ is %f, mean DP is %f" % (mean_gq, mean_dp))

Exercise 2: what fraction of genotypes have GQ 20 or above? What fraction of homozygotes? What fraction of heterozygotes?

Fill in the <?> with code.

In [ ]:
[gq20, hom_gq20, het_gq20] = vds.query_genotypes(['<?>', '<?>', '<?>'])
print("Total fraction is %f, among homozygotes %f, and heterozygotes %f" % (gq20, hom_gq20, het_gq20))
In [ ]:
 
In [ ]: