Practical 3: Annotation, query and plotting

The cells of this practical can be entered (by cut and paste) into the IPython console.

Before entering the first cell, make sure you have changed to the directory hail-practical. Skip the first cell haven't closed IPython console since running the last practical.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from math import log, isnan
from pprint import pprint
import matplotlib.patches as mpatches

from hail import *

%matplotlib inline

def qqplot(pvals):
    spvals = sorted([x for x in pvals if x and not(isnan(x))])
    exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]
    obs = [-log(p, 10) for p in spvals]
    plt.scatter(exp, obs)
    plt.plot(np.arange(0, max(max(exp), max(obs))), c="red")
    plt.xlabel("Expected p-value (-log10 scale)")
    plt.ylabel("Observed p-value (-log10 scale)")
    plt.xlim(xmin=0)
    plt.ylim(ymin=0)

hc = HailContext()

Start by importing and writing a larger dataset, then reading the result

# dataset already imported to VDS in hail-practical
# hc.import_vcf('1kg.vcf.bgz').filter_multi().variant_qc().sample_qc().write('1kg.vds', overwrite=True)

vds = hc.read('1kg.vds')

Let's print the schemas: we've seen this before, but should always know what's in our data

Let's add QC statistics. Note that although we're going to be running a method called variant_qc this doesn't actually filter variants from our dataset.

vds = vds.variant_qc()
vds = vds.sample_qc()
vds.persist() # We'll use these results several times, so checkpoint our computations

These methods added quite a few annotations!

print(vds.variant_schema)

There's a lot here, and it can certainly be confusing. Our documentation explains what these values are, however: see sample_qc and variant_qc for more information.

Let's use matplotlib to look at some of these values. Both sa.qc.callRate and sa.qc.gqMean may be useful metrics for filtering.

[call_rate, gq_mean] = vds.query_samples(['samples.map(s => sa.qc.callRate).collect()',
                                          'samples.map(s => sa.qc.dpMean).collect()'])
plt.scatter(gq_mean, call_rate, alpha=0.25)
plt.xlabel('Mean DP')
plt.ylabel('Call Rate')
plt.title('Sample call rate vs mean DP')
plt.show()
[call_rate, gq_mean] = vds.query_variants(['variants.map(v => va.qc.callRate).collect()',
                                          'variants.map(v => va.qc.dpMean).collect()'])
plt.scatter(gq_mean, call_rate, alpha=0.2)
plt.xlabel('Mean DP')
plt.ylabel('Call Rate')
plt.title('Variant call rate vs mean DP')
plt.show()

It's a good idea to get rid of samples and variant with a low call rate. Let's see how a cutoff of 95% affects the included variants and samples.

original_counts = vds.count()
filtered_vds = (vds.filter_variants_expr('va.qc.callRate > 0.95')
                   .filter_samples_expr('sa.qc.callRate > 0.95'))
new_counts = filtered_vds.count()
print(original_counts)
print(new_counts)

Annotations can be added from files as well.

The next cell uses something called "ipython magic". These "magics" begin with one or two percent signs, and can do various helpful things. The %%sh magic runs the contents of the cell as Unix commands.

%%sh
head sample_annotations.txt | column -t
%%sh
head variant_annotations.txt | column -t
filtered_vds = filtered_vds.annotate_samples_table('sample_annotations.txt', 
                                                   sample_expr='Sample',
                                                   root='sa.metadata',
                                                   config=TextTableConfig(impute=True))
filtered_vds = filtered_vds.annotate_variants_table('variant_annotations.txt', 
                                                   variant_expr='Variant',
                                                   code='va.consequence = table.Consequence',
                                                   config=TextTableConfig(impute=True))
print('Variant schema:')
print(filtered_vds.variant_schema)
print('Sample schema:')
print(filtered_vds.sample_schema)

Query methods are good ways to explore datasets

filtered_vds.query_variants('variants.map(v => va.consequence).counter()')
filtered_vds.query_samples('samples.map(s => sa.metadata.SuperPopulation).counter()')

Mutation rate refresher

Not all base pairs are created equal. A "CG" motif (see wikipedia) has a high probability of mutating to "TG".

On the opposite strand, a "GC" motif is likely to mutate to "AC".

We expect to see many C/T and G/A SNPs in our dataset. Do we?

alleles = vds.query_variants('variants.map(v => v.altAllele).counter()')
from collections import Counter
pprint(Counter(alleles).most_common())

Counts from the full thousand genomes dataset:

G/A    | 15689986
C/T    | 15685250
A/G    | 10779869
T/C    | 10736805
G/T    | 3735538
C/A    | 3727525
C/G    | 3436122
G/C    | 3425025
T/G    | 2861302
A/C    | 2860428
A/T    | 2715606
T/A    | 2713458

Plot variant frequency distribution

af = vds.query_variants('variants.map(v => va.qc.AF).collect()')
plt.hist(af, bins=250)
plt.xlabel('Alternate allele frequency')
plt.ylabel('Count')
plt.show()

The below code will work at any scale. Given the choice, always use a built-in Hail aggregator function!

# This will always work!
af_hist = vds.query_variants('variants.map(v => va.qc.AF).hist(0, 1, 250)')
plt.bar(af_hist.binEdges[:-1], af_hist.binFrequencies, width=0.004)
plt.xlim(0, 1)
plt.xlabel('Alternate allele frequency')
plt.ylabel('Count')
plt.show()

Warning! Don't exit the IPython console. You will need filtered_vds for the next practical.