In [ ]:
from hail import *
import matplotlib.pyplot as plt
import numpy as np
import seaborn
%matplotlib inline
In [ ]:
hc = HailContext()

Start by importing and writing a larger dataset, then reading the result

In [ ]:
hc.import_vcf('hail-practical/1kg.vcf.bgz').filter_multi().write('hail-practical/1kg.vds', overwrite=True)
vds = hc.read('hail-practical/1kg.vds')

Let's print the schemas: we've seen this before, but should always know what's in our data

In [ ]:
print(vds.variant_schema)
In [ ]:
print(vds.sample_schema)

Let's add QC statistics. Note that although we're going to be running methods called variant_qc and sample_qc, these don't actually filter variants or samples.

In [ ]:
vds = vds.variant_qc().sample_qc().persist()

These methods added quite a few annotations!

In [ ]:
print(vds.variant_schema)
In [ ]:
print(vds.sample_schema)

There's a lot here, and it can certainly be confusing. Our documentation explains what these values are, however: see sample_qc and variant_qc for more information.

Let's use matplotlib to look at some of these values. Both sa.qc.callRate and sa.qc.gqMean may be useful metrics for filtering.

In [ ]:
[call_rate, gq_mean] = vds.query_samples(['samples.map(s => sa.qc.callRate).collect()',
                                          'samples.map(s => sa.qc.dpMean).collect()'])
plt.scatter(gq_mean, call_rate, alpha=0.25)
plt.xlabel('Mean DP')
plt.ylabel('Call Rate')
plt.title('Sample call rate vs mean DP')
plt.show()
In [ ]:
[call_rate, gq_mean] = vds.query_variants(['variants.map(v => va.qc.callRate).collect()',
                                          'variants.map(v => va.qc.dpMean).collect()'])
plt.scatter(gq_mean, call_rate, alpha=0.2)
plt.xlabel('Mean DP')
plt.ylabel('Call Rate')
plt.title('Variant call rate vs mean DP')
plt.show()

It's a good idea to get rid of samples and variant with a low call rate. Let's see how a cutoff of 95% affects the included variants and samples.

In [ ]:
original_counts = vds.count()
filtered_vds = (vds.filter_variants_expr('va.qc.callRate > 0.95')
                   .filter_samples_expr('sa.qc.callRate > 0.95'))
new_counts = filtered_vds.count()
print(original_counts)
print(new_counts)

Annotations can be added from files as well.

In [ ]:
%%sh
head hail-practical/sample_annotations.txt | column -t
In [ ]:
filtered_vds = filtered_vds.annotate_samples_table('hail-practical/sample_annotations.txt', 
                                                   sample_expr='Sample',
                                                   root='sa.metadata',
                                                   config=TextTableConfig(impute=True))
print(filtered_vds.sample_schema)
In [ ]:
filtered_vds = filtered_vds.annotate_variants_table('hail-practical/variant_annotations.txt', 
                                                   variant_expr='Variant',
                                                   code='va.consequence = table.Consequence',
                                                   config=TextTableConfig(impute=True))
print(filtered_vds.variant_schema)

Query methods are always good to explore what we've done

In [ ]:
filtered_vds.query_variants('variants.map(v => va.consequence).counter()')
In [ ]:
filtered_vds.query_samples('samples.map(s => sa.metadata.SuperPopulation).counter()')

Exercises

Number 1: count the number of variants that are LOF and have a call rate > 99%

Fill in the <?> with code.

In [ ]:
filtered_vds.query_variants('<?>')

Number 2: What fraction of British samples (sa.metadata.Population == "GBR") have purple hair?

Fill in the <?> with code.

Hint 1: the fraction of samples that are British is samples.fraction(s => sa.metadata.Population == "GBR")

Hint 2: you need to use ".filter"

In [ ]:
filtered_vds.query_samples('<?>')

Number 3: What is the mean caffeine consumption among South Asian samples (sa.metadata.SuperPopulation == "SAS")?

Fill in the <?> with code.

Hint: Remember that .stats() will compute a variety of useful metrics from a numeric Aggregable.

In [ ]:
filtered_vds.query_samples('<?>')
In [ ]: