from hail import *
import matplotlib.pyplot as plt
import numpy as np
import seaborn
%matplotlib inline

hc = HailContext()

Start by importing and writing a larger dataset, then reading the result¶

hc.import_vcf('hail-practical/1kg.vcf.bgz').filter_multi().write('hail-practical/1kg.vds', overwrite=True)
vds = hc.read('hail-practical/1kg.vds')

Let's print the schemas: we've seen this before, but should always know what's in our data¶

print(vds.variant_schema)

print(vds.sample_schema)

Let's add QC statistics. Note that although we're going to be running methods called `variant_qc` and `sample_qc`, these don't actually filter variants or samples.¶

vds = vds.variant_qc().sample_qc().persist()

These methods added quite a few annotations!¶

print(vds.variant_schema)

print(vds.sample_schema)

There's a lot here, and it can certainly be confusing. Our documentation explains what these values are, however: see sample_qc and variant_qc for more information.¶

Let's use `matplotlib` to look at some of these values. Both `sa.qc.callRate` and `sa.qc.gqMean` may be useful metrics for filtering.¶

[call_rate, gq_mean] = vds.query_samples(['samples.map(s => sa.qc.callRate).collect()',
                                          'samples.map(s => sa.qc.dpMean).collect()'])
plt.scatter(gq_mean, call_rate, alpha=0.25)
plt.xlabel('Mean DP')
plt.ylabel('Call Rate')
plt.title('Sample call rate vs mean DP')
plt.show()

[call_rate, gq_mean] = vds.query_variants(['variants.map(v => va.qc.callRate).collect()',
                                          'variants.map(v => va.qc.dpMean).collect()'])
plt.scatter(gq_mean, call_rate, alpha=0.2)
plt.xlabel('Mean DP')
plt.ylabel('Call Rate')
plt.title('Variant call rate vs mean DP')
plt.show()

It's a good idea to get rid of samples and variant with a low call rate. Let's see how a cutoff of 95% affects the included variants and samples.¶

original_counts = vds.count()
filtered_vds = (vds.filter_variants_expr('va.qc.callRate > 0.95')
                   .filter_samples_expr('sa.qc.callRate > 0.95'))
new_counts = filtered_vds.count()
print(original_counts)
print(new_counts)

Annotations can be added from files as well.¶

%%sh
head hail-practical/sample_annotations.txt | column -t

filtered_vds = filtered_vds.annotate_samples_table('hail-practical/sample_annotations.txt', 
                                                   sample_expr='Sample',
                                                   root='sa.metadata',
                                                   config=TextTableConfig(impute=True))
print(filtered_vds.sample_schema)

filtered_vds = filtered_vds.annotate_variants_table('hail-practical/variant_annotations.txt', 
                                                   variant_expr='Variant',
                                                   code='va.consequence = table.Consequence',
                                                   config=TextTableConfig(impute=True))
print(filtered_vds.variant_schema)

Query methods are always good to explore what we've done¶

filtered_vds.query_variants('variants.map(v => va.consequence).counter()')

filtered_vds.query_samples('samples.map(s => sa.metadata.SuperPopulation).counter()')

Exercises¶

Number 1: count the number of variants that are LOF and have a call rate > 99%¶

Fill in the <?> with code.

filtered_vds.query_variants('<?>')

Number 2: What fraction of British samples (`sa.metadata.Population == "GBR"`) have purple hair?¶

Fill in the <?> with code.

Hint 1: the fraction of samples that are British is samples.fraction(s => sa.metadata.Population == "GBR")

Hint 2: you need to use ".filter"

filtered_vds.query_samples('<?>')

Number 3: What is the mean caffeine consumption among South Asian samples (`sa.metadata.SuperPopulation == "SAS"`)?¶

Fill in the <?> with code.

Hint: Remember that .stats() will compute a variety of useful metrics from a numeric Aggregable.

filtered_vds.query_samples('<?>')

Start by importing and writing a larger dataset, then reading the result¶

Let's print the schemas: we've seen this before, but should always know what's in our data¶

Let's add QC statistics. Note that although we're going to be running methods called variant_qc and sample_qc, these don't actually filter variants or samples.¶

These methods added quite a few annotations!¶

There's a lot here, and it can certainly be confusing. Our documentation explains what these values are, however: see sample_qc and variant_qc for more information.¶

Let's use matplotlib to look at some of these values. Both sa.qc.callRate and sa.qc.gqMean may be useful metrics for filtering.¶

It's a good idea to get rid of samples and variant with a low call rate. Let's see how a cutoff of 95% affects the included variants and samples.¶

Annotations can be added from files as well.¶

Query methods are always good to explore what we've done¶

Exercises¶

Number 1: count the number of variants that are LOF and have a call rate > 99%¶

Number 2: What fraction of British samples (sa.metadata.Population == "GBR") have purple hair?¶

Number 3: What is the mean caffeine consumption among South Asian samples (sa.metadata.SuperPopulation == "SAS")?¶

Let's add QC statistics. Note that although we're going to be running methods called `variant_qc` and `sample_qc`, these don't actually filter variants or samples.¶

Let's use `matplotlib` to look at some of these values. Both `sa.qc.callRate` and `sa.qc.gqMean` may be useful metrics for filtering.¶

Number 2: What fraction of British samples (`sa.metadata.Population == "GBR"`) have purple hair?¶

Number 3: What is the mean caffeine consumption among South Asian samples (`sa.metadata.SuperPopulation == "SAS"`)?¶