In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from math import log, isnan

%matplotlib inline
In [ ]:
def qqplot(pvals):
    spvals = sorted([x for x in pvals if x and not(isnan(x))])
    exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]
    obs = [-log(p, 10) for p in spvals]
    plt.scatter(exp, obs)
    plt.plot(np.arange(0, max(max(exp), max(obs))), c="red")
    plt.xlabel("Expected p-value (-log10 scale)")
    plt.ylabel("Observed p-value (-log10 scale)")
    plt.xlim(xmin=0)
    plt.ylim(ymin=0)
In [ ]:
import hail

hc = hail.HailContext()

Aggregables

Terrible name, we know.

In the practical on the Hail expression language, you learned about arrays, filter, map, and aggregating opeprations like sum, max, etc.

Hail is designed to work with very large datasets, datasets with 100s of millions (billions?) of variants and 10s or 100s of trillions of genotypes. More data than one could hope to store on a single computer. Therefore, Hail stores and computes on data in a distributed fasion. But we still want a simple programming model that allows us to query and transform such distributed data. Thats where Aggregables come in.

And Aggregable[T] is distributed collection of elements of type T. The interface is modeled on Array[T], but aggregables can be arbitrarily large and they are unordered, so they don't support operations like indexing.

Aggregables support map and filter. Like sum, max, etc. on arrays, aggregables support operations which we call "aggregators" that operate on the entire aggregable collection and produce a summary or derived statistic. See the documentation for a complete list of aggregators.

Aggregables are available in expressions on various methods on VariantDataset. For example, query_samples has an aggregable samples: Aggregable[String], query_variants has an aggregable variants: Aggregable[Variant] and query_genotypes has an aggregable gs: Aggregable[Genotype] which is the collection of all the genotypes in the dataset.

Let's load a small VCF to play with.

In [ ]:
vds = hc.import_vcf('hail-practical/sample.vcf')
print(vds.variant_schema)

There are several aggregators for obtaining elements of an aggregable. .take(n) returns n elements of an aggregable sampled randomly. takeBy(x => <expr>, n) returns the n elements with highest value for <expr>. <expr> must be a type that can be compared: numeric or String. Let's grab a few variants and genotypes.

In [ ]:
# five random variants
vds.query_variants('variants.take(5)')
In [ ]:
# return five genotypes with the largest gq
vds.query_genotypes('gs.takeBy(g => g.gq, 5)')

collect() returns an array of all the elements of an array.

Warning: This is very dangerous for large data! If you collect 100 trillion genotypes, it will fail, probably with an out of memory error (or something more obscure).

In these practicals, we routinely use collect to return values for plotting. Here's an example using collect to return the allele number (number of called alternate alleles) per variant.

Note: We're doing two aggregations here. In annotate_variants_expr, gs: Aggregable[Genotype] is the collection of all genotypes for the given variant (ranging over sample), and variants: Aggregable[Variant] in query_variants is the collection of all variants in the dataset.

g.nNonRefAlleles returns number of called non-reference alleles (0/0 = 0, 0/1 = 1, 1/1 = 2). Note, we're using map to transform aggregables.

In [ ]:
ANs = (vds.annotate_variants_expr('va.AN = gs.map(g => g.nNonRefAlleles).sum()')
       .query_variants('variants.map(v => va.AN).collect()'))
ANs[:5] # print the first 5

You might have noticed there is some tricky magic here! In variants.map(v => va.AN), we're mapping over variants but we're accessing the variant annotations va. How is that possible?

Aggregables differ from arrays in one other important respect: in addition to the elements of the aggregable, each element has an associated context. map and filter transform the elements of the collection, but do not change the context. The documentation for each method which provides an aggregable will document the context for the aggregable. For example, see the documentation for query_genotypes_typed. In query_genotypes, gs: Aggregable[Genotype] is the collection of all genotypes in the dataset. Each genotype is associated with a variant v and a sample s, so thsoe are in the context of gs. In addition, associated to each variant is its variant annotations va and to each sample s its sample annotations sa, so those are in teh context, too. Finally, the genotype g itself is in the context, so if you map away the genotype, it is still accessible as g! Here's an example to sample 8 dp fields from Het genotypes.

In [ ]:
# you can still access g since it is in the context even though we've mapped the gs aggregable to the genotype depth
vds.query_genotypes('gs.map(g => g.dp).filter(dp => g.isHet()).take(8)')

Hard Exercise

Compute, for each variant, the top 5 five non-reference genotypes by depth and then collect collect a few variants tos how the output.

A few more useful aggregators

See the documentation for the full list.

You can use count to count the number of elements of an aggregator.

In [ ]:
# get number of variants
vds.query_variants('variants.count()')

You can compute a collection of simple statistics with stats.

In [ ]:
# depth statisticals for all genotypes
vds.query_genotypes('gs.map(g => g.dp).stats()')

You can use hist(min, max, bins), documentation to compute a histogram of a numeric aggregable. Let's use hist to plot the histogram of genotype quality (GQ) over all genotypes.

In [ ]:
dp_hist = vds.query_genotypes('gs.map(g => g.dp).hist(0, 450, 45)')
dp_hist
In [ ]:
# plot the histogram
plt.bar(dp_hist.binEdges[:-1], dp_hist.binFrequencies, width=10)
plt.show()

We'll explore the distribute of DP and GQ more in a later practical.

The aggregator counter counts the number of times each element occures in a aggregable. When applied to an Aggregable[T], it returns a Dict[T, Long] which maps each element to its count. Let's count the distribution of genotypes (missing, 0, 1 and 2) in the dataset.

In [ ]:
vds.query_genotypes('gs.map(g => g.nNonRefAlleles).counter()')
In [ ]: