import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from math import log, isnan

%matplotlib inline

def qqplot(pvals):
    spvals = sorted([x for x in pvals if x and not(isnan(x))])
    exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]
    obs = [-log(p, 10) for p in spvals]
    plt.scatter(exp, obs)
    plt.plot(np.arange(0, max(max(exp), max(obs))), c="red")
    plt.xlabel("Expected p-value (-log10 scale)")
    plt.ylabel("Observed p-value (-log10 scale)")
    plt.xlim(xmin=0)
    plt.ylim(ymin=0)

Create HailContext¶

The HailContext hc is the main entrypoint into Hail functionality.

import hail

hc = hail.HailContext()

hc

Assocation Practical¶

On Tuesday, you went through Jeff's GWAS practical using PLINK. In this practical, we're going to learn how to use Hail to perform the same analysis. The first step is to load the GWAS exmaple PLINK file.

gwas_example = 'gwas-practical/gwas-example'
example_vds = hc.import_plink(gwas_example + '.bed', gwas_example + '.bim', gwas_example + '.fam')

Next, let's use the count command to summarize the dataset. How many samples and how many variants are there? How many genotypes are missing?

example_vds.count(genotypes=True)

Print out the schemas to see the structure of the annotations.

print('sample schema:')
print(example_vds.sample_schema)
print('\nvariant schema:')
print(example_vds.variant_schema)
print('\nglobal schema:')
print(example_vds.global_schema)

Next, let's use the expression language, annotations and aggregators to explore the data.

First, how many cases and controls are there? To do this, we can use the counter aggregator (scroll down a bit to find counter) over samples.

example_vds.query_samples('samples.map(s => sa.isCase).counter()')

Next, let's find the SNP with the highest missingness. First, we annotate variants with missingness using the fraction aggregator. Note, when working over variants, gs is an Aggregable[Genotype] over the genotypes for that variant. We use the takeBy aggregator to find the most missing variant. We return a struct containing the variant, the rsID and the missingness. What if we wanted the 5 most missing variants? What sample has the highest missingness?

most_missing = (example_vds.annotate_variants_expr('va.missing = gs.fraction(g => !g.isCalled)')
                .query_variants('variants.map(v => {v: v, rsid: va.rsid, missing: va.missing}).takeBy(x => va.missing, 1)'))
most_missing

Next, let's look at the distribution of allele frequencies. To do this, let's use the callStats aggregator. The callStats aggregator needs to know the variant corresponding to the genotype, so you need to pass it g => v. callStats computes a number of useful statistics: allele count (AC: Array[Int]), allele frequency (AF: Array[Int]), allele number (Int) and genotype count (GC: Array[Int]). AC and AF are indexed by the allele number, while GC is indexed by the genotype number (for biallelic variants, 0/0 = 0, 0/1 = 1, 1/1 = 2). See the docs for more details.Since we want to plot the alternate alelle frequency, we use AC[1] below.

mafs = (example_vds.annotate_variants_expr('va.maf = gs.callStats(g => v).AF[1]')
        .query_variants('variants.map(v => va.maf).collect()'))

plt.hist(mafs)

Hard Exercise¶

If you want a challenge, try computing AF directly. A few things that might be useful for this: g.nNonRefAlleles returns a count of the number of non-reference alleles in a genotype, that is, 0/0 = 0, 0/1 = 1, and 1/1 = 2. It returns missing for a missing genotype. You can use .sum() to find the sum of a numeric aggregable and .count() to count the number of elements in an aggregable. Finally, does your method handle missingness correctly? Note, sum ignores missing values (it sums the non-missing values) but count counts all the values, missing or non-missing. In particular, gs.count() for a variant will always return the number of samples, whether or not genotypes are missing.

You can check you got the correct answer by collecting the list of AFs and comparing to the mafs array from the example above.

Now let's create the cleaned dataset. We use the variant_qc and sample_qc methods of VariantDataset to compute statistics (like callRate, AF and the Hardy-Weinberg p-value) to use when filtering. The QC methods compute many more statistics, see the documentation for more details. We write the cleaned dataset as a VDS and read it.

(example_vds
 .sample_qc()
 .variant_qc()
 .filter_variants_expr('va.qc.callRate >= 0.95 && va.qc.pHWE >= 1e-6 && va.qc.AF >= 0.01', keep=True)
 .filter_samples_expr('sa.qc.callRate >= 0.95', keep=True)
 .write('clean.vds', overwrite=True))

clean_vds = hc.read('clean.vds')

Now we're ready to run an association! This is done with the logreg logistic regression command on VariantDataset. logreg supports several tests, here we use the Wald test. Then we collect the resulting p-values and make a QQ plot.

Note, logreg adds its results the variant annotations at va.logreg by default. We also print out the variant annotations to show what's been added. See the documentation for a description of each field.

We collect both the p-values and the variant base-pair start position so we can make a manhattan plot below. Since some p-values can be missing (for example, if the test failed to converge), we filter those out.

logreg_vds = example_vds.logreg('wald', 'sa.isCase')
print(logreg_vds.variant_schema)

pval_pos = logreg_vds.query_variants('variants.map(v => {start: v.start, pval: va.logreg.pval}).collect()')
pval_pos = filter(lambda x: x.pval != None, pval_pos)

qqplot([x.pval for x in pval_pos])

Let's make a Manhattan plot of the p-values.

plt.scatter([x.start for x in pval_pos], 
            [-log(x.pval, 10) for x in pval_pos])

Our QQ plot was inflated. Now let's look at the clean dataset.

clean_pvals = (clean_vds.logreg('wald', 'sa.isCase')
               .query_variants('variants.map(v => va.logreg.pval).collect()'))

qqplot(clean_pvals)

Better, but there's still some inflation. Next, let's compute PCA and plot the case/control status against the first two PCs. We use the pca method. The first argument is the location in sample annotations to store the PCs.

pca_vds = clean_vds.pca('sa.scores', k=2)

print(pca_vds.sample_schema)

[pc1, pc2, case] = pca_vds.query_samples(['samples.map(s => sa.scores.PC1).collect()',
                                          'samples.map(s => sa.scores.PC2).collect()',
                                          'samples.map(s => sa.isCase).collect()'])

plt.scatter(pc1, pc2, c=case)

As you can see, the phenotype is stratified with the PCs, esp. PC1. Let's re-run the association using the PCs as covariates and plot the result.

pca_pvals = (pca_vds.logreg('wald', 'sa.isCase', ['sa.scores.PC1', 'sa.scores.PC2'])
             .query_variants('variants.map(v => va.logreg.pval).collect()'))

qqplot(pca_pvals)

Beautiful! How would you find the SNPs corresponding to the top two hits?

Hard Exercise¶

There is also a linear regression command for running association on quantitative phenotypes. The GWAS practical has quantitative phenotypes in the gwas-example.qt file. Try running a linear regression. You'll need the annotate_samples_table method of VariantDataset to load gwas-example.qt data into the sample annotations.