import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from math import log, isnan

%matplotlib inline

def qqplot(pvals):
    spvals = sorted([x for x in pvals if x and not(isnan(x))])
    exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]
    obs = [-log(p, 10) for p in spvals]
    plt.scatter(exp, obs)
    plt.plot(np.arange(0, max(max(exp), max(obs))), c="red")
    plt.xlabel("Expected p-value (-log10 scale)")
    plt.ylabel("Observed p-value (-log10 scale)")
    plt.xlim(xmin=0)
    plt.ylim(ymin=0)

Create HailContext¶

The HailContext hc is the main entrypoint into Hail functionality.

import hail

hc = hail.HailContext()

hc

Import VCF, BED files into Hail¶

In this section, we'll import sequencing and genotype data files and examine the resulting VariantDatasets. Since analyzing sequence data is part of the motivation for building Hail, let's start by importing a VCF file with the HailContext import_vcf method.

vcf = hc.import_vcf('hail-practical/sample.vcf')
vcf

How many samples does vcf have? How many variants? Use the count to summarize basic properties of the dataset. What happens when you pass the True for the genotypes argument?

vcf.count()

Now display the variant, sample and global schemas. If you're familiar with the VCF format, compare how the variant schema relates to the ##INFO header lines of the VCF.

print('sample schema:')
print(vcf.sample_schema)
print('\nvariant schema:')
print(vcf.variant_schema)
print('\nglobal schema:')
print(vcf.global_schema)

# show the first 5 sample IDs
vcf.sample_ids[:5]

It is easy to access the sample IDs becuase they can be stored on the driver node. There can be 100s of millions of variants (billions?) in a sequencing dataset. Therefore, the variants are stored in a distributed way. It is more complex to show a few variants becuase we need to make a query against the distributed dataset.

vcf.query_variants('variants.take(5)')

Although samples and variants are stored differently in Hail, the interfaces to access them are largely symmetric. (sample_ids is an exception). Use query_samples to show a few samples.

# Exercise: Query a few sample IDs with query_samples

Finally, let's look at a genotype. In PLINK, genotypes had values 0, 1, 2 and missing. In VCF and Hail, the carries more metadata. Let's grab 5 genotypes using query_genotypes. Again, if you're familiar with VCF format, compare Hail genotypes with the ##FORMAT header lines.

vcf.query_genotypes('gs.take(5)')

Import a BED file¶

Now let's import a BED file. On Tuesday, you went through Jeff's GWAS practical. Let's load the same data. Later, we'll see how to do many of the same analyses inside Hail. We'll use the import_plink method.

Note, Hail supports a number of other formats, including GEN and BGEN (used by the UK Biobank dataset), although we won't be using them in this practical.

gwas_example = 'gwas-practical/gwas-example'

plink = hc.import_plink(gwas_example + '.bed', gwas_example + '.bim', gwas_example + '.fam')
plink

plink is also a VariantDataset object just like the vcf above. Hail doesn't care what file format was loaded or where the data came from, it always provides a consisent interface. Let's use the same VariantDataset methods to explore the plink dataset. First, let's count the number of variants and samples.

plink.count()

Next, let's examine the sample, variant and global schemas:

print('sample schema:')
print(plink.sample_schema)
print('\nvariant schema:')
print(plink.variant_schema)
print('\nglobal schema:')
print(plink.global_schema)

Notice, although they are both VariantDatasets, the schemas differ between vcf and plink. In particular, plink carries sample metadata coming from the FAM file while vcf has variant-level metadata coming from the FILTER, QUAL and INFO fields. Later, we'll see how to access the sample and variant annotations, for example, when filtering or running a regression.

You can query a few variants and samples from plink as did for vcf. Let's also look at a couple of genotypes. How do you think they will differ from the VCF genotypes?

plink.query_genotypes('gs.take(5)')

VDS files¶

Hail has its own file format, VDS (for VariantDataset. We realize this can be confusing. VariantDataset can both refer to a file format and for a Python object representing a dataset inside Hail.) You can read and write VDS files using, surprisingly, the read and write methods. Note, write is a method on VariantDataset and read is a method on HailContext.

What's the advantage of VDS files? VDS files are similar in size to compressed VCF files but generally are much faster to access. If you're planning to iterate many times on the analysis of a dataset stored as VCF, you're better off importing it and writing it as VDS and analyzing that.

Let's write out vcf as a VDS file and load it.

vcf.write('small.vds', overwrite=True)

vds = hc.read('small.vds')

You can explore the vds as you did with the the other datasets, but it is identical to vcf. In fact, Hail has a command for comparing to VariantDatasets, and you can verify that vds and vcf are exactly the same:

vds.same(vcf)

Basic Association¶

To finish this practical, let's run a very basic association. Rather than running on one of the datasets above, let's generate a synthetic one! Hail has method to generate genetic data with population structure according to the Balding-Nichols model. After we generate genetic data, we will randomly generate two covariates and then construct a quantiative phenotype linearly from the covariates and Balding-Nichols population number. We're write out the synthetic dataset as a VDS and read it back in.

(hc.balding_nichols_model(3, 1000, 1000, pop_dist = [0.2, 0.3, 0.5], fst = [.05, .09, .11])
 .annotate_samples_expr(['sa.cov1 = rnorm(0, 1)', 'sa.cov2 = rnorm(0, 1)'])
 .annotate_samples_expr('sa.pheno = rnorm(1, 1) + 2 * sa.cov1 - sa.cov2 + .1 * sa.pop')
 .write('synth.vds', overwrite=True))

synth_vds = hc.read('synth.vds')

You can explore the synthetic dataset as before. Definitely look at the various schemas:

print('sample schema:')
print(synth_vds.sample_schema)
print('\nvariant schema:')
print(synth_vds.variant_schema)
print('\nglobal schema:')
print(synth_vds.global_schema)

We have a global schema that contains metadata about how the dataset was generated. In addition, cov1, cov2 and pheno have been added to the sample schema by the annotate_samples_expr calls. We'lll learn more about annotations later. The global annotations can be accessed directly in Python:

synth_vds.globals._attrs # adding ._attrs makes it print out more nicely

Finally, let's run a linear regression with covariates against the phenotype and visualize the p-values in a QQ plot.

[pvals] = (synth_vds.linreg('sa.pheno', covariates = ['sa.cov1', 'sa.cov2'])
           .query_variants(['variants.map(v => va.linreg.pval).collect()']))

qqplot(pvals)

The p-values are inflated! Why? We'll come back to this in a later practical.