Hail Practical 1: Importing, schemas, simulated data

The cells of this practical can be entered (by cut and paste) into the IPython console.

Before entering the first cell, make sure you have changed to the directory hail-practical.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from math import log, isnan
from pprint import pprint
import matplotlib.patches as mpatches

from hail import *

%matplotlib inline

def qqplot(pvals):
    spvals = sorted([x for x in pvals if x and not(isnan(x))])
    exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]
    obs = [-log(p, 10) for p in spvals]
    plt.scatter(exp, obs)
    plt.plot(np.arange(0, max(max(exp), max(obs))), c="red")
    plt.xlabel("Expected p-value (-log10 scale)")
    plt.ylabel("Observed p-value (-log10 scale)")
    plt.xlim(xmin=0)
    plt.ylim(ymin=0)

hc = HailContext()

Create HailContext

Note, the prelude above imported the hail module and crated the HailContext with the line hc = hail.HailContext(). We will access the Hail functionality by calling methods on the hc object.

Import VCF, BED files into Hail

In this section, we'll import sequencing and genotype data files and examine the resulting VariantDatasets. Since analyzing sequence data is part of the motivation for building Hail, let's start by importing a VCF file with the HailContext import_vcf method.

vcf = hc.import_vcf('sample.vcf')
vcf

How many samples does vcf have? How many variants? Use the count to summarize basic properties of the dataset. What happens when you pass the True for the genotypes argument?

vcf.count()

Now display the variant, sample and global schemas.

print('sample schema:')
print(vcf.sample_schema)
print('\nvariant schema:')
print(vcf.variant_schema)
print('\nglobal schema:')
print(vcf.global_schema)

The sample schema is empty. Is that what you expected? Look at the VCF file. You can do this in the terminal with a command line less -S. How does the variant schema compare to the the ##INFO header lines and the ##CHROM line?

The variants are stored in a distributed way. To sample the variants, we need to make a query against the dataset. We'll learn more details about how this works later.

vcf.query_variants('variants.take(5)')

Exercise query_samples

Although samples and variants are stored differently in Hail, the interfaces to access them are largely symmetric. Use query_samples to show a few samples. query_samples works like query_variants but the collection of samples is called samples. You can use the take method as above.

Fill in the <?> to query 5 samples.

vcf.query_samples('<?>')

Finally, let's look at some genotypes. In PLINK, genotypes had values 0, 1, 2 and missing. In VCF and Hail, the genotypes carry more metadata. Let's grab 5 genotypes using query_genotypes. Again, if you're familiar with VCF format, compare Hail genotypes with the ##FORMAT header lines.

vcf.query_genotypes('gs.take(5)')

Again, look at the VCF file. How do the ##FORMAT fields header compare to the fields of Hail's genotype?

Import a BED file

Now let's import a BED file. On Tuesday, you went through Jeff's GWAS practical. Let's load the same data. Later, we'll see how to do many of the same analyses inside Hail. We'll use the import_plink method.

Note, Hail supports a number of other formats, including GEN and BGEN (used by the UK Biobank dataset), although we won't be using them in this practical.

gwas_example = 'gwas-practical/gwas-example'

plink = hc.import_plink(gwas_example + '.bed', gwas_example + '.bim', gwas_example + '.fam')

plink is also a VariantDataset object just like the vcf above. Hail doesn't care what file format was loaded or where the data came from, it always provides a consisent interface. Let's use the same VariantDataset methods to explore the plink dataset. First, let's count the number of variants and samples.

plink.count()

Next, let's examine the sample, variant and global schemas:

print('sample schema:')
print(plink.sample_schema)
print('\nvariant schema:')
print(plink.variant_schema)
print('\nglobal schema:')
print(plink.global_schema)

Is that what you expected? It's quite different from the VCF example. This time there is a sample schema. How does the sample schema compare to the structure of the FAM file? Although they are both VariantDatasets, the schemas differ between vcf and plink. In particular, plink carries sample metadata coming from the FAM file while vcf has variant-level metadata coming from the FILTER, QUAL and INFO fields. Later, we'll see how to access the sample and variant annotations, for example, when filtering or running a regression.

You can query a few variants and samples from plink as did for vcf. Let's also look at a couple of genotypes. How do you think they will differ from the VCF genotypes?

plink.query_genotypes('gs.take(5)')

VDS files

Hail has its own file format, VDS (for VariantDataset. We realize this can be confusing. VariantDataset can both refer to a file format and for a Python object representing a dataset inside Hail.) You can read and write VDS files using, surprisingly, the read and write methods. Note, write is a method on VariantDataset and read is a method on HailContext.

What's the advantage of VDS files? VDS files are similar in size to compressed VCF files but generally are much faster to access. If you're planning to iterate many times on the analysis of a dataset stored as VCF, you're better off importing it and writing it as VDS and analyzing that.

Let's write out vcf as a VDS file and load it.

vcf.write('small.vds', overwrite=True)

vds = hc.read('small.vds')

You can explore the vds as you did with the the other datasets, but it is identical to vcf. In fact, Hail has a command for comparing to VariantDatasets, and you can verify that vds and vcf are exactly the same:

vds.same(vcf)

Basic Association

To finish this practical, let's run a very basic association. Rather than running on one of the datasets above, let's generate a synthetic one! Hail has method to generate genetic data with population structure according to the Balding-Nichols model. After we generate genetic data, we will randomly generate two covariates and then construct a quantiative phenotype linearly from the covariates and Balding-Nichols population number. We're write out the synthetic dataset as a VDS and read it back in.

hc.balding_nichols_model is a Hail that simulates a dataset and returns a VariantDataset object. Notice how we make a series of chained method calls seperated by .. This is a bit like a UNIX pipeline where we use . instead of |. Each method returns an object which is consumed by the next one.

(hc.balding_nichols_model(3, 2000, 2000, 
                          pop_dist = [0.2, 0.3, 0.5], 
                          fst = [0.07, 0.11, 0.13])
 .annotate_samples_expr(['sa.cov1 = rnorm(0, 1)', 
                         'sa.cov2 = rnorm(0, 1)'])
 .annotate_samples_expr(
        'sa.pheno = rnorm(1, 1) + 2 * sa.cov1 - sa.cov2 + 0.1 * sa.pop')
 .write('synth.vds', overwrite=True))

synth_vds = hc.read('synth.vds')

You can explore the synthetic dataset as before. Definitely look at the various schemas:

print('sample schema:')
print(synth_vds.sample_schema)
print('\nvariant schema:')
print(synth_vds.variant_schema)
print('\nglobal schema:')
print(synth_vds.global_schema)

synth_vds.count()

We have a global schema that contains metadata about how the dataset was generated. In addition, cov1, cov2 and pheno have been added to the sample schema by the annotate_samples_expr calls. We'lll learn more about annotations later. The global annotations can be accessed directly in Python:

synth_vds.globals._attrs # adding ._attrs makes it print out more nicely

Finally, let's run a linear regression with covariates against the phenotype and visualize the p-values in a QQ plot.

[pvals] = (synth_vds.linreg('sa.pheno', covariates = ['sa.cov1', 'sa.cov2'])
           .query_variants(['variants.map(v => va.linreg.pval).collect()']))

qqplot(pvals)

The p-values are inflated! Why? We'll come back to this in a later practical.