{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Hail Practical 1: Importing, schemas, simulated data\n", "\n", "## The cells of this practical can be entered (by cut and paste) into the IPython console.\n", "\n", "## Before entering the first cell, make sure you have changed to the directory hail-practical." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn\n", "from math import log, isnan\n", "from pprint import pprint\n", "import matplotlib.patches as mpatches\n", "\n", "from hail import *\n", "\n", "%matplotlib inline\n", "\n", "def qqplot(pvals):\n", " spvals = sorted([x for x in pvals if x and not(isnan(x))])\n", " exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]\n", " obs = [-log(p, 10) for p in spvals]\n", " plt.scatter(exp, obs)\n", " plt.plot(np.arange(0, max(max(exp), max(obs))), c=\"red\")\n", " plt.xlabel(\"Expected p-value (-log10 scale)\")\n", " plt.ylabel(\"Observed p-value (-log10 scale)\")\n", " plt.xlim(xmin=0)\n", " plt.ylim(ymin=0)\n", "\n", "hc = HailContext()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create HailContext\n", "\n", "Note, the prelude above imported the `hail` module and crated the HailContext with the line `hc = hail.HailContext()`. We will access the Hail functionality by calling methods on the `hc` object. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import VCF, BED files into Hail\n", "\n", "In this section, we'll import sequencing and genotype data files and examine the resulting `VariantDataset`s. Since analyzing sequence data is part of the motivation for building Hail, let's start by importing a VCF file with the HailContext [import_vcf](https://hail.is/hail/hail.HailContext.html#hail.HailContext.import_vcf) method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vcf = hc.import_vcf('sample.vcf')\n", "vcf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many samples does `vcf` have? How many variants? Use the [count](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.count) to summarize basic properties of the dataset. What happens when you pass the `True` for the `genotypes` argument?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vcf.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now display the variant, sample and global schemas." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print('sample schema:')\n", "print(vcf.sample_schema)\n", "print('\\nvariant schema:')\n", "print(vcf.variant_schema)\n", "print('\\nglobal schema:')\n", "print(vcf.global_schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sample schema is empty. Is that what you expected? Look at the VCF file. You can do this in the terminal with a command line `less -S`. How does the variant schema compare to the the ##INFO header lines and the ##CHROM line?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The variants are stored in a distributed way. To sample the variants, we need to make a query against the dataset. We'll learn more details about how this works later." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vcf.query_variants('variants.take(5)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise query_samples\n", "\n", "Although samples and variants are stored differently in Hail, the interfaces to access them are largely symmetric. Use [query_samples](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.query_samples) to show a few samples. `query_samples` works like `query_variants` but the collection of samples is called `samples`. You can use the take method as above.\n", "\n", "Fill in the `` to query 5 samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vcf.query_samples('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's look at some genotypes. In PLINK, genotypes had values 0, 1, 2 and missing. In VCF and Hail, the genotypes carry more metadata. Let's grab 5 genotypes using [query_genotypes](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.query_genotypes). Again, if you're familiar with VCF format, compare Hail genotypes with the ##FORMAT header lines." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vcf.query_genotypes('gs.take(5)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, look at the VCF file. How do the ##FORMAT fields header compare to the fields of Hail's genotype?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import a BED file\n", "\n", "Now let's import a BED file. On Tuesday, you went through Jeff's GWAS practical. Let's load the same data. Later, we'll see how to do many of the same analyses inside Hail. We'll use the [import_plink](https://hail.is/hail/hail.HailContext.html#hail.HailContext.import_plink) method.\n", "\n", "Note, Hail supports a number of other formats, including GEN and BGEN (used by the UK Biobank dataset), although we won't be using them in this practical." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gwas_example = 'gwas-practical/gwas-example'\n", "\n", "plink = hc.import_plink(gwas_example + '.bed', gwas_example + '.bim', gwas_example + '.fam')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`plink` is also a `VariantDataset` object just like the `vcf` above. Hail doesn't care what file format was loaded or where the data came from, it always provides a consisent interface. Let's use the same `VariantDataset` methods to explore the `plink` dataset. First, let's count the number of variants and samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plink.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's examine the sample, variant and global schemas:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print('sample schema:')\n", "print(plink.sample_schema)\n", "print('\\nvariant schema:')\n", "print(plink.variant_schema)\n", "print('\\nglobal schema:')\n", "print(plink.global_schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is that what you expected? It's quite different from the VCF example. This time there is a sample schema. How does the sample schema compare to the structure of the FAM file? Although they are both `VariantDataset`s, the schemas differ between `vcf` and `plink`. In particular, `plink` carries sample metadata coming from the FAM file while `vcf` has variant-level metadata coming from the FILTER, QUAL and INFO fields. Later, we'll see how to access the sample and variant annotations, for example, when filtering or running a regression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can query a few variants and samples from `plink` as did for `vcf`. Let's also look at a couple of genotypes. How do you think they will differ from the VCF genotypes?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plink.query_genotypes('gs.take(5)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## VDS files\n", "\n", "Hail has its own file format, VDS (for VariantDataset. We realize this can be confusing. VariantDataset can both refer to a file format and for a Python object representing a dataset inside Hail.) You can read and write VDS files using, surprisingly, the [read](https://hail.is/hail/hail.HailContext.html#hail.HailContext.read) and [write](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.write) methods. Note, `write` is a method on `VariantDataset` and `read` is a method on `HailContext`.\n", "\n", "What's the advantage of VDS files? VDS files are similar in size to compressed VCF files but generally are much faster to access. If you're planning to iterate many times on the analysis of a dataset stored as VCF, you're better off importing it and writing it as VDS and analyzing that.\n", "\n", "Let's write out `vcf` as a VDS file and load it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vcf.write('small.vds', overwrite=True)\n", "\n", "vds = hc.read('small.vds')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can explore the `vds` as you did with the the other datasets, but it is identical to `vcf`. In fact, Hail has a command for comparing to VariantDatasets, and you can verify that `vds` and `vcf` are exactly the same:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.same(vcf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Association\n", "\n", "To finish this practical, let's run a very basic association. Rather than running on one of the datasets above, let's generate a synthetic one! Hail has method to generate genetic data with population structure according to the Balding-Nichols model. After we generate genetic data, we will randomly generate two covariates and then construct a quantiative phenotype linearly from the covariates and Balding-Nichols population number. We're write out the synthetic dataset as a VDS and read it back in.\n", "\n", "`hc.balding_nichols_model` is a Hail that simulates a dataset and returns a `VariantDataset` object. Notice how we make a series of chained method calls seperated by `.`. This is a bit like a UNIX pipeline where we use `.` instead of `|`. Each method returns an object which is consumed by the next one." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "(hc.balding_nichols_model(3, 2000, 2000, \n", " pop_dist = [0.2, 0.3, 0.5], \n", " fst = [0.07, 0.11, 0.13])\n", " .annotate_samples_expr(['sa.cov1 = rnorm(0, 1)', \n", " 'sa.cov2 = rnorm(0, 1)'])\n", " .annotate_samples_expr(\n", " 'sa.pheno = rnorm(1, 1) + 2 * sa.cov1 - sa.cov2 + 0.1 * sa.pop')\n", " .write('synth.vds', overwrite=True))\n", "\n", "synth_vds = hc.read('synth.vds')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can explore the synthetic dataset as before. Definitely look at the various schemas:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print('sample schema:')\n", "print(synth_vds.sample_schema)\n", "print('\\nvariant schema:')\n", "print(synth_vds.variant_schema)\n", "print('\\nglobal schema:')\n", "print(synth_vds.global_schema)\n", "\n", "synth_vds.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a global schema that contains metadata about how the dataset was generated. In addition, cov1, cov2 and pheno have been added to the sample schema by the `annotate_samples_expr` calls. We'lll learn more about annotations later. The global annotations can be accessed directly in Python:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "synth_vds.globals._attrs # adding ._attrs makes it print out more nicely" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's run a linear regression with covariates against the phenotype and visualize the p-values in a QQ plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "[pvals] = (synth_vds.linreg('sa.pheno', covariates = ['sa.cov1', 'sa.cov2'])\n", " .query_variants(['variants.map(v => va.linreg.pval).collect()']))\n", "\n", "qqplot(pvals)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The p-values are inflated! Why? We'll come back to this in a later practical." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 1 }