{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from hail import *\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import seaborn\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "hc = HailContext()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start by importing and writing a larger dataset, then reading the result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "hc.import_vcf('hail-practical/1kg.vcf.bgz').filter_multi().write('hail-practical/1kg.vds', overwrite=True)\n", "vds = hc.read('hail-practical/1kg.vds')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's print the schemas: we've seen this before, but should always know what's in our data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(vds.variant_schema)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(vds.sample_schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's add QC statistics. Note that although we're going to be running methods called `variant_qc` and `sample_qc`, these don't actually filter variants or samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "vds = vds.variant_qc().sample_qc().persist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## These methods added quite a few annotations!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(vds.variant_schema)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(vds.sample_schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### There's a lot here, and it can certainly be confusing. Our documentation explains what these values are, however: see [sample_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.sample_qc) and [variant_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.variant_qc) for more information.\n", "\n", "### Let's use `matplotlib` to look at some of these values. Both `sa.qc.callRate` and `sa.qc.gqMean` may be useful metrics for filtering." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "[call_rate, gq_mean] = vds.query_samples(['samples.map(s => sa.qc.callRate).collect()',\n", " 'samples.map(s => sa.qc.dpMean).collect()'])\n", "plt.scatter(gq_mean, call_rate, alpha=0.25)\n", "plt.xlabel('Mean DP')\n", "plt.ylabel('Call Rate')\n", "plt.title('Sample call rate vs mean DP')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "[call_rate, gq_mean] = vds.query_variants(['variants.map(v => va.qc.callRate).collect()',\n", " 'variants.map(v => va.qc.dpMean).collect()'])\n", "plt.scatter(gq_mean, call_rate, alpha=0.2)\n", "plt.xlabel('Mean DP')\n", "plt.ylabel('Call Rate')\n", "plt.title('Variant call rate vs mean DP')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## It's a good idea to get rid of samples and variant with a low call rate. Let's see how a cutoff of 95% affects the included variants and samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "original_counts = vds.count()\n", "filtered_vds = (vds.filter_variants_expr('va.qc.callRate > 0.95')\n", " .filter_samples_expr('sa.qc.callRate > 0.95'))\n", "new_counts = filtered_vds.count()\n", "print(original_counts)\n", "print(new_counts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Annotations can be added from files as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%sh\n", "head hail-practical/sample_annotations.txt | column -t" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filtered_vds = filtered_vds.annotate_samples_table('hail-practical/sample_annotations.txt', \n", " sample_expr='Sample',\n", " root='sa.metadata',\n", " config=TextTableConfig(impute=True))\n", "print(filtered_vds.sample_schema)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filtered_vds = filtered_vds.annotate_variants_table('hail-practical/variant_annotations.txt', \n", " variant_expr='Variant',\n", " code='va.consequence = table.Consequence',\n", " config=TextTableConfig(impute=True))\n", "print(filtered_vds.variant_schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query methods are always good to explore what we've done" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filtered_vds.query_variants('variants.map(v => va.consequence).counter()')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filtered_vds.query_samples('samples.map(s => sa.metadata.SuperPopulation).counter()')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number 1: count the number of variants that are LOF and have a call rate > 99%\n", "\n", "Fill in the `` with code." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filtered_vds.query_variants('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number 2: What fraction of British samples (`sa.metadata.Population == \"GBR\"`) have purple hair?\n", "\n", "Fill in the `` with code.\n", "\n", "Hint 1: the fraction of samples that are British is `samples.fraction(s => sa.metadata.Population == \"GBR\")`\n", "\n", "Hint 2: you need to use \".filter\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filtered_vds.query_samples('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number 3: What is the mean caffeine consumption among South Asian samples (`sa.metadata.SuperPopulation == \"SAS\"`)?\n", "\n", "Fill in the `` with code.\n", "\n", "Hint: Remember that `.stats()` will compute a variety of useful metrics from a numeric Aggregable." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filtered_vds.query_samples('')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 1 }