{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from hail import *\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import seaborn\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "hc = HailContext()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start by importing and writing a larger dataset, then reading the result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hc.import_vcf('hail-practical/1kg.vcf.bgz').filter_multi().write('hail-practical/1kg.vds', overwrite=True)\n",
    "vds = hc.read('hail-practical/1kg.vds')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's print the schemas: we've seen this before, but should always know what's in our data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(vds.variant_schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(vds.sample_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's add QC statistics.  Note that although we're going to be running methods called `variant_qc` and `sample_qc`, these don't actually filter variants or samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "vds = vds.variant_qc().sample_qc().persist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## These methods added quite a few annotations!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(vds.variant_schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(vds.sample_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### There's a lot here, and it can certainly be confusing.  Our documentation explains what these values are, however: see [sample_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.sample_qc) and [variant_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.variant_qc) for more information.\n",
    "\n",
    "### Let's use `matplotlib` to look at some of these values.  Both `sa.qc.callRate` and `sa.qc.gqMean` may be useful metrics for filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "[call_rate, gq_mean] = vds.query_samples(['samples.map(s => sa.qc.callRate).collect()',\n",
    "                                          'samples.map(s => sa.qc.dpMean).collect()'])\n",
    "plt.scatter(gq_mean, call_rate, alpha=0.25)\n",
    "plt.xlabel('Mean DP')\n",
    "plt.ylabel('Call Rate')\n",
    "plt.title('Sample call rate vs mean DP')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "[call_rate, gq_mean] = vds.query_variants(['variants.map(v => va.qc.callRate).collect()',\n",
    "                                          'variants.map(v => va.qc.dpMean).collect()'])\n",
    "plt.scatter(gq_mean, call_rate, alpha=0.2)\n",
    "plt.xlabel('Mean DP')\n",
    "plt.ylabel('Call Rate')\n",
    "plt.title('Variant call rate vs mean DP')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## It's a good idea to get rid of samples and variant with a low call rate.  Let's see how a cutoff of 95% affects the included variants and samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "original_counts = vds.count()\n",
    "filtered_vds = (vds.filter_variants_expr('va.qc.callRate > 0.95')\n",
    "                   .filter_samples_expr('sa.qc.callRate > 0.95'))\n",
    "new_counts = filtered_vds.count()\n",
    "print(original_counts)\n",
    "print(new_counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Annotations can be added from files as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%sh\n",
    "head hail-practical/sample_annotations.txt | column -t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filtered_vds = filtered_vds.annotate_samples_table('hail-practical/sample_annotations.txt', \n",
    "                                                   sample_expr='Sample',\n",
    "                                                   root='sa.metadata',\n",
    "                                                   config=TextTableConfig(impute=True))\n",
    "print(filtered_vds.sample_schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filtered_vds = filtered_vds.annotate_variants_table('hail-practical/variant_annotations.txt', \n",
    "                                                   variant_expr='Variant',\n",
    "                                                   code='va.consequence = table.Consequence',\n",
    "                                                   config=TextTableConfig(impute=True))\n",
    "print(filtered_vds.variant_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query methods are always good to explore what we've done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filtered_vds.query_variants('variants.map(v => va.consequence).counter()')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filtered_vds.query_samples('samples.map(s => sa.metadata.SuperPopulation).counter()')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Number 1: count the number of variants that are LOF and have a call rate > 99%\n",
    "\n",
    "Fill in the `<?>` with code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filtered_vds.query_variants('<?>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Number 2: What fraction of British samples (`sa.metadata.Population == \"GBR\"`) have purple hair?\n",
    "\n",
    "Fill in the `<?>` with code.\n",
    "\n",
    "Hint 1: the fraction of samples that are British is `samples.fraction(s => sa.metadata.Population == \"GBR\")`\n",
    "\n",
    "Hint 2: you need to use \".filter\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filtered_vds.query_samples('<?>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Number 3: What is the mean caffeine consumption among South Asian samples (`sa.metadata.SuperPopulation == \"SAS\"`)?\n",
    "\n",
    "Fill in the `<?>` with code.\n",
    "\n",
    "Hint: Remember that `.stats()` will compute a variety of useful metrics from a numeric Aggregable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filtered_vds.query_samples('<?>')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}