{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practical 7: Basic association analysis\n",
    "\n",
    "## The cells of this practical can be entered (by cut and paste) into the IPython console.\n",
    "\n",
    "## Before entering the first cell, make sure you have changed to the directory hail-practical. Skip the first cell haven't closed IPython console since running the last practical."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn\n",
    "from math import log, isnan\n",
    "from pprint import pprint\n",
    "import matplotlib.patches as mpatches\n",
    "\n",
    "from hail import *\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "def qqplot(pvals):\n",
    "    spvals = sorted([x for x in pvals if x and not(isnan(x))])\n",
    "    exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]\n",
    "    obs = [-log(p, 10) for p in spvals]\n",
    "    plt.scatter(exp, obs)\n",
    "    plt.plot(np.arange(0, max(max(exp), max(obs))), c=\"red\")\n",
    "    plt.xlabel(\"Expected p-value (-log10 scale)\")\n",
    "    plt.ylabel(\"Observed p-value (-log10 scale)\")\n",
    "    plt.xlim(xmin=0)\n",
    "    plt.ylim(ymin=0)\n",
    "\n",
    "hc = HailContext()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Assocation Practical\n",
    "\n",
    "On Tuesday, you went through Jeff's GWAS practical using PLINK.  In this practical, we're going to learn how to use Hail to perform the same analysis.  The first step is to load the GWAS exmaple PLINK file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "gwas_example = 'gwas-practical/gwas-example'\n",
    "example_vds = hc.import_plink(gwas_example + '.bed', gwas_example + '.bim', gwas_example + '.fam')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's use the [count](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.count) command to summarize the dataset.  How many samples and how many variants are there?  How many genotypes are missing?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "example_vds.count(genotypes=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print out the schemas to see the structure of the annotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('sample schema:')\n",
    "print(example_vds.sample_schema)\n",
    "print('\\nvariant schema:')\n",
    "print(example_vds.variant_schema)\n",
    "print('\\nglobal schema:')\n",
    "print(example_vds.global_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's use the expression language, annotations and aggregators to explore the data.\n",
    "\n",
    "First, how many cases and controls are there?  To do this, we can use the counter [aggregator](https://hail.is/hail/types.html#aggregable-t) (scroll down a bit to find counter) over samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "example_vds.query_samples('samples.map(s => sa.isCase).counter()')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's find the SNP with the highest missingness.  First, we annotate variants with missingness using the fraction [aggregator](https://hail.is/hail/types.html#aggregable-t).  Note, when working over variants, `gs` is an `Aggregable[Genotype]` over the genotypes for that variant.  We use the `takeBy` [aggregator](https://hail.is/hail/types.html#aggregable-t) to find the most missing variant.  We return a struct containing the variant, the rsID and the missingness.  What if we wanted the 5 most missing variants?  What sample has the highest missingness?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "most_missing = (example_vds.annotate_variants_expr('va.missing = gs.fraction(g => !g.isCalled)')\n",
    "                .query_variants('variants.map(v => {v: v, rsid: va.rsid, missing: va.missing}).takeBy(x => va.missing, 1)'))\n",
    "most_missing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's look at the distribution of allele frequencies.  To do this, let's use the `callStats` aggregator.  The `callStats` aggregator needs to know the variant corresponding to the genotype, so you need to pass it `g => v`.  `callStats` computes a number of useful statistics: allele count (`AC: Array[Int]`), allele frequency (`AF: Array[Int]`), allele number (`Int`) and genotype count (`GC: Array[Int]`).  AC and AF are indexed by the allele number, while GC is indexed by the genotype number (for biallelic variants, 0/0 = 0, 0/1 = 1, 1/1 = 2).  See the [docs](https://hail.is/hail/types.html#aggregable-genotype) for more details.Since we want to plot the alternate alelle frequency, we use `AC[1]` below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mafs = (example_vds.annotate_variants_expr('va.maf = gs.callStats(g => v).AF[1]')\n",
    "        .query_variants('variants.map(v => va.maf).collect()'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.hist(mafs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hard Exercise\n",
    "\n",
    "If you want a challenge, try computing AF directly.  A few things that might be useful for this: `g.nNonRefAlleles` returns a count of the number of non-reference alleles in a genotype, that is, 0/0 = 0, 0/1 = 1, and 1/1 = 2.  It returns missing for a missing genotype.  You can use `.sum()` to find the sum of a numeric aggregable and `.count()` to count the number of elements in an aggregable.  Finally, does your method handle missingness correctly?  Note, sum ignores missing values (it sums the non-missing values) but count counts all the values, missing or non-missing.  In particular, `gs.count()` for a variant will always return the number of samples, whether or not genotypes are missing.\n",
    "\n",
    "You can check you got the correct answer by collecting the list of AFs and comparing to the `mafs` array from the example above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "<?>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create the cleaned dataset.  We use the [variant_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.variant_qc) and [sample_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.sample_qc) methods of VariantDataset to compute statistics (like callRate, AF and the Hardy-Weinberg p-value) to use when filtering.  The QC methods compute many more statistics, see the documentation for more details.  We write the cleaned dataset as a VDS and read it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "(example_vds\n",
    " .sample_qc()\n",
    " .variant_qc()\n",
    " .filter_variants_expr('va.qc.callRate >= 0.95 && va.qc.pHWE >= 1e-6 && va.qc.AF >= 0.01', keep=True)\n",
    " .filter_samples_expr('sa.qc.callRate >= 0.95', keep=True)\n",
    " .write('clean.vds', overwrite=True))\n",
    "\n",
    "clean_vds = hc.read('clean.vds')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're ready to run an association!  This is done with the [logreg](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.logreg) logistic regression command on VariantDataset.  `logreg` supports several tests, here we use the Wald test.  Then we collect the resulting p-values and make a QQ plot.\n",
    "\n",
    "Note, `logreg` adds its results the variant annotations at `va.logreg` by default.  We also print out the variant annotations to show what's been added. See the documentation for a description of each field.\n",
    "\n",
    "We collect both the p-values and the variant base-pair start position so we can make a manhattan plot below.  Since some p-values can be missing (for example, if the test failed to converge), we filter those out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "logreg_vds = example_vds.logreg('wald', 'sa.isCase')\n",
    "print(logreg_vds.variant_schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pval_pos = logreg_vds.query_variants('variants.map(v => {start: v.start, pval: va.logreg.pval}).collect()')\n",
    "pval_pos = filter(lambda x: x.pval != None, pval_pos)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "qqplot([x.pval for x in pval_pos])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make a Manhattan plot of the p-values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter([x.start for x in pval_pos], \n",
    "            [-log(x.pval, 10) for x in pval_pos])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our QQ plot was inflated.  Now let's look at the clean dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "clean_pvals = (clean_vds.logreg('wald', 'sa.isCase')\n",
    "               .query_variants('variants.map(v => va.logreg.pval).collect()'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "qqplot(clean_pvals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Better, but there's still some inflation.  Next, let's compute PCA and plot the case/control status against the first two PCs.  We use the [pca](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.pca) method.  The first argument is the location in sample annotations to store the PCs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pca_vds = clean_vds.pca('sa.scores', k=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(pca_vds.sample_schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "[pc1, pc2, case] = pca_vds.query_samples(['samples.map(s => sa.scores.PC1).collect()',\n",
    "                                          'samples.map(s => sa.scores.PC2).collect()',\n",
    "                                          'samples.map(s => sa.isCase).collect()'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(pc1, pc2, c=case)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the phenotype is stratified with the PCs, esp. PC1.  Let's re-run the association using the PCs as covariates and plot the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pca_pvals = (pca_vds.logreg('wald', 'sa.isCase', ['sa.scores.PC1', 'sa.scores.PC2'])\n",
    "             .query_variants('variants.map(v => va.logreg.pval).collect()'))\n",
    "\n",
    "qqplot(pca_pvals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Beautiful!  How would you find the SNPs corresponding to the top two hits?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise\n",
    "\n",
    "There is also a linear regression command for running association on quantitative phenotypes.  The GWAS practical has quantitative phenotypes in the `gwas-practical/gwas-example.qt` file.  Try running a linear regression.  You'll need the [annotate_samples_table](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.annotate_samples_table) method of VariantDataset to load `gwas-practical/gwas-example.qt` data into the sample annotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}