{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Institute for Behavioral Genetics International Statistical Genetics 2021 Workshop \n",
    "\n",
    "# Rare Variant Analysis of Sequencing Data with Hail\n",
    "\n",
    "You will learn more details on the analysis of rare variant signals using Hail in the SAIGE session. In this notebook, your learning objective is to :\n",
    "\n",
    "- Understand basic principles behind simple variant aggregation and burden tests.\n",
    "\n",
    "GWAS is a great tool for finding associations between **common variants** and disease, but is underpowered to detect rare-variant associations, because rare variants by definition have small sample sizes.\n",
    "\n",
    "It is possible to find associations between rare variants and disease by **grouping variants of similar effect**, and testing each group.\n",
    "\n",
    "One possible solution is to sum variant counts according to some genomic interval (for instance, gene), and then association with these intervals. This is often called a gene burden test.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Same as in the last practical, these steps initialize our Hail session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import hail as hl\n",
    "from hail.plot import output_notebook, show"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hl.init()\n",
    "output_notebook()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 1: Rapid-fire import, QC, sample annotation\n",
    "\n",
    "The last notebook covered these steps in detail. We'll do them quickly here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read matrix from disk; this was written from the imported VCF in the common variant practical\n",
    "mt = hl.read_matrix_table('resources/hgdp.mt')\n",
    "\n",
    "# import annotations\n",
    "sd = hl.import_table('resources/HGDP_sample_data.tsv',\n",
    "                     key='sample_id',\n",
    "                     impute=True)\n",
    "\n",
    "# annotate columns\n",
    "mt = mt.annotate_cols(sample_data = sd[mt.s])\n",
    "\n",
    "# remove non-PASS variants\n",
    "mt = mt.filter_rows(hl.len(mt.filters) == 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discard common variants"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will keep variants with an allele frequency of under 1%. Including common variants will only reduce the power of a burden test.\n",
    "\n",
    "We could rerun `hl.variant_qc` here, or use an aggregator designed to compute allele frequencies and counts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = mt.filter_rows(hl.agg.call_stats(mt.GT, mt.alleles).AF[1] < 0.01)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 2: Group by gene\n",
    "\n",
    "We have variant annotations in a text file in the `resources/` folder. We will use these to annotate our matrix table with gene and consequence information.\n",
    "\n",
    "Additionally, you can also use the `VEP` annotation tool which provides a *huge* number of potentially useful variant annotations. If you are running Hail on Google Cloud Platform (GCP), the Hail team has done the work of installing and configuring VEP. The team is also working on a new resource called the \"annotation database\": see [here](https://hail.is/docs/0.2/annotation_database_ui.html) for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "annotation_ht = hl.import_table('resources/hgdp_gene_annotations.tsv', impute=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "annotation_ht.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Use the `aggregate` method and the familiar `hl.agg.counter` aggregator to compute the number of appearances of each \"csq\" value.\n",
    "\n",
    "Which of these would you expect to have no effect on a protein? Which would you expect to have a large effect?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Annotate variants with genes\n",
    "\n",
    "In order join our two tables (the QC-ed data and gene table), we need to create fields of type `locus` and `array<str>` (alleles) to match the row key of our matrix.\n",
    "\n",
    "We can use the `hl.parse_variant` function to parse the `variant` field of this table of `CHR:POS:REF:ALT` form to a locus and alleles array. Then we assign these new fields to be the key:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "parsed = hl.parse_variant(annotation_ht.variant, reference_genome='GRCh38')\n",
    "annotation_ht = annotation_ht.key_by(locus = parsed.locus, alleles=parsed.alleles).drop('variant')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall how we annotated sample phenotypes earlier in the common variant tutorial -- this join looks very similar:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = mt.annotate_rows(vep_info = annotation_ht[mt.locus, mt.alleles])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's `show` the resulting annotations on the matrix table to make sure everything worked:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt.vep_info.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 3: Aggregate by gene\n",
    "\n",
    "Hail's modularity makes it easy to perform non-kernel-based burden tests.\n",
    "\n",
    "We'll compose two general tools:\n",
    " - [group_rows_by](https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.group_rows_by) / [aggregate](https://hail.is/docs/0.2/hail.GroupedMatrixTable.html#hail.GroupedMatrixTable.aggregate)\n",
    " - [hl.linear_regression_rows](https://hail.is/docs/0.2/methods/stats.html#hail.methods.linear_regression_rows).\n",
    " \n",
    "This means that you can flexibly specify the way genotypes are summarized per gene. Using other tools, you may have a few ways to aggregate, but if you want to do something different you are out of luck!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "burden_mt = (\n",
    "    mt\n",
    "    .group_rows_by(mt.vep_info.gene_symbol)\n",
    "    .aggregate(n_variants = hl.agg.count_where(mt.GT.n_alt_alleles() > 0))\n",
    ")\n",
    "\n",
    "# filter to genes with at least one rare variant!\n",
    "burden_mt = burden_mt.filter_rows(hl.agg.sum(burden_mt.n_variants) > 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "burden_mt.describe(widget=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "burden_mt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Is this a dense (mostly non-zero) or sparse (mostly zero) matrix? Is this expected? How many variants are in our dataset, and how many genes are there?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <font color=\"#1a0dab\">Step 4:</font> Run linear regression per gene\n",
    "\n",
    "This should look familiar! We can reuse the same modular components (like `linear_regression_rows`) for many different purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca_eigenvalues, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, compute_loadings=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "burden_mt = burden_mt.annotate_cols(pca = pca_scores[burden_mt.s])\n",
    "\n",
    "burden_results = hl.linear_regression_rows(\n",
    "    y=burden_mt.sample_data.sleep_duration, \n",
    "    x=burden_mt.n_variants,\n",
    "    covariates=[1.0, \n",
    "                burden_mt.pca.scores[0], \n",
    "                burden_mt.pca.scores[1], \n",
    "                burden_mt.pca.scores[2]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sorry, no `hl.plot.manhattan` for genes!\n",
    "\n",
    "Manhattan plots are really only useful for standard GWAS. Instead, we can simply sort by p-value using [order_by](https://hail.is/docs/0.2/hail.Table.html#hail.Table.order_by), and print:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "burden_results.order_by(burden_results.p_value).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A Q-Q plot is still meaningful on genes, though! Let's plot one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = hl.plot.qq(burden_results.p_value)\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With fewer tests performed (one per gene, instead of one per variant), the X and Y range of the Q-Q plot is much smaller than in the common variant association practical.\n",
    "\n",
    "This plot is showing us that although our study is relatively well-controlled, it's also very underpowered!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The end\n",
    "\n",
    "You've reached the end of the prepared Hail practical materials! Congratulations -- we didn't really expect anyone to make it this far in the allotted time!\n",
    "\n",
    "If you have questions, please ask the faculty! We are eager to discuss Hail and how it might be of assistance in your science."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# When the workshop ends and you return to your life\n",
    "\n",
    "The hosted notebook service that is running this notebook will be turned off in a few hours, but you can continue using Hail!\n",
    "\n",
    "The [Hail website](https://www.hail.is) has a page with information about [getting started](https://hail.is/docs/0.2/getting_started.html). If you have a MacOS or Linux computer, or have access to a Linux server, you can run Hail. \n",
    "\n",
    "It is also possible to run Hail on Google Cloud. See the video lectures for guidance on how to do that, or reach out to the team for help!\n",
    "\n",
    "## The Hail community\n",
    "\n",
    "Although Hail has a steeper learning curve than many command-line tools, you won't be learning it alone! Hail has a [forum](https://discuss.hail.is) and [Zulip chatroom](https://hail.zulipchat.com) full of like-minded users of all experience levels. Please stop by to say hello!\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}