{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Institute for Behavioral Genetics International Statistical Genetics 2021 Workshop \n",
    "\n",
    "## Common Variant Analysis of Sequencing Data with Hail\n",
    "\n",
    "You have reviewed the [lecture videos](https://youtu.be/2N_VqmX22Xg), and you are ready to get hands-on with Hail to analyze real sequencing data. \n",
    "\n",
    "In this practical, we will learn how to:\n",
    "\n",
    "1) Use simple python code and Jupyter notebooks.\n",
    "\n",
    "2) Use Hail to import a VCF and run basic queries over sequencing data.\n",
    "\n",
    "3) Use Hail to perform basic quality control on sequencing data.\n",
    "\n",
    "4) Use Hail to run a basic genome-wide association study on common variants in sequencing data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "\n",
    "### It doesn't all need to \"stick\" today\n",
    "\n",
    "This practical contains a lot of new material, and the goal of this workbook is not for you to be able to reproduce from memory all the various capabilities demonstrated here. Instead, the goal is for you to get a sense for the kind of analysis tasks that sequencing data requires, and to gain some exposure to what these analyses look like in Hail. \n",
    "\n",
    "There is no one-size-fits-all sequencing analysis pipeline, because each sequencing dataset will have unique properties that need to be understood and accounted for in QC and analysis. Hail can empower you to interrogate sequencing data, but it cannot give you all the questions to ask!\n",
    "\n",
    "Some of the questions and exercises in this notebook might seem unrelated to the specific task of analyzing sequencing data, but that is intentional -- Hail is a computational tool that hopes to help you indulge your scientific curiosity, and asking and answering a variety of questions about many aspects of your data is the best way to learn *how to Hail*.\n",
    "\n",
    "We don't expect you to be able to run a full GWAS on your own data in Hail tomorrow. If this is something you want to do, there are **lots more** resources available -- documentation, cookbooks, tutorials, and most importantly, the Hail community on the [forum](https://discuss.hail.is) and [zulip chatroom](https://hail.zulipchat.com).\n",
    "\n",
    "### We encourage you to play\n",
    "\n",
    "Hail is a highly expressive library with lots of functionality -- you'll see just a small fraction of it today. Throughout this notebook and especially in the denoted **exercises**, we encourage you to experiment with the code being run to see what happens! Sometimes it will be an error, but sometimes you will encounter new pieces of functionality. If you're curious about how to use Hail to ask slightly different questions than the code or exercises here, please ask the faculty! We are eager to help.\n",
    "\n",
    "### Interactive analysis on the cloud\n",
    "\n",
    "Part of what we think is so exciting about Hail is that Hail development has coincided with other technological shifts in the data science community.\n",
    "\n",
    "Five years ago, most computational biologists analyzed sequencing data using command-line tools, and took advantage of research compute clusters by explicitly using scheduling frameworks like LSF or Sun Grid Engine. These clusters are powerful resources, but it requires a great deal of extra thought and effort to manage pipelines running on them.\n",
    "\n",
    "Today, most Hail users run Hail from within interactive Python notebooks (like this one!) backed by thousands of cores on public compute clouds like [Google Cloud](https://cloud.google.com/), [Amazon Web Services](https://aws.amazon.com/), or [Microsoft Azure](https://azure.microsoft.com/). You don't need to share a cluster with hundreds of other scientists, and you only need to pay for the resources that you use.\n",
    "\n",
    "You won't get hands-on experience with this kind of usage today, but there are lots of resources to help you get started if you're interested in that. Please stay in touch with us after the workshop ends!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Using Jupyter\n",
    "\n",
    "The notebook software that you are using right now is called [Jupyter](https://jupyter.org/), which came from a combination of the languages **Ju**lia, **Pyt**hon, and **R**.\n",
    "\n",
    "**Learning objectives**\n",
    "\n",
    " - be comfortable running, editing, adding, and deleting code cells.\n",
    " - learn techniques for unblocking yourself if Jupyter acts up.\n",
    "\n",
    "### Running cells\n",
    "Evaluate cells using SHIFT + ENTER. Select the next cell and run it. If you prefer clicking, you can select the cell and click the \"Run\" button in the toolbar above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Hello, world')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Modes\n",
    "\n",
    "Jupyter has two modes, a **navigation mode** and an **editor mode**.\n",
    "\n",
    "#### Navigation mode:\n",
    "\n",
    " - <font color=\"blue\"><strong>BLUE</strong></font> cell borders\n",
    " - `UP` / `DOWN` move between cells\n",
    " - `ENTER` while a cell is selected will move to **editing mode**.\n",
    " - Many letters are keyboard shortcuts! This is a common trap.\n",
    " \n",
    "#### Editor mode:\n",
    "\n",
    " - <font color=\"green\"><strong>GREEN</strong></font> cell borders\n",
    " - `UP` / `DOWN`/ move within cells before moving between cells.\n",
    " - `ESC` will return to **navigation mode**.\n",
    " - `SHIFT + ENTER` will evaluate a cell and return to **navigation mode**.\n",
    " \n",
    "Try editing this markdown cell by double clicking, then re-rendering it by \"running\" the cell."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cell types\n",
    "\n",
    "There are several types of cells in Jupyter notebooks. The two you will see in this notebook are **Markdown** (text) and **Code**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T00:52:03.914467Z",
     "start_time": "2021-06-16T00:52:03.911988Z"
    }
   },
   "outputs": [],
   "source": [
    "# This is a code cell\n",
    "my_variable = 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**This is a markdown cell**, so even if something looks like code (as below), it won't get executed!\n",
    "\n",
    "my_variable += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Shell commands\n",
    "\n",
    "It is possible to call command-line utilities from Jupyter by prefixing a line with a `!`. For instance, we can print the current directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T00:56:59.757110Z",
     "start_time": "2021-06-16T00:56:59.630749Z"
    }
   },
   "outputs": [],
   "source": [
    "! pwd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tips and tricks\n",
    "\n",
    "Keyboard shortcuts:\n",
    "\n",
    " - `SHIFT + ENTER` to evaluate a cell\n",
    " - `ESC` to return to navigation mode\n",
    " - `y` to turn a markdown cell into code\n",
    " - `m` to turn a code cell into markdown\n",
    " - `a` to add a new cell **above** the currently selected cell\n",
    " - `b` to add a new cell **below** the currently selected cell\n",
    " - `d, d` (repeated) to delete the currently selected cell\n",
    " - `TAB` to activate code completion\n",
    " \n",
    "To try this out, create a new cell below this one using `b`, and print `my_variable` by starting with `print(my` and pressing `TAB`!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <strong style=\"color: red;\">Resetting Jupyter if you get stuck</strong>\n",
    "\n",
    "If at any point during this practical, you are unable to successfully run cells, it is possible that your Python interpreter is in a bad state due to cells being run in an incorrect order. If this happens, you can recover a working session by doing the following:\n",
    "\n",
    "1. Navigate to the \"Kernel\" menu at the top, and select \"Restart and clear output\".\n",
    "\n",
    "2. Select the cell you were working on, then select \"Run all above\" from the \"Cell\" menu at the top.\n",
    "\n",
    "3. If the problem persists, reach out to the faculty for help!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Import and initialize Hail"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to Hail, we import a few methods from the Hail plotting library. We'll see examples soon!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T00:53:12.301526Z",
     "start_time": "2021-06-16T00:53:10.754725Z"
    }
   },
   "outputs": [],
   "source": [
    "import hail as hl\n",
    "from hail.plot import output_notebook, show"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we initialize Hail and set up plotting to display inline in the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T00:53:16.391741Z",
     "start_time": "2021-06-16T00:53:12.303399Z"
    }
   },
   "outputs": [],
   "source": [
    "hl.init()\n",
    "output_notebook()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook works on a small (~16MB) downsampled chunk of the publically available Human Genome Diversity Project (HGDP) dataset. HGDP is a super-set of the well-known [1000 genomes](https://www.internationalgenome.org/) dataset, with a broader group of represented populations.\n",
    "\n",
    "We can see the files used using `ls` below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T00:57:27.835145Z",
     "start_time": "2021-06-16T00:57:27.701051Z"
    }
   },
   "outputs": [],
   "source": [
    "! ls -lh resources/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Explore genetic data with Hail\n",
    "\n",
    "#### Learning Objectives:\n",
    "\n",
    "- To be comfortable exploring Hail data structures, especially the `MatrixTable`.\n",
    "- To understand categories of functionality for performing QC."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import data from VCF\n",
    "\n",
    "The [Variant Call Format (VCF)](https://en.wikipedia.org/wiki/Variant_Call_Format) is a common file format for representing genetic data collected on multiple individuals (samples).\n",
    "\n",
    "Hail has an [import_vcf](https://hail.is/docs/0.2/methods/impex.html#hail.methods.import_vcf) function that reads this file to a Hail `MatrixTable`, which is a general-purpose data structure that is often used to represent a matrix of genetic data.\n",
    "\n",
    "Why not work directly on the VCF? While VCF is a text format that is easy for humans to read, it is inefficient to process on a computer. \n",
    "\n",
    "The first thing we do is import (`import_vcf`) and convert the `VCF` file into a Hail native file format. This is done by using the `write` method below. Any queries that follow will now run much more quickly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T00:59:22.096923Z",
     "start_time": "2021-06-16T00:59:17.291718Z"
    }
   },
   "outputs": [],
   "source": [
    "hl.import_vcf('resources/hgdp_subset.vcf.bgz', min_partitions=4, reference_genome='GRCh38')\\\n",
    ".write('resources/hgdp.mt', overwrite=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### HGDP as a Hail `MatrixTable`\n",
    "\n",
    "We represent genetic data as a Hail [`MatrixTable`](https://hail.is/docs/0.2/overview/matrix_table.html), and name our variable `mt` to indicate this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:00:40.993969Z",
     "start_time": "2021-06-16T01:00:40.826452Z"
    }
   },
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/hgdp.mt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is a `MatrixTable`?\n",
    "\n",
    "Let's explore it!\n",
    "\n",
    "You can see:\n",
    " - **numeric** types:\n",
    "     - integers (`int32`, `int64`), e.g. `5`\n",
    "     - floating point numbers (`float32`, `float64`), e.g. `5.5` or `3e-8`\n",
    " - **strings** (`str`), e.g. `\"Foo\"`\n",
    " - **boolean** values  (`bool`) e.g. `True`\n",
    " - **collections**:\n",
    "     - arrays (`array`), e.g. `[1,1,2,3]`\n",
    "     - sets (`set`), e.g. `{1,3}`\n",
    "     - dictionaries (`dict`), e.g. `{'Foo': 5, 'Bar': 10}`\n",
    " - **genetic data types**:\n",
    "     - loci (`locus`), e.g. `[GRCh37] 1:10000` or `[GRCh38] chr1:10024`\n",
    "     - genotype calls (`call`), e.g. `0/2` or `1|0`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:00:42.382461Z",
     "start_time": "2021-06-16T01:00:41.951181Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.describe(widget=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "Take a few moments to explore the interactive representation of the matrix table above.\n",
    "\n",
    "* Where is the variant information (`locus` and `alleles`)? \n",
    "* Where is the sample identifier (`s`)?\n",
    "* Where is the genotype quality `GQ`?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `show`\n",
    "\n",
    "Hail has a variety of functionality to help you quickly interrogate a dataset. The `show()` method prints the first few values of any field, and even prints in pretty HTML output in a Jupyter notebook! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:01:33.148911Z",
     "start_time": "2021-06-16T01:01:32.839716Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.s.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also possible to show() the matrix table itself, which prints a portion of the top-left corner of the variant-by-sample matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:01:32.791317Z",
     "start_time": "2021-06-16T01:01:31.052206Z"
    }
   },
   "outputs": [],
   "source": [
    "# show() works fine with no arguments, but can print too little data by default on small screens!\n",
    "mt.show(n_cols=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above output is visually noisy because the matrix table has as lot of information in it. `show`ing just the called genotype (`GT`) is a bit more friendly.\n",
    "\n",
    "The printed representation of GT is explained below, where `a` is the reference allele and `A` is the alternate allele:\n",
    "\n",
    "`0/0` : homozygous reference or `aa`\n",
    "\n",
    "`0/1` : heterozygous or `Aa`\n",
    "\n",
    "`1/1` : homozygous alternate or `AA` \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:02:27.710797Z",
     "start_time": "2021-06-16T01:02:26.672662Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.GT.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "There is a fourth value seen above, other than `0/0`, `0/1`, `1/1`. What is it?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `summarize`\n",
    "`summarize` Prints (potentially) useful information about any field or object:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`DP` is the read depth (number of short reads spanning a position for a given sample). Let's summarize all values of DP:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:03:25.537680Z",
     "start_time": "2021-06-16T01:03:24.790327Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.DP.summarize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`AD` is the array of allelic depth per allele at a called genotype. Note especially the missingness properties:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:03:32.473835Z",
     "start_time": "2021-06-16T01:03:31.705798Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.AD.summarize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "In the empty cell below, summarize some of the other fields on the matrix table. You can use the interactive widget above to find the names of some of the other fields.\n",
    "\n",
    "Share any interesting findings with your colleagues!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `count`\n",
    "\n",
    "`MatrixTable.count` returns a tuple with the number of rows (variants) and number of columns (samples)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:03:52.322494Z",
     "start_time": "2021-06-16T01:03:52.286311Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The count above tells us that we have 10,441 variants and 392 samples. This is just a tiny slice of a real sequencing dataset. The largest sequencing datasets today comprise hundreds of thousands of samples and more than a billion variants."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hail has a large library of genetics functionality\n",
    "\n",
    "Hail can be used to analyze any kind of data (Hail team members have used Hail to analyze household financial data, USA election polling data, and even to build a bot that posts real-time updates about the Euro 2020 tournament to Slack). However, Hail does not have *only* general-purpose analysis functionality. Hail has a large set of functionality built for genetics and genomics.\n",
    "\n",
    "For example, `hl.summarize_variants` prints useful statistics about the variants in the dataset. These are not part of the generic `summarize()` function, which must support all kinds of data, not just variant data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:11:30.328467Z",
     "start_time": "2021-06-16T01:11:27.813580Z"
    }
   },
   "outputs": [],
   "source": [
    "hl.summarize_variants(mt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Annotation and quality control"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Integrate sample information\n",
    "\n",
    "Our dataset currently only has sample IDs and genetic data. In order to run a toy GWAS, we need phenotype information.\n",
    "\n",
    "We can find it in the following file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:22:48.116268Z",
     "start_time": "2021-06-16T01:22:47.945999Z"
    }
   },
   "outputs": [],
   "source": [
    "! head resources/HGDP_sample_data.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can import it as a [Hail Table](https://hail.is/docs/0.2/overview/table.html) with [hl.import_table](https://hail.is/docs/0.2/methods/impex.html?highlight=import_table#hail.methods.import_table).\n",
    "\n",
    "We call it `sd` for \"sample data\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:22:53.640423Z",
     "start_time": "2021-06-16T01:22:53.344048Z"
    }
   },
   "outputs": [],
   "source": [
    "sd = hl.import_table('resources/HGDP_sample_data.tsv',\n",
    "                     key='sample_id',\n",
    "                     impute=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The \"key\" argument tells Hail to use the `sample_id` field as the table key, which is used to find matching values in  joins. In a moment, we will be joining the `sd` table onto our matrix table so that we can use the sample data fields in our QC and analysis. It is also possible to specify a new key for an existing table using the `.key_by(...)` method.\n",
    "\n",
    "The \"impute\" argument tells Hail to impute the data types of the fields on the table. What does this mean? It means that you can ask Hail to figure out what is the data type in each column field such as `str` (string or just characters), `bool` (boolean or just true and false), `float64` (float or numbers with decimals), or `int32` (integer or numbers without decimals/whole numbers). If you don't use the `impute` flag or specify types manually with the `types` argument, each field will be imported as a string."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While we can see the names and types of fields in the logging messages and in the `head` output above, we can also `show` this table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:23:16.711218Z",
     "start_time": "2021-06-16T01:23:16.113034Z"
    }
   },
   "outputs": [],
   "source": [
    "sd.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can `summarize` each field in `sd`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:23:23.819124Z",
     "start_time": "2021-06-16T01:23:22.675895Z"
    }
   },
   "outputs": [],
   "source": [
    "sd.summarize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add sample data to our HGDP `MatrixTable`\n",
    "\n",
    "Let's now merge our genetic data (`mt`) with our sample data (`sd`).\n",
    "\n",
    "This is a join between the `sd` table and the columns of our matrix table. It just takes one line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:27:41.949211Z",
     "start_time": "2021-06-16T01:27:41.933565Z"
    }
   },
   "outputs": [],
   "source": [
    "mt = mt.annotate_cols(sample_data = sd[mt.s])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What's going on here?\n",
    "\n",
    "Understanding what's going on here is a bit more difficult. To understand, we need to understand a few pieces:\n",
    "\n",
    "#### 1. `annotate` methods\n",
    "\n",
    "In Hail, `annotate` methods refer to **adding new fields**. \n",
    "\n",
    " - `MatrixTable`'s `annotate_cols` adds new column (**sample**) fields.\n",
    " - `MatrixTable`'s `annotate_rows` adds new row (**variant**) fields.\n",
    " - `MatrixTable`'s `annotate_entries` adds new entry (**genotype**) fields.\n",
    " - `Table`'s `annotate` adds new row fields.\n",
    "\n",
    "In the above cell, we are adding a new column (**sample**) field called \"sample_data\". This field should be the values in our table `sd` associated with the sample ID `s` in our `MatrixTable` - that is, this is performing a **join**.\n",
    "\n",
    "Python uses square brackets to look up values in dictionaries:\n",
    "\n",
    "    >>> d = {'foo': 5, 'bar': 10}\n",
    "    \n",
    "    >>> d['foo']\n",
    "    'bar'\n",
    "\n",
    "You should think of this in much the same way - for each column of `mt`, we are looking up the fields in `sd` using the sample ID `s`.\n",
    "\n",
    "Let's see how the matrix table has changed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:27:42.962564Z",
     "start_time": "2021-06-16T01:27:42.443875Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.describe(widget=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cheat sheets\n",
    "\n",
    "More information about matrix tables and tables can be found in a graphical representation as Hail cheat sheets:\n",
    "\n",
    " - [MatrixTable](https://hail.is/docs/0.2/_static/cheatsheets/hail_matrix_tables_cheat_sheet.pdf)\n",
    " - [Table](https://hail.is/docs/0.2/_static/cheatsheets/hail_tables_cheat_sheet.pdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query the sample data\n",
    "\n",
    "We will use some of the general-purpose query functionality to interrogate the sample data we have imported.\n",
    "\n",
    "The code below uses the `aggregate_cols` method on our matrix table, which computes aggregate statistics about column (sample) data. There are also methods for `aggregate_rows` (aggregate over row data) and `aggregate_entries` aggregate over all of the entries in our matrix, one per variant per sample).\n",
    "\n",
    "Hail **aggregators** can be recognized by the `hl.agg` prefix. Some examples:\n",
    "\n",
    " - `hl.agg.fraction(CONDITION)` - compute the fraction of values at which `CONDITION` is true.\n",
    " - `hl.agg.count_where(CONDITION)` - compute the number of values at which `CONDITION` is true.\n",
    " - `hl.agg.stats(X)` - compute a few useful statistics about `X`.\n",
    " - `hl.agg.counter(X)` - compute the number of occurrences of each unique value of `X`. Useful for categorical fields like strings, not as useful for numbers!\n",
    " - `hl.agg.corr(X, Y)` - compute the Pearson correlation coefficient between X and Y values.\n",
    " - For more adventurous students, see the [full list of aggregators](https://hail.is/docs/0.2/aggregators.html#sec-aggregators).\n",
    "\n",
    "\n",
    "### Sex karyotype\n",
    "\n",
    "To start, we will compute the occurrences of each value of `sex_karyotype`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:28:14.204263Z",
     "start_time": "2021-06-16T01:28:13.670269Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.aggregate_cols(hl.agg.counter(mt.sample_data.sex_karyotype))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above result tells us that slightly more than half of our samples are XY, and the rest are XX. What should you do if some of your samples are neither XX or XY? That depends on the analysis you are trying to do, but you should be ready to think about this case!\n",
    "\n",
    "### Ancestry\n",
    "\n",
    "How many people are in each self-reported major continental ancestry group?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:28:14.605478Z",
     "start_time": "2021-06-16T01:28:14.207441Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.aggregate_cols(hl.agg.counter(mt.sample_data.continental_pop))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Try changing `continental_pop` to `pop` and rerunning the cell above. Most of the populations are abbreviated, but see if you can find an ancestral population from each continent among the non-abbreviated ones!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Numeric aggregations\n",
    "\n",
    "The numeric aggregators are used the same way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:28:14.948873Z",
     "start_time": "2021-06-16T01:28:14.607861Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.aggregate_cols(hl.agg.stats(mt.sample_data.sleep_duration))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Use the `fraction` and `count_where` aggregators to answer the following questions in the cells below:\n",
    "\n",
    "1. How many samples drink more than 8 cups of tea per day? *Hint: the CONDITION will take the form `SOMETHING > 8`.*\n",
    "\n",
    "2. What fraction of samples sleep less than 4 hours per day?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checkpoint #1. \n",
    "\n",
    "You've finished the annotation section. Now is a good time to take a bio break or ask any pertinent questions to the faculty!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Interrogating SNP distributions\n",
    "\n",
    "You might have noticed in the above `hl.summarize_variants()` printout that the only \"allele type\" present in our data is SNP. In a real sequencing dataset, you'll see insertions, deletions, \"star\" alleles (upstream spanning deletions), and complex variation (none of the above).\n",
    "\n",
    "We selected only SNPs from the HGDP dataset for simplicity. However, since we've got a few thousand SNPs, we might as well look at the frequency of the twelve SNP polymorphisms. \n",
    "\n",
    "We can do this with an aggregator!\n",
    "\n",
    "The `collections.Counter` class we use below is a builtin Python data structure that helps print out our query result sorted by number of occurrences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:18:39.911149Z",
     "start_time": "2021-06-16T01:18:39.570472Z"
    }
   },
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "Counter(mt.aggregate_rows(hl.agg.counter((mt.alleles[0], mt.alleles[1])))).most_common()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Discuss these two questions as a group:\n",
    "\n",
    "**Question 1** - Why do the C/T and G/A SNPs have roughly the same frequency? Why do the T/C and A/G SNPs have roughly the same frequency?\n",
    "\n",
    "**Question 2** - Why does the C/T SNP occur so much more frequently than a T/A SNP?\n",
    "\n",
    "*Hint: The answer to these questions doesn't involve statistics, but basic biology!*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More complicated aggregators\n",
    "\n",
    "You've seen and tried a taste of aggregator functionality, but aggregators in Hail can be very expressive. As an example, we're going to compute the sample IDs of the five happiest XX individuals. Don't worry about being able to reproduce this on your own right now!\n",
    "\n",
    "The answer involves using the [filter](https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.filter) aggregator in combination with the [take](https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.take) aggregator, and using the `take` aggregator's optional `ordering` argument to indicate that we want 5 sample IDs sorted by happiness.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-06-16T01:35:50.507269Z",
     "start_time": "2021-06-16T01:35:50.140069Z"
    }
   },
   "outputs": [],
   "source": [
    "mt.aggregate_cols(\n",
    "    hl.agg.filter(\n",
    "        mt.sample_data.sex_karyotype == 'XX',\n",
    "        hl.agg.take(hl.struct(sample_id=mt.s, happiness=mt.sample_data.general_happiness),\n",
    "                    5,\n",
    "                    ordering=-mt.sample_data.general_happiness)\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checkpoint #2. \n",
    "\n",
    "You've finished the basic aggregation materials!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. Sample QC\n",
    "\n",
    "In past workshop sessions you have learned the important adage: _garbage in, garbage out_. Good QC takes time and thoughtfulness but is necessary for robust results. \n",
    "\n",
    "Here, we run through some simple sample qc steps, but **these steps are not a one-size-fits-all solution for QC on your own data!**\n",
    "\n",
    "Hail has the function [hl.sample_qc](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.sample_qc) to compute a list of useful statistics about samples from sequencing data. This function adds a new column field, `sample_qc`, with the computed statistics. Note that this doesn't actually remove samples for you -- those decisions are up to you. The `sample_qc` method gives you some data you can use as a starting point.\n",
    "\n",
    "**Click the sample_qc link** above to see the documentation, which lists the computed fields and their descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.sample_qc(mt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt.describe(widget=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hail includes a plotting library built on [bokeh](https://bokeh.pydata.org/en/latest/index.html) that makes it easy to visualize fields of Hail tables and matrix tables.\n",
    "\n",
    "Let's visualize the pairwise distribution of `Mean DP` (Read Depth) and `Call Rate`.\n",
    "\n",
    "Note that you can **hover over points with your mouse to see the sample IDs!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = hl.plot.scatter(x=mt.sample_qc.dp_stats.mean,\n",
    "                    y=mt.sample_qc.call_rate,\n",
    "                    xlabel='Mean DP',\n",
    "                    ylabel='Call Rate',\n",
    "                    hover_fields={'ID': mt.s},\n",
    "                    size=8)\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Try adding the following argument into the plot function argument list above below the `hover_fields=...` line:\n",
    "\n",
    "    label=mt.sample_data.sex_karyotype,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter columns using generated QC statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before filtering samples, we should compute a raw sample count:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt.count_cols()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`filter_cols` removes entire columns from the matrix table. Here, we keep columns (samples) where the `call_rate` is over 92%:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.92)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compute a final sample count:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt.count_cols()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many samples did not meet your QC criteria?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7. Variant QC\n",
    "\n",
    "Now that we have successfully gone through basic sample QC using the function `sample_qc` and general-purpose filtering methods, let's do variant QC.\n",
    "\n",
    "\n",
    "Hail has the function [hl.variant_qc](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc) to compute a list of useful statistics about **variants** from sequencing data.\n",
    "\n",
    "Once again, **Click the link** above to see the documentation!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.variant_qc(mt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt.describe(widget=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the `variant_qc` output!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can `show()` the computed information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt.variant_qc.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Metrics like `call_rate` are important for QC. Let's plot the cumulative density function of call rate per variant:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "show(hl.plot.cdf(mt.variant_qc.call_rate))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before filtering variants, we should compute a raw variant count:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pre-qc variant count\n",
    "mt.count_rows()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`filter_rows` removes entire rows of the matrix table. Here, we keep rows where the `call_rate` is over 95%:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = mt.filter_rows(mt.variant_qc.call_rate > 0.95)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Hardy-Weinberg filter\n",
    "\n",
    "In past sessions, you have seen filters on Hardy-Weinberg equilibrium using tools like PLINK.\n",
    "\n",
    "Hail can do the same. First, let's plot the log-scaled HWE p-values per variant:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = hl.plot.histogram(hl.log10(mt.variant_qc.p_value_hwe))\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some **extremely** bad variants here, with p-values smaller than 1e-100. Let's look at some of these variants:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rows = mt.rows()\n",
    "rows.order_by(rows.variant_qc.p_value_hwe).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "There's a lot of information in the above output. Take a moment to look through, and remember, these are **bad-quality variants**. Can you see why these variants had such low HWE p-values? *Hint: scroll all the way to the right to the variant_qc output*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The VCF `FILTERS` field\n",
    "\n",
    "You're not entirely on your own for variant QC -- most variant calling software generates its own filtering annotations that are present in the `FILTERS` field of the VCF. Lines of the VCF might have reasons to be filtered, or might be `PASS`. Above, almost every one of these bad variants has the `\"InbreedingCoeff\"` filter listed in its `mt.filters` field!\n",
    "\n",
    "We can remove all of these pre-filtered variants by keeping only variants which have no filters applied."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = mt.filter_rows(hl.len(mt.filters) == 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then compute the final sample and variant count:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many variants did you lose from your variant QC? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checkpoint #3.\n",
    "\n",
    "You've finished the QC materials!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8. Association Testing and PCA\n",
    "\n",
    "The goal of a GWAS is to find associations between genotypes and a trait of interest.\n",
    "\n",
    "First, we filter to common variants (those with a minor allele frequency over 1%). GWAS is not well-powered to detect signal from extremely rare variants, like those only observed in one individual.\n",
    "\n",
    "Our filter below removes variants where the minimum value of AF is below 1%. The AF array computed by `hl.variant_qc` contains one value for each allele, **including the reference**, and sums to 1.0.\n",
    "\n",
    "A variant with 5% MAF might have AF values `[0.95, 0.05]` or `[0.05, 0.95]`. A variant with 0.1% MAF might have AF values `[0.999, 0.001]` or `[0.001, 0.999]`. This filter ensures that we account for both cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "base = mt\n",
    "mt = base.filter_rows(hl.min(base.variant_qc.AF) > 0.01)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many variants did you lose from your common variant filter? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that in a GWAS we're running independent association tests on each variant.\n",
    "\n",
    "In Hail, the method we use is [hl.linear_regression_rows](https://hail.is/docs/0.2/methods/stats.html#hail.methods.linear_regression_rows). Why isn't this called `hl.gwas`? Modularity! There are applications for this statistical method other than genome wide association studies.\n",
    "\n",
    "We will start by using the `sleep_duration` phenotype. At the end, you will have a chance to try the others (you can do that by editing the string below to one of the other phenotypes and rerunning the rest of the notebook). The other phenotypes are:\n",
    "\n",
    " - 'tea_intake_daily'\n",
    " - 'general_happiness'\n",
    " - 'screen_time_per_day'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "phenotype = 'sleep_duration'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gwas = hl.linear_regression_rows(y=mt.sample_data[phenotype], \n",
    "                                 x=mt.GT.n_alt_alleles(), \n",
    "                                 covariates=[1.0]  # intercept term\n",
    "                                )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting output from the line above is a `table`. Let's look at what the table looks like."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Out of all of the fields, we would recommend focusing your understanding of the p-value and beta effect when determining if you have a GWAS signal.\n",
    "\n",
    "**However**, keep in mind that the results thus far may need your model to be adjusted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gwas.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# 9. Visualization\n",
    "\n",
    "Let’s visualize our association test results from the linear regression. We can do so by creating 2 common plots: a [Manhattan plot](https://en.wikipedia.org/wiki/Manhattan_plot) and a [Q-Q plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot).\n",
    "\n",
    "We'll start with the Manhattan plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "p = hl.plot.manhattan(gwas.p_value)\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### Manhattan plot interpretation\n",
    "\n",
    "The dashed red line is the line for genome-wide significance with a typical Bonferroni correction: 5e-8.\n",
    "\n",
    "We have several points above the line -- mouse over to see the loci they refer to. This means we're ready to publish our sleep GWAS in *Nature*, right?\n",
    "\n",
    "...right?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### Q-Q plot\n",
    "The plot that accompanies a Manhattan plot is the Q-Q (quantile-quantile) plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "p = hl.plot.qq(gwas.p_value)\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "The Q-Q plot can clearly indicate widespread inflation or deflation of p-values. Is either of those properties present here?\n",
    "\n",
    "### Exercise\n",
    "\n",
    "**Is this GWAS well controlled? Discuss with your group.**\n",
    "\n",
    "Wikipedia has a good description of [genomic control estimation](https://en.wikipedia.org/wiki/Genomic_control) (lambda GC) to read later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checkpoint #4\n",
    "\n",
    "You've used Hail to run a basic GWAS! It's more complicated than running in PLINK, but hopefully you can see that each part of the process is flexible and configurable!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 10. Confounded! PCA to the rescue.\n",
    "\n",
    "If you've done a GWAS before, you've probably included a few other covariates that might be confounders -- age, sex, and principal components.\n",
    "\n",
    "Principal components are a measure of genetic ancestry, and can be used to control for [population stratification](https://en.wikipedia.org/wiki/Population_stratification).\n",
    "\n",
    "Principal component analysis (PCA) is a very general statistical method for reducing high dimensional data to a small number of dimensions which capture most of the variation in the data. Hail has the function [pca](https://hail.is/docs/0.2/methods/stats.html#hail.methods.pca) for performing generic PCA.\n",
    "\n",
    "PCA typically works best on normalized data (e.g. mean centered). Hail provides the specialized function [hwe_normalized_pca](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca) which first normalizes the genotypes according to the Hardy-Weinberg Equilibium model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca_eigenvalues, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, compute_loadings=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca_scores = pca_scores.checkpoint('resources/pca_scores.ht', overwrite=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The pca function returns three things:\n",
    "* **eigenvalues**: an array of doubles\n",
    "* **scores**: a table keyed by sample\n",
    "* **loadings**: a table keyed by variant\n",
    "\n",
    "The **loadings** are the *principal directions*, each of which is a weighted sum of variants (a \"virtual variant\"). By default, `hwe_normalized_pca` gives us 10 principal directions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Joining computed PCs onto our matrix table\n",
    "\n",
    "Using the same syntax as we used to join the sample data table `sd` above, we can join the computed scores onto our matrix table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = mt.annotate_cols(pca = pca_scores[mt.s])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize principal components\n",
    "\n",
    "Let's make plots!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "p1 = hl.plot.scatter(x=mt.pca.scores[0],\n",
    "                    y=mt.pca.scores[1],\n",
    "                    xlabel='PC1',\n",
    "                    ylabel='PC2',\n",
    "                    hover_fields={'ID': mt.s, 'pop': mt.sample_data.pop},\n",
    "                    label=mt.sample_data.continental_pop,\n",
    "                    size=8)\n",
    "\n",
    "p2 = hl.plot.scatter(x=mt.pca.scores[2],\n",
    "                    y=mt.pca.scores[3],\n",
    "                    xlabel='PC3',\n",
    "                    ylabel='PC4',\n",
    "                    hover_fields={'ID': mt.s, 'pop': mt.sample_data.pop},\n",
    "                    label=mt.sample_data.continental_pop,\n",
    "                    size=8)\n",
    "\n",
    "show(p1)\n",
    "show(p2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Compare your plots with your group. **Besides the randomly-chosen coloration**, do they look the same? If not, how do they differ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checkpoint #5\n",
    "\n",
    "You've finished the PCA materials!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 11. Control confounders and run another GWAS\n",
    "\n",
    "Now that we have computed principal components and saved it into our `mt`, let’s run another GWAS that includes PCs as covariates to account for confounding due to population stratification.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gwas2 = hl.linear_regression_rows(\n",
    "    y=mt.sample_data[phenotype],\n",
    "    x=mt.GT.n_alt_alleles(),\n",
    "    covariates=[1.0, mt.pca.scores[0], mt.pca.scores[1], mt.pca.scores[2]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = hl.plot.qq(gwas2.p_value)\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above Q-Q plot indicates a much better-controlled GWAS. Let's try the Manhattan plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = hl.plot.manhattan(gwas2.p_value)\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Are there any significant hits? Do you trust these results? Why or why not?\n",
    "\n",
    "Discuss with your group - roughly how many samples would you need to discover rare variant signal? polygenic signal?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Go back to the \"Association Testing and PCA\" section above and rerun the rest of the notebook with a different phenotype. You will need to edit the cell where `phenotype = 'sleep_duration'`, but rerun all the cells starting from the \"Association testing and PCA\" heading beginning with `base = mt...`.\n",
    "\n",
    "### Exercise\n",
    "\n",
    "Change the \"gwas2\" cell to experiment with how many principal components are needed as covariates to properly control this GWAS. How many are needed here in our tiny simulated example? How many are needed in a typical GWAS?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The end! \n",
    "\n",
    "If you still have time and desire, there is another practical: `02-Hail-rare-variant-analysis`. You can access that notebook from the Jupyter home page. But please don't rush straight to that notebook -- you might learn more reviewing the code you ran here, digesting it, experimenting with the code, and asking questions!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# When the workshop ends and you return to your life\n",
    "\n",
    "The hosted notebook service that is running this notebook will be turned off in a few hours, but you can continue using Hail!\n",
    "\n",
    "The [Hail website](https://www.hail.is) has a page with information about [getting started](https://hail.is/docs/0.2/getting_started.html). If you have a MacOS or Linux computer, or have access to a Linux server, you can run Hail. \n",
    "\n",
    "It is also possible to run Hail on Google Cloud. See the video lectures for guidance on how to do that, or reach out to the team for help!\n",
    "\n",
    "## The Hail community\n",
    "\n",
    "Although Hail has a steeper learning curve than many command-line tools, you won't be learning it alone! Hail has a [forum](https://discuss.hail.is) and [Zulip chatroom](https://hail.zulipchat.com) full of like-minded users of all experience levels. Please stop by to say hello!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {
    "height": "242px",
    "width": "283px"
   },
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}