{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Institute for Behavioral Genetics International Statistical Genetics 2023 Workshop \n", "\n", "## Common Variant Analysis of Sequencing Data with Hail\n", "\n", "In this practical, we will learn how to:\n", "\n", "1) Use simple python code and Jupyter notebooks.\n", "\n", "2) Use Hail to import a VCF and run basic queries over sequencing data.\n", "\n", "3) Use Hail to perform basic quality control on sequencing data.\n", "\n", "4) Use Hail to run a basic genome-wide association study on common variants in sequencing data.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "\n", "### It doesn't all need to \"stick\" today\n", "\n", "This practical contains a lot of new material, and the goal of this workbook is not for you to be able to reproduce from memory all the various capabilities demonstrated here. Instead, the goal is for you to get a sense for the kind of analysis tasks that sequencing data requires, and to gain some exposure to what these analyses look like in Hail. \n", "\n", "There is no one-size-fits-all sequencing analysis pipeline, because each sequencing dataset will have unique properties that need to be understood and accounted for in QC and analysis. Hail can empower you to interrogate sequencing data, but it cannot give you all the questions to ask!\n", "\n", "Some of the questions and exercises in this notebook might seem unrelated to the specific task of analyzing sequencing data, but that is intentional -- Hail is a computational tool that hopes to help you indulge your scientific curiosity, and asking and answering a variety of questions about many aspects of your data is the best way to learn *how to Hail*.\n", "\n", "We don't expect you to be able to run a full GWAS on your own data in Hail tomorrow. If this is something you want to do, there are **lots more** resources available -- documentation, cookbooks, tutorials, and most importantly, the Hail community on the [forum](https://discuss.hail.is) and [zulip chatroom](https://hail.zulipchat.com).\n", "\n", "### We encourage you to play\n", "\n", "Hail is a highly expressive library with lots of functionality -- you'll see just a small fraction of it today. Throughout this notebook and especially in the denoted **exercises**, we encourage you to experiment with the code being run to see what happens! Sometimes it will be an error, but sometimes you will encounter new pieces of functionality. If you're curious about how to use Hail to ask slightly different questions than the code or exercises here, please ask the faculty! We are eager to help.\n", "\n", "### Interactive analysis on the cloud\n", "\n", "Part of what we think is so exciting about Hail is that Hail development has coincided with other technological shifts in the data science community.\n", "\n", "Five years ago, most computational biologists analyzed sequencing data using command-line tools, and took advantage of research compute clusters by explicitly using scheduling frameworks like LSF or Sun Grid Engine. These clusters are powerful resources, but it requires a great deal of extra thought and effort to manage pipelines running on them.\n", "\n", "Today, most Hail users run Hail from within interactive Python notebooks (like this one!) backed by thousands of cores on public compute clouds like [Google Cloud](https://cloud.google.com/), [Amazon Web Services](https://aws.amazon.com/), or [Microsoft Azure](https://azure.microsoft.com/). You don't need to share a cluster with hundreds of other scientists, and you only need to pay for the resources that you use.\n", "\n", "You won't get hands-on experience with this kind of usage today, but there are lots of resources to help you get started if you're interested in that. Please stay in touch with us after the workshop ends!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Using Jupyter\n", "\n", "The notebook software that you are using right now is called [Jupyter](https://jupyter.org/), which came from a combination of the languages **Ju**lia, **Pyt**hon, and **R**.\n", "\n", "**Learning objectives**\n", "\n", " - be comfortable running, editing, adding, and deleting code cells.\n", " - learn techniques for unblocking yourself if Jupyter acts up.\n", "\n", "### Running cells\n", "Evaluate cells using SHIFT + ENTER. Select the next cell and run it. If you prefer clicking, you can select the cell and click the \"Run\" button in the toolbar above." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello, world\n" ] } ], "source": [ "print('Hello, world')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modes\n", "\n", "Jupyter has two modes, a **navigation mode** and an **editor mode**.\n", "\n", "#### Navigation mode:\n", "\n", " - BLUE cell borders\n", " - `UP` / `DOWN` move between cells\n", " - `ENTER` while a cell is selected will move to **editing mode**.\n", " - Many letters are keyboard shortcuts! This is a common trap.\n", " \n", "#### Editor mode:\n", "\n", " - GREEN cell borders\n", " - `UP` / `DOWN`/ move within cells before moving between cells.\n", " - `ESC` will return to **navigation mode**.\n", " - `SHIFT + ENTER` will evaluate a cell and return to **navigation mode**.\n", " \n", "Try editing this markdown cell by double clicking, then re-rendering it by \"running\" the cell." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cell types\n", "\n", "There are several types of cells in Jupyter notebooks. The two you will see in this notebook are **Markdown** (text) and **Code**." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T00:52:03.914467Z", "start_time": "2021-06-16T00:52:03.911988Z" } }, "outputs": [], "source": [ "# This is a code cell\n", "my_variable = 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**This is a markdown cell**, so even if something looks like code (as below), it won't get executed!\n", "\n", "my_variable += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Shell commands\n", "\n", "It is possible to call command-line utilities from Jupyter by prefixing a line with a `!`. For instance, we can print the current directory:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T00:56:59.757110Z", "start_time": "2021-06-16T00:56:59.630749Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/timp/2023_IBG_HAIL/2021_IBG_Hail\n" ] } ], "source": [ "! pwd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tips and tricks\n", "\n", "Keyboard shortcuts:\n", "\n", " - `SHIFT + ENTER` to evaluate a cell\n", " - `ESC` to return to navigation mode\n", " - `y` to turn a markdown cell into code\n", " - `m` to turn a code cell into markdown\n", " - `a` to add a new cell **above** the currently selected cell\n", " - `b` to add a new cell **below** the currently selected cell\n", " - `d, d` (repeated) to delete the currently selected cell\n", " - `TAB` to activate code completion\n", " \n", "To try this out, create a new cell below this one using `b`, and print `my_variable` by starting with `print(my` and pressing `TAB`!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resetting Jupyter if you get stuck\n", "\n", "If at any point during this practical, you are unable to successfully run cells, it is possible that your Python interpreter is in a bad state due to cells being run in an incorrect order. If this happens, you can recover a working session by doing the following:\n", "\n", "1. Navigate to the \"Kernel\" menu at the top, and select \"Restart and clear output\".\n", "\n", "2. Select the cell you were working on, then select \"Run all above\" from the \"Cell\" menu at the top.\n", "\n", "3. If the problem persists, reach out to the faculty for help!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Import and initialize Hail" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to Hail, we import a few methods from the Hail plotting library. We'll see examples soon!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T00:53:12.301526Z", "start_time": "2021-06-16T00:53:10.754725Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "23/03/10 09:37:38 WARN Utils: Your hostname, ip-10-0-201-172 resolves to a loopback address: 127.0.1.1; using 10.0.201.172 instead (on interface ens5)\n", "23/03/10 09:37:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address\n", "23/03/10 09:37:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "23/03/10 09:37:39 WARN Hail: This Hail JAR was compiled for Spark 3.3.0, running with Spark 3.3.2.\n", " Compatibility is not guaranteed.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Running on Apache Spark version 3.3.2\n", "SparkUI available at http://10.0.201.172:4040\n", "Welcome to\n", " __ __ <>__\n", " / /_/ /__ __/ /\n", " / __ / _ `/ / /\n", " /_/ /_/\\_,_/_/_/ version 0.2.109-b123465fc0bf\n", "LOGGING: writing to /home/timp/2023_IBG_HAIL/2021_IBG_Hail/hail-20230310-0937-0.2.109-b123465fc0bf.log\n" ] } ], "source": [ "import os\n", "os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-memory 6G pyspark-shell'\n", "import hail as hl\n", "hl.init()\n", "\n", "from hail.ggplot import *\n", "\n", "import plotly\n", "import plotly.io as pio\n", "pio.renderers.default='iframe'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook works on a small (~16MB) downsampled chunk of the publically available Human Genome Diversity Project (HGDP) dataset. HGDP is a super-set of the well-known [1000 genomes](https://www.internationalgenome.org/) dataset, with a broader group of represented populations.\n", "\n", "We can see the files used using `ls` below:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T00:57:27.835145Z", "start_time": "2021-06-16T00:57:27.701051Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 19M\n", "-rwxr-xr-x 1 timp workshop 15K Mar 10 07:52 HGDP_sample_data.tsv\n", "-rwxr-xr-x 1 timp workshop 2.5M Mar 10 07:52 ensembl_gene_annotations.txt\n", "drwxr-xr-x 5 timp workshop 6.0K Mar 10 07:52 genes.ht\n", "drwxr-xr-x 7 timp workshop 6.0K Mar 10 07:52 hgdp-tgp-rare-variants.mt\n", "drwxr-xr-x 8 timp workshop 6.0K Mar 10 08:59 hgdp.mt\n", "-rwxr-xr-x 1 timp workshop 382K Mar 10 07:52 hgdp_gene_annotations.tsv\n", "-rwxr-xr-x 1 timp workshop 16M Mar 10 07:52 hgdp_subset.vcf.bgz\n", "drwxr-xr-x 6 timp workshop 6.0K Mar 10 09:01 pca_scores.ht\n", "drwxr-xr-x 5 timp workshop 6.0K Mar 10 07:52 rare-variant-phenotypes.ht\n" ] } ], "source": [ "! ls -lh resources/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Explore genetic data with Hail\n", "\n", "#### Learning Objectives:\n", "\n", "- To be comfortable exploring Hail data structures, especially the `MatrixTable`.\n", "- To understand categories of functionality for performing QC." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import data from VCF\n", "\n", "The [Variant Call Format (VCF)](https://en.wikipedia.org/wiki/Variant_Call_Format) is a common file format for representing genetic data collected on multiple individuals (samples).\n", "\n", "Hail has an [import_vcf](https://hail.is/docs/0.2/methods/impex.html#hail.methods.import_vcf) function that reads this file to a Hail `MatrixTable`, which is a general-purpose data structure that is often used to represent a matrix of genetic data.\n", "\n", "Why not work directly on the VCF? While VCF is a text format that is easy for humans to read, it is inefficient to process on a computer. \n", "\n", "The first thing we do is import (`import_vcf`) and convert the `VCF` file into a Hail native file format. This is done by using the `write` method below. Any queries that follow will now run much more quickly." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T00:59:22.096923Z", "start_time": "2021-06-16T00:59:17.291718Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 1:=============================> (2 + 2) / 4]\r" ] } ], "source": [ "hl.import_vcf('resources/hgdp_subset.vcf.bgz', min_partitions=4, reference_genome='GRCh38')\\\n", ".write('resources/hgdp.mt', overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### HGDP as a Hail `MatrixTable`\n", "\n", "We represent genetic data as a Hail [`MatrixTable`](https://hail.is/docs/0.2/overview/matrix_table.html), and name our variable `mt` to indicate this." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:00:40.993969Z", "start_time": "2021-06-16T01:00:40.826452Z" } }, "outputs": [], "source": [ "mt = hl.read_matrix_table('resources/hgdp.mt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is a `MatrixTable`?\n", "\n", "Let's explore it!\n", "\n", "You can see:\n", " - **numeric** types:\n", " - integers (`int32`, `int64`), e.g. `5`\n", " - floating point numbers (`float32`, `float64`), e.g. `5.5` or `3e-8`\n", " - **strings** (`str`), e.g. `\"Foo\"`\n", " - **boolean** values (`bool`) e.g. `True`\n", " - **collections**:\n", " - arrays (`array`), e.g. `[1,1,2,3]`\n", " - sets (`set`), e.g. `{1,3}`\n", " - dictionaries (`dict`), e.g. `{'Foo': 5, 'Bar': 10}`\n", " - **genetic data types**:\n", " - loci (`locus`), e.g. `[GRCh37] 1:10000` or `[GRCh38] chr1:10024`\n", " - genotype calls (`call`), e.g. `0/2` or `1|0`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:00:42.382461Z", "start_time": "2021-06-16T01:00:41.951181Z" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "61ccc7e23b9f4b64aa503ec4ca8d2b77", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HBox(children=(Button(description='globals', layout=Layout(height='30px', width='65px'), style=…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "10b683ed70de440891bbe7272f5a2040", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Tab(children=(VBox(children=(HTML(value='

Global fields, with one value in the dataset.

\\n

C…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mt.describe(widget=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "Take a few moments to explore the interactive representation of the matrix table above.\n", "\n", "* Where is the variant information (`locus` and `alleles`)? \n", "* Where is the sample identifier (`s`)?\n", "* Where is the genotype quality `GQ`?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `show`\n", "\n", "Hail has a variety of functionality to help you quickly interrogate a dataset. The `show()` method prints the first few values of any field, and even prints in pretty HTML output in a Jupyter notebook! " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:01:33.148911Z", "start_time": "2021-06-16T01:01:32.839716Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
s
str
"LP6005441-DNA_F08"
"LP6005441-DNA_C05"
"HGDP00961"
"HGDP00804"
"HGDP00926"
"HGDP00716"
"HGDP01269"
"HGDP00241"
"HGDP00110"
"LP6005441-DNA_F02"

showing top 10 rows

\n" ], "text/plain": [ "+---------------------+\n", "| s |\n", "+---------------------+\n", "| str |\n", "+---------------------+\n", "| \"LP6005441-DNA_F08\" |\n", "| \"LP6005441-DNA_C05\" |\n", "| \"HGDP00961\" |\n", "| \"HGDP00804\" |\n", "| \"HGDP00926\" |\n", "| \"HGDP00716\" |\n", "| \"HGDP01269\" |\n", "| \"HGDP00241\" |\n", "| \"HGDP00110\" |\n", "| \"LP6005441-DNA_F02\" |\n", "+---------------------+\n", "showing top 10 rows" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mt.s.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to show() the matrix table itself, which prints a portion of the top-left corner of the variant-by-sample matrix:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:01:32.791317Z", "start_time": "2021-06-16T01:01:31.052206Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
'LP6005441-DNA_F08'
'LP6005441-DNA_C05'
'HGDP00961'
locus
alleles
GT
DP
GQ
AD
PL
GT
DP
GQ
AD
PL
GT
DP
GQ
AD
PL
locus<GRCh38>array<str>callint32int32array<int32>array<int32>callint32int32array<int32>array<int32>callint32int32array<int32>array<int32>
chr1:17379["G","A"]0/01120NANA0/01530NANA0/03760NANA
chr1:95068["G","A"]0/01520NANA0/02730NANANANANANANA
chr1:111735["C","A"]0/01420NANA0/01520NANA0/11599[8,7][153,0,153]
chr1:134610["G","A"]0/0820NANANANANANANANANANANANA
chr1:414783["T","C"]NANANANANA0/0920NANANANANANANA
chr1:1130877["C","G"]0/02440NANA0/02730NANA0/03320NANA
chr1:1226707["C","G"]0/02620NANA0/02520NANA0/02920NANA
chr1:1491494["G","A"]0/02740NANA0/03020NANA0/02720NANA
chr1:1618118["G","A"]0/02720NANA0/03040NANA0/14499[22,22][621,0,622]
chr1:2078529["G","A"]0/03050NANA0/03760NANA0/02940NANA

showing top 10 rows

\n", "

showing the first 3 of 392 columns

\n" ], "text/plain": [ "+---------------+------------+------------------------+------------------------+\n", "| locus | alleles | 'LP6005441-DNA_F08'.GT | 'LP6005441-DNA_F08'.DP |\n", "+---------------+------------+------------------------+------------------------+\n", "| locus | array | call | int32 |\n", "+---------------+------------+------------------------+------------------------+\n", "| chr1:17379 | [\"G\",\"A\"] | 0/0 | 11 |\n", "| chr1:95068 | [\"G\",\"A\"] | 0/0 | 15 |\n", "| chr1:111735 | [\"C\",\"A\"] | 0/0 | 14 |\n", "| chr1:134610 | [\"G\",\"A\"] | 0/0 | 8 |\n", "| chr1:414783 | [\"T\",\"C\"] | NA | NA |\n", "| chr1:1130877 | [\"C\",\"G\"] | 0/0 | 24 |\n", "| chr1:1226707 | [\"C\",\"G\"] | 0/0 | 26 |\n", "| chr1:1491494 | [\"G\",\"A\"] | 0/0 | 27 |\n", "| chr1:1618118 | [\"G\",\"A\"] | 0/0 | 27 |\n", "| chr1:2078529 | [\"G\",\"A\"] | 0/0 | 30 |\n", "+---------------+------------+------------------------+------------------------+\n", "\n", "+------------------------+------------------------+------------------------+\n", "| 'LP6005441-DNA_F08'.GQ | 'LP6005441-DNA_F08'.AD | 'LP6005441-DNA_F08'.PL |\n", "+------------------------+------------------------+------------------------+\n", "| int32 | array | array |\n", "+------------------------+------------------------+------------------------+\n", "| 20 | NA | NA |\n", "| 20 | NA | NA |\n", "| 20 | NA | NA |\n", "| 20 | NA | NA |\n", "| NA | NA | NA |\n", "| 40 | NA | NA |\n", "| 20 | NA | NA |\n", "| 40 | NA | NA |\n", "| 20 | NA | NA |\n", "| 50 | NA | NA |\n", "+------------------------+------------------------+------------------------+\n", "\n", "+------------------------+------------------------+------------------------+\n", "| 'LP6005441-DNA_C05'.GT | 'LP6005441-DNA_C05'.DP | 'LP6005441-DNA_C05'.GQ |\n", "+------------------------+------------------------+------------------------+\n", "| call | int32 | int32 |\n", "+------------------------+------------------------+------------------------+\n", "| 0/0 | 15 | 30 |\n", "| 0/0 | 27 | 30 |\n", "| 0/0 | 15 | 20 |\n", "| NA | NA | NA |\n", "| 0/0 | 9 | 20 |\n", "| 0/0 | 27 | 30 |\n", "| 0/0 | 25 | 20 |\n", "| 0/0 | 30 | 20 |\n", "| 0/0 | 30 | 40 |\n", "| 0/0 | 37 | 60 |\n", "+------------------------+------------------------+------------------------+\n", "\n", "+------------------------+------------------------+----------------+\n", "| 'LP6005441-DNA_C05'.AD | 'LP6005441-DNA_C05'.PL | 'HGDP00961'.GT |\n", "+------------------------+------------------------+----------------+\n", "| array | array | call |\n", "+------------------------+------------------------+----------------+\n", "| NA | NA | 0/0 |\n", "| NA | NA | NA |\n", "| NA | NA | 0/1 |\n", "| NA | NA | NA |\n", "| NA | NA | NA |\n", "| NA | NA | 0/0 |\n", "| NA | NA | 0/0 |\n", "| NA | NA | 0/0 |\n", "| NA | NA | 0/1 |\n", "| NA | NA | 0/0 |\n", "+------------------------+------------------------+----------------+\n", "\n", "+----------------+----------------+----------------+----------------+\n", "| 'HGDP00961'.DP | 'HGDP00961'.GQ | 'HGDP00961'.AD | 'HGDP00961'.PL |\n", "+----------------+----------------+----------------+----------------+\n", "| int32 | int32 | array | array |\n", "+----------------+----------------+----------------+----------------+\n", "| 37 | 60 | NA | NA |\n", "| NA | NA | NA | NA |\n", "| 15 | 99 | [8,7] | [153,0,153] |\n", "| NA | NA | NA | NA |\n", "| NA | NA | NA | NA |\n", "| 33 | 20 | NA | NA |\n", "| 29 | 20 | NA | NA |\n", "| 27 | 20 | NA | NA |\n", "| 44 | 99 | [22,22] | [621,0,622] |\n", "| 29 | 40 | NA | NA |\n", "+----------------+----------------+----------------+----------------+\n", "showing top 10 rows\n", "showing the first 3 of 392 columns" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# show() works fine with no arguments, but can print too little data by default on small screens!\n", "mt.show(n_cols=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above output is visually noisy because the matrix table has as lot of information in it. `show`ing just the called genotype (`GT`) is a bit more friendly.\n", "\n", "The printed representation of GT is explained below, where `a` is the reference allele and `A` is the alternate allele:\n", "\n", "`0/0` : homozygous reference or `aa`\n", "\n", "`0/1` : heterozygous or `Aa`\n", "\n", "`1/1` : homozygous alternate or `AA` \n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:02:27.710797Z", "start_time": "2021-06-16T01:02:26.672662Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
'LP6005441-DNA_F08'
'LP6005441-DNA_C05'
'HGDP00961'
'HGDP00804'
locus
alleles
GT
GT
GT
GT
locus<GRCh38>array<str>callcallcallcall
chr1:17379["G","A"]0/00/00/00/0
chr1:95068["G","A"]0/00/0NA0/0
chr1:111735["C","A"]0/00/00/10/0
chr1:134610["G","A"]0/0NANA0/0
chr1:414783["T","C"]NA0/0NANA
chr1:1130877["C","G"]0/00/00/00/0
chr1:1226707["C","G"]0/00/00/00/0
chr1:1491494["G","A"]0/00/00/00/0
chr1:1618118["G","A"]0/00/00/10/1
chr1:2078529["G","A"]0/00/00/00/0

showing top 10 rows

\n", "

showing the first 4 of 392 columns

\n" ], "text/plain": [ "+---------------+------------+------------------------+------------------------+\n", "| locus | alleles | 'LP6005441-DNA_F08'.GT | 'LP6005441-DNA_C05'.GT |\n", "+---------------+------------+------------------------+------------------------+\n", "| locus | array | call | call |\n", "+---------------+------------+------------------------+------------------------+\n", "| chr1:17379 | [\"G\",\"A\"] | 0/0 | 0/0 |\n", "| chr1:95068 | [\"G\",\"A\"] | 0/0 | 0/0 |\n", "| chr1:111735 | [\"C\",\"A\"] | 0/0 | 0/0 |\n", "| chr1:134610 | [\"G\",\"A\"] | 0/0 | NA |\n", "| chr1:414783 | [\"T\",\"C\"] | NA | 0/0 |\n", "| chr1:1130877 | [\"C\",\"G\"] | 0/0 | 0/0 |\n", "| chr1:1226707 | [\"C\",\"G\"] | 0/0 | 0/0 |\n", "| chr1:1491494 | [\"G\",\"A\"] | 0/0 | 0/0 |\n", "| chr1:1618118 | [\"G\",\"A\"] | 0/0 | 0/0 |\n", "| chr1:2078529 | [\"G\",\"A\"] | 0/0 | 0/0 |\n", "+---------------+------------+------------------------+------------------------+\n", "\n", "+----------------+----------------+\n", "| 'HGDP00961'.GT | 'HGDP00804'.GT |\n", "+----------------+----------------+\n", "| call | call |\n", "+----------------+----------------+\n", "| 0/0 | 0/0 |\n", "| NA | 0/0 |\n", "| 0/1 | 0/0 |\n", "| NA | 0/0 |\n", "| NA | NA |\n", "| 0/0 | 0/0 |\n", "| 0/0 | 0/0 |\n", "| 0/0 | 0/0 |\n", "| 0/1 | 0/1 |\n", "| 0/0 | 0/0 |\n", "+----------------+----------------+\n", "showing top 10 rows\n", "showing the first 4 of 392 columns" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mt.GT.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "\n", "There is a fourth value seen above, other than `0/0`, `0/1`, `1/1`. What is it?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `summarize`\n", "`summarize` Prints (potentially) useful information about any field or object:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DP` is the read depth (number of short reads spanning a position for a given sample). Let's summarize all values of DP:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:03:25.537680Z", "start_time": "2021-06-16T01:03:24.790327Z" } }, "outputs": [ { "data": { "text/html": [ "

4092872 records.

DP (int32):
    Non-missing3851673 (94.11%)
    Missing241199 (5.89%)
    Minimum0
    Maximum5057
    Mean33.02
    Std Dev30.20
" ], "text/plain": [ "\n", "4092872 records.\n", "\n", "- DP (int32):\n", " Non-missing: 3851673 (94.11%)\n", " Missing: 241199 (5.89%)\n", " Minimum: 0\n", " Maximum: 5057\n", " Mean: 33.02\n", " Std Dev: 30.20" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mt.DP.summarize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`AD` is the array of allelic depth per allele at a called genotype. Note especially the missingness properties:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:03:32.473835Z", "start_time": "2021-06-16T01:03:31.705798Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 9:=============================> (2 + 2) / 4]\r" ] }, { "data": { "text/html": [ "

4092872 records.

AD (array<int32>):
    Non-missing1164892 (28.46%)
    Missing2927980 (71.54%)
    Min Size2
    Max Size2
    Mean Size2.00
  • AD[<elements>] (int32):
      Non-missing2329784 (100.00%)
      Missing0
      Minimum0
      Maximum1299
      Mean17.09
      Std Dev15.13
" ], "text/plain": [ "\n", "4092872 records.\n", "\n", "- AD (array):\n", " Non-missing: 1164892 (28.46%)\n", " Missing: 2927980 (71.54%)\n", " Min Size: 2\n", " Max Size: 2\n", " Mean Size: 2.00\n", "\n", " - AD[] (int32):\n", " Non-missing: 2329784 (100.00%)\n", " Missing: 0\n", " Minimum: 0\n", " Maximum: 1299\n", " Mean: 17.09\n", " Std Dev: 15.13" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mt.AD.summarize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3\n", "\n", "In the empty cell below, summarize some of the other fields on the matrix table. You can use the interactive widget above to find the names of some of the other fields.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `count`\n", "\n", "`MatrixTable.count` returns a tuple with the number of rows (variants) and number of columns (samples)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:03:52.322494Z", "start_time": "2021-06-16T01:03:52.286311Z" } }, "outputs": [ { "data": { "text/plain": [ "(10441, 392)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mt.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The count above tells us that we have 10,441 variants and 392 samples. This is just a tiny slice of a real sequencing dataset. The largest sequencing datasets today comprise hundreds of thousands of samples and more than a billion variants." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hail has a large library of genetics functionality\n", "\n", "Hail can be used to analyze any kind of data (Hail team members have used Hail to analyze household financial data, USA election polling data, and even to build a bot that posts real-time updates about the Euro 2020 tournament to Slack). However, Hail does not have *only* general-purpose analysis functionality. Hail has a large set of functionality built for genetics and genomics.\n", "\n", "For example, `hl.summarize_variants` prints useful statistics about the variants in the dataset. These are not part of the generic `summarize()` function, which must support all kinds of data, not just variant data!" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:11:30.328467Z", "start_time": "2021-06-16T01:11:27.813580Z" } }, "outputs": [ { "data": { "text/html": [ "

Variant summary:

  • Total variants: 10441

  • Alleles per variant:

    Number of allelesCount
    210441
  • Counts by allele type:

    Allele typeCount
    SNP10441
  • Transitions/Transversions:

    MetricValue
    Transitions6602
    Transversions3839
    Ratio1.72
  • Variants per contig:

    ContigCount
    chr1881
    chr2799
    chr3728
    chr4659
    chr5618
    chr6572
    chr7576
    chr8525
    chr9476
    chr10516
    chr11483
    chr12411
    chr13397
    chr14332
    chr15316
    chr16319
    chr17312
    chr18270
    chr19252
    chr20263
    chr21194
    chr22170
    chrX361
    chrY11
" ], "text/plain": [ "==============================\n", "Number of variants: 10441\n", "==============================\n", "Alleles per variant\n", "-------------------\n", " 2 alleles: 10441 variants\n", "==============================\n", "Variants per contig\n", "-------------------\n", " chr1: 881 variants\n", " chr2: 799 variants\n", " chr3: 728 variants\n", " chr4: 659 variants\n", " chr5: 618 variants\n", " chr6: 572 variants\n", " chr7: 576 variants\n", " chr8: 525 variants\n", " chr9: 476 variants\n", " chr10: 516 variants\n", " chr11: 483 variants\n", " chr12: 411 variants\n", " chr13: 397 variants\n", " chr14: 332 variants\n", " chr15: 316 variants\n", " chr16: 319 variants\n", " chr17: 312 variants\n", " chr18: 270 variants\n", " chr19: 252 variants\n", " chr20: 263 variants\n", " chr21: 194 variants\n", " chr22: 170 variants\n", " chrX: 361 variants\n", " chrY: 11 variants\n", "==============================\n", "Allele type distribution\n", "------------------------\n", " SNP: 10441 alternate alleles (Ti: 6602, Tv: 3839, ratio: 1.72)\n", "==============================" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "hl.summarize_variants(mt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Annotation and quality control" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Integrate sample information\n", "\n", "Our dataset currently only has sample IDs and genetic data. In order to run a toy GWAS, we need phenotype information.\n", "\n", "We can find it in the following file:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:22:48.116268Z", "start_time": "2021-06-16T01:22:47.945999Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sample_id\tpop\tcontinental_pop\tsex_karyotype\tsleep_duration\ttea_intake_daily\tgeneral_happiness\tscreen_time_per_day\n", "HG00107\tgbr\tnfe\tXY\t6\t3\t3.2895e+00\t11\n", "HG00114\tgbr\tnfe\tXY\t5\t3\t3.5099e+00\t10\n", "HG00121\tgbr\tnfe\tXX\t6\t6\t2.0851e+00\t6\n", "HG00127\tgbr\tnfe\tXX\t6\t3\t2.7580e+00\t6\n", "HG00132\tgbr\tnfe\tXX\t5\t6\t2.2454e+00\t5\n", "HG00149\tgbr\tnfe\tXY\t5\t6\t2.8159e+00\t9\n", "HG00177\tfin\tfin\tXX\t7\t9\t3.3661e+00\t8\n", "HG00190\tfin\tfin\tXY\t5\t6\t2.9159e+00\t6\n", "HG00233\tgbr\tnfe\tXX\t8\t3\t3.9002e+00\t10\n" ] } ], "source": [ "! head resources/HGDP_sample_data.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can import it as a [Hail Table](https://hail.is/docs/0.2/overview/table.html) with [hl.import_table](https://hail.is/docs/0.2/methods/impex.html?highlight=import_table#hail.methods.import_table).\n", "\n", "We call it `sd` for \"sample data\"." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:22:53.640423Z", "start_time": "2021-06-16T01:22:53.344048Z" } }, "outputs": [], "source": [ "sd = hl.import_table('resources/HGDP_sample_data.tsv',\n", " key='sample_id',\n", " impute=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"key\" argument tells Hail to use the `sample_id` field as the table key, which is used to find matching values in joins. In a moment, we will be joining the `sd` table onto our matrix table so that we can use the sample data fields in our QC and analysis. It is also possible to specify a new key for an existing table using the `.key_by(...)` method.\n", "\n", "The \"impute\" argument tells Hail to impute the data types of the fields on the table. What does this mean? It means that you can ask Hail to figure out what is the data type in each column field such as `str` (string or just characters), `bool` (boolean or just true and false), `float64` (float or numbers with decimals), or `int32` (integer or numbers without decimals/whole numbers). If you don't use the `impute` flag or specify types manually with the `types` argument, each field will be imported as a string." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While we can see the names and types of fields in the logging messages and in the `head` output above, we can also `show` this table:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:23:16.711218Z", "start_time": "2021-06-16T01:23:16.113034Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
sample_id
pop
continental_pop
sex_karyotype
sleep_duration
tea_intake_daily
general_happiness
screen_time_per_day
strstrstrstrint32int32float64int32
"HG00107""gbr""nfe""XY"633.29e+0011
"HG00114""gbr""nfe""XY"533.51e+0010
"HG00121""gbr""nfe""XX"662.09e+006
"HG00127""gbr""nfe""XX"632.76e+006
"HG00132""gbr""nfe""XX"562.25e+005
"HG00149""gbr""nfe""XY"562.82e+009
"HG00177""fin""fin""XX"793.37e+008
"HG00190""fin""fin""XY"562.92e+006
"HG00233""gbr""nfe""XX"833.90e+0010
"HG00252""gbr""nfe""XY"503.27e+0010

showing top 10 rows

\n" ], "text/plain": [ "+-----------+-------+-----------------+---------------+----------------+\n", "| sample_id | pop | continental_pop | sex_karyotype | sleep_duration |\n", "+-----------+-------+-----------------+---------------+----------------+\n", "| str | str | str | str | int32 |\n", "+-----------+-------+-----------------+---------------+----------------+\n", "| \"HG00107\" | \"gbr\" | \"nfe\" | \"XY\" | 6 |\n", "| \"HG00114\" | \"gbr\" | \"nfe\" | \"XY\" | 5 |\n", "| \"HG00121\" | \"gbr\" | \"nfe\" | \"XX\" | 6 |\n", "| \"HG00127\" | \"gbr\" | \"nfe\" | \"XX\" | 6 |\n", "| \"HG00132\" | \"gbr\" | \"nfe\" | \"XX\" | 5 |\n", "| \"HG00149\" | \"gbr\" | \"nfe\" | \"XY\" | 5 |\n", "| \"HG00177\" | \"fin\" | \"fin\" | \"XX\" | 7 |\n", "| \"HG00190\" | \"fin\" | \"fin\" | \"XY\" | 5 |\n", "| \"HG00233\" | \"gbr\" | \"nfe\" | \"XX\" | 8 |\n", "| \"HG00252\" | \"gbr\" | \"nfe\" | \"XY\" | 5 |\n", "+-----------+-------+-----------------+---------------+----------------+\n", "\n", "+------------------+-------------------+---------------------+\n", "| tea_intake_daily | general_happiness | screen_time_per_day |\n", "+------------------+-------------------+---------------------+\n", "| int32 | float64 | int32 |\n", "+------------------+-------------------+---------------------+\n", "| 3 | 3.29e+00 | 11 |\n", "| 3 | 3.51e+00 | 10 |\n", "| 6 | 2.09e+00 | 6 |\n", "| 3 | 2.76e+00 | 6 |\n", "| 6 | 2.25e+00 | 5 |\n", "| 6 | 2.82e+00 | 9 |\n", "| 9 | 3.37e+00 | 8 |\n", "| 6 | 2.92e+00 | 6 |\n", "| 3 | 3.90e+00 | 10 |\n", "| 0 | 3.27e+00 | 10 |\n", "+------------------+-------------------+---------------------+\n", "showing top 10 rows" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sd.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can `summarize` each field in `sd`:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:23:23.819124Z", "start_time": "2021-06-16T01:23:22.675895Z" } }, "outputs": [ { "data": { "text/html": [ "

392 records.

  • sample_id (str):
      Non-missing392 (100.00%)
      Missing0
      Min Size7
      Max Size17
      Mean Size7.92
      Sample Values['HG00107', 'HG00114', 'HG00121', 'HG00127', 'HG00132']
  • pop (str):
      Non-missing392 (100.00%)
      Missing0
      Min Size2
      Max Size11
      Mean Size3.84
      Sample Values['gbr', 'gbr', 'gbr', 'gbr', 'gbr']
  • continental_pop (str):
      Non-missing392 (100.00%)
      Missing0
      Min Size3
      Max Size3
      Mean Size3.00
      Sample Values['nfe', 'nfe', 'nfe', 'nfe', 'nfe']
  • sex_karyotype (str):
      Non-missing392 (100.00%)
      Missing0
      Min Size2
      Max Size2
      Mean Size2.00
      Sample Values['XY', 'XY', 'XX', 'XX', 'XX']
  • sleep_duration (int32):
      Non-missing392 (100.00%)
      Missing0
      Minimum2
      Maximum10
      Mean6.03
      Std Dev1.44
  • tea_intake_daily (int32):
      Non-missing392 (100.00%)
      Missing0
      Minimum0
      Maximum12
      Mean5.37
      Std Dev1.97
  • general_happiness (float64):
      Non-missing392 (100.00%)
      Missing0
      Minimum1.81
      Maximum4.50
      Mean2.94
      Std Dev0.51
  • screen_time_per_day (int32):
      Non-missing392 (100.00%)
      Missing0
      Minimum0
      Maximum13
      Mean6.57
      Std Dev2.50
" ], "text/plain": [ "\n", "392 records.\n", "\n", " - sample_id (str):\n", " Non-missing: 392 (100.00%)\n", " Missing: 0\n", " Min Size: 7\n", " Max Size: 17\n", " Mean Size: 7.92\n", " Sample Values: ['HG00107', 'HG00114', 'HG00121', 'HG00127', 'HG00132']\n", "\n", " - pop (str):\n", " Non-missing: 392 (100.00%)\n", " Missing: 0\n", " Min Size: 2\n", " Max Size: 11\n", " Mean Size: 3.84\n", " Sample Values: ['gbr', 'gbr', 'gbr', 'gbr', 'gbr']\n", "\n", " - continental_pop (str):\n", " Non-missing: 392 (100.00%)\n", " Missing: 0\n", " Min Size: 3\n", " Max Size: 3\n", " Mean Size: 3.00\n", " Sample Values: ['nfe', 'nfe', 'nfe', 'nfe', 'nfe']\n", "\n", " - sex_karyotype (str):\n", " Non-missing: 392 (100.00%)\n", " Missing: 0\n", " Min Size: 2\n", " Max Size: 2\n", " Mean Size: 2.00\n", " Sample Values: ['XY', 'XY', 'XX', 'XX', 'XX']\n", "\n", " - sleep_duration (int32):\n", " Non-missing: 392 (100.00%)\n", " Missing: 0\n", " Minimum: 2\n", " Maximum: 10\n", " Mean: 6.03\n", " Std Dev: 1.44\n", "\n", " - tea_intake_daily (int32):\n", " Non-missing: 392 (100.00%)\n", " Missing: 0\n", " Minimum: 0\n", " Maximum: 12\n", " Mean: 5.37\n", " Std Dev: 1.97\n", "\n", " - general_happiness (float64):\n", " Non-missing: 392 (100.00%)\n", " Missing: 0\n", " Minimum: 1.81\n", " Maximum: 4.50\n", " Mean: 2.94\n", " Std Dev: 0.51\n", "\n", " - screen_time_per_day (int32):\n", " Non-missing: 392 (100.00%)\n", " Missing: 0\n", " Minimum: 0\n", " Maximum: 13\n", " Mean: 6.57\n", " Std Dev: 2.50" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sd.summarize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add sample data to our HGDP `MatrixTable`\n", "\n", "Let's now merge our genetic data (`mt`) with our sample data (`sd`).\n", "\n", "This is a join between the `sd` table and the columns of our matrix table. It just takes one line:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:27:41.949211Z", "start_time": "2021-06-16T01:27:41.933565Z" } }, "outputs": [], "source": [ "mt = mt.annotate_cols(sample_data = sd[mt.s])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What's going on here?\n", "\n", "Understanding what's going on here is a bit more difficult. To understand, we need to understand a few pieces:\n", "\n", "#### 1. `annotate` methods\n", "\n", "In Hail, `annotate` methods refer to **adding new fields**. \n", "\n", " - `MatrixTable`'s `annotate_cols` adds new column (**sample**) fields.\n", " - `MatrixTable`'s `annotate_rows` adds new row (**variant**) fields.\n", " - `MatrixTable`'s `annotate_entries` adds new entry (**genotype**) fields.\n", " - `Table`'s `annotate` adds new row fields.\n", "\n", "In the above cell, we are adding a new column (**sample**) field called \"sample_data\". This field should be the values in our table `sd` associated with the sample ID `s` in our `MatrixTable` - that is, this is performing a **join**.\n", "\n", "Python uses square brackets to look up values in dictionaries:\n", "\n", " >>> d = {'foo': 5, 'bar': 10}\n", " \n", " >>> d['foo']\n", " 'bar'\n", "\n", "You should think of this in much the same way - for each column of `mt`, we are looking up the fields in `sd` using the sample ID `s`.\n", "\n", "Let's see how the matrix table has changed:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2021-06-16T01:27:42.962564Z", "start_time": "2021-06-16T01:27:42.443875Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----------------------------------------\n", "Global fields:\n", " None\n", "----------------------------------------\n", "Column fields:\n", " 's': str\n", " 'sample_data': struct {\n", " pop: str, \n", " continental_pop: str, \n", " sex_karyotype: str, \n", " sleep_duration: int32, \n", " tea_intake_daily: int32, \n", " general_happiness: float64, \n", " screen_time_per_day: int32\n", " }\n", "----------------------------------------\n", "Row fields:\n", " 'locus': locus\n", " 'alleles': array\n", " 'rsid': str\n", " 'qual': float64\n", " 'filters': set\n", " 'info': struct {\n", " QUALapprox: int32, \n", " SB: array, \n", " MQ: float64, \n", " MQRankSum: float64, \n", " VarDP: int32, \n", " AS_ReadPosRankSum: float64, \n", " AS_pab_max: float64, \n", " AS_QD: float64, \n", " AS_MQ: float64, \n", " QD: float64, \n", " AS_MQRankSum: float64, \n", " FS: float64, \n", " AS_FS: float64, \n", " ReadPosRankSum: float64, \n", " AS_QUALapprox: int32, \n", " AS_SB_TABLE: array, \n", " AS_VarDP: int32, \n", " AS_SOR: float64, \n", " SOR: float64, \n", " transmitted_singleton: bool, \n", " omni: bool, \n", " mills: bool, \n", " monoallelic: bool, \n", " AS_VQSLOD: float64, \n", " InbreedingCoeff: float64\n", " }\n", "----------------------------------------\n", "Entry fields:\n", " 'GT': call\n", " 'DP': int32\n", " 'GQ': int32\n", " 'AD': array\n", " 'PL': array\n", "----------------------------------------\n", "Column key: ['s']\n", "Row key: ['locus', 'alleles']\n", "----------------------------------------\n" ] } ], "source": [ "mt.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cheat sheets\n", "\n", "More information about matrix tables and tables can be found in a graphical representation as Hail cheat sheets:\n", "\n", " - [MatrixTable](https://hail.is/docs/0.2/_static/cheatsheets/hail_matrix_tables_cheat_sheet.pdf)\n", " - [Table](https://hail.is/docs/0.2/_static/cheatsheets/hail_tables_cheat_sheet.pdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Sample QC\n", "\n", "In past workshop sessions you have learned the important adage: _garbage in, garbage out_. Good QC takes time and thoughtfulness but is necessary for robust results. \n", "\n", "Here, we run through some simple sample qc steps, but **these steps are not a one-size-fits-all solution for QC on your own data!**\n", "\n", "Hail has the function [hl.sample_qc](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.sample_qc) to compute a list of useful statistics about samples from sequencing data. This function adds a new column field, `sample_qc`, with the computed statistics. Note that this doesn't actually remove samples for you -- those decisions are up to you. The `sample_qc` method gives you some data you can use as a starting point.\n", "\n", "**Click the sample_qc link** above to see the documentation, which lists the computed fields and their descriptions." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "mt = hl.sample_qc(mt)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7a58b639d653401fbcd7199b87322f3e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HBox(children=(Button(description='globals', layout=Layout(height='30px', width='65px'), style=…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ee619b2a056c4da39ea62db964204fe9", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Tab(children=(VBox(children=(HTML(value='

Global fields, with one value in the dataset.

\\n

C…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mt.describe(widget=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hail includes a plotting library built that makes it easy to visualize fields of Hail tables and matrix tables.\n", "\n", "Let's visualize the pairwise distribution of `Mean DP` (Read Depth) and `Call Rate`.\n", "\n", "Note that you can **hover over points with your mouse to see the sample IDs!**" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(ggplot(mt, aes(x=mt.sample_qc.dp_stats.mean, y=mt.sample_qc.call_rate)) \n", " + geom_point(aes(color=mt.sample_data.sex_karyotype))\n", " + xlab(\"Mean DP\")\n", " + ylab(\"Call Rate\"))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filter columns using generated QC statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before filtering samples, we should compute a raw sample count:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "392" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mt.count_cols()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`filter_cols` removes entire columns from the matrix table. Here, we keep columns (samples) where the `call_rate` is over 92%:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.92)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compute a final sample count:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "385" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mt.count_cols()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many samples did not meet your QC criteria?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6. Variant QC\n", "\n", "Now that we have successfully gone through basic sample QC using the function `sample_qc` and general-purpose filtering methods, let's do variant QC.\n", "\n", "\n", "Hail has the function [hl.variant_qc](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc) to compute a list of useful statistics about **variants** from sequencing data.\n", "\n", "Once again, **Click the link** above to see the documentation!" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "mt = hl.variant_qc(mt)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a0d4954b4eb1468b8eaf4f2382c4acd2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HBox(children=(Button(description='globals', layout=Layout(height='30px', width='65px'), style=…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "85ce579ded12444388fd255fa034a687", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Tab(children=(VBox(children=(HTML(value='

Global fields, with one value in the dataset.

\\n

C…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mt.describe(widget=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the `variant_qc` output!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can `show()` the computed information:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 22:=============================> (2 + 2) / 4]\r" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
variant_qc
dp_stats
gq_stats
locus
alleles
mean
stdev
min
max
mean
stdev
min
max
AC
AF
AN
homozygote_count
call_rate
n_called
n_not_called
n_filtered
n_het
n_non_ref
het_freq_hwe
p_value_hwe
p_value_excess_het
locus<GRCh38>array<str>float64float64float64float64float64float64float64float64array<int32>array<float64>int32array<int32>float64int64int64int64int64int64float64float64float64
chr1:17379["G","A"]4.95e+012.17e+017.00e+001.41e+026.44e+012.84e+012.00e+019.90e+01[752,10][9.87e-01,1.31e-02]762[371,0]9.90e-013814010102.59e-025.29e-014.71e-01
chr1:95068["G","A"]2.28e+011.19e+014.00e+001.06e+024.36e+013.19e+011.20e+019.90e+01[556,122][8.20e-01,1.80e-01]678[219,2]8.81e-013394601181202.96e-012.55e-041.36e-04
chr1:111735["C","A"]1.42e+016.02e+002.00e+002.80e+012.73e+011.87e+011.00e+009.90e+01[538,84][8.65e-01,1.35e-01]622[235,8]8.08e-0131174068762.34e-011.85e-018.70e-01
chr1:134610["G","A"]1.36e+015.46e+002.00e+002.80e+012.28e+016.94e+006.00e+005.60e+01[439,15][9.67e-01,3.30e-02]454[213,1]5.90e-01227158013146.40e-021.16e-018.84e-01
chr1:414783["T","C"]6.27e+002.65e+001.00e+001.40e+011.50e+014.73e+003.00e+002.60e+01[110,8][9.32e-01,6.78e-02]118[54,3]1.53e-01593260251.27e-011.37e-041.00e+00
chr1:1130877["C","G"]3.75e+017.77e+001.90e+016.50e+015.50e+012.82e+012.00e+019.90e+01[758,12][9.84e-01,1.56e-02]770[373,0]1.00e+003850012123.07e-025.42e-014.58e-01
chr1:1226707["C","G"]3.61e+016.28e+002.30e+017.70e+015.26e+012.16e+012.00e+019.90e+01[768,2][9.97e-01,2.60e-03]770[383,0]1.00e+0038500225.19e-035.01e-014.99e-01
chr1:1491494["G","A"]3.41e+015.39e+001.90e+017.30e+013.97e+011.92e+012.00e+019.90e+01[761,9][9.88e-01,1.17e-02]770[376,0]1.00e+0038500992.31e-025.23e-014.77e-01
chr1:1618118["G","A"]3.63e+016.23e+002.10e+017.60e+016.50e+012.67e+012.00e+019.90e+01[639,129][8.32e-01,1.68e-01]768[277,22]9.97e-0138410851072.80e-011.15e-041.00e+00
chr1:2078529["G","A"]3.41e+014.28e+002.20e+015.80e+014.13e+011.60e+012.00e+019.90e+01[770,0][1.00e+00,0.00e+00]770[385,0]1.00e+0038500000.00e+005.00e-015.00e-01

showing top 10 rows

\n" ], "text/plain": [ "+---------------+------------+--------------------------+\n", "| locus | alleles | variant_qc.dp_stats.mean |\n", "+---------------+------------+--------------------------+\n", "| locus | array | float64 |\n", "+---------------+------------+--------------------------+\n", "| chr1:17379 | [\"G\",\"A\"] | 4.95e+01 |\n", "| chr1:95068 | [\"G\",\"A\"] | 2.28e+01 |\n", "| chr1:111735 | [\"C\",\"A\"] | 1.42e+01 |\n", "| chr1:134610 | [\"G\",\"A\"] | 1.36e+01 |\n", "| chr1:414783 | [\"T\",\"C\"] | 6.27e+00 |\n", "| chr1:1130877 | [\"C\",\"G\"] | 3.75e+01 |\n", "| chr1:1226707 | [\"C\",\"G\"] | 3.61e+01 |\n", "| chr1:1491494 | [\"G\",\"A\"] | 3.41e+01 |\n", "| chr1:1618118 | [\"G\",\"A\"] | 3.63e+01 |\n", "| chr1:2078529 | [\"G\",\"A\"] | 3.41e+01 |\n", "+---------------+------------+--------------------------+\n", "\n", "+---------------------------+-------------------------+\n", "| variant_qc.dp_stats.stdev | variant_qc.dp_stats.min |\n", "+---------------------------+-------------------------+\n", "| float64 | float64 |\n", "+---------------------------+-------------------------+\n", "| 2.17e+01 | 7.00e+00 |\n", "| 1.19e+01 | 4.00e+00 |\n", "| 6.02e+00 | 2.00e+00 |\n", "| 5.46e+00 | 2.00e+00 |\n", "| 2.65e+00 | 1.00e+00 |\n", "| 7.77e+00 | 1.90e+01 |\n", "| 6.28e+00 | 2.30e+01 |\n", "| 5.39e+00 | 1.90e+01 |\n", "| 6.23e+00 | 2.10e+01 |\n", "| 4.28e+00 | 2.20e+01 |\n", "+---------------------------+-------------------------+\n", "\n", "+-------------------------+--------------------------+\n", "| variant_qc.dp_stats.max | variant_qc.gq_stats.mean |\n", "+-------------------------+--------------------------+\n", "| float64 | float64 |\n", "+-------------------------+--------------------------+\n", "| 1.41e+02 | 6.44e+01 |\n", "| 1.06e+02 | 4.36e+01 |\n", "| 2.80e+01 | 2.73e+01 |\n", "| 2.80e+01 | 2.28e+01 |\n", "| 1.40e+01 | 1.50e+01 |\n", "| 6.50e+01 | 5.50e+01 |\n", "| 7.70e+01 | 5.26e+01 |\n", "| 7.30e+01 | 3.97e+01 |\n", "| 7.60e+01 | 6.50e+01 |\n", "| 5.80e+01 | 4.13e+01 |\n", "+-------------------------+--------------------------+\n", "\n", "+---------------------------+-------------------------+\n", "| variant_qc.gq_stats.stdev | variant_qc.gq_stats.min |\n", "+---------------------------+-------------------------+\n", "| float64 | float64 |\n", "+---------------------------+-------------------------+\n", "| 2.84e+01 | 2.00e+01 |\n", "| 3.19e+01 | 1.20e+01 |\n", "| 1.87e+01 | 1.00e+00 |\n", "| 6.94e+00 | 6.00e+00 |\n", "| 4.73e+00 | 3.00e+00 |\n", "| 2.82e+01 | 2.00e+01 |\n", "| 2.16e+01 | 2.00e+01 |\n", "| 1.92e+01 | 2.00e+01 |\n", "| 2.67e+01 | 2.00e+01 |\n", "| 1.60e+01 | 2.00e+01 |\n", "+---------------------------+-------------------------+\n", "\n", "+-------------------------+---------------+---------------------+\n", "| variant_qc.gq_stats.max | variant_qc.AC | variant_qc.AF |\n", "+-------------------------+---------------+---------------------+\n", "| float64 | array | array |\n", "+-------------------------+---------------+---------------------+\n", "| 9.90e+01 | [752,10] | [9.87e-01,1.31e-02] |\n", "| 9.90e+01 | [556,122] | [8.20e-01,1.80e-01] |\n", "| 9.90e+01 | [538,84] | [8.65e-01,1.35e-01] |\n", "| 5.60e+01 | [439,15] | [9.67e-01,3.30e-02] |\n", "| 2.60e+01 | [110,8] | [9.32e-01,6.78e-02] |\n", "| 9.90e+01 | [758,12] | [9.84e-01,1.56e-02] |\n", "| 9.90e+01 | [768,2] | [9.97e-01,2.60e-03] |\n", "| 9.90e+01 | [761,9] | [9.88e-01,1.17e-02] |\n", "| 9.90e+01 | [639,129] | [8.32e-01,1.68e-01] |\n", "| 9.90e+01 | [770,0] | [1.00e+00,0.00e+00] |\n", "+-------------------------+---------------+---------------------+\n", "\n", "+---------------+-----------------------------+----------------------+\n", "| variant_qc.AN | variant_qc.homozygote_count | variant_qc.call_rate |\n", "+---------------+-----------------------------+----------------------+\n", "| int32 | array | float64 |\n", "+---------------+-----------------------------+----------------------+\n", "| 762 | [371,0] | 9.90e-01 |\n", "| 678 | [219,2] | 8.81e-01 |\n", "| 622 | [235,8] | 8.08e-01 |\n", "| 454 | [213,1] | 5.90e-01 |\n", "| 118 | [54,3] | 1.53e-01 |\n", "| 770 | [373,0] | 1.00e+00 |\n", "| 770 | [383,0] | 1.00e+00 |\n", "| 770 | [376,0] | 1.00e+00 |\n", "| 768 | [277,22] | 9.97e-01 |\n", "| 770 | [385,0] | 1.00e+00 |\n", "+---------------+-----------------------------+----------------------+\n", "\n", "+---------------------+-------------------------+-----------------------+\n", "| variant_qc.n_called | variant_qc.n_not_called | variant_qc.n_filtered |\n", "+---------------------+-------------------------+-----------------------+\n", "| int64 | int64 | int64 |\n", "+---------------------+-------------------------+-----------------------+\n", "| 381 | 4 | 0 |\n", "| 339 | 46 | 0 |\n", "| 311 | 74 | 0 |\n", "| 227 | 158 | 0 |\n", "| 59 | 326 | 0 |\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "| 384 | 1 | 0 |\n", "| 385 | 0 | 0 |\n", "+---------------------+-------------------------+-----------------------+\n", "\n", "+------------------+----------------------+-------------------------+\n", "| variant_qc.n_het | variant_qc.n_non_ref | variant_qc.het_freq_hwe |\n", "+------------------+----------------------+-------------------------+\n", "| int64 | int64 | float64 |\n", "+------------------+----------------------+-------------------------+\n", "| 10 | 10 | 2.59e-02 |\n", "| 118 | 120 | 2.96e-01 |\n", "| 68 | 76 | 2.34e-01 |\n", "| 13 | 14 | 6.40e-02 |\n", "| 2 | 5 | 1.27e-01 |\n", "| 12 | 12 | 3.07e-02 |\n", "| 2 | 2 | 5.19e-03 |\n", "| 9 | 9 | 2.31e-02 |\n", "| 85 | 107 | 2.80e-01 |\n", "| 0 | 0 | 0.00e+00 |\n", "+------------------+----------------------+-------------------------+\n", "\n", "+------------------------+-------------------------------+\n", "| variant_qc.p_value_hwe | variant_qc.p_value_excess_het |\n", "+------------------------+-------------------------------+\n", "| float64 | float64 |\n", "+------------------------+-------------------------------+\n", "| 5.29e-01 | 4.71e-01 |\n", "| 2.55e-04 | 1.36e-04 |\n", "| 1.85e-01 | 8.70e-01 |\n", "| 1.16e-01 | 8.84e-01 |\n", "| 1.37e-04 | 1.00e+00 |\n", "| 5.42e-01 | 4.58e-01 |\n", "| 5.01e-01 | 4.99e-01 |\n", "| 5.23e-01 | 4.77e-01 |\n", "| 1.15e-04 | 1.00e+00 |\n", "| 5.00e-01 | 5.00e-01 |\n", "+------------------------+-------------------------------+\n", "showing top 10 rows" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mt.variant_qc.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metrics like `call_rate` are important for QC. Let's plot the cumulative density function of call rate per variant:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ggplot(mt, aes(x=mt.variant_qc.call_rate)) + geom_density() + xlab('Variant Call Rate') + ylab('Density')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before filtering variants, we should compute a raw variant count:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10441" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pre-qc variant count\n", "mt.count_rows()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`filter_rows` removes entire rows of the matrix table. Here, we keep rows where the `call_rate` is over 95%:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "mt = mt.filter_rows(mt.variant_qc.call_rate > 0.95)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hardy-Weinberg filter\n", "\n", "In past sessions, you have seen filters on Hardy-Weinberg equilibrium using tools like PLINK.\n", "\n", "Hail can do the same. First, let's plot the log-scaled HWE p-values per variant:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 30:===========================================> (3 + 1) / 4]\r" ] }, { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(ggplot(mt, aes(x=-hl.log10(mt.variant_qc.p_value_hwe)))\n", " + geom_histogram()\n", " + xlab(\"-log10 pHWE\")\n", " + ylab(\"density\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are some **extremely** bad variants here, with p-values smaller than 1e-100. Let's look at some of these variants:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 33:=============================> (2 + 2) / 4]\r" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
variant_qc
info
dp_stats
gq_stats
locus
alleles
rsid
qual
filters
QUALapprox
SB
MQ
MQRankSum
VarDP
AS_ReadPosRankSum
AS_pab_max
AS_QD
AS_MQ
QD
AS_MQRankSum
FS
AS_FS
ReadPosRankSum
AS_QUALapprox
AS_SB_TABLE
AS_VarDP
AS_SOR
SOR
transmitted_singleton
omni
mills
monoallelic
AS_VQSLOD
InbreedingCoeff
mean
stdev
min
max
mean
stdev
min
max
AC
AF
AN
homozygote_count
call_rate
n_called
n_not_called
n_filtered
n_het
n_non_ref
het_freq_hwe
p_value_hwe
p_value_excess_het
locus<GRCh38>array<str>strfloat64set<str>int32array<int32>float64float64int32float64float64float64float64float64float64float64float64float64int32array<int32>int32float64float64boolboolboolboolfloat64float64float64float64float64float64float64float64float64float64array<int32>array<float64>int32array<int32>float64int64int64int64int64int64float64float64float64
chr1:125165544["G","T"]NA-1.00e+01{"InbreedingCoeff"}332680306[2530347,2658319,4674032,5445216]5.92e+01-1.72e-01153079145.29e-011.00e+002.17e+015.92e+012.17e+01-1.77e-011.17e+001.17e+005.15e-01332680306[2530759,2658686,4674032,5445216]153079148.02e-018.02e-01FalseFalseFalseFalse-1.62e+00-9.98e-019.05e+012.26e+014.40e+013.13e+029.90e+010.00e+009.90e+019.90e+01[385,385][5.00e-01,5.00e-01]770[0,0]1.00e+00385003853855.01e-012.21e-1152.21e-115
chr2:94565722["T","C"]"rs1273701572"-1.00e+01{"InbreedingCoeff"}133145705[2744392,2448688,2632227,938989]5.38e+01-5.59e+0087642811.06e+001.00e+001.52e+015.38e+011.52e+01-5.61e+002.52e+012.52e+011.05e+00133145705[2744392,2448688,2632227,938989]87642811.98e+001.98e+00FalseFalseFalseFalse-2.50e+00-9.99e-015.63e+011.36e+012.80e+011.29e+029.90e+010.00e+009.90e+019.90e+01[385,385][5.00e-01,5.00e-01]770[0,0]1.00e+00385003853855.01e-012.21e-1152.21e-115
chr2:113591601["G","A"]"rs6419550"-1.00e+01{"InbreedingCoeff"}124592291[3447869,3213349,2778556,2338468]4.69e+01-2.27e+00117782409.33e-011.00e+001.06e+014.69e+011.06e+01-2.29e+001.11e+001.11e+009.38e-01124592291[3447869,3213349,2778556,2338468]117782408.00e-018.00e-01FalseFalseFalseFalse-3.95e-01-1.00e+009.17e+011.96e+014.70e+011.76e+029.90e+010.00e+009.90e+019.90e+01[385,385][5.00e-01,5.00e-01]770[0,0]1.00e+00385003853855.01e-012.21e-1152.21e-115
chr3:77780883["A","T"]"rs75079699"-1.00e+01{"InbreedingCoeff"}179360814[2431382,2598153,2418400,2106424]6.00e+010.00e+009554359-1.42e-011.00e+001.88e+016.00e+011.88e+01-6.00e-033.18e+003.18e+00-1.27e-01179360814[2431382,2598153,2418400,2106424]95543597.86e-017.86e-01FalseFalseFalseFalse1.22e+00-1.00e+005.81e+011.15e+013.20e+011.47e+029.90e+010.00e+009.90e+019.90e+01[385,385][5.00e-01,5.00e-01]770[0,0]1.00e+00385003853855.01e-012.21e-1152.21e-115
chr5:46433521["G","C"]NA-1.00e+01{"InbreedingCoeff"}115906887[2362879,2344419,2204239,1144094]5.32e+01-2.23e+0080556319.49e-011.00e+001.44e+015.32e+011.44e+01-2.19e+001.53e+011.53e+019.54e-01115906389[2362886,2344431,2204225,1144085]80555671.54e+001.54e+00FalseFalseFalseFalse-1.18e+00-9.97e-014.87e+011.02e+012.60e+011.16e+029.90e+010.00e+009.90e+019.90e+01[385,385][5.00e-01,5.00e-01]770[0,0]1.00e+00385003853855.01e-012.21e-1152.21e-115
chr5:170833888["C","T"]"rs79126011"-1.00e+01{"InbreedingCoeff"}157871544[1038041,2758564,1388629,2553316]5.99e+01-5.50e-027738550-7.80e-021.00e+002.04e+015.99e+012.04e+01-1.60e-026.53e+006.53e+00-6.50e-02157871544[1038041,2758564,1388629,2553316]77385503.91e-013.91e-01FalseFalseFalseFalse-1.80e-03-1.00e+004.53e+019.72e+002.50e+011.23e+029.90e+010.00e+009.90e+019.90e+01[385,385][5.00e-01,5.00e-01]770[0,0]1.00e+00385003853855.01e-012.21e-1152.21e-115
chr9:63771196["C","T"]"rs4310313"-1.00e+01{"InbreedingCoeff"}113903241[9577761,6255003,2256377,1811696]4.26e+01-6.49e-01199008365.55e-011.00e+005.72e+004.26e+015.72e+00-6.74e-012.30e+002.30e+005.50e-01113903241[9577761,6255003,2256377,1811696]199008365.08e-015.08e-01FalseFalseFalseFalse-1.05e+00-9.98e-011.08e+022.95e+016.50e+014.78e+029.90e+013.56e-019.20e+019.90e+01[385,385][5.00e-01,5.00e-01]770[0,0]1.00e+00385003853855.01e-012.21e-1152.21e-115
chr10:46546502["G","A"]"rs3127692"-1.00e+01{"InbreedingCoeff"}152605413[2875916,2474502,3025177,2601488]6.00e+01-4.00e-03109770764.96e-011.00e+001.39e+016.00e+011.39e+010.00e+000.00e+000.00e+004.96e-01152604571[2875919,2474509,3025151,2601452]109769526.94e-016.94e-01FalseFalseFalseFalse1.87e+01-9.91e-017.18e+011.52e+014.10e+011.52e+029.90e+010.00e+009.90e+019.90e+01[385,385][5.00e-01,5.00e-01]770[0,0]1.00e+00385003853855.01e-012.21e-1152.21e-115
chr17:21291540["G","A"]"rs72840058"-1.00e+01{"InbreedingCoeff"}204637824[2818670,2658428,2716858,2427104]6.00e+01-1.40e-02106210593.76e-011.00e+001.93e+016.00e+011.93e+010.00e+005.20e-015.20e-013.64e-01204637824[2818670,2658428,2716858,2427104]106210597.49e-017.49e-01FalseFalseFalseFalse3.95e+00-1.00e+006.83e+011.17e+013.90e+011.35e+029.90e+010.00e+009.90e+019.90e+01[385,385][5.00e-01,5.00e-01]770[0,0]1.00e+00385003853855.01e-012.21e-1152.21e-115
chr17:21302646["A","G"]"rs62057725"-1.00e+01{"InbreedingCoeff"}166814776[2745828,2529213,2725172,2431983]6.00e+01-8.00e-03104321957.45e-011.00e+001.60e+016.00e+011.60e+011.10e-025.16e-015.16e-017.40e-01166814776[2745828,2529213,2725172,2431983]104321957.25e-017.25e-01FalseFalseFalseFalse2.60e+00-1.00e+007.04e+011.26e+014.10e+011.34e+029.90e+010.00e+009.90e+019.90e+01[385,385][5.00e-01,5.00e-01]770[0,0]1.00e+00385003853855.01e-012.21e-1152.21e-115

showing top 10 rows

\n" ], "text/plain": [ "+----------------+------------+----------------+-----------+\n", "| locus | alleles | rsid | qual |\n", "+----------------+------------+----------------+-----------+\n", "| locus | array | str | float64 |\n", "+----------------+------------+----------------+-----------+\n", "| chr1:125165544 | [\"G\",\"T\"] | NA | -1.00e+01 |\n", "| chr2:94565722 | [\"T\",\"C\"] | \"rs1273701572\" | -1.00e+01 |\n", "| chr2:113591601 | [\"G\",\"A\"] | \"rs6419550\" | -1.00e+01 |\n", "| chr3:77780883 | [\"A\",\"T\"] | \"rs75079699\" | -1.00e+01 |\n", "| chr5:46433521 | [\"G\",\"C\"] | NA | -1.00e+01 |\n", "| chr5:170833888 | [\"C\",\"T\"] | \"rs79126011\" | -1.00e+01 |\n", "| chr9:63771196 | [\"C\",\"T\"] | \"rs4310313\" | -1.00e+01 |\n", "| chr10:46546502 | [\"G\",\"A\"] | \"rs3127692\" | -1.00e+01 |\n", "| chr17:21291540 | [\"G\",\"A\"] | \"rs72840058\" | -1.00e+01 |\n", "| chr17:21302646 | [\"A\",\"G\"] | \"rs62057725\" | -1.00e+01 |\n", "+----------------+------------+----------------+-----------+\n", "\n", "+---------------------+-----------------+-----------------------------------+\n", "| filters | info.QUALapprox | info.SB |\n", "+---------------------+-----------------+-----------------------------------+\n", "| set | int32 | array |\n", "+---------------------+-----------------+-----------------------------------+\n", "| {\"InbreedingCoeff\"} | 332680306 | [2530347,2658319,4674032,5445216] |\n", "| {\"InbreedingCoeff\"} | 133145705 | [2744392,2448688,2632227,938989] |\n", "| {\"InbreedingCoeff\"} | 124592291 | [3447869,3213349,2778556,2338468] |\n", "| {\"InbreedingCoeff\"} | 179360814 | [2431382,2598153,2418400,2106424] |\n", "| {\"InbreedingCoeff\"} | 115906887 | [2362879,2344419,2204239,1144094] |\n", "| {\"InbreedingCoeff\"} | 157871544 | [1038041,2758564,1388629,2553316] |\n", "| {\"InbreedingCoeff\"} | 113903241 | [9577761,6255003,2256377,1811696] |\n", "| {\"InbreedingCoeff\"} | 152605413 | [2875916,2474502,3025177,2601488] |\n", "| {\"InbreedingCoeff\"} | 204637824 | [2818670,2658428,2716858,2427104] |\n", "| {\"InbreedingCoeff\"} | 166814776 | [2745828,2529213,2725172,2431983] |\n", "+---------------------+-----------------+-----------------------------------+\n", "\n", "+----------+----------------+------------+------------------------+\n", "| info.MQ | info.MQRankSum | info.VarDP | info.AS_ReadPosRankSum |\n", "+----------+----------------+------------+------------------------+\n", "| float64 | float64 | int32 | float64 |\n", "+----------+----------------+------------+------------------------+\n", "| 5.92e+01 | -1.72e-01 | 15307914 | 5.29e-01 |\n", "| 5.38e+01 | -5.59e+00 | 8764281 | 1.06e+00 |\n", "| 4.69e+01 | -2.27e+00 | 11778240 | 9.33e-01 |\n", "| 6.00e+01 | 0.00e+00 | 9554359 | -1.42e-01 |\n", "| 5.32e+01 | -2.23e+00 | 8055631 | 9.49e-01 |\n", "| 5.99e+01 | -5.50e-02 | 7738550 | -7.80e-02 |\n", "| 4.26e+01 | -6.49e-01 | 19900836 | 5.55e-01 |\n", "| 6.00e+01 | -4.00e-03 | 10977076 | 4.96e-01 |\n", "| 6.00e+01 | -1.40e-02 | 10621059 | 3.76e-01 |\n", "| 6.00e+01 | -8.00e-03 | 10432195 | 7.45e-01 |\n", "+----------+----------------+------------+------------------------+\n", "\n", "+-----------------+------------+------------+----------+-------------------+\n", "| info.AS_pab_max | info.AS_QD | info.AS_MQ | info.QD | info.AS_MQRankSum |\n", "+-----------------+------------+------------+----------+-------------------+\n", "| float64 | float64 | float64 | float64 | float64 |\n", "+-----------------+------------+------------+----------+-------------------+\n", "| 1.00e+00 | 2.17e+01 | 5.92e+01 | 2.17e+01 | -1.77e-01 |\n", "| 1.00e+00 | 1.52e+01 | 5.38e+01 | 1.52e+01 | -5.61e+00 |\n", "| 1.00e+00 | 1.06e+01 | 4.69e+01 | 1.06e+01 | -2.29e+00 |\n", "| 1.00e+00 | 1.88e+01 | 6.00e+01 | 1.88e+01 | -6.00e-03 |\n", "| 1.00e+00 | 1.44e+01 | 5.32e+01 | 1.44e+01 | -2.19e+00 |\n", "| 1.00e+00 | 2.04e+01 | 5.99e+01 | 2.04e+01 | -1.60e-02 |\n", "| 1.00e+00 | 5.72e+00 | 4.26e+01 | 5.72e+00 | -6.74e-01 |\n", "| 1.00e+00 | 1.39e+01 | 6.00e+01 | 1.39e+01 | 0.00e+00 |\n", "| 1.00e+00 | 1.93e+01 | 6.00e+01 | 1.93e+01 | 0.00e+00 |\n", "| 1.00e+00 | 1.60e+01 | 6.00e+01 | 1.60e+01 | 1.10e-02 |\n", "+-----------------+------------+------------+----------+-------------------+\n", "\n", "+----------+------------+---------------------+--------------------+\n", "| info.FS | info.AS_FS | info.ReadPosRankSum | info.AS_QUALapprox |\n", "+----------+------------+---------------------+--------------------+\n", "| float64 | float64 | float64 | int32 |\n", "+----------+------------+---------------------+--------------------+\n", "| 1.17e+00 | 1.17e+00 | 5.15e-01 | 332680306 |\n", "| 2.52e+01 | 2.52e+01 | 1.05e+00 | 133145705 |\n", "| 1.11e+00 | 1.11e+00 | 9.38e-01 | 124592291 |\n", "| 3.18e+00 | 3.18e+00 | -1.27e-01 | 179360814 |\n", "| 1.53e+01 | 1.53e+01 | 9.54e-01 | 115906389 |\n", "| 6.53e+00 | 6.53e+00 | -6.50e-02 | 157871544 |\n", "| 2.30e+00 | 2.30e+00 | 5.50e-01 | 113903241 |\n", "| 0.00e+00 | 0.00e+00 | 4.96e-01 | 152604571 |\n", "| 5.20e-01 | 5.20e-01 | 3.64e-01 | 204637824 |\n", "| 5.16e-01 | 5.16e-01 | 7.40e-01 | 166814776 |\n", "+----------+------------+---------------------+--------------------+\n", "\n", "+-----------------------------------+---------------+-------------+----------+\n", "| info.AS_SB_TABLE | info.AS_VarDP | info.AS_SOR | info.SOR |\n", "+-----------------------------------+---------------+-------------+----------+\n", "| array | int32 | float64 | float64 |\n", "+-----------------------------------+---------------+-------------+----------+\n", "| [2530759,2658686,4674032,5445216] | 15307914 | 8.02e-01 | 8.02e-01 |\n", "| [2744392,2448688,2632227,938989] | 8764281 | 1.98e+00 | 1.98e+00 |\n", "| [3447869,3213349,2778556,2338468] | 11778240 | 8.00e-01 | 8.00e-01 |\n", "| [2431382,2598153,2418400,2106424] | 9554359 | 7.86e-01 | 7.86e-01 |\n", "| [2362886,2344431,2204225,1144085] | 8055567 | 1.54e+00 | 1.54e+00 |\n", "| [1038041,2758564,1388629,2553316] | 7738550 | 3.91e-01 | 3.91e-01 |\n", "| [9577761,6255003,2256377,1811696] | 19900836 | 5.08e-01 | 5.08e-01 |\n", "| [2875919,2474509,3025151,2601452] | 10976952 | 6.94e-01 | 6.94e-01 |\n", "| [2818670,2658428,2716858,2427104] | 10621059 | 7.49e-01 | 7.49e-01 |\n", "| [2745828,2529213,2725172,2431983] | 10432195 | 7.25e-01 | 7.25e-01 |\n", "+-----------------------------------+---------------+-------------+----------+\n", "\n", "+----------------------------+-----------+------------+------------------+\n", "| info.transmitted_singleton | info.omni | info.mills | info.monoallelic |\n", "+----------------------------+-----------+------------+------------------+\n", "| bool | bool | bool | bool |\n", "+----------------------------+-----------+------------+------------------+\n", "| False | False | False | False |\n", "| False | False | False | False |\n", "| False | False | False | False |\n", "| False | False | False | False |\n", "| False | False | False | False |\n", "| False | False | False | False |\n", "| False | False | False | False |\n", "| False | False | False | False |\n", "| False | False | False | False |\n", "| False | False | False | False |\n", "+----------------------------+-----------+------------+------------------+\n", "\n", "+----------------+----------------------+--------------------------+\n", "| info.AS_VQSLOD | info.InbreedingCoeff | variant_qc.dp_stats.mean |\n", "+----------------+----------------------+--------------------------+\n", "| float64 | float64 | float64 |\n", "+----------------+----------------------+--------------------------+\n", "| -1.62e+00 | -9.98e-01 | 9.05e+01 |\n", "| -2.50e+00 | -9.99e-01 | 5.63e+01 |\n", "| -3.95e-01 | -1.00e+00 | 9.17e+01 |\n", "| 1.22e+00 | -1.00e+00 | 5.81e+01 |\n", "| -1.18e+00 | -9.97e-01 | 4.87e+01 |\n", "| -1.80e-03 | -1.00e+00 | 4.53e+01 |\n", "| -1.05e+00 | -9.98e-01 | 1.08e+02 |\n", "| 1.87e+01 | -9.91e-01 | 7.18e+01 |\n", "| 3.95e+00 | -1.00e+00 | 6.83e+01 |\n", "| 2.60e+00 | -1.00e+00 | 7.04e+01 |\n", "+----------------+----------------------+--------------------------+\n", "\n", "+---------------------------+-------------------------+\n", "| variant_qc.dp_stats.stdev | variant_qc.dp_stats.min |\n", "+---------------------------+-------------------------+\n", "| float64 | float64 |\n", "+---------------------------+-------------------------+\n", "| 2.26e+01 | 4.40e+01 |\n", "| 1.36e+01 | 2.80e+01 |\n", "| 1.96e+01 | 4.70e+01 |\n", "| 1.15e+01 | 3.20e+01 |\n", "| 1.02e+01 | 2.60e+01 |\n", "| 9.72e+00 | 2.50e+01 |\n", "| 2.95e+01 | 6.50e+01 |\n", "| 1.52e+01 | 4.10e+01 |\n", "| 1.17e+01 | 3.90e+01 |\n", "| 1.26e+01 | 4.10e+01 |\n", "+---------------------------+-------------------------+\n", "\n", "+-------------------------+--------------------------+\n", "| variant_qc.dp_stats.max | variant_qc.gq_stats.mean |\n", "+-------------------------+--------------------------+\n", "| float64 | float64 |\n", "+-------------------------+--------------------------+\n", "| 3.13e+02 | 9.90e+01 |\n", "| 1.29e+02 | 9.90e+01 |\n", "| 1.76e+02 | 9.90e+01 |\n", "| 1.47e+02 | 9.90e+01 |\n", "| 1.16e+02 | 9.90e+01 |\n", "| 1.23e+02 | 9.90e+01 |\n", "| 4.78e+02 | 9.90e+01 |\n", "| 1.52e+02 | 9.90e+01 |\n", "| 1.35e+02 | 9.90e+01 |\n", "| 1.34e+02 | 9.90e+01 |\n", "+-------------------------+--------------------------+\n", "\n", "+---------------------------+-------------------------+\n", "| variant_qc.gq_stats.stdev | variant_qc.gq_stats.min |\n", "+---------------------------+-------------------------+\n", "| float64 | float64 |\n", "+---------------------------+-------------------------+\n", "| 0.00e+00 | 9.90e+01 |\n", "| 0.00e+00 | 9.90e+01 |\n", "| 0.00e+00 | 9.90e+01 |\n", "| 0.00e+00 | 9.90e+01 |\n", "| 0.00e+00 | 9.90e+01 |\n", "| 0.00e+00 | 9.90e+01 |\n", "| 3.56e-01 | 9.20e+01 |\n", "| 0.00e+00 | 9.90e+01 |\n", "| 0.00e+00 | 9.90e+01 |\n", "| 0.00e+00 | 9.90e+01 |\n", "+---------------------------+-------------------------+\n", "\n", "+-------------------------+---------------+---------------------+\n", "| variant_qc.gq_stats.max | variant_qc.AC | variant_qc.AF |\n", "+-------------------------+---------------+---------------------+\n", "| float64 | array | array |\n", "+-------------------------+---------------+---------------------+\n", "| 9.90e+01 | [385,385] | [5.00e-01,5.00e-01] |\n", "| 9.90e+01 | [385,385] | [5.00e-01,5.00e-01] |\n", "| 9.90e+01 | [385,385] | [5.00e-01,5.00e-01] |\n", "| 9.90e+01 | [385,385] | [5.00e-01,5.00e-01] |\n", "| 9.90e+01 | [385,385] | [5.00e-01,5.00e-01] |\n", "| 9.90e+01 | [385,385] | [5.00e-01,5.00e-01] |\n", "| 9.90e+01 | [385,385] | [5.00e-01,5.00e-01] |\n", "| 9.90e+01 | [385,385] | [5.00e-01,5.00e-01] |\n", "| 9.90e+01 | [385,385] | [5.00e-01,5.00e-01] |\n", "| 9.90e+01 | [385,385] | [5.00e-01,5.00e-01] |\n", "+-------------------------+---------------+---------------------+\n", "\n", "+---------------+-----------------------------+----------------------+\n", "| variant_qc.AN | variant_qc.homozygote_count | variant_qc.call_rate |\n", "+---------------+-----------------------------+----------------------+\n", "| int32 | array | float64 |\n", "+---------------+-----------------------------+----------------------+\n", "| 770 | [0,0] | 1.00e+00 |\n", "| 770 | [0,0] | 1.00e+00 |\n", "| 770 | [0,0] | 1.00e+00 |\n", "| 770 | [0,0] | 1.00e+00 |\n", "| 770 | [0,0] | 1.00e+00 |\n", "| 770 | [0,0] | 1.00e+00 |\n", "| 770 | [0,0] | 1.00e+00 |\n", "| 770 | [0,0] | 1.00e+00 |\n", "| 770 | [0,0] | 1.00e+00 |\n", "| 770 | [0,0] | 1.00e+00 |\n", "+---------------+-----------------------------+----------------------+\n", "\n", "+---------------------+-------------------------+-----------------------+\n", "| variant_qc.n_called | variant_qc.n_not_called | variant_qc.n_filtered |\n", "+---------------------+-------------------------+-----------------------+\n", "| int64 | int64 | int64 |\n", "+---------------------+-------------------------+-----------------------+\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "| 385 | 0 | 0 |\n", "+---------------------+-------------------------+-----------------------+\n", "\n", "+------------------+----------------------+-------------------------+\n", "| variant_qc.n_het | variant_qc.n_non_ref | variant_qc.het_freq_hwe |\n", "+------------------+----------------------+-------------------------+\n", "| int64 | int64 | float64 |\n", "+------------------+----------------------+-------------------------+\n", "| 385 | 385 | 5.01e-01 |\n", "| 385 | 385 | 5.01e-01 |\n", "| 385 | 385 | 5.01e-01 |\n", "| 385 | 385 | 5.01e-01 |\n", "| 385 | 385 | 5.01e-01 |\n", "| 385 | 385 | 5.01e-01 |\n", "| 385 | 385 | 5.01e-01 |\n", "| 385 | 385 | 5.01e-01 |\n", "| 385 | 385 | 5.01e-01 |\n", "| 385 | 385 | 5.01e-01 |\n", "+------------------+----------------------+-------------------------+\n", "\n", "+------------------------+-------------------------------+\n", "| variant_qc.p_value_hwe | variant_qc.p_value_excess_het |\n", "+------------------------+-------------------------------+\n", "| float64 | float64 |\n", "+------------------------+-------------------------------+\n", "| 2.21e-115 | 2.21e-115 |\n", "| 2.21e-115 | 2.21e-115 |\n", "| 2.21e-115 | 2.21e-115 |\n", "| 2.21e-115 | 2.21e-115 |\n", "| 2.21e-115 | 2.21e-115 |\n", "| 2.21e-115 | 2.21e-115 |\n", "| 2.21e-115 | 2.21e-115 |\n", "| 2.21e-115 | 2.21e-115 |\n", "| 2.21e-115 | 2.21e-115 |\n", "| 2.21e-115 | 2.21e-115 |\n", "+------------------------+-------------------------------+\n", "showing top 10 rows" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "rows = mt.rows()\n", "rows.order_by(rows.variant_qc.p_value_hwe).show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4\n", "\n", "There's a lot of information in the above output. Take a moment to look through, and remember, these are **bad-quality variants**. Why do these variants had such low HWE p-values? *Hint: scroll all the way to the right to the variant_qc output*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The VCF `FILTERS` field\n", "\n", "You're not entirely on your own for variant QC -- most variant calling software generates its own filtering annotations that are present in the `FILTERS` field of the VCF. Lines of the VCF might have reasons to be filtered, or might be `PASS`. Above, almost every one of these bad variants has the `\"InbreedingCoeff\"` filter listed in its `mt.filters` field!\n", "\n", "We can remove all of these pre-filtered variants by keeping only variants which have no filters applied." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "mt = mt.filter_rows(hl.len(mt.filters) == 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then compute the final sample and variant count:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 37:=============================> (2 + 2) / 4]\r" ] }, { "data": { "text/plain": [ "(8538, 385)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mt.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many variants did you lose from your variant QC? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7. Association Testing and PCA\n", "\n", "The goal of a GWAS is to find associations between genotypes and a trait of interest.\n", "\n", "First, we filter to common variants (those with a minor allele frequency over 1%). GWAS is not well-powered to detect signal from extremely rare variants, like those only observed in one individual.\n", "\n", "Our filter below removes variants where the minimum value of AF is below 1%. The AF array computed by `hl.variant_qc` contains one value for each allele, **including the reference**, and sums to 1.0.\n", "\n", "A variant with 5% MAF might have AF values `[0.95, 0.05]` or `[0.05, 0.95]`. A variant with 0.1% MAF might have AF values `[0.999, 0.001]` or `[0.001, 0.999]`. This filter ensures that we account for both cases." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "base = mt\n", "mt = base.filter_rows(hl.min(base.variant_qc.AF) > 0.01)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 39:=============================> (2 + 2) / 4]\r" ] }, { "data": { "text/plain": [ "(5862, 385)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mt.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many variants did you lose from your common variant filter? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that in a GWAS we're running independent association tests on each variant.\n", "\n", "In Hail, the method we use is [hl.linear_regression_rows](https://hail.is/docs/0.2/methods/stats.html#hail.methods.linear_regression_rows). Why isn't this called `hl.gwas`? Modularity! There are applications for this statistical method other than genome wide association studies.\n", "\n", "We will start by using the `sleep_duration` phenotype. At the end, you will have a chance to try the others (you can do that by editing the string below to one of the other phenotypes and rerunning the rest of the notebook). The other phenotypes are:\n", "\n", " - 'tea_intake_daily'\n", " - 'general_happiness'\n", " - 'screen_time_per_day'" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "phenotype = 'sleep_duration'" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 44:=============================> (2 + 2) / 4]\r" ] } ], "source": [ "gwas = hl.linear_regression_rows(y=mt.sample_data[phenotype], \n", " x=mt.GT.n_alt_alleles(), \n", " covariates=[1.0] # intercept term\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting output from the line above is a `table`. Let's look at what the table looks like." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Out of all of the fields, we would recommend focusing your understanding of the p-value and beta effect when determining if you have a GWAS signal.\n", "\n", "**However**, keep in mind that the results thus far may need your model to be adjusted." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
locus
alleles
n
sum_x
y_transpose_x
beta
standard_error
t_stat
p_value
locus<GRCh38>array<str>int32float64float64float64float64float64float64
chr1:17379["G","A"]3851.01e+015.18e+01-9.35e-014.64e-01-2.02e+004.45e-02
chr1:1130877["C","G"]3851.20e+017.20e+01-2.41e-024.27e-01-5.65e-029.55e-01
chr1:1491494["G","A"]3859.00e+005.30e+01-1.38e-014.91e-01-2.81e-017.79e-01
chr1:1618118["G","A"]3851.29e+027.67e+02-9.27e-021.28e-01-7.26e-014.68e-01
chr1:3008858["A","G"]3852.33e+021.44e+031.52e-011.01e-011.51e+001.32e-01
chr1:3254545["T","C"]3852.83e+021.76e+032.53e-019.75e-022.59e+009.83e-03
chr1:3779436["T","C"]3853.72e+022.17e+03-2.60e-018.92e-02-2.91e+003.78e-03
chr1:4156721["C","T"]3852.31e+011.46e+022.76e-012.99e-019.23e-013.57e-01
chr1:4510922["C","T"]3859.30e+015.67e+027.22e-021.50e-014.83e-016.30e-01
chr1:4748394["G","A"]3853.18e+021.85e+03-3.37e-011.02e-01-3.30e+001.07e-03

showing top 10 rows

\n" ], "text/plain": [ "+---------------+------------+-------+----------+---------------+-----------+\n", "| locus | alleles | n | sum_x | y_transpose_x | beta |\n", "+---------------+------------+-------+----------+---------------+-----------+\n", "| locus | array | int32 | float64 | float64 | float64 |\n", "+---------------+------------+-------+----------+---------------+-----------+\n", "| chr1:17379 | [\"G\",\"A\"] | 385 | 1.01e+01 | 5.18e+01 | -9.35e-01 |\n", "| chr1:1130877 | [\"C\",\"G\"] | 385 | 1.20e+01 | 7.20e+01 | -2.41e-02 |\n", "| chr1:1491494 | [\"G\",\"A\"] | 385 | 9.00e+00 | 5.30e+01 | -1.38e-01 |\n", "| chr1:1618118 | [\"G\",\"A\"] | 385 | 1.29e+02 | 7.67e+02 | -9.27e-02 |\n", "| chr1:3008858 | [\"A\",\"G\"] | 385 | 2.33e+02 | 1.44e+03 | 1.52e-01 |\n", "| chr1:3254545 | [\"T\",\"C\"] | 385 | 2.83e+02 | 1.76e+03 | 2.53e-01 |\n", "| chr1:3779436 | [\"T\",\"C\"] | 385 | 3.72e+02 | 2.17e+03 | -2.60e-01 |\n", "| chr1:4156721 | [\"C\",\"T\"] | 385 | 2.31e+01 | 1.46e+02 | 2.76e-01 |\n", "| chr1:4510922 | [\"C\",\"T\"] | 385 | 9.30e+01 | 5.67e+02 | 7.22e-02 |\n", "| chr1:4748394 | [\"G\",\"A\"] | 385 | 3.18e+02 | 1.85e+03 | -3.37e-01 |\n", "+---------------+------------+-------+----------+---------------+-----------+\n", "\n", "+----------------+-----------+----------+\n", "| standard_error | t_stat | p_value |\n", "+----------------+-----------+----------+\n", "| float64 | float64 | float64 |\n", "+----------------+-----------+----------+\n", "| 4.64e-01 | -2.02e+00 | 4.45e-02 |\n", "| 4.27e-01 | -5.65e-02 | 9.55e-01 |\n", "| 4.91e-01 | -2.81e-01 | 7.79e-01 |\n", "| 1.28e-01 | -7.26e-01 | 4.68e-01 |\n", "| 1.01e-01 | 1.51e+00 | 1.32e-01 |\n", "| 9.75e-02 | 2.59e+00 | 9.83e-03 |\n", "| 8.92e-02 | -2.91e+00 | 3.78e-03 |\n", "| 2.99e-01 | 9.23e-01 | 3.57e-01 |\n", "| 1.50e-01 | 4.83e-01 | 6.30e-01 |\n", "| 1.02e-01 | -3.30e+00 | 1.07e-03 |\n", "+----------------+-----------+----------+\n", "showing top 10 rows" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "gwas.show()" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 8. Visualization\n", "\n", "Let’s visualize our association test results from the linear regression. We can do so by creating 2 common plots: a [Manhattan plot](https://en.wikipedia.org/wiki/Manhattan_plot) and a [Q-Q plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot).\n", "\n", "We'll start with the Manhattan plot:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "manhattan_plot(gwas.p_value)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### Manhattan plot interpretation\n", "\n", "The dashed red line is the line for genome-wide significance with a typical Bonferroni correction: 5e-8.\n", "\n", "We have several points above the line -- mouse over to see the loci they refer to. This means we're ready to publish our sleep GWAS in *Nature*, right?\n", "\n", "...right?" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### Q-Q plot\n", "The plot that accompanies a Manhattan plot is the Q-Q (quantile-quantile) plot." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "qq_plot(gwas.p_value)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "The Q-Q plot can clearly indicate widespread inflation or deflation of p-values. Is either of those properties present here?\n", "\n", "### Exercise 5\n", "\n", "**Is this GWAS well controlled? Discuss with your group.**\n", "\n", "Wikipedia has a good description of [genomic control estimation](https://en.wikipedia.org/wiki/Genomic_control) (lambda GC) to read later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 9. Confounded! PCA to the rescue.\n", "\n", "If you've done a GWAS before, you've probably included a few other covariates that might be confounders -- age, sex, and principal components.\n", "\n", "Principal components are a measure of genetic ancestry, and can be used to control for [population stratification](https://en.wikipedia.org/wiki/Population_stratification).\n", "\n", "Principal component analysis (PCA) is a very general statistical method for reducing high dimensional data to a small number of dimensions which capture most of the variation in the data. Hail has the function [pca](https://hail.is/docs/0.2/methods/stats.html#hail.methods.pca) for performing generic PCA.\n", "\n", "PCA typically works best on normalized data (e.g. mean centered). Hail provides the specialized function [hwe_normalized_pca](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca) which first normalizes the genotypes according to the Hardy-Weinberg Equilibium model." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "pca_eigenvalues, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, compute_loadings=True)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 143:=========================================> (12 + 2) / 16]\r" ] } ], "source": [ "pca_scores = pca_scores.checkpoint('resources/pca_scores.ht', overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pca function returns three things:\n", "* **eigenvalues**: an array of doubles\n", "* **scores**: a table keyed by sample\n", "* **loadings**: a table keyed by variant\n", "\n", "The **loadings** are the *principal directions*, each of which is a weighted sum of variants (a \"virtual variant\"). By default, `hwe_normalized_pca` gives us 10 principal directions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Joining computed PCs onto our matrix table\n", "\n", "Using the same syntax as we used to join the sample data table `sd` above, we can join the computed scores onto our matrix table:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "mt = mt.annotate_cols(pca = pca_scores[mt.s])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize principal components\n", "\n", "Let's make plots!" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(ggplot(mt,\n", " aes(x=mt.pca.scores[0],\n", " y=mt.pca.scores[1],\n", " color=mt.sample_data.continental_pop.upper(),\n", " tooltip=(mt.sample_data.pop, mt.sample_data.continental_pop.upper())))\n", " + geom_point()\n", " + xlab('PC1')\n", " + ylab('PC2'))" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(ggplot(mt,\n", " aes(x=mt.pca.scores[2],\n", " y=mt.pca.scores[3],\n", " color=mt.sample_data.continental_pop.upper(),\n", " tooltip=(mt.sample_data.pop, mt.sample_data.continental_pop.upper())))\n", " + geom_point()\n", " + xlab('PC3')\n", " + ylab('PC4'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 10. Control confounders and run another GWAS\n", "\n", "Now that we have computed principal components and saved it into our `mt`, let’s run another GWAS that includes PCs as covariates to account for confounding due to population stratification.\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 158:============================> (2 + 2) / 4]\r" ] } ], "source": [ "gwas2 = hl.linear_regression_rows(\n", " y=mt.sample_data[phenotype],\n", " x=mt.GT.n_alt_alleles(),\n", " covariates=[1.0, mt.pca.scores[0], mt.pca.scores[1], mt.pca.scores[2]])" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "qq_plot(gwas2.p_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above Q-Q plot indicates a much better-controlled GWAS. Let's try the Manhattan plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "manhattan_plot(gwas2.p_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Are there any significant hits? Do you trust these results? Why or why not?\n", "\n", "Discuss with your group - roughly how many samples would you need to discover rare variant signal? polygenic signal?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6\n", "\n", "Change the \"gwas2\" cell to experiment with how many principal components are needed as covariates to properly control this GWAS. How many are needed here in our tiny simulated example? How many are needed in a typical GWAS?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "toc": { "base_numbering": 1, "nav_menu": { "height": "242px", "width": "283px" }, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }