{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dab0261a",
   "metadata": {},
   "source": [
    "# Institute for Behavioral Genetics International Statistical Genetics 2023 Workshop \n",
    "\n",
    "## Advanced Hail Functionality\n",
    "\n",
    "This notebook is a grab bag of more advanced Hail functionality."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d8010a4",
   "metadata": {},
   "source": [
    "### Approximate CDF\n",
    "\n",
    "Normally computing quantiles or the median requires sorting an entire dataset. Hail uses a sophisticated data structure to get provably good approximations of all quantiles without sorting the data, providing buckets, or using unbounded memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b271722d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-memory 6G pyspark-shell'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b494f2e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import hail as hl\n",
    "hl.plot.output_notebook()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1865757",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/hgdp-tgp-rare-variants.mt')\n",
    "mt = hl.variant_qc(mt)\n",
    "call_rate_cdf = mt.aggregate_rows(hl.agg.approx_cdf(mt.variant_qc.call_rate))\n",
    "call_rate_cdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f3a0875",
   "metadata": {},
   "outputs": [],
   "source": [
    "cdf_plot = hl.plot.cdf(call_rate_cdf, title='Approximate CDF of Call Rate', legend='call rate')\n",
    "hl.plot.show(cdf_plot)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ff134bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/hgdp-tgp-rare-variants.mt')\n",
    "mt = hl.variant_qc(mt)\n",
    "ref_af_cdf = mt.aggregate_rows(hl.agg.approx_cdf(mt.variant_qc.AF[0]))\n",
    "cdf_plot = hl.plot.cdf(ref_af_cdf, title='Approximate CDF of Reference AF', legend='af')\n",
    "hl.plot.show(cdf_plot)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f4b1b1c",
   "metadata": {},
   "source": [
    "You can also ask directly for the median:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "806a373b",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt.aggregate_rows(hl.agg.approx_median(mt.variant_qc.AF[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03740e4d",
   "metadata": {},
   "source": [
    "### PCA on Unusual Values\n",
    "\n",
    "Flexible, general-purpose methods enable analysts to explore data sets with novel statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18aa97e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/hgdp-subset-3.mt')\n",
    "mt = mt.filter_rows(hl.agg.any(hl.is_missing(mt.GT)))\n",
    "mt = mt.annotate_entries(\n",
    "    is_missing = hl.is_missing(mt.GT)\n",
    ")\n",
    "mt = mt.annotate_rows(\n",
    "    is_missing_stats = hl.agg.stats(mt.is_missing)\n",
    ")\n",
    "mt = mt.annotate_entries(\n",
    "    normed_is_missing = (mt.is_missing - mt.is_missing_stats.mean) / mt.is_missing_stats.stdev\n",
    ")\n",
    "_, scores, _ = hl.pca(mt.normed_is_missing, k=2)\n",
    "hl.plot.show(hl.plot.scatter(scores.scores[0], scores.scores[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0dfc344a",
   "metadata": {},
   "source": [
    "### LD Prune"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2379a82a",
   "metadata": {},
   "outputs": [],
   "source": [
    "?hl.ld_prune"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ee30742",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/qced-hgdp-1kg.mt')\n",
    "print(f'Before pruning we have: {mt.count_rows()}')\n",
    "pruned_variants = hl.ld_prune(mt.GT)\n",
    "pruned_variants.write('output/pruned_variants.ht', overwrite=True)\n",
    "pruned_variants = hl.read_table('output/pruned_variants.ht')\n",
    "\n",
    "mt = mt.filter_rows(hl.is_defined(pruned_variants[mt.row_key]))\n",
    "print(f'After pruning we have: {mt.count_rows()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12d4d72d",
   "metadata": {},
   "source": [
    "### Kinship Estimators\n",
    "\n",
    "Hail supports a number of different kinship estimators."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5dd89211",
   "metadata": {},
   "source": [
    "Getting PC Relate to produce good-looking results is tricky! Here we see what happens when you don't quality control the variants well enough."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e25c569b",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/qced-hgdp-1kg.mt')\n",
    "\n",
    "pc_kin = hl.pc_relate(mt.GT, 0.1, k=4, statistics='kin20', min_kinship=0.1)\n",
    "pc_kin.write('output/pc_kin.ht', overwrite=True)\n",
    "pc_kin = hl.read_table('output/pc_kin.ht')\n",
    "\n",
    "hl.plot.show(\n",
    "    hl.plot.scatter(\n",
    "        pc_kin.kin,\n",
    "        pc_kin.ibd0,\n",
    "        width=400,\n",
    "        height=400,\n",
    "        size=3\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "afa1f340",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/qced-hgdp-1kg.mt')\n",
    "\n",
    "king_kin = hl.king(mt.GT)\n",
    "king_kin = king_kin.filter_entries(king_kin.phi > 0.1).entries()\n",
    "king_kin.write('output/king_kin.ht', overwrite=True)\n",
    "king_kin = hl.read_table('output/king_kin.ht')\n",
    "\n",
    "hl.plot.show(\n",
    "    hl.plot.histogram(\n",
    "        king_kin.phi\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8fa8d52",
   "metadata": {},
   "outputs": [],
   "source": [
    "king_kin.filter(king_kin.phi < 0.45).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd17e988",
   "metadata": {},
   "source": [
    "Hail also supports identity-by-descent calculation but it's currently broken for the new Apple M1 chips because it uses some fast native code that hasn't been compiled for M1 yet. Expect a fix soon!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97401517",
   "metadata": {},
   "source": [
    "### Polygenic Score Calculation\n",
    "\n",
    "In this section, I import a height polygenic score from the [PGS Catalog](https://www.pgscatalog.org/score/PGS000297/), and use it to calculate the polygenic score in our toy dataset. Our toy dataset does not have enough shared variants with the score to produce useful estimates, but the code below could be effectively applied to a larger, quality-controlled dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1da42e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "ht = hl.import_table('resources/height-polygenic-score.txt', comment='#', impute=True)\n",
    "ht = ht.key_by(\n",
    "    locus = hl.locus(hl.str(ht.chr_name), ht.chr_position)\n",
    ")\n",
    "ht.write('output/height-polygenic-score.ht', overwrite=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fcfdfe2",
   "metadata": {},
   "outputs": [],
   "source": [
    "ht = hl.read_table('output/height-polygenic-score.ht')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2064f4ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "ht.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18d7ee6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/1kg.mt')\n",
    "mt = hl.variant_qc(mt)\n",
    "mt = mt.annotate_rows(score=ht[mt.locus])\n",
    "\n",
    "mt = mt.annotate_rows(is_flipped = (\n",
    "    hl.case()\n",
    "    .when(mt.score.effect_allele == mt.alleles[0], True)\n",
    "    .when(mt.score.effect_allele == mt.alleles[1], False)\n",
    "    .or_missing()\n",
    "))\n",
    "mt = mt.annotate_rows(\n",
    "    mean_gt=2 * hl.if_else(mt.is_flipped, mt.variant_qc.AF[0], mt.variant_qc.AF[1])\n",
    ")\n",
    "mt = mt.annotate_entries(\n",
    "    n_effect_alleles = hl.if_else(\n",
    "        mt.is_flipped,\n",
    "        2 - mt.GT.n_alt_alleles(),\n",
    "        mt.GT.n_alt_alleles()\n",
    "    )\n",
    ")\n",
    "mt = mt.annotate_cols(\n",
    "    prs = hl.agg.sum(mt.score.effect_weight * hl.coalesce(mt.n_effect_alleles, mt.mean_gt)),\n",
    "    n_useful_variants = hl.agg.sum(hl.is_defined(mt.score.effect_weight))\n",
    ")\n",
    "mt.cols().show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6d2c029",
   "metadata": {},
   "source": [
    "### LD Score\n",
    "\n",
    "Hail also has utilities for simulating phenotypes, calculating LD Scores, and running LD Score regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa1899de",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/qced-hgdp-1kg.mt')\n",
    "mt = hl.experimental.ldscsim.simulate_phenotypes(mt, mt.GT, h2=0.5)\n",
    "mt.y.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b08e71b",
   "metadata": {},
   "outputs": [],
   "source": [
    "betas = hl.linear_regression_rows(y=mt.y, x=mt.GT.n_alt_alleles(), covariates=[1.0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae41da93",
   "metadata": {},
   "outputs": [],
   "source": [
    "betas.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43676851",
   "metadata": {},
   "outputs": [],
   "source": [
    "?hl.experimental.ld_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "058d2574",
   "metadata": {},
   "outputs": [],
   "source": [
    "ht_scores = hl.experimental.ld_score(entry_expr=mt.GT.n_alt_alleles(),\n",
    "                                     locus_expr=mt.locus,\n",
    "                                     radius=1e6)\n",
    "\n",
    "\n",
    "betas = betas.annotate(z_score = betas.beta / betas.standard_error)\n",
    "betas = betas.annotate(chi_sq_statistic = betas.z_score ** 2)\n",
    "\n",
    "ht = mt.rows()\n",
    "\n",
    "ht_results = hl.experimental.ld_score_regression(\n",
    "    weight_expr=ht_scores[ht.locus].univariate,\n",
    "    ld_score_expr=ht_scores[ht.locus].univariate,\n",
    "    chi_sq_exprs=betas[ht.key].chi_sq_statistic,\n",
    "    n_samples_exprs=betas[ht.key].n\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7eff852e",
   "metadata": {},
   "outputs": [],
   "source": [
    "ht_results.write('output/ldsr.ht', overwrite=True)\n",
    "ldsr = hl.read_table('output/ldsr.ht')\n",
    "ldsr.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15cda7b8",
   "metadata": {},
   "source": [
    "### Annotation Database\n",
    "\n",
    "The Hail team maintains a database of common variant annotations in Google Cloud Storage and S3. These commands will only work when executed inside a cluster with access to Google Cloud Storage or S3. They will not work on your laptop.\n",
    "\n",
    "A full list of available annotations can be found [in the Hail docs](https://hail.is/docs/0.2/annotation_database_ui.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76ffaa47",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/1kg.mt')\n",
    "\n",
    "db = hl.experimental.DB(region='us', cloud='aws')\n",
    "mt = db.annotate_rows_db(\n",
    "    mt,\n",
    "    'CADD', 'GTEx_eQTL_Adipose_Subcutaneous_all_snp_gene_associations', 'gnomad_ld_scores_afr'\n",
    ")\n",
    "mt.rows().show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53f56f01",
   "metadata": {},
   "source": [
    "### VEP\n",
    "\n",
    "Hail also supports VEP annotation. This requires a specially configured cluster. The code below will not work on your laptop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1f6de50",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/1kg.mt')\n",
    "mt = hl.vep(mt)\n",
    "mt.vep.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9444c931",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = hl.read_matrix_table('resources/qced-hgdp-1kg.mt')\n",
    "mt.vep.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c0473393",
   "metadata": {},
   "outputs": [],
   "source": [
    "mt = mt.annotate_rows(\n",
    "    interesting_cnsq = mt.vep.transcript_consequences.find(lambda x: x.consequence_terms.contains(\"stop_gained\"))\n",
    ")\n",
    "mt = mt.filter_rows(hl.is_defined(mt.interesting_cnsq))\n",
    "mt.interesting_cnsq.show(n=30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2363e8e9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec26dbcc",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}