Practical 2: The Hail expression language

The cells of this practical can be entered (by cut and paste) into the IPython console.

Before entering the first cell, make sure you have changed to the directory hail-practical. Skip the first cell haven't closed IPython console since running the last practical.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from math import log, isnan
from pprint import pprint
import matplotlib.patches as mpatches

from hail import *

%matplotlib inline

def qqplot(pvals):
    spvals = sorted([x for x in pvals if x and not(isnan(x))])
    exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]
    obs = [-log(p, 10) for p in spvals]
    plt.scatter(exp, obs)
    plt.plot(np.arange(0, max(max(exp), max(obs))), c="red")
    plt.xlabel("Expected p-value (-log10 scale)")
    plt.ylabel("Observed p-value (-log10 scale)")
    plt.xlim(xmin=0)
    plt.ylim(ymin=0)

hc = HailContext()

Hail Expression Language

The Hail expression language is used everywhere in Hail: filtering conditions, describing covariates and phenotypes, generating synthetic data, plotting, exporting, etc. You can evaluate a Hail expression with the HailContext method eval_expr_typed. eval_expr_typed returns a tuple with the result of evaluating the expression and the type of the expression. Use eval_expr if you just want the value. We'll use eval_expr_typed throughout so you can become more comfortable with types in Hail.

Primitive Types

Let's start with simple primitve types: Boolean, Int, Double, String. Hail expressions are passed as Python strings to Hail methods.

hc.eval_expr_typed('true') # the Boolean literals are true and false

The return value is True, not true. Why? When values are returned by Hail methods, they are automatically converted to the corresponding Python value.

String literals are denoted with double-quotes.

Note, we use variables a, b, ... so you don't have to cut and paste quite so many cells.

a = hc.eval_expr_typed('123')
b = hc.eval_expr_typed('123.45')
c = hc.eval_expr_typed('"Hello, world"')

print a
print b
print c

Exercise

Primitive types support all the usual operations you'd expect. For details, refer to the documentation on functions, operators and types. What's the difference between operators and functions? Operators are symbols like + and * that are written infix and functions have names and are called with parens like f(5).

Try a few simple expressions with operators on primitives. Divide two integers. Can you compare strings? You can concatenate strings with +. What's the log base 10 of 1024? (Hint: it's a function.)

Experiment with some expressions by filling in <?>.

hc.eval_expr_typed('<?>')

Missingness

Like R, all values in Hail can be missing. Most operations, like addition, return missing if any of their inputs is missing. There are a few special operations for manipulating missing values. There is also a missing literal, NA, but you have to specify it's type. Remember, e: Int just means that e has type Int. Missing Hail values are converted to None in Python. You can test missingness with isDefined and isMissing.

Before you evaluate these, guess what the result will be.

Here are some examples:

a = hc.eval_expr_typed('NA: Int') # missing Int

b = hc.eval_expr_typed('1 + NA: Int')

c = hc.eval_expr_typed('isDefined(1)')

d = hc.eval_expr_typed('isDefined(NA: Int)')

e = hc.eval_expr_typed('isMissing(NA: Double)')

print a
print b
print c
print d
print e

Let

You can assign a value to a variable with a let expression. Here is an example.

hc.eval_expr_typed('let a = 5 in a + 1')

The variable, here a is only visible in the body of the let, the expression following in. You can assign multiple variables. Variable assignments are separated by and. Each variable is visible in the right hand side of the following variables as well as the body of the let.

Python triple quote strings can span multiple lines. This can be useful for writing long Hail expressions.

For example:

hc.eval_expr_typed('''
let a = 5
and b = a + 1
 in a * b
''')

Conditionals

Unlike other languages, conditionals in Hail return a value. The arms of the conditional must have the same type. The predicate must be of type Boolean. If the predicate is missing, the value of the entire conditional is missing. This differs from R, where it is an error to have a missing conditional. Here are some simple examples.

a = hc.eval_expr_typed('if (true) 1 else 2')

b = hc.eval_expr_typed('if (false) 1 else 2')

c = hc.eval_expr_typed('if (NA: Boolean) 1 else 2')

print a
print b
print c
hc.eval_expr_typed('if (true) 1 else "two"') # type error, Int and String incompatible

Arrays

Hail has several compound types: Array[T], Set[T], Dict[K, V], Structs and Aggregable[T]. T, K and V can be any type, including other compound types. Array[T] are similar to Python's lists, except they must be homogenous: that is, each element must be of the same type. Arrays are 0-indexed. Here are some examples of simple array expressions.

Array literals are constructed with square brackets.

Arrays are indexed with square brackets and support Python's slice syntax.

a = hc.eval_expr_typed('[1, 2, 3, 4, 5]')

b = hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[0]')

c = hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[1:3]')

d = hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[1:]') # slice to the end, a[:4] to slice from the beginning

e = hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a.length')

print a
print b
print c
print d
print e

Arrays can be transformed with functional operators filter and map. These operations return a new array, never modify the original.

# keep the elements that are less than 10
a = hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.filter(x => x < 10)')

# square the elements of an array
b = hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.map(x => x * x)')

print a
print b

The full list of methods on arrays can be found here.

Numeric Arrays

Numeric arrays, like Array[Int] and Array[Double] have additional operations like max, mean, median, sort. For a full list, see, for example, Array[Int]. Here are a few examples.

a = hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].sum()')

b = hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].max()')

print a
print b

Structs

Structs are a collection of named values known as fields. Hail does not have tuples like Python. Unlike arrays, the values can be heterogenous. Unlike Dicts, the set of names are part of the type and must be known statically. Structs are constructed with a syntax similar to Python's dict syntax. Struct fields are accessed using the . syntax.

x, t = hc.eval_expr_typed('{gene: "ACBD", function: "LOF", nHet: 12}')
print x
print t
hc.eval_expr_typed('let s = {gene: "ACBD", function: "LOF", nHet: 12} in s.gene')

Exercises

Let's do a series of exercises to transform an array. First, fill in the <?> below to compute the mean of a. The mean is often denoted by the Greek letter mu.

hc.eval_expr_typed('''
let a = [1, -2, 11, 3, -2] 
and mu = <?>
 in mu''')

Second, let's compute the variance of a. Remember, the variance is the sum of the squares of the residual differences from the mean. Note, Hail has no square function. You'll need the mean you computed above. Note, there is currently no square operation in Hail, so you can multiplication or the pow function (for example, pow(2, 3) == 8).

hc.eval_expr_typed('''
let a = [1, -2, 11, 3, -2]
and mu = <?>
and var = a.map(x => <?>).sum()
 in var
''')

Finally, put it all together to return a structure that contains the mean, variance and the array a mean-centered and variance-normalized (Z-score).

hc.eval_expr_typed('''
let a = [1, -2, 11, 3, -2]
and mu = <?>
and var = a.map(x => <?>).sum()
and norm = <?>
 in {mean: <?>, variance: <?>, normalized: <?>}
''')

What if a contains an missing value NA: Int? Will your code still work?