Slack Export - #day08-sequencing-introduction-to-hail

Jeff Lessem (he/him) (jeff.lessem@colorado.edu)

2021-04-26 10:18:47

@Jeff Lessem (he/him) has joined the channel

Kumar Veerapen (veerapen@broadinstitute.org)

2021-04-30 09:56:14

@Kumar Veerapen has joined the channel

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-04-30 09:56:14

@Tim Poterba (he/him) has joined the channel

John Compitello (johnc@broadinstitute.org)

2021-04-30 09:56:14

@John Compitello has joined the channel

Daniel Goldstein (dgoldste@broadinstitute.org)

2021-04-30 09:56:14

@Daniel Goldstein has joined the channel

Gunn-Helen Moen (g.moen@uq.edu.au)

2021-05-03 13:13:41

@Gunn-Helen Moen has joined the channel

Mark Adams (mark.adams@ed.ac.uk)

2021-05-04 02:16:41

@Mark Adams has joined the channel

Carolin Diaz (carolin@broadinstitute.org)

2021-05-06 07:23:53

@Carolin Diaz has joined the channel

Test Student (test-student@ibg.colorado.edu)

2021-05-06 11:38:57

@Test Student has joined the channel

Bridget Joyner (bnj13@my.fsu.edu)

2021-05-10 13:00:14

@Bridget Joyner has joined the channel

Sally Kuo (ickuo@vcu.edu)

2021-05-10 13:30:20

@Sally Kuo has joined the channel

Aislinn Bowler (aislinnbowler@gmail.com)

2021-05-10 13:30:27

@Aislinn Bowler has joined the channel

Morgan Driver (driverm@vcu.edu)

2021-05-10 13:31:03

@Morgan Driver has joined the channel

Sarah Brislin (she/her) (sarah.brislin@gmail.com)

2021-05-10 13:31:37

@Sarah Brislin (she/her) has joined the channel

Lisa Dinkler (lisa.dinkler@gu.se)

2021-05-10 13:31:42

@Lisa Dinkler has joined the channel

Katie Bountress (kaitlin.bountress@vcuhealth.org)

2021-05-10 13:32:21

@Katie Bountress has joined the channel

Peter Tanksley (peter.tanksley@austin.utexas.edu)

2021-05-10 13:32:32

@Peter Tanksley has joined the channel

Tong Chen (tuc548@psu.edu)

2021-05-10 13:34:04

@Tong Chen has joined the channel

Charlotte Viktorsson (viktorsson.charlotte@gmail.com)

2021-05-10 13:34:34

@Charlotte Viktorsson has joined the channel

Jacob Kunkel (kunke104@umn.edu)

2021-05-10 13:35:31

@Jacob Kunkel has joined the channel

Matthieu de Hemptinne (matthieu.dehemptinne@gmail.com)

2021-05-10 13:36:00

@Matthieu de Hemptinne has joined the channel

Jay Ross (jay.ross@mail.mcgill.ca)

2021-05-10 13:38:33

@Jay Ross has joined the channel

Sam Freis (she/her) (Samantha.Freis@colorado.edu)

2021-05-10 13:38:41

@Sam Freis (she/her) has joined the channel

Jeremy Elman (jaelman@health.ucsd.edu)

2021-05-10 13:38:55

@Jeremy Elman has joined the channel

Spencer Moore (spmo3925@colorado.edu)

2021-05-10 13:39:52

@Spencer Moore has joined the channel

Maizy Brasher (mabr7162@colorado.edu)

2021-05-10 13:39:52

@Maizy Brasher has joined the channel

Jenny Phan (jphan5@wisc.edu)

2021-05-10 13:39:58

@Jenny Phan has joined the channel

Meng Huang (meng.huang.cn@gmail.com)

2021-05-10 13:41:18

@Meng Huang has joined the channel

Jung Chen (jchen378@ucmerced.edu)

2021-05-10 13:41:58

@Jung Chen has joined the channel

Stephanie Zellers (she/her/hers) (zelle063@umn.edu)

2021-05-10 13:42:17

@Stephanie Zellers (she/her/hers) has joined the channel

Grace Wu (yakew@email.unc.edu)

2021-05-10 13:42:31

@Grace Wu has joined the channel

Gladi Thng (s2124928@ed.ac.uk)

2021-05-10 13:43:47

@Gladi Thng has joined the channel

Zoe Schmilovich (zoe.schmilovich@mail.mcgill.ca)

2021-05-10 13:43:50

@Zoe Schmilovich has joined the channel

Olivia Rennie (olivia.rennie@alum.utoronto.ca)

2021-05-10 13:43:57

@Olivia Rennie has joined the channel

Christina Sheerin (Christina.sheerin@vcuhealth.org)

2021-05-10 13:43:59

@Christina Sheerin has joined the channel

William McAuliffe (williamhbmcauliffe@gmail.com)

2021-05-10 13:44:17

@William McAuliffe has joined the channel

Chloe Myers (cmyer011@ucr.edu)

2021-05-10 13:44:20

@Chloe Myers has joined the channel

Francis Vergunst (he/him) (francis.vergunst@umontreal.ca)

2021-05-10 13:44:33

@Francis Vergunst (he/him) has joined the channel

Ravi Bhatt (ravibot93@gmail.com)

2021-05-10 13:44:48

@Ravi Bhatt has joined the channel

Nathan Bell (n.y.bell@student.vu.nl)

2021-05-10 14:46:29

@Nathan Bell has joined the channel

Emil Uffelmann (e.uffelmann@vu.nl)

2021-05-10 14:46:47

@Emil Uffelmann has joined the channel

Kristen Kelly (k.m.kelly@vu.nl)

2021-05-10 14:47:49

@Kristen Kelly has joined the channel

Jeff Lessem (he/him) (jeff.lessem@colorado.edu)

2021-05-11 09:46:10

They web page with the video lectures for Sequencing and Introduction to Hail is at https://www.colorado.edu/ibg/international-workshop/2021-international-statistical-genetics-workshop/syllabus/day-8-wednesday

Institute for Behavioral Genetics

Day 8 Wednesday June 16, 2021 (GMT)

Topic: Sequencing and Introduction to Hail Lead Fa

Original URL: https://www.colorado.edu/ibg/international-workshop/2021-international-statistical-genetics-workshop/syllabus/day-8-wednesday

Jeff Lessem (he/him) (jeff.lessem@colorado.edu)

2021-06-08 15:28:32

@Jeff Lessem (he/him) has renamed the channel from "sequencing-introduction-to-hail" to "day08-sequencing-introduction-to-hail"

🙌 Kumar Veerapen

Kumar Veerapen (veerapen@broadinstitute.org)

2021-06-15 15:18:49

@channel excited to see y’all tomorrow for our workshop session for sequencing and Hail. A few notes Remember to review the lecture on why sequencing analysis is important and why use Hail (https://hail.is/) :

https://www.youtube.com/watch?v=2N_VqmX22Xg&list=PL-A34BVyxWtXn9nxuj8Gk1yRfxhpdZ4y2

Review the python tutorial from Cotton. Why? Because Hail uses Python: https://www.youtube.com/watch?v=QIaunoHeP9Q

However, for the session, we will not be expecting too much coding from you. What we do want you to leave from the session is the ability to know what the pieces of code does? And obviously, to know how awesome Hail is for the analysis of sequencing data.

Finally, the practical sessions will be run on Google cloud services where we will provide a link and password to you tomorrow. All you need is a functioning web browser (and obviously internet access).

After the session, we will share the scripts and additional material (if any) via a github repo and a link on our Hail website.

If you have any questions in regards to the lectures that you have reviewed, in preparation for tomorrow’s session, and/or after tomorrow’s session, feel free to send us a slack message in this channel that’s solely for your teaching and learning experience in Sequencing and Hail.

YouTube

} International Statistical Genetics Workshop (https://www.youtube.com/channel/UCoWilKMAP8sJpD2jHuHZEbw)

Sequencing and Hail Part 1

Original URL: https://www.youtube.com/watch?v=2N_VqmX22Xg&list=PL-A34BVyxWtXn9nxuj8Gk1yRfxhpdZ4y2

YouTube

} International Statistical Genetics Workshop (https://www.youtube.com/channel/UCoWilKMAP8sJpD2jHuHZEbw)

Introduction to Python

Original URL: https://www.youtube.com/watch?v=QIaunoHeP9Q

🙌 Lucía Colodro-Conde

Sage Hawn (shawn1@bu.edu)

2021-06-15 16:52:05

I think a portion of the video differentiating genotyping from sequencing was cut out. Could you please clarify the distinction?

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-15 18:36:54

*Thread Reply:* Genotyping chips measure a small subset of the sites in the genome, which have been selected ahead of time to be sites of common variation that contain a lot of information. This is a cheap technology, but there’s a drawback - there’s not much information about rare variation.

The most common sequencing technology is high-throughput sequencing, also called short-read sequencing or shotgun sequencing. In this technology, the genome is chopped up into small (<150BP) chunks, each of which is sequenced separately. The billions of short reads are then aligned to a human reference genome, and after a processing pipeline (“variant calling”) you get information about every site in the genome (or exome) that contains the best-guess alleles for each individual, and some metadata about the sequencing process.

🙌 Kumar Veerapen, Ravi Bhatt, Jet Termorshuizen

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-15 18:37:28

*Thread Reply:* Sequencing data is much bigger than genotype data, even for the same sample size, because so much more information is collected from each person. This makes it tougher to analyze.

🙌 Kumar Veerapen

Kumar Veerapen (veerapen@broadinstitute.org)

2021-06-15 19:26:16

*Thread Reply:* @Sage Hawn if you compare the slide deck available on the main page vs the video that was cut out, do you recall which slide was it?

Kumar Veerapen (veerapen@broadinstitute.org)

2021-06-15 19:26:27

*Thread Reply:* Thanks, @Tim Poterba (he/him) ! Always to the rescue!

matthew keller (matthew.c.keller@gmail.com)

2021-06-15 20:57:32

*Thread Reply:* Re array vs. sequence data, I agree the main distinction by far is more rare variation in sequence. But sequence data also has much more info on non-SNP variants (indels). Do you agree @Tim Poterba (he/him)? And I’m not sure about the detection of larger CNVs and inversions/translocations in sequence data - can someone comment on that?

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-15 21:00:50

*Thread Reply:* That’s a great point! I’m often guilty of forgetting structural variation. Having worked on copy number variant calling from genotype data, it’s easy to find large deletions/duplications from genotype data, but sequencing data allows you to observe small deletions/duplications (at the level of a single base), and complex events like balanced translocations or rearrangements. But someone else should comment on why this is so important!

Kumar Veerapen (veerapen@broadinstitute.org)

2021-06-15 21:09:13

*Thread Reply:* For larger indels, you would have to use SV tools such as listed here : https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1720-5

I’ve had experience with pindel and breakdancer which the small nuances allow for things like transposable elements vs large indels.

Genome Biology

Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing

Background Structural variations (SVs) or copy number variations (CNVs) greatly impact the functions of the genes encoded in the genome and are responsible for diverse human diseases. Although a number of existing SV detection algorithms can detect many types of SVs using whole genome sequencing (WGS) data, no single algorithm can call every type of SVs with high precision and high recall. Results We comprehensively evaluate the performance of 69 existing SV detection algorithms using multiple simulated and real WGS datasets. The results highlight a subset of algorithms that accurately call SVs depending on specific types and size ranges of the SVs and that accurately determine breakpoints, sizes, and genotypes of the SVs. We enumerate potential good algorithms for each SV category, among which GRIDSS, Lumpy, SVseq2, SoftSV, Manta, and Wham are better algorithms in deletion or duplication categories. To improve the accuracy of SV calling, we systematically evaluate the accuracy of overlapping calls between possible combinations of algorithms for every type and size range of SVs. The results demonstrate that both the precision and recall for overlapping calls vary depending on the combinations of specific algorithms rather than the combinations of methods used in the algorithms. Conclusion These results suggest that careful selection of the algorithms for each type and size range of SVs is required for accurate calling of SVs. The selection of specific pairs of algorithms for overlapping calls promises to effectively improve the SV detection accuracy.

Original URL: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1720-5

Sage Hawn (shawn1@bu.edu)

2021-06-16 09:15:06

*Thread Reply:* Thanks to you all - very helpful!

🙌 Kumar Veerapen

Jet Termorshuizen (jet.termorshuizen@ki.se)

2021-06-16 04:00:12

Hi! I might be completely missing the point here, but... How do PLINK (http://zzz.bwh.harvard.edu/plink/) and Hail relate to each other?

Mark Adams (mark.adams@ed.ac.uk)

2021-06-16 04:58:46

*Thread Reply:* They are both tools for managing and analysing genetic data. The difference is that PLINK is a standalone program that you run on the command line. Hail, in contrast, is a programming library that you use by writing Python code.

Jet Termorshuizen (jet.termorshuizen@ki.se)

2021-06-16 05:17:04

*Thread Reply:* Okay! And is there any recommendation of which tool is better? Or is that personal preference? For example, I believe the PGC (Psychiatric Genomics Consortium) mostly uses PLINK.

Mark Adams (mark.adams@ed.ac.uk)

2021-06-16 05:21:35

*Thread Reply:* PLINK is easier to learn and get started with, and works well with genotyping data on hundreds of thousands of samples and 10s of millions of variants.

Mark Adams (mark.adams@ed.ac.uk)

2021-06-16 05:24:21

*Thread Reply:* Hail requires more background to use, since you need to know Python, but it is ultimately more flexible since if it doesn't have the analysis you want to do, you can program it yourself (compared with PLINK, which can only do whatever commands it has built-in). Hail also scales better to sequencing data with 100s of millions of variants on millions of samples, since it can use cloud computing.

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 05:24:25

*Thread Reply:* The two tools have some overlap in functionality, but very different goals. PLINK is designed to be a set of prepackaged modules that work fast and well on genotype data (or well-QCed genotype calls that originally came from sequencing data).

PLINK only handles with the hard-call biallelic genotypes or genotype dosages. Sequencing data comes with more data (and more problems!) like allelic depth, genotype quality, and multiallelic variants.

Hail doesn’t have nearly as much pre-packaged functionality as PLINK does (you’ll see today, using Hail involves programming to ask the questions you want), but the expressiveness of Hail is required to handle sequencing data well, because every sequencing dataset has slightly different analysis needs.

As to which tool is “better” — that will depend on your application. My general rule of thumb is that if PLINK works for your project, use it! But if you are dealing with sequencing data or large datasets with genotype data (UKBB, for instance), Hail may offer a better experience.

👍 Mark Adams, Jet Termorshuizen

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 05:24:49

*Thread Reply:* Thanks Mark, those are great answers!

Mark Adams (mark.adams@ed.ac.uk)

2021-06-16 05:26:43

*Thread Reply:* As an analogy, using PLINK is a bit like manipulating data using shell tools like awk and sort whereas Hail is more like manipulating data in R using dplyr.

🙌 Tim Poterba (he/him)

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 05:33:20

*Thread Reply:* (Hail’s interfaces are heavily inspired by dplyr!)

Jet Termorshuizen (jet.termorshuizen@ki.se)

2021-06-16 05:55:29

*Thread Reply:* Thanks for the clarifications!

Kumar Veerapen (veerapen@broadinstitute.org)

2021-06-16 06:41:28

*Thread Reply:* I love this entire discussion! Tim will later be talking a little bit about this in our intro/primer to your practicals 🙂

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 06:48:00

Here are the instructions for connecting to the Hail workshop service! You can log in now, but please don’t start until you’re in breakout rooms so that everyone is going through materials together.

Hail workshop system.pdf

Cotton Seed (cseed@broadinstitute.org)

2021-06-16 08:06:13

There have been a few questions about the meaning of the VCF fields (GT, AD, DP, etc.) Here is a description of those fields: https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format

GATK

VCF - Variant Call Format

This document describes "regular" VCF files produced for GERMLINE short variant (SNP and indel) calls (e.g. by HaplotypeCaller in "normal" mode and by GenotypeGVCFs). For inform...

Original URL: https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format

Cotton Seed (cseed@broadinstitute.org)

2021-06-16 08:06:42

See section 5, "Interpreting genotype and other sample-level information"

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 08:31:11

Many groups have questions about the SNP count exercises.

Question 1 - The reason why C/T and G/A SNPs occur at the same frequency is that they are the same SNP read from different directions! The ‘reference’ strand is arbitrary, so we should see each of these with the same frequency.

Question 2 - The various base mutations (C>T, T>A, etc) happen with different frequency due to the biochemistry of the nucleotides themselves. In particular, C>T mutations happen because when C nucleotides are methylated by nuclear machinery (called Cp), these bases look a lot like T nucleotides and replication errors from Cp > T happen higher frequency than any other substitution.

Daniel Howrigan (howrigan@atgu.mgh.harvard.edu)

2021-06-16 08:34:37

Quick descriptor of the ti/tv ratio: https://genome.sph.umich.edu/wiki/SNP_Call_Set_Properties

Daniel Howrigan (howrigan@atgu.mgh.harvard.edu)

2021-06-16 08:35:50

Also wikipedia: https://en.wikipedia.org/wiki/Transversion#Ratio_of_transitions_to_transversions

} Wikipedia (https://en.wikipedia.org/)

Transversion

Transversion, in molecular biology, refers to a point mutation in DNA in which a single (two ring) purine (A or G) is changed for a (one ring) pyrimidine (T or C), or vice versa. A transversion can be spontaneous, or it can be caused by ionizing radiation or alkylating agents. It can only be reversed by a spontaneous reversion.

Original URL: https://en.wikipedia.org/wiki/Transversion#Ratio_of_transitions_to_transversions

Elinor Bridges (e.c.bridges@sms.ed.ac.uk)

2021-06-16 08:49:42

Apologies if I missed this during the session, how long will we have access to the notebook for to have a look at the practical again later?

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 08:50:20

*Thread Reply:* we’re going to answer that when rooms close. We’ll keep the rooms open for 2h or so, but if you want to run the practical after that, we can help you install Hail on your own computer, and you can download the materials!

Elinor Bridges (e.c.bridges@sms.ed.ac.uk)

2021-06-16 08:52:46

*Thread Reply:* Great, thanks!

Kumar Veerapen (veerapen@broadinstitute.org)

2021-06-16 09:06:49

@Isabella Loft If you could share the link to improving your knowledge?

Isabella Loft (ilof@regionsjaelland.dk)

2021-06-16 09:18:04

*Thread Reply:* So I don't really have any good links, it is more about worked up knowledge by working in Python data stuctures, and a lot of boring parsing of files over the years.

But the MatrixTable remind me a bit of a nested dictionary visualised as a big table. Where it consists of a lot of nested dictionaries where the value for a key can also be a seperate table structure. The MatrixTable is visualised way better compared to raw data printed in python. Also dictionaries in python are not always the easiest to get you head around, and not always that efficient.

So with a quick google search describing Nested dictionaries in python.

programiz.com

Python Nested Dictionary (With Examples)

In this article, you’ll learn about nested dictionary in Python. More specifically, you’ll learn to create nested dictionary, access elements, modify them and so on with the help of examples.

Original URL: https://www.programiz.com/python-programming/nested-dictionary#:~:text=In%20Python%2C%20a%20nested%20dictionary,dictionaries%20into%20one%20single%20dictionary.&text=Here%2C%20the%20nested_dict%20is%20a,having%20own%20key%20and%20value.

Kumar Veerapen (veerapen@broadinstitute.org)

2021-06-16 09:27:44

*Thread Reply:* Thank you for sharing!!! ❤

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 09:07:12

Breakout rooms and the workshop service will remain open for the next 2 hours!

Anna Furtjes (anna.furtjes@kcl.ac.uk)

2021-06-16 09:47:54

The practical session was great! Very helpful, thank you 🙂

Giulio Centorame (giulio.centorame@outlook.it)

2021-06-16 10:19:02

Hi, I might be missing this, but how do we download the files from the practical?

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 10:36:31

*Thread Reply:* The notebooks and data we used today are here: https://github.com/mkveerapen/2021_IBG_Hail/tree/main/resources

GitHub

mkveerapen/2021_IBG_Hail

Contribute to mkveerapen/2021_IBG_Hail development by creating an account on GitHub.

Original URL: https://github.com/mkveerapen/2021_IBG_Hail/tree/main/resources

😍 Giulio Centorame, Kumar Veerapen

Giulio Centorame (giulio.centorame@outlook.it)

2021-06-16 10:37:14

*Thread Reply:* Great, thank you!

Emilie Hegelund (emhe@sund.ku.dk)

2021-06-17 02:48:26

*Thread Reply:* Would it be possible for you to upload pdf versions of the two notebooks with all the code printed so we can see the plots? An answer sheet to check all our answers would also be very helpful 🙂

👍 Abigail ter Kuile

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-17 05:47:32

*Thread Reply:* Here are PDF versions with all code run: https://github.com/mkveerapen/2021_IBG_Hail/tree/main/Materials/outputPDF

They don’t have the plots, though — the plots are interactive and work best as HTML. The GWAS tutorial on the Hail website (https://hail.is/docs/0.2/tutorials/01-genome-wide-association-study.html) has examples of similar workflows, though due to a bug the code output/plots are not showing up right now — if you check back in a few days it should be fixed.

GitHub

mkveerapen/2021_IBG_Hail

Contribute to mkveerapen/2021_IBG_Hail development by creating an account on GitHub.

Original URL: https://github.com/mkveerapen/2021_IBG_Hail/tree/main/Materials/outputPDF

Emilie Hegelund (emhe@sund.ku.dk)

2021-06-17 05:50:51

*Thread Reply:* Thank you! Would it also be possible to upload an answer sheet with answers to all the questions in the two practicals?

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-17 05:52:14

*Thread Reply:* yes, that’s on my to-do list for today! 🙂

👍 Emilie Hegelund, Giulio Centorame

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-17 14:55:03

*Thread Reply:* Answer sheets are up: https://github.com/mkveerapen/2021_IBG_Hail/tree/main/resources

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 11:33:27

Hi all, we’re going to shut down the notebooks from session A in 10 minutes! The materials are available here: https://github.com/mkveerapen/2021_IBG_Hail/tree/main/resources

Kumar Veerapen (veerapen@broadinstitute.org)

2021-06-16 14:25:02

Results from this AM’s poll. Excited to see everyone for session B! 🙂

How did you feel about the Hail practical?.png

What is your academic background?.png

Where are you from?.png

Which section did you feel best about?.png

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 14:27:08

you should add [clinical] psychology to the 2nd poll for session B 😉

👍 Michel Nivard

😂 Michel Nivard, Kumar Veerapen

Rob Kirkpatrick (robert.kirkpatrick@vcuhealth.org)

2021-06-16 14:27:27

*Thread Reply:* I agree with Tim.

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 14:27:42

*Thread Reply:* maybe “social science” too

Kumar Veerapen (veerapen@broadinstitute.org)

2021-06-16 14:36:54

*Thread Reply:* Lol I just added “Psychology”

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 16:03:01

Instructions for connecting to the notebook server for today’s practical: https://boulder-workshop.slack.com/archives/C0201TSKBNE/p1623847680019900

} Tim Poterba (he/him) (https://boulder-workshop.slack.com/team/U01Q7P11LKH)

<p>Here are the instructions for connecting to the Hail workshop service! You can log in now, but please don’t start until you’re in breakout rooms so that everyone is going through materials together.</p>

Original URL: https://boulder-workshop.slack.com/archives/C0201TSKBNE/p1623847680019900

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 17:43:12

Some followup information about the HWE exercise — some questions have come up about how there could be sites with 100% of samples called heterozygous. What error mode could lead to this?

Here is one of these sites pulled up on the gnomAD browser: https://gnomad.broadinstitute.org/variant/1-125165544-G-T?dataset=gnomad_r3

We have theorized that a site could have 100% heterozygotes if there is a read mapping error upstream. Suppose that there are two 500-base-pair regions in different parts of the genome which are identical aside from a single base (one has a G, one has a T). When we use short-read sequencing, we’ll get some reads from both of these regions. If we align them to the same place in the reference genome (this is wrong), we’ll get a variant where every sample is a heterozygote. But there’s a way we can detect this — samples will have too many reads. The samples in gnomAD have mean depth 30, but the distribution for this site looks very different — most samples have more than 100 reads at this site!

image.png

Jay Ross (jay.ross@mail.mcgill.ca)

2021-06-16 17:57:39

Thank you for the Hail tutorial, it was very nice to use. Because our lab only has access to a slurm-based scheduler cluster, is it completely impossible to install hail on this type of computing cluster? It would be great to integrate hail into our workflows, but we will not have access to a spark cluster.

⬆ Zoe Schmilovich

Zoe Schmilovich (zoe.schmilovich@mail.mcgill.ca)

2021-06-16 17:58:33

*Thread Reply:* I’m also wondering! Thank you!

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 17:58:44

*Thread Reply:* It’s possible but quite hard to run Spark (the way Hail runs on multiple nodes right now) on Slurm. However, it’s totally possible to run Hail on a single big node (32 cores or more) to do some heavy lifting. I think you could analyze 1000WGS in this model with no problem.

Jay Ross (jay.ross@mail.mcgill.ca)

2021-06-16 18:01:53

*Thread Reply:* okay perfect! We have access (on Compute Canada clusters) to nodes with up to 48 cores and 752G memory. Will definitely check with our team about working with it.

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 18:02:06

*Thread Reply:* that’s a boatload of memory 😱

Jay Ross (jay.ross@mail.mcgill.ca)

2021-06-16 18:03:41

*Thread Reply:* I think we would use a large proportion of our yearly allocation if we took the whole node there, can't say I've ever requested all of that!

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 18:07:58

*Thread Reply:* To be clear, 48 cores would be great and help your pipelines speed along, but you don’t need a huge amount of memory to run Hail. Something like 4-6G per core should be plenty.

👍 Zoe Schmilovich

Jay Ross (jay.ross@mail.mcgill.ca)

2021-06-16 18:08:55

*Thread Reply:* but it's not really possible to distribute these jobs across different nodes, right? Is that the limitation of slurm vs spark?

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 18:10:41

*Thread Reply:* correct.

Mark Adams (mark.adams@ed.ac.uk)

2021-06-17 02:23:52

*Thread Reply:* with 48 cores would the invocation command be hail.init(local="local[48]") ?

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-17 05:48:52

*Thread Reply:* I think master="local[48]", though it’s possible your line would work as well. If you leave this argument out. The default is local[**] , which uses all available cores.

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 18:01:03

Notebooks and files for today’s practical are here: https://boulder-workshop.slack.com/archives/C0201TSKBNE/p1623861391032200?thread_ts=1623860342.031900&cid=C0201TSKBNE

} Tim Poterba (he/him) (https://boulder-workshop.slack.com/team/U01Q7P11LKH)

<p>The notebooks and data we used today are here: <a href="https://github.com/mkveerapen/2021_IBG_Hail/tree/main/resources">https://github.com/mkveerapen/2021_IBG_Hail/tree/main/resources</a></p>

Original URL: https://boulder-workshop.slack.com/archives/C0201TSKBNE/p1623861391032200?thread_ts=1623860342.031900&cid=C0201TSKBNE

👍 Zoe Schmilovich

Katerina Zorina-Lichtenwalter (kazo7929@colorado.edu)

2021-06-16 18:05:04

thanks very much for the lectures and tutorials! Hail is indeed a nifty platform to do many analyses, QC steps, and visualisations in one place. I am pretty impressed. One minor note is that I might consider renaming the "impute" parameter in the hl.import_table function to something else, like "infer", because it sounds very much like imputing missing values (at least in the genetics context)!

Tim Poterba (he/him) (tpoterba@broadinstitute.org)

2021-06-16 18:07:21

*Thread Reply:* Thanks for this feedback! This is actually already on our list of things to change when we make a “breaking” version change (0.2 => 0.3 or 0.2 => 1.0). We can’t change it right now or it will break pipelines of existing users!

Katerina Zorina-Lichtenwalter (kazo7929@colorado.edu)

2021-06-16 21:07:02

*Thread Reply:* sounds good. thanks!

Kumar Veerapen (veerapen@broadinstitute.org)

2021-06-17 08:06:10

We have updated the github repo containing all the materials used for teaching https://github.com/mkveerapen/2021_IBG_Hail We also included a link at the bottom if you’d like to download everything in a tar ball.

It was fantastic having such an engaging community to have an excellent teaching and learning experience with. Thank you everyone. We hope that you enjoyed yesterday and as much as we did! Please keep us posted with your adventures in Hail on either hail.zulipchat.com or discuss.hail.is.

Lastly, would also like to share the screenshots from Session B’s live polls in the following message

GitHub

mkveerapen/2021_IBG_Hail

Contribute to mkveerapen/2021_IBG_Hail development by creating an account on GitHub.