@Jeff Lessem (he/him) has joined the channel
@Tetyana Zayats has joined the channel
@Test Student has joined the channel
@Sarah Brislin (she/her) has joined the channel
@Katie Bountress has joined the channel
@Peter Tanksley has joined the channel
@Charlotte Viktorsson has joined the channel
@Matthieu de Hemptinne has joined the channel
@Sam Freis (she/her) has joined the channel
@Stephanie Zellers (she/her/hers) has joined the channel
@Zoe Schmilovich has joined the channel
@Olivia Rennie has joined the channel
@Christina Sheerin has joined the channel
@William McAuliffe has joined the channel
@Francis Vergunst (he/him) has joined the channel
@Jeff Lessem (he/him) has renamed the channel from "rare-saige" to "day09-rare-saige"
Hi @channel, excited to see you tomorrow in the rare+SAIGE session! We will try 4 methods corresponding to the 4 videos, respectively, to perform genetic association tests for binary phenotypes. We will use Rstudio to run the commands. Here is the material for tomorrow’s practical https://github.com/weizhou0/ISGW_rare_SAIGE_hands_on/wiki/Day-9-Rare-and-SAIGE Please feel free to post any questions on this slack channel.
Hello! Will the lecture slides be made available? Thank you!
*Thread Reply:* Hi! They will be put on the website on the day’s page shortly. Thanks!
*Thread Reply:* Parts 2-4 are up, and part 1 will be added when it's available to me.
Hi! In the GWAS in large-scale biobanks and cohorts lecture, one of the limitations is that ‘asymptotic approaches were used to achieve scalability for large data sizes, whose performance may be poor when sample sizes are too small’. I was wondering how small is ‘too small’ and if you could elaborate on why this is the case?
*Thread Reply:* Hi, We have tried the sample size low to 1000 in the UKBB data and it still works fine. It depends on how heavy the sample relatedness is in the data set.
@Wei Zhou I am trying to run Step 2 (set-based association tests) of Part 4 and get the error below. However, when I look at this location in my home drive the SKAT.so file is sitting there. Everything has worked up to there. How do I fix this? Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object '/home/penell44/R/x86_64-pc-linux-gnu-library/4.0/SKAT/libs/SKAT.so': libR.so: cannot open shared object file: No such file or directory Calls: SPAGMMATtest ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous> Timing stopped at: 0.007 0 0.014 Execution halted
*Thread Reply:* Hi, did you try to install SAIGE on the cluster or directly call the singularity?
*Thread Reply:* I ran this using the singularity
*Thread Reply:* All good - it worked this morning.
Hi! When is a case-control ratio considered "unbalanced", and is it suitable to use the saddlepoint approximation (SPA) test? Is the border a ratio of 1:5?
*Thread Reply:* Usually we start seeing inflation when case-control ratio is < 1:10
Hi, I am very intrigued by the Phecodes, would you mind expanding a little bit on what they are and what is the difference from manually-constructed phenotypes? Can you list some examples?
*Thread Reply:* phecodes is a curated database to help map ICD codes to diseases https://phewascatalog.org/phecodes_icd10
*Thread Reply:* Here is a paper on phecodes https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0175508
Thanks for the great lectures! What are the advantages of using SAIGE for GWAS of common variants in binary phenotypes over the more recently developed methods Regenie and FastGWA-GLMM?
*Thread Reply:* This is a great questions. These methods have pros and cons. It would be nice to systematically compare them in different scenarios. SAIGE uses Average Information REML to fit the null logistic mixed model, which is different from what Regenie uses. Regenie improves the computation efficiency by running multiple phenotypes together that needs to impute the missing phenotypes when analyzing together for the same samples, which may not be the ideal approach for some data sets. FastGWA-GLMM fits the null logistic mixed model using the sparse GRM instead of a full GRM. It will be certainly much faster than using a full GRM. This works well for some data sets, such as UKBB, with light sample relatedness, but for data with very heavy sample relatedness, using sparse GRM is not quite feasible. BTW, SAIGE can also fit the null model using a sparse GRM with the argument --useSparseGRMtoFitNULL
Hi, in the lectures you mention sparse GRM matrix? How can you obtain these? Any references on that topic?
*Thread Reply:* SAIGE-GENE has the step 0 script to generate a sparse GRM https://github.com/weizhou0/ISGW_rare_SAIGE_hands_on/wiki/Part-4-SAIGE-GENE#step-0-creating-a-sparse-grm There are also other programs that can be used to generate GRM, such as GCTA and KING https://kingrelatedness.com/
*Thread Reply:* Thank you for the links 😊 , are there also papers that describe sparse GRM matrices in more detail?
*Thread Reply:* A sparse GRM is one in which off-diagonal values of pihat that are small enough (e.g., < .05) are set to 0. I think it was first described in Zaitlen et al (2013) PLoS Genetics paper.
Hi! I have a question related to the practical. You probably have covered this, but do you mind elaborate a little more about what is meant/happening when we “call the singularity container of SAIGE/SAIGE-GENE”?
*Thread Reply:* The brief answer is when you call singularity you are setting up a special environment designed to run SAIGE.
The longer answer is below.
There are some definitions to get out of the way first. A virtual machine is a computer that is emulated by another computer. So, for example a way to run Linux within your Windows computer.
A container is sort of a mini-virtual machine. It is like a zip file which contains all of the files and programs necessary to do some task.
SAIGE depends on certain versions of python and R, so it is easiest to install as a container, so it doesn't interfere with other things that require different versions of python and R.
"singularity" is just a container running method. Perhaps you've heard of docker or kubernetes as other methods of running containers.
So when you call the singularity container for SAIGE, what your doing is "booting" another computer (which is just emulated in the computer your logged into) which is running a system setup in the special way necessary to run SAIGE.
I don't know if that is more or less or different than what you wanted to know.
Some more info on inverse normalization which you may have noticed was recommended for quantitative traits: • https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2921808/ • https://cran.r-project.org/web/packages/RNOmni/vignettes/RNOmni.html
I was wondering if you could please share a reference discussing that heritability estimates from LMMs are not accurate, but that it's okay for genetic correlations? Thank you 🙂
*Thread Reply:* I haven’t find a published paper to discuss it. The notes on the wiki page of LDSC https://github.com/bulik/ldsc/wiki has mentioned it
hello! if I am performing a logistic regression, is it recommended that I account for relatives in my sample by adding family ID as a random variable (ie.: (1|FID)) or to include the GRM as a variable? thank you!
*Thread Reply:* Hello! It depends what sample relatedness you’d like to account for. Using (1|FID) is accounting for sample relatedness within families, while GRM is account for sample relatedness between each sample pairs in the data, no matter whether sample are in the same family or not.
I am wondering why the theta output in Step1 of SAIGE is not a good estimate of heritability even though it estimates the variance in the phenotype explained by the GRM. And then, for what purpose may it be used?
*Thread Reply:* Great question! Tau is a vector with 2 elements. The first element is for the variance component parameter for the error term and the second one is for the GRM (genetic relationship matrix). Tau can be extracted from the null model of SAIGE results by R load("model.rda"); tau = modglmm$theta
For quantitative traits from the linear mixed model, h2 = tau[2]/(tau[1]+tau[2])
For binary traits from the logistic mixed model (tau[1] is always 1), h2_liability = tau[2]/(tau[2]+pi^2/3)
. But note that the heritability is the point estimate for proportion of variance of the phenotype explained by the GRM, which is not equal to the heritability explained using LDSC. Also, we have noticed that the h2 estimate for binary traits by SAIGE is underestimated and the penalized quasi-likelihood used in SAIGE for fitting the null logistic model is known to be biased for heritability estimation but it works well for adjusting for sample-relatedness.
*Thread Reply:* hmmmm. I'll have to think more about this 🙂 It's comparable to the h2 estimated in BOLT-LMM, right? which is also not a good estimator of h2?
*Thread Reply:* Yes, sorry i forgot to mention for quantitative traits using the linear mixed mdoels, the heritability estimates in SAIGE and BOLT-LMM are the same. But for binary phenotypes using logistic mixed models, the h2 in SAIGE is underestimated
*Thread Reply:* ahh, ok. Got it!