--- title: "SNP heritability and ascertainment" author: "Michel G Nivard" format: revealjs: transition: slide navigation-mode: vertical theme: [default, custom.scss] highlight-style: "nord" --- # Logistics cp -r /faculty/michel/2023/practical/ . # learning goals - Be able to asses the limits of heritability in the context of out field - familiarity with SNP heritability and the logic behind LD score regression - appreciate the influence of sampling and measurement (asertainment) on our estimates. # What is heritability? - Briefly discussed Monday in [**What Genetics has taught us about life** (Nick)]{.blue} and [**Biometrical Model/Genome and its secrets** (Ben)]{.blue} - Briefly discussed in [**Family-based association**]{.blue} on Tuesday ## What *is* heritability? - Expressed as a proportion of the genetic variance in a trait to the total variance of a trait. - This is the narrow sense heritability, enough for today ## Key nuances related too h2 - Depends on population (Loic on Tuesday and Ben on Monday) - Doesn't always imply biology! ## Heritability is dependent on population, time, age - The strongest GWAS hit for lung cancer is in a Nicotine receptor gene sub unit. - Would the nicotine receptor gene sub unit have been a lungcancer hit if we would have had a UKB in the year 1300? - Will it be in the year 2100 if smoking rates are near 0? ## Heritability doesn't always imply biology - If Lung cancer would have been 100% caused by smoking (it isn't), would it be heritable? - Would this heritability imply biology of lung cancer? - How sure are you your trait of interest has substantial heritability that is orthogonal to heritable environmental causes? ## narrow sense heritability definition in GWAS context the standardized phenotype ($y$) is a sum of the squared effects of $n$ standardized (mean = 0, sd = 1) genotypes ($g$) and the environment $e$: $$ y = \sum_{n=1}^{n} b_{n}*g_{n} + e$$ ## narrow sense heritability definition The additive genetic effect then equals: $$\sigma^2_a = \sum_{n=1}^{n} b^2_{n}$$ $$h^2 = \sigma^2_a / (\sigma^2_a + \sigma^2_e)$$ # What is [SNP]{.yellow} heritability? - Briefly discussed Monday in [**What Genetics has taught us about life**]{.blue} - The proportion of genetic variance measured or tagged by SNPs measured on genotyping chips. - poorly covers rare and structural (CNVs etc) genetic variation. ## What is [SNP]{.yellow} heritability? $$y = \sum_{n=1}^{n} b_{n}*g_{n} + e$$ $$\sigma^2_a = \sum_{n=1}^{n} ? * b^2_{n}$$ Where $?$ = is an unknown loss of precision because we will not always measure, or tag by LD the true causal variants. ## What is [SNP]{.yellow} heritability? - What do we know about the loss of precision? - We tag rare(er) variants less well - We tag CNVs and other structural variants less well - Its not as simple as rare vs common tough ## estimating [SNP]{.yellow} heritability: LD Score regression - In GWAS we estimate a form of this regression: - [$trait = \hat{b}_{0} + \hat{b}_{snp} * SNP + error$]{.blue} - This gives us an estimate of the true effect of differences in allele count at this SNP: $b_{snp}$, the difference between the estimate and the true value is denoted with the little hat. - Can you come up with systematic reasons $b_{snp}$ and $\hat{b_{snp}}$ differ? ## beta, and beta hat... if our SNP has "LD buddies" snp2 and snp3... - [$\hat{b}_{snp} = b_{snp} + LD*b_{snp2} + LD*b_{snp3} + bias + \epsilon$]{.blue} - What can we learn from this? - if [$b_{snp}$ is 0]{.yellow}, $\hat{b}_{snp}$ need not be - [$\hat{b}_{snp} = 0 + LD*b_{snp2} + LD*b_{snp3} + bias + \epsilon$]{.blue} - If LD is greater, or more SNPs are in LD, $\hat{b}_{snp}$ can increase the absolute $b_{snp}$ ## relating these equations to SNP heritability - $\hat{b}_{snp}$ contains 3 pieces: - 1. [$b_{snp} + r^2_{12}*b_{snp2} + r^2_{23}*b_{snp3}$]{.yellow} - 2. [$bias$]{.blue} (drift/stratification uncorrelated to LD) - 3. $\epsilon$ (goes down with GWAS N) - The variance in $\hat{b}_{snp}$ goes up with LD. ## estimating [SNP]{.yellow} heritability: LD Score regression for convenience LDSC works with Z-stats not beta's $Z = \hat{b}_{snp} / se_{b}$ And we summarize the LD a SNP has with its neighbors as: $LDscore_j = \sum_{k=1}^{k}r^2_{kj}$ $E[Z^2_j] = 1 + N*a + \frac{N*h^2_{snp}}{M} *LDscore_j$ ## some intuitons Why $E[Z^2_j]$ and not $E[Z_j]$? Why is the 1 here: $E[Z^2_j] = 1$? ``` r #| echo: true Z <- rnorm(1000,mean=0,sd=1) # no signal mean(Z^2) ``` mean(Z\^2) = 1! ## Lets confirm the LDscore relations empirically - Get schizophrenia GWAS, and the east-west (latitude) locatiom of your home in UKB - $E[Z^2_j] = 1 + N*a + \frac{N*h^2_{snp}}{M} *LDscore_j$ - What is your expectation of $a$ or the intercept (postrat/bias) for each? - What is your expectation of $\frac{N*h^2_{snp}}{M}$ or the slope (heritabilty) for each? ## Lets confirm the LDscore relations empirically cp -r /faculty/michel/2023/practical/ ## Practical failsave ``` {.r code-line-numbers="|3-5|11-12|16|19-24"} library(ggplot2) scz2.sumstats <- read.delim("scz2.sumstats.gz") ldscore <- read.delim("1.l2.ldscore") eastwest.sumstats <- read.delim("eastwest.sumstats.bgz") # Make Z^2 from Z scz2.sumstats$Z2 <- scz2.sumstats$Z^2 eastwest.sumstats$Z2 <- eastwest.sumstats$Z^2 mean(scz2.sumstats$Z2) mean(eastwest.sumstats$Z2) # heritable trait sanity check: scz.merged <- merge(ldscore,scz2.sumstats,by="SNP") ggplot(scz.merged, aes(x=L2, y=Z2)) + geom_point(alpha = 1/10,col="azure4") + xlim(0,80) + ylim(0,25) + geom_smooth(method='lm') + geom_hline(yintercept = 1,col="red") # pop-strat sanity check eastwest.merged <- merge(ldscore,eastwest.sumstats,by="SNP") ggplot(eastwest.merged, aes(x=L2, y=Z2)) + geom_point(alpha = 1/10,col="azure4") + xlim(0,80) + ylim(0,25) + geom_smooth(method='lm') + geom_hline(yintercept = 1,col="red") ``` ## Practical failsave Schizophrenia ![](images/image-1550151228.png){fig-align="center"} ## Practical failsave east-west ![](images/image-1006150185.png){fig-align="center"} # Ascertainment ![](images/image-918671449.png){fig-align="center"} ## Ascertainment - The people in your study aren't always representative of the population ([sampling]{.blue}) - The measurement of your trait is not the same as your trait ([measurement]{.blue}) - These aspects of a study can follow from [design]{.yellow}, or [unintentionally]{.yellow} ## Ascertainment by [design]{.yellow} - Over-sample cases in a schizophrenia GWAS (because its rare) - Target a study at a specific populations with specific health needs - You will need to adjust for this when computing $h^2_{snp}$!! ## [unintentional]{.yellow} ascertainment (sampling) - low SES participants might not have the time to spare for a day long lab study at a location that has poor access via public transport - Elderly people might only respond to email if their 1. online 2. able too - level of institutional trust may influence people's willinges to consent ## [unintentional]{.yellow} ascertainment (sampling) - Why would I care? - It will bias all(!!) statistical estimates and inference - There is a long causal chain between populaiton and sample ## [unintentional]{.yellow} ascertainment (sampling): Collider bias - if: outcome1 -\> ascertainment & outcome2 -\> ascertainment - in the ascertained sample outcome1 and outcome2 will correlate! ## [unintentional]{.yellow} ascertainment (sampling): dating example - Why do people feel their more attractive partners where also more toxic? - Maybe its true? - Or is it collider bias? ## [unintentional]{.yellow} ascertainment (sampling): dating example ![](images/image-34604593.png){fig-align="center"} ## [unintentional]{.yellow} ascertainment (sampling): dating example ![](images/image-1360480714.png){fig-align="center"} ## [unintentional]{.yellow} ascertainment (sampling): dating example ![](images/image-211131327.png){fig-align="center"} ## [unintentional]{.yellow} ascertainment (sampling): Genetics example ![](images/image-1875974755.png){fig-align="center"} ## [unintentional]{.yellow} ascertainment (sampling): Genetics example ![](images/image-832458917.png){fig-align="center"} ## [unintentional]{.yellow} ascertainment (sampling): Genetics example ![](images/image-1937634137.png){fig-align="center"} ## [unintentional]{.yellow} ascertainment (sampling): Genetics example ![](images/image-1354719607.png){fig-align="center"} ## [unintentional]{.yellow} ascertainment (sampling): Genetics example ![](images/image-145156236.png){fig-align="center"} ## [unintentional]{.yellow} ascertainment (sampling): Solutions ![](images/image-1264342565.png){width="211" fig-align="center"} ## [unintentional]{.yellow} ascertainment (sampling): Solutions ![](images/image-1420922157.png){fig-align="center"} ## [unintentional]{.yellow} ascertainment (sampling): Solutions ![](images/image-405175823.png){fig-align="center"} ## Measurement: Does it matter? - There is a long causal chain between a trait "ideal" and the phenotype in your file - "true" ADHD -> detection in school/home -> GP -> referral -> (mis)diagnosis - "Alcohol use disorder"-> questionnaire -> standard cuttoffs -> AUD phenotype - These chains are biased wrt sex, age, ses, ethnicity.. ## Measurement: Does it matter? - Example Martin et al. rg MDD & BIP in female: 0.55, in male: 0.05 - Example AUD, the AUDIT (scale) measures quantity and consequences, do you combine? How do you treat former drinkers? ## Measurement: Solution? - Be internally (when designing study) that you are studying the end of a long social process - Be externally clear (in writing) that you are studying the end of a very long causal chain. ## Conclussion - Correct inference requires a good statistical genetics model - Correct inference also requires a adequate sampling model - Correct inference also requires a adequate measurement model