---
title: "Causes of Covariation"
author: "Michel Nivard"
format: 
  revealjs:
    transition: slide
    navigation-mode: vertical
    fig-align: center
    theme: sky
highlight-style: "nord"
---

## Flamingo's

{{< video images/IMG_2070.mp4 width="338" height="600"  >}}

# Causes of covariation

**Today** will cover ways to model the *genetic* covariance, and correlation, between two, or more traits.

**This hour** will cover:

-   What co correlation is (and isn't)
-   How to relate what you want to know, to a statistical result

# What is a correlation?

-   a quantification of the degree to which two variables are linearly related
-   correlation implies dependence
-   dependence DOES NOT imply correlation

## Examples of dependence vs correlation

## Uncorrelated

```{r}
require(ggplot2)
u <- rnorm(1000,0,1)
y <- rnorm(1000,0,1)
x <- rnorm(1000,0,1)

data <- as.data.frame(x=x,y=y)

ggplot(data = data,aes(x=x,y=y)) +
   geom_point()

```

## Correlated

```{r}
require(ggplot2)
u <- rnorm(1000,0,1)
y <- u + .5*rnorm(1000,0,1)
x <- u + .5*rnorm(1000,0,1)

data <- as.data.frame(x=x,y=y)

ggplot(data = data,aes(x=x,y=y)) +
   geom_point()

```

## Functionally related, but is it correlated?

```{r}
require(ggplot2)


x <-  rnorm(1000,0,1)
y <- x^2 + .5*rnorm(1000,0,1)

data <- as.data.frame(x=x,y=y)

ggplot(data = data,aes(x=x,y=y)) +
   geom_point()

```

## Functionally related, but is it correlated?

```{r}
require(ggplot2)


x <-  rnorm(1000,0,1)
y <- x^3 + .5*rnorm(1000,0,1)

data <- as.data.frame(x=x,y=y)

ggplot(data = data,aes(x=x,y=y)) +
   geom_point()

```

## Dependent, likely uncorrelated...

```{r}
require(ggplot2)
u <- runif(1000,-pi,pi)
y <- sin(u) + .05*rnorm(1000,0,1)
x <- cos(u) + .05*rnorm(1000,0,1)

data <- as.data.frame(x=x,y=y)

ggplot(data = data,aes(x=x,y=y)) +
   geom_point()

```

## scatterplots go brrr

```{r}
library("datasauRus")
ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset))+
  geom_point() +
  theme_void() +
  theme(legend.position = "none")+
  facet_wrap(~dataset, ncol = 5)
```

## A common estimator of covariance

$$cov_{x,y} = \sum_{i = 1}^{n}{\frac{(x_i-\color{blue}{\bar{x}})*(y_i-\color{red}{\bar{y}})}{N-1}}$$

```{r}
require(ggplot2)
u <- rnorm(50,0,1)
y <- u + .5*rnorm(50,0,1)
x <- u + .5*rnorm(50,0,1)

data <- as.data.frame(x=x,y=y)

ggplot(data = data,aes(x=x,y=y)) +
   geom_point() +
   geom_hline(yintercept=mean(y),col="red") +
   geom_vline(xintercept=mean(x),col="blue") 

```

## A common estimator of covariance

$$var_{x} = \sum_{i = 1}^{n}{\frac{(x_i-\color{blue}{\bar{x}})*(x_i-\color{red}{\bar{x}})}{N-1}}$$

$$var_{x} = \sum_{i = 1}^{n}{\frac{(x_i-\color{blue}{\bar{x}})^2}{N-1}}$$

## A common estimator of corelations

$$cor_{x,y} = {\frac{cov_{x,y}}{\sqrt{var_x * var_y}}}$$

## Two definition of genetic correlation...

$$p1 = a1 + c1 + e1$$

$$p2 = a2 + c2 + e2$$ $$r_g = cor(a1,a2)$$

## Two definition of genetic correlation...

$$p1_i = \sum_{j = 1}^{m}{(b1_j*snp_j)} +e_i$$ $$p2_i = \sum_{j = 1}^{m}{(b2_j*snp_j)} +e_i$$

$$r_g = cor(b_1,b_2)$$

## Lets play a game!

[correlation game](https://www.rossmanchance.com/applets/2021/guesscorrelation/GuessCorrelation.html)

# From research question, to statistical output

-   How to relate what you want to know, to a statistical result?

## What is it that you want to know?

::: columns
::: {.column width="40%"}
*"are risk for depression and BMI genetically correlated?"*
:::

::: {.column width="60%"}
![](images/clipboard-3210655481.png)
:::
:::

## How will you go and find out?

::: columns
::: {.column width="40%"}
"we will apply LD score regression to two sets of GWAS summary data form two different consortia, that studied BMI and MDD"
:::

::: {.column width="60%"}
![](images/clipboard-2241931969.png)
:::
:::

## What did you find?

::: columns
::: {.column width="40%"}
"The estimate of the genetic correlation between the PGC MDD, and GIANT BMI GWASs, Using LD score regression is 0.09"
:::

::: {.column width="60%"}
![](images/clipboard-3592335646.png)
:::
:::

## Lets go over this step by step

An **estimand** is a quantity that is to be estimated in a statistical analysis. The term is used to distinguish the target of inference (**estimand**) from the method used to obtain an approximation of this target (i.e., **the estimator**) and the specific value obtained from a given method and dataset (i.e., **the estimate**).

## (genetic) correlation, estimands and estimate

-   We almost always want to know about processes that move the estimand
-   The diagram below, **depends on your estimand!**

```{mermaid}
flowchart LR
  A(common cause) --> D(Estimand correlation)
  B(BMI -> Dep) --> D
  C(Dep -> BMI) --> D
  
  D --> E[estimator]
  E --> H[Estimate correlation]
  F[sampling] --> H[Estimate correlation]
  G[measurement] --> H[Estimate correlation]
```

## correlation, estimands and estimate

-   We almost always want to know about processes that move the estimand
-   The diagram below, **depends on your estimand!**

![](images/clipboard-623555496.png)

## causation, estimands and estimate

-   If we change the estimand, or estimator the diagram shifts!
-   Estimand: "*The causal effect of BMI on Depression*"

```{mermaid}
flowchart LR
  B(BMI -> Dep) --> D(Estimand)
  
  
  A(common cause) --> H[Estimate correlation]
  C(Dep -> BMI) --> H
  D --> E[estimator]
  E --> H
  F[sampling] --> H
  G[measurement] --> H
```

## causation, estimands and estimate

-   If we change the estimand, or estimator the diagram shifts!
-   Estimand: "*The causal effect of BMI on Depression*"

![](images/clipboard-3297154981.png)

## Lets look at some specific cases...

```{mermaid}
flowchart LR
  D(Estimand correlation) --> G[Estimate]
  E[sampling] --> G[Estimate]
  F[measurement] --> G[Estimate]
```

There are some very specific causes of correlation we need to discuss:

-   ascertainment (and colider bias)
-   measurement (and measurement error)

## Ascertainment & measurement

-   The people in your study aren't always representative of the population (***sampling***)
-   The measurement of your trait is not the same as your trait (***measurement***)
-   These aspects of a study can arise by ***design***, or ***unintentionally***

## Ascertainment by design

-   Over-sample cases in a schizophrenia GWAS (because its rare)

-   Target a study at a specific populations with specific health needs

-   You will need to adjust your estimator of $h^2$!!

## Unintentional ascertainment (usually sampling)

-   participants who social economic position is fragile might not have the time to spare for a day long lab study at a location that has poor access via public transport

-   Elderly people might only respond to email if their 1. online 2. able too

-   level of institutional trust may influence people's willingness to consent

## unintentional ascertainment (sampling)

-   Why would I care?

-   It will bias all(!!) statistical estimates and inference

-   There is a long causal chain between population and sample

## unintentional ascertainment (sampling): Collider bias

-   if: outcome1 -\> ascertainment & outcome2 -\> ascertainment

-   in the ascertained sample outcome1 and outcome2 will correlate!

## Collider bias: dating example

-   Why do people feel their more attractive partners where also more toxic?

-   Maybe its true? (maybe it effects my estimand)

-   Or is it collider bias? (or it effects my estimate)

## Collider bias: dating example

![](images/image-34604593.png){fig-align="center"}

## Collider bias: dating example

![](images/image-1360480714.png){fig-align="center"}

## How common is this? Should I care?

![](images/clipboard-3461306808.png)

## How common is this? Should I care?

![](images/clipboard-3618054889.png)

## The causes of a (genetic) correlation that we do care about?

-   (latent) common cause

-   causal relation between two traits

## A common cause

```{mermaid}
flowchart TB
  D(SNP) --> G[astma]
  D(SNP) --> H[stress]
```

## ALSO a common cause

```{mermaid}
flowchart TB
  D(SNP) --> E[smoking]
  E --> G[astma]
  E --> H[stress]
```

## A causal effect

```{mermaid}
flowchart LR
  D(SNP) --> E[lung cancer]
```

## ALSO a causal effect

```{mermaid}
flowchart LR
  C(SNP) --> D[smoking] --> E[lung cancer]
```

## Take home

-   You have to consider what you want to know (estimand) carefully
-   This will help you understand what your actually estimate means
-   When analyzing the relation bertween two or more traits, consider all the causes of covariation!

# Glance at the rest of the day:

-   **Margot** will discuss estimating genetic correlation between two traits, using twin/family data.
-   **Brad** will discuss models for the genetic correlations between more than 2 traits in family data
-   **I** will discuss estimators of genetic correlation based on GWAS summary data (LDSC/Genomic SEM)
-   **Andrew** will discuss models for the genetic correlations between more than 2 traits based on GWAS summary data (LDSC/Genomic SEM)

## bivariate twin model with Margot

$$p1 = a1 + c1 + e1$$

$$p2 = a2 + c2 + e2$$

$$r_g = cor(a1,a2)$$ 

## bivariate twin model with Margot

$$Vp1_{mz} = Va1 + Vc1 + Ve1$$

$$cov(p1_{mz1},p1_{mz2}) = Va1 + Vc1$$ 

## bivariate twin model with Margot

$$Vp1 = Va1 + Vc1 + Ve1$$ 
$$Vp2 = Va2 + Vc2 + Ve2$$ 
$$cov(p1_{mz1},p2_{mz2}) = Coc(a1,a2) + Cov(c1,c2)$$

## Bivariate molecular model with Me

-   we can do a ry similar thing with GWAS summary data.

$$p1_i = \sum_{j = 1}^{m}{(b1_j*snp_j)} +e_i$$ $$p2_i = \sum_{j = 1}^{m}{(b2_j*snp_j)} +e_i$$

$$r_g = cor(b_1,b_2)$$ \## genetic correlations

-   The bivariate twin model, and LDSC are complementary estimators of a similar quantity
-   Its not an identical quantity(!)

## Latent variable models with Brad & Andrew

```{mermaid}
flowchart TB
  D(latent_variable) --> E[Depression]
  D --> G[Anxiety]
  D --> H[PTSD]
```

## Latent variable models with Brad & Andrew

```{mermaid}
flowchart TB
  A(A) --> D(latent_variable)
  B(E) --> D(latent_variable)
  D --> E[Depression]
  D --> G[Anxiety]
  D --> H[PTSD]
```

Or...

```{mermaid}
flowchart TB
  D(E) --> F[Depression]
  D --> G[Anxiety]
  D --> H[PTSD]
  
  E(A) --> F[Depression]
  E --> G[Anxiety]
  E --> H[PTSD]
```

## Genetics in the context of genetic latent variable modeling

-   Brad and Andrew discuss complimentary estimators of genetic latent variable models
-   The methods and code might look very different, various concepts are shared