This practical follows directly from the fine-mapping practical. There, you simulated a locus, ran a GWAS, and fine-mapped it with SuSiE, recovering credible sets of putative causal variants. Here we take a second step. We ask whether a genetic signal at the same locus is shared between two traits, a question known as colocalization.
The motivating biology is simple. A GWAS variant is most often non-coding, and the most proximal molecular trait it can affect is gene expression. If a GWAS signal and an expression QTL (eQTL) signal are driven by the same causal variant, the eQTL nominates a candidate gene and direction of effect for the GWAS locus. If they merely sit nearby but are driven by distinct variants in linkage disequilibrium (LD), the eQTL is a red herring. Colocalization is the formal test that separates these two cases.
coloc casts colocalization as a Bayesian model
comparison over five mutually exclusive hypotheses for a locus tested in
two studies.
| Hypothesis | Meaning |
|---|---|
| H0 | No association with either trait |
| H1 | Association with trait 1 only |
| H2 | Association with trait 2 only |
| H3 | Association with both traits, distinct causal variants |
| H4 | Association with both traits, one shared causal variant |
Each hypothesis receives a posterior probability (PP.H0
… PP.H4) that sums to one. A large PP.H4
is evidence of colocalization. A large PP.H3 means both
traits are associated but through different variants.
coloc.abf needs only marginal summary
statistics and no LD matrix. That is its real advantage. Use it as the
fallback when you lack a reliable in-sample LD
matrix.coloc.abf assumes at most one causal variant
per locus. Treat this as a limitation to work around, not a
property you can usually assume holds. Choosing coloc.abf
because you believe the locus has a single signal is the wrong reason to
choose it.coloc.susie relaxes that assumption to
multiple causal variants per locus. It takes the SuSiE
fine-mapping output for each trait and tests every pair of credible
sets, reporting a posterior probability of a shared causal variant
(PP.H4) for each pair. Whenever you have a trustworthy
in-sample LD matrix, this is the method to prefer.Note. Real loci frequently harbor multiple causal variants, and their causal effects can be positively or negatively correlated through LD. A second signal can then distort, or even mask, the marginal association of the first, which breaks the single-causal-variant assumption (Kanai et al. 2021; Zhang et al. 2023).
We will run both on the same locus and see why the distinction matters.
library(susieR) # SuSiE fine-mapping
library(coloc) # colocalization
We start from the locus built in the fine-mapping practical, loading
its frozen genotype matrix ISG_2026_genotypes.rds. We use a
saved matrix because MASS::mvrnorm performs differently
across platforms even with the same seed, so regenerating it live would
not be reproducible. The commented chunk below shows how the file was
generated once.
p <- 100 # variants in the locus
N <- 1000 # individuals
X <- readRDS("ISG_2026_genotypes.rds") # frozen 1000 x 100 genotype matrix
in_sample_LD <- cov(scale(X)) # in-sample LD used by SuSiE
# How ISG_2026_genotypes.rds was generated:
# library(MASS)
# LD <- matrix(0.3, nrow = p, ncol = p)
# diag(LD) <- 1.0
# set.seed(4)
# X <- mvrnorm(n = N, mu = rep(0, p), Sigma = LD)
# saveRDS(X, "ISG_2026_genotypes.rds")
# Trait 1: a GWAS phenotype with L = 3 causal variants
L <- 3
h2g <- 0.1
set.seed(1)
causal_config <- rep(0, p)
causal_ind1 <- sample(1:p, L, replace = FALSE) # the true causal variants
causal_config[causal_ind1] <- 1
per_snp_h2g <- h2g / L
effect_sizes <- rnorm(L, mean = 0, sd = sqrt(per_snp_h2g))
beta1 <- causal_config
beta1[causal_ind1] <- effect_sizes
genetic_effect <- X %*% beta1
var_g <- var(genetic_effect)
sigma_squared <- ifelse(1 - var_g > 0.05, 1 - var_g, 0.1) # environmental noise
y1 <- as.numeric(genetic_effect + rnorm(N, mean = 0, sd = sqrt(sigma_squared)))
cat(sprintf("Trait 1 causal variants: %s\n", paste(causal_ind1, collapse = ", ")))
## Trait 1 causal variants: 68, 39, 1
A small helper to run a single-variant GWAS, the same marginal regression as the fine-mapping practical.
run_gwas <- function(X, y) {
p <- ncol(X)
beta <- se <- numeric(p)
for (i in 1:p) {
fit <- summary(lm(y ~ X[, i] - 1))$coefficients
beta[i] <- fit[1, "Estimate"]
se[i] <- fit[1, "Std. Error"]
}
data.frame(beta = beta, se = se, z = beta / se,
pval = pchisq((beta / se)^2, df = 1, lower.tail = FALSE))
}
gwas1 <- run_gwas(X, y1)
Fine-map trait 1 with SuSiE, exactly as in the fine-mapping practical.
fm1 <- susie_rss(bhat = gwas1$beta, shat = gwas1$se, n = N,
R = in_sample_LD, var_y = var(y1),
L = L + 1, estimate_residual_variance = TRUE)
fm1$sets$cs # credible sets for trait 1
## $L1
## [1] 39
Trait 1 has three causal variants (68, 39, and 1), but only variant 39 carries a strong enough signal to form a credible set. SuSiE localizes the signals it can resolve confidently.
Now we create the QTL phenotype to colocalize against the GWAS. We reuse the same genotypes, so both traits share the same LD, as a GWAS and an eQTL measured in the same region would. We give the eQTL two causal variants:
This mirrors the realistic situation that motivates
coloc.susie. A locus carries several independent signals,
only some of which are shared across traits.
set.seed(2026)
shared_snp <- 39 # also causal for the GWAS (trait 1)
private_snp <- 20 # causal for the eQTL only
causal_ind2 <- c(shared_snp, private_snp)
L2 <- length(causal_ind2)
h2g2 <- 0.15 # eQTLs often explain more variance than GWAS loci
per_snp_h2g2 <- h2g2 / L2
beta2 <- rep(0, p)
eff2 <- rnorm(L2, mean = 0, sd = sqrt(per_snp_h2g2))
beta2[causal_ind2] <- eff2
genetic_effect2 <- X %*% beta2
var_g2 <- var(genetic_effect2)
sigma_squared2 <- ifelse(1 - var_g2 > 0.05, 1 - var_g2, 0.1)
y2 <- as.numeric(genetic_effect2 + rnorm(N, mean = 0, sd = sqrt(sigma_squared2)))
cat(sprintf("Trait 2 (eQTL) causal variants: %s\n", paste(causal_ind2, collapse = ", ")))
## Trait 2 (eQTL) causal variants: 39, 20
Run the eQTL GWAS and fine-map it with SuSiE, the same workflow as trait 1.
gwas2 <- run_gwas(X, y2)
fm2 <- susie_rss(bhat = gwas2$beta, shat = gwas2$se, n = N,
R = in_sample_LD, var_y = var(y2),
L = L2 + 1, estimate_residual_variance = TRUE)
fm2$sets$cs # credible sets for trait 2
## $L1
## [1] 20
##
## $L2
## [1] 39
Before testing colocalization, always look at the locus. First the marginal association, plotted as \(-\log_{10}(p)\) against variant position. True causal variants are red, and the lead (most significant) variant of each trait is labeled.
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
plot_assoc <- function(gwas, causal, trait) {
logp <- -log10(gwas$pval)
lead <- which.max(logp)
plot(1:p, logp, pch = 16, col = "grey60",
xlab = "Variant", ylab = expression(-log[10](p)),
main = sprintf("%s marginal association", trait))
points(causal, logp[causal], pch = 16, col = "red")
text(lead, logp[lead], labels = paste0("lead: variant ", lead), pos = 2, cex = 0.9)
}
plot_assoc(gwas1, causal_ind1, "Trait 1 (GWAS)")
plot_assoc(gwas2, causal_ind2, "Trait 2 (eQTL)")
The GWAS lead is variant 39, whereas the eQTL lead is variant 20. The
shared variant 39 is present in the eQTL but it is not the
eQTL’s strongest signal. Keep this in mind when we run
coloc.abf, which only ever compares the single lead per
trait.
Next, the SuSiE fine-mapping output. susieR ships its
own plotting function, susie_plot. With
y = "PIP" it draws the posterior inclusion probabilities
and colors variants by credible-set membership. Passing the true effect
vector to b overlays the simulated causal variants, which
is handy for a teaching example. (susieR also provides
susie_plot(model, y = "z") for z-scores, plus
susie_plot_iteration and
susie_plot_changepoint for diagnostics.)
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
b1_true <- numeric(p); b1_true[causal_ind1] <- 1
susie_plot(fm1, y = "PIP", b = b1_true, main = "Trait 1 (GWAS) SuSiE")
b2_true <- numeric(p); b2_true[causal_ind2] <- 1
susie_plot(fm2, y = "PIP", b = b2_true, main = "Trait 2 (eQTL) SuSiE")
Each credible set is drawn in its own color, and the true causal variants are circled. Note that trait 1’s two weaker causal variants (68 and 1) are not captured in any credible set. SuSiE reports only the signals it can localize confidently.
Q1: Which variant falls in a credible set for both traits, and which is private to the eQTL?
coloc.abf will therefore
lean toward H3 (distinct variants), as we confirm below.
coloc.abf needs a small list per trait. For a
quantitative trait we supply the marginal effect size beta,
its variance varbeta = se^2, the variant names
snp, the sample size N, and sdY,
the standard deviation of the phenotype. We attach simple variant names
so both traits refer to the same SNPs.
snp_names <- paste0("SNP", 1:p)
D1 <- list(beta = gwas1$beta, varbeta = gwas1$se^2, snp = snp_names,
type = "quant", N = N, sdY = sd(y1))
D2 <- list(beta = gwas2$beta, varbeta = gwas2$se^2, snp = snp_names,
type = "quant", N = N, sdY = sd(y2))
# coloc checks that each dataset is well formed
check_dataset(D1)
## NULL
check_dataset(D2)
## NULL
res.abf <- coloc.abf(dataset1 = D1, dataset2 = D2)
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 5.05e-15 2.17e-11 2.32e-04 1.00e+00 6.20e-06
## [1] "PP abf for shared variant: 0.00062%"
round(res.abf$summary, 4)
## nsnps PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 100.0000 0.0000 0.0000 0.0002 0.9998 0.0000
Q2: Which hypothesis does coloc.abf favor, and
what is it concluding about the locus?
PP.H3 is close to 1, so coloc.abf concludes
that both traits are associated but through distinct
causal variants. Under its one-causal-variant assumption it compares
only the single strongest signal in each trait. The GWAS lead is variant
39, whereas the eQTL lead is the private variant 20, so the single
signals look different and the verdict is H3.
Notice the trap. The locus genuinely does share variant 39,
but coloc.abf cannot see it because the eQTL’s strongest
signal sits elsewhere. The single-causal-variant assumption has thrown
away the shared signal. This is exactly the situation
coloc.susie was built for.
coloc.susie takes the SuSiE objects directly. It needs
the variant names on the fitted object so it can match SNPs across
traits, so we attach them. (runsusie() does this for you
automatically; here we add them to the objects produced above.)
name_susie <- function(S, snps) {
colnames(S$alpha) <- snps
colnames(S$lbf_variable) <- snps
names(S$pip) <- snps
S
}
fm1 <- name_susie(fm1, snp_names)
fm2 <- name_susie(fm2, snp_names)
res.susie <- coloc.susie(fm1, fm2)
# coloc.susie returns a data.table, which does not auto-print inside knitr; coerce to print it
as.data.frame(res.susie$summary)
coloc.susie returns one row per pair of credible
sets, with hit1 and hit2 naming the
lead variant of each set. Read the table by looking for rows with high
PP.H4.
Q3: Which pair of signals colocalizes, and what is its PP.H4?
hit1 = SNP39, hit2 = SNP39 has
PP.H4 close to 1. The shared variant 39 is recovered as a
colocalizing signal. Every pairing that involves the eQTL’s private
variant 20 returns high PP.H3 instead, correctly flagging
those as distinct.
Q4: coloc.abf said H3 and
coloc.susie found a shared signal. Are they contradicting
each other?
coloc.abf asks whether
the single strongest signal is shared, and at this
locus it is not (39 versus 20). coloc.susie decomposes the
locus into independent signals and asks the question per signal, so it
can find that one of several signals, variant 39, is shared while
another is private. When a locus may have more than one causal variant,
prefer coloc.susie.
coloc.susie whenever you have
a reliable in-sample LD matrix, because you usually cannot know in
advance that a locus carries a single causal variant. Fall back to
coloc.abf when no trustworthy LD is available, accepting
its single-causal-variant assumption as the price of needing only
summary statistics.coloc.susie are sensitive to LD misspecification. Use an
in-sample LD matrix, or a reference panel that matches the GWAS
ancestry, exactly as in the fine-mapping practical.coloc.abf uses prior
probabilities p1, p2, and p12.
The default p12 = 1e-5 is reasonable but worth a
sensitivity check, for example with
sensitivity(res.abf, "H4 > 0.5").Q5: You fine-map a GWAS locus and an eQTL and each resolves a single credible set, both centered on the same variant. Which method would you reach for, and what result do you expect?
coloc.susie. It reduces to the same shared-variant
conclusion here, with high PP.H4, but it stays correct if a
weak second signal is present that the single credible set did not
resolve. Reserve coloc.abf for the case where you have only
marginal summary statistics and no reliable LD matrix. Its
single-causal-variant assumption is a real limitation, not a safe
default, because correlated or anti-correlated causal effects at the
same locus are common (Kanai et al. 2021; Zhang et al. 2023).
coloc.abf
method. PLoS
Genetcoloc.susie for multiple causal variants.
PLoS
Genetcoloc package vignettes: chr1glh24.github.io/colocrunsusie insteadWhen you fine-map specifically with colocalization downstream,
coloc::runsusie() is a cleaner entry point than calling
susie_rss directly. It wraps susie_rss,
carries the LD and SNP names through automatically, and returns an
object ready for coloc.susie. It expects a coloc-style
dataset that also includes the LD matrix.
# runsusie checks that the LD matrix is labelled with the same SNP names as the dataset
rownames(in_sample_LD) <- snp_names
colnames(in_sample_LD) <- snp_names
D1_ld <- c(D1, list(LD = in_sample_LD, position = 1:p))
D2_ld <- c(D2, list(LD = in_sample_LD, position = 1:p))
s1 <- runsusie(D1_ld)
s2 <- runsusie(D2_ld)
coloc.susie(s1, s2)$summary