Background

This practical follows directly from the fine-mapping practical. There, you simulated a locus, ran a GWAS, and fine-mapped it with SuSiE, recovering credible sets of putative causal variants. Here we take a second step. We ask whether a genetic signal at the same locus is shared between two traits, a question known as colocalization.

The motivating biology is simple. A GWAS variant is most often non-coding, and the most proximal molecular trait it can affect is gene expression. If a GWAS signal and an expression QTL (eQTL) signal are driven by the same causal variant, the eQTL nominates a candidate gene and direction of effect for the GWAS locus. If they merely sit nearby but are driven by distinct variants in linkage disequilibrium (LD), the eQTL is a red herring. Colocalization is the formal test that separates these two cases.

The five hypotheses

coloc casts colocalization as a Bayesian model comparison over five mutually exclusive hypotheses for a locus tested in two studies.

Hypothesis Meaning
H0 No association with either trait
H1 Association with trait 1 only
H2 Association with trait 2 only
H3 Association with both traits, distinct causal variants
H4 Association with both traits, one shared causal variant

Each hypothesis receives a posterior probability (PP.H0PP.H4) that sums to one. A large PP.H4 is evidence of colocalization. A large PP.H3 means both traits are associated but through different variants.

Two flavors of coloc

  • coloc.abf needs only marginal summary statistics and no LD matrix. That is its real advantage. Use it as the fallback when you lack a reliable in-sample LD matrix.
  • coloc.abf assumes at most one causal variant per locus. Treat this as a limitation to work around, not a property you can usually assume holds. Choosing coloc.abf because you believe the locus has a single signal is the wrong reason to choose it.
  • coloc.susie relaxes that assumption to multiple causal variants per locus. It takes the SuSiE fine-mapping output for each trait and tests every pair of credible sets, reporting a posterior probability of a shared causal variant (PP.H4) for each pair. Whenever you have a trustworthy in-sample LD matrix, this is the method to prefer.

Note. Real loci frequently harbor multiple causal variants, and their causal effects can be positively or negatively correlated through LD. A second signal can then distort, or even mask, the marginal association of the first, which breaks the single-causal-variant assumption (Kanai et al. 2021; Zhang et al. 2023).

We will run both on the same locus and see why the distinction matters.


Setup

library(susieR)  # SuSiE fine-mapping
library(coloc)   # colocalization

Recap: the fine-mapping locus and trait-1 SuSiE

We start from the locus built in the fine-mapping practical, loading its frozen genotype matrix ISG_2026_genotypes.rds. We use a saved matrix because MASS::mvrnorm performs differently across platforms even with the same seed, so regenerating it live would not be reproducible. The commented chunk below shows how the file was generated once.

p <- 100   # variants in the locus
N <- 1000  # individuals

X <- readRDS("ISG_2026_genotypes.rds")  # frozen 1000 x 100 genotype matrix
in_sample_LD <- cov(scale(X))           # in-sample LD used by SuSiE
# How ISG_2026_genotypes.rds was generated:
# library(MASS)
# LD <- matrix(0.3, nrow = p, ncol = p)
# diag(LD) <- 1.0
# set.seed(4)
# X <- mvrnorm(n = N, mu = rep(0, p), Sigma = LD)
# saveRDS(X, "ISG_2026_genotypes.rds")
# Trait 1: a GWAS phenotype with L = 3 causal variants
L <- 3
h2g <- 0.1

set.seed(1)
causal_config <- rep(0, p)
causal_ind1 <- sample(1:p, L, replace = FALSE)   # the true causal variants
causal_config[causal_ind1] <- 1

per_snp_h2g <- h2g / L
effect_sizes <- rnorm(L, mean = 0, sd = sqrt(per_snp_h2g))
beta1 <- causal_config
beta1[causal_ind1] <- effect_sizes

genetic_effect <- X %*% beta1
var_g <- var(genetic_effect)
sigma_squared <- ifelse(1 - var_g > 0.05, 1 - var_g, 0.1)  # environmental noise
y1 <- as.numeric(genetic_effect + rnorm(N, mean = 0, sd = sqrt(sigma_squared)))

cat(sprintf("Trait 1 causal variants: %s\n", paste(causal_ind1, collapse = ", ")))
## Trait 1 causal variants: 68, 39, 1

A small helper to run a single-variant GWAS, the same marginal regression as the fine-mapping practical.

run_gwas <- function(X, y) {
  p <- ncol(X)
  beta <- se <- numeric(p)
  for (i in 1:p) {
    fit <- summary(lm(y ~ X[, i] - 1))$coefficients
    beta[i] <- fit[1, "Estimate"]
    se[i]   <- fit[1, "Std. Error"]
  }
  data.frame(beta = beta, se = se, z = beta / se,
             pval = pchisq((beta / se)^2, df = 1, lower.tail = FALSE))
}

gwas1 <- run_gwas(X, y1)

Fine-map trait 1 with SuSiE, exactly as in the fine-mapping practical.

fm1 <- susie_rss(bhat = gwas1$beta, shat = gwas1$se, n = N,
                 R = in_sample_LD, var_y = var(y1),
                 L = L + 1, estimate_residual_variance = TRUE)

fm1$sets$cs  # credible sets for trait 1
## $L1
## [1] 39

Trait 1 has three causal variants (68, 39, and 1), but only variant 39 carries a strong enough signal to form a credible set. SuSiE localizes the signals it can resolve confidently.


Simulate a second trait: an eQTL

Now we create the QTL phenotype to colocalize against the GWAS. We reuse the same genotypes, so both traits share the same LD, as a GWAS and an eQTL measured in the same region would. We give the eQTL two causal variants:

  1. one shared with the GWAS (variant 39, the strong trait-1 signal), and
  2. one private to the eQTL.

This mirrors the realistic situation that motivates coloc.susie. A locus carries several independent signals, only some of which are shared across traits.

set.seed(2026)

shared_snp  <- 39                 # also causal for the GWAS (trait 1)
private_snp <- 20                 # causal for the eQTL only
causal_ind2 <- c(shared_snp, private_snp)

L2  <- length(causal_ind2)
h2g2 <- 0.15                      # eQTLs often explain more variance than GWAS loci
per_snp_h2g2 <- h2g2 / L2

beta2 <- rep(0, p)
eff2 <- rnorm(L2, mean = 0, sd = sqrt(per_snp_h2g2))
beta2[causal_ind2] <- eff2

genetic_effect2 <- X %*% beta2
var_g2 <- var(genetic_effect2)
sigma_squared2 <- ifelse(1 - var_g2 > 0.05, 1 - var_g2, 0.1)
y2 <- as.numeric(genetic_effect2 + rnorm(N, mean = 0, sd = sqrt(sigma_squared2)))

cat(sprintf("Trait 2 (eQTL) causal variants: %s\n", paste(causal_ind2, collapse = ", ")))
## Trait 2 (eQTL) causal variants: 39, 20

Run the eQTL GWAS and fine-map it with SuSiE, the same workflow as trait 1.

gwas2 <- run_gwas(X, y2)

fm2 <- susie_rss(bhat = gwas2$beta, shat = gwas2$se, n = N,
                 R = in_sample_LD, var_y = var(y2),
                 L = L2 + 1, estimate_residual_variance = TRUE)

fm2$sets$cs  # credible sets for trait 2
## $L1
## [1] 20
## 
## $L2
## [1] 39

Visualize the two signals

Before testing colocalization, always look at the locus. First the marginal association, plotted as \(-\log_{10}(p)\) against variant position. True causal variants are red, and the lead (most significant) variant of each trait is labeled.

par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))

plot_assoc <- function(gwas, causal, trait) {
  logp <- -log10(gwas$pval)
  lead <- which.max(logp)
  plot(1:p, logp, pch = 16, col = "grey60",
       xlab = "Variant", ylab = expression(-log[10](p)),
       main = sprintf("%s marginal association", trait))
  points(causal, logp[causal], pch = 16, col = "red")
  text(lead, logp[lead], labels = paste0("lead: variant ", lead), pos = 2, cex = 0.9)
}

plot_assoc(gwas1, causal_ind1, "Trait 1 (GWAS)")
plot_assoc(gwas2, causal_ind2, "Trait 2 (eQTL)")

The GWAS lead is variant 39, whereas the eQTL lead is variant 20. The shared variant 39 is present in the eQTL but it is not the eQTL’s strongest signal. Keep this in mind when we run coloc.abf, which only ever compares the single lead per trait.

Next, the SuSiE fine-mapping output. susieR ships its own plotting function, susie_plot. With y = "PIP" it draws the posterior inclusion probabilities and colors variants by credible-set membership. Passing the true effect vector to b overlays the simulated causal variants, which is handy for a teaching example. (susieR also provides susie_plot(model, y = "z") for z-scores, plus susie_plot_iteration and susie_plot_changepoint for diagnostics.)

par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))

b1_true <- numeric(p); b1_true[causal_ind1] <- 1
susie_plot(fm1, y = "PIP", b = b1_true, main = "Trait 1 (GWAS) SuSiE")

b2_true <- numeric(p); b2_true[causal_ind2] <- 1
susie_plot(fm2, y = "PIP", b = b2_true, main = "Trait 2 (eQTL) SuSiE")

Each credible set is drawn in its own color, and the true causal variants are circled. Note that trait 1’s two weaker causal variants (68 and 1) are not captured in any credible set. SuSiE reports only the signals it can localize confidently.

Q1: Which variant falls in a credible set for both traits, and which is private to the eQTL?

Answer Variant 39 is in a credible set for both the GWAS and the eQTL, so it is the shared signal. Variant 20 is in a credible set for the eQTL only, so it is private to the eQTL. Importantly, variant 20 is the eQTL’s strongest signal, so the two traits’ lead variants differ. A single-causal-variant method like coloc.abf will therefore lean toward H3 (distinct variants), as we confirm below.

coloc.abf: the single-causal-variant test

coloc.abf needs a small list per trait. For a quantitative trait we supply the marginal effect size beta, its variance varbeta = se^2, the variant names snp, the sample size N, and sdY, the standard deviation of the phenotype. We attach simple variant names so both traits refer to the same SNPs.

snp_names <- paste0("SNP", 1:p)

D1 <- list(beta = gwas1$beta, varbeta = gwas1$se^2, snp = snp_names,
           type = "quant", N = N, sdY = sd(y1))
D2 <- list(beta = gwas2$beta, varbeta = gwas2$se^2, snp = snp_names,
           type = "quant", N = N, sdY = sd(y2))

# coloc checks that each dataset is well formed
check_dataset(D1)
## NULL
check_dataset(D2)
## NULL
res.abf <- coloc.abf(dataset1 = D1, dataset2 = D2)
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  5.05e-15  2.17e-11  2.32e-04  1.00e+00  6.20e-06 
## [1] "PP abf for shared variant: 0.00062%"
round(res.abf$summary, 4)
##     nsnps PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  100.0000    0.0000    0.0000    0.0002    0.9998    0.0000

Q2: Which hypothesis does coloc.abf favor, and what is it concluding about the locus?

Answer PP.H3 is close to 1, so coloc.abf concludes that both traits are associated but through distinct causal variants. Under its one-causal-variant assumption it compares only the single strongest signal in each trait. The GWAS lead is variant 39, whereas the eQTL lead is the private variant 20, so the single signals look different and the verdict is H3.

Notice the trap. The locus genuinely does share variant 39, but coloc.abf cannot see it because the eQTL’s strongest signal sits elsewhere. The single-causal-variant assumption has thrown away the shared signal. This is exactly the situation coloc.susie was built for.


coloc.susie: allowing multiple causal variants

coloc.susie takes the SuSiE objects directly. It needs the variant names on the fitted object so it can match SNPs across traits, so we attach them. (runsusie() does this for you automatically; here we add them to the objects produced above.)

name_susie <- function(S, snps) {
  colnames(S$alpha)        <- snps
  colnames(S$lbf_variable) <- snps
  names(S$pip)             <- snps
  S
}

fm1 <- name_susie(fm1, snp_names)
fm2 <- name_susie(fm2, snp_names)
res.susie <- coloc.susie(fm1, fm2)
# coloc.susie returns a data.table, which does not auto-print inside knitr; coerce to print it
as.data.frame(res.susie$summary)

coloc.susie returns one row per pair of credible sets, with hit1 and hit2 naming the lead variant of each set. Read the table by looking for rows with high PP.H4.

Q3: Which pair of signals colocalizes, and what is its PP.H4?

Answer The pair hit1 = SNP39, hit2 = SNP39 has PP.H4 close to 1. The shared variant 39 is recovered as a colocalizing signal. Every pairing that involves the eQTL’s private variant 20 returns high PP.H3 instead, correctly flagging those as distinct.

Q4: coloc.abf said H3 and coloc.susie found a shared signal. Are they contradicting each other?

Answer No. They answer different questions. coloc.abf asks whether the single strongest signal is shared, and at this locus it is not (39 versus 20). coloc.susie decomposes the locus into independent signals and asks the question per signal, so it can find that one of several signals, variant 39, is shared while another is private. When a locus may have more than one causal variant, prefer coloc.susie.

Interpretation and practical guidance

  • Always look at the locus first. Marginal Manhattan plots and SuSiE PIPs tell you how many signals are present before you trust any single number.
  • Let LD availability pick the method, not your guess about signal count. Prefer coloc.susie whenever you have a reliable in-sample LD matrix, because you usually cannot know in advance that a locus carries a single causal variant. Fall back to coloc.abf when no trustworthy LD is available, accepting its single-causal-variant assumption as the price of needing only summary statistics.
  • LD must match the summary statistics. SuSiE and coloc.susie are sensitive to LD misspecification. Use an in-sample LD matrix, or a reference panel that matches the GWAS ancestry, exactly as in the fine-mapping practical.
  • Priors matter. coloc.abf uses prior probabilities p1, p2, and p12. The default p12 = 1e-5 is reasonable but worth a sensitivity check, for example with sensitivity(res.abf, "H4 > 0.5").
  • PP.H4 is necessary, not sufficient. A high PP.H4 says the data are consistent with one shared variant. It does not by itself prove causality, and it can be inflated by shared LD structure or poorly matched reference panels.

Q5: You fine-map a GWAS locus and an eQTL and each resolves a single credible set, both centered on the same variant. Which method would you reach for, and what result do you expect?

Answer You already have the in-sample LD that fine-mapping required, so reach for coloc.susie. It reduces to the same shared-variant conclusion here, with high PP.H4, but it stays correct if a weak second signal is present that the single credible set did not resolve. Reserve coloc.abf for the case where you have only marginal summary statistics and no reliable LD matrix. Its single-causal-variant assumption is a real limitation, not a safe default, because correlated or anti-correlated causal effects at the same locus are common (Kanai et al. 2021; Zhang et al. 2023).

Resources

  • Giambartolomei et al. 2014, the original coloc.abf method. PLoS Genet
  • Wallace 2021, coloc.susie for multiple causal variants. PLoS Genet
  • Kanai et al. 2021, fine-mapping across diverse populations, with examples of negatively correlated causal variants at a locus (e.g. FGFR4, CETP). medRxiv
  • Zhang et al. 2023, pervasive correlations between causal effects of proximal SNPs, from linkage masking under stabilizing selection. medRxiv
  • coloc package vignettes: chr1glh24.github.io/coloc

Appendix: using runsusie instead

When you fine-map specifically with colocalization downstream, coloc::runsusie() is a cleaner entry point than calling susie_rss directly. It wraps susie_rss, carries the LD and SNP names through automatically, and returns an object ready for coloc.susie. It expects a coloc-style dataset that also includes the LD matrix.

# runsusie checks that the LD matrix is labelled with the same SNP names as the dataset
rownames(in_sample_LD) <- snp_names
colnames(in_sample_LD) <- snp_names

D1_ld <- c(D1, list(LD = in_sample_LD, position = 1:p))
D2_ld <- c(D2, list(LD = in_sample_LD, position = 1:p))

s1 <- runsusie(D1_ld)
s2 <- runsusie(D2_ld)

coloc.susie(s1, s2)$summary