Background & Setup

Statistical fine-mapping is the process of identifying the likely causal genetic variants (having non-zero effect size on a trait) that drive a Genome-Wide Association Study (GWAS) signal. Because variants are often highly correlated due to Linkage Disequilibrium (LD), GWAS alone often cannot distinguish causal variants from their non-causal LD neighbors.

Workshop Objectives

In this practical, you will get hands-on experience running fine-mapping methods and interpreting the results. You will:

  1. Simulate a genetic locus with known causal variants and specific LD structures.
  2. Run a standard marginal GWAS and apply the Approximate Bayes Factor (ABF) method.
  3. Apply a modern Bayesian fine-mapping method (SuSiE) to tease apart multiple causal signals.
  4. Explore how Sample Size and LD impact fine-mapping resolution.
  5. Interpret core outputs like Posterior Inclusion Probabilities (PIPs) and Credible Sets.
  6. Review fine-mapping best practices for real-world scenarios.

Environment Setup

Stop! Before running the code below, ensure you have the required packages installed. If you haven’t already, run install.packages(c("MASS", "susieR", "ggplot2")) in your R console.

library(MASS)    
library(susieR)
library(ggplot2)

Simulating Data

To see how fine-mapping works, we need a dataset where we know the ground truth. We will simulate a genomic region with 100 variants for 1,000 individuals. We will set a pairwise LD (correlation) of 0.3, and randomly assign 3 variants to be truly causal.

# -----------------1. Simulate Genotypes-------------------
p <- 100 
LD <- matrix(0.3, nrow = p, ncol = p) 
diag(LD) <- 1.0 
N <- 1000 
set.seed(4)
X <- mvrnorm(n = N, mu = rep(0, p), Sigma = LD) 
in_sample_LD <- cov(scale(X))

# Plot LD
# Convert the 100x100 matrix into a long-format data frame for ggplot
ld_df <- expand.grid(SNP1 = 1:p, SNP2 = 1:p)
ld_df$LD <- as.vector(in_sample_LD)

# Plot using ggplot2
ggplot(ld_df, aes(x = SNP1, y = SNP2, fill = LD)) +
  geom_raster() +
  scale_fill_gradient(low = "white", high = "red", limits = c(-0.1, 1.05)) +
  theme_minimal() +
  coord_fixed() +
  labs(title = "In-Sample LD Matrix", 
       x = "Variants", 
       y = "Variants", 
       fill = "LD (r)")

# ------------------2. Simulate Phenotypes-----------------
L <- 3 
h2g <- 0.1 

set.seed(1) # setting seed for reproducibility 
causal_config <- rep(0, p) 
causal_ind <- sample(1:p, L, replace = FALSE) 
causal_config[causal_ind] <- 1 

# Assign effect sizes 
per_snp_h2g <- h2g / L 
effect_sizes <- rnorm(L, mean = 0, sd = sqrt(per_snp_h2g)) 
beta <- causal_config
beta[causal_ind] <- effect_sizes 

# Generate phenotypes deterministically 
genetic_effect <- X %*% beta 
var_g <- var(genetic_effect)

# Calculate environmental noise 
sigma_squared <- ifelse(1 - var_g > 0.05, 1 - var_g, 0.1) 

epsilon <- rnorm(N, mean = 0, sd = sqrt(sigma_squared)) 
y <- as.numeric(genetic_effect + epsilon) 

Q1: What are the indices of our ground truth causal variants? (Hint: Look at the causal_ind variable).

Answer

Because we set the seed to 1 before drawing causal indices, the chosen causal variants are at indices 68, 39, and 1.


Running GWAS and ABF

Now that we have our simulated genotype (X) and phenotype (y), let’s run a standard marginal GWAS, testing one variant at a time. Then, we will apply Jon Wakefield’s Approximate Bayes Factor (ABF) method to calculate a PIP for each variant.

# -----------------3. Run Marginal GWAS--------------------
GWAS_df <- data.frame(
  beta_marginal = numeric(p),
  se_marginal = numeric(p),
  z = numeric(p),
  chisq = numeric(p),
  p_val = numeric(p),
  minus_log10p = numeric(p)
)

for (i in 1:p) {
  X_i <- X[, i]
  mod <- lm(y ~ X_i - 1) 
  summary_mod <- summary(mod)
  
  coef <- summary_mod$coefficients[1, "Estimate"]
  sd <- summary_mod$coefficients[1, "Std. Error"]
  z <- coef / sd
  chisq <- z^2
  p_val <- pchisq(chisq, df = 1, lower.tail = FALSE) 
  
  GWAS_df[i, ] <- c(coef, sd, z, chisq, p_val, -log10(p_val))
}

# Plot Regional Marginal GWAS
GWAS_df$SNP <- 1:p
GWAS_df$Variant_Type <- factor(
  ifelse(GWAS_df$SNP %in% causal_ind, "True Causal SNPs", "All SNPs"),
  levels = c("All SNPs", "True Causal SNPs")
)

ggplot(GWAS_df, aes(x = SNP, y = minus_log10p, color = Variant_Type)) +
  geom_point(size = 2.5) +
  geom_hline(yintercept = -log10(5e-8), linetype = "dashed", color = "blue") +
  scale_color_manual(values = c("All SNPs" = rgb(0, 0, 0, 0.5), "True Causal SNPs" = "red")) +
  theme_minimal() +
  theme(legend.position = "bottom") + 
  labs(title = "Regional GWAS Results",
       x = "SNP Location",
       y = "-log10(p-value)",
       color = NULL)

# --------4. Run ABF (Approximate Bayes Factor)------------
run_abf <- function(beta, stderr, W = 0.04) {
  z <- beta / stderr
  V <- stderr^2
  r <- W / (W + V)
  lbf <- (log(1 - r) + (r * (z^2))) / 2
  lbf_max <- max(lbf)
  denom <- lbf_max + log(sum(exp(lbf - lbf_max)))
  prob <- exp(lbf - denom)
  return(list(lbf = lbf, prob = prob))
}

abf_results <- run_abf(GWAS_df$beta_marginal, GWAS_df$se_marginal)
GWAS_df$prob_abf <- abf_results$prob

# --------5. Plot ABF PIPs---------------------------------
# Add SNP index
GWAS_df$SNP <- 1:p

# Create Variant_Type as a factor. Putting "True Causal SNPs" second ensures they draw on top!
GWAS_df$Variant_Type <- factor(
  ifelse(GWAS_df$SNP %in% causal_ind, "True Causal SNPs", "All SNPs"),
  levels = c("All SNPs", "True Causal SNPs")
)

# Plot ABF PIPs using ggplot2 (No dataframe sorting needed!)
ggplot(GWAS_df, aes(x = SNP, y = prob_abf, color = Variant_Type)) +
  geom_point(size = 2.5) +
  scale_color_manual(values = c("All SNPs" = rgb(0, 0, 0, 0.5), "True Causal SNPs" = "red")) +
  theme_minimal() +
  theme(legend.position = "bottom") + 
  labs(title = "ABF Fine-mapping",
       x = "SNP Location",
       y = "ABF PIP",
       color = NULL)

Q2: Based on the plot, did the ABF method assign a highly confident PIP (e.g., > 0.90) to any of the true causal variants?

Answer

Yes, but only one of the three causal variants got assigned a high PIP.

Q3: Conceptually, why does ABF not assign high-PIP to the other two causal variants in this specific region?

Answer

The ABF algorithm assumes there is exactly one causal variant in the region, this means the sum of PIPs in this region must equal to 1. The model is not equipped to assign any other variant a high PIP in this region. Because we simulated a region with three causal variants, ABF missed two of them.

[Optional] Look at the GWAS_df dataframe for the causal variants by running GWAS_df[causal_ind, ]. Did all of their p-values stand out?

Answer

Yes, some of them stands out more than others.


Running SuSiE & Interpreting Results

To see how a multiple-causal-variant method works in this region, we will apply SuSiE (Sum of Single Effects) method. SuSiE is designed to tease apart multiple causal signals within the same locus.

# ---------------5. Run SuSiE------------------------------
allowed_causal <- L + 1 # Tell SuSiE to look for up to 4 signals

# Run SuSiE using summary statistics and in-sample LD
susie_results <- susie_rss(bhat = GWAS_df$beta_marginal,
                           shat = GWAS_df$se_marginal,
                           n = N,
                           R = in_sample_LD,
                           var_y = var(y),
                           L = allowed_causal,
                           estimate_residual_variance = TRUE)

GWAS_df$prob_susie <- susie_results$pip

# ---------------Plot SuSiE PIPs----------------------------
# Ensure the dataframe has the SNP index and causality labels
GWAS_df$SNP <- 1:p
GWAS_df$Variant_Type <- ifelse(GWAS_df$SNP %in% causal_ind, "True Causal SNPs", "All SNPs")

ggplot(GWAS_df, aes(x = SNP, y = prob_susie, color = Variant_Type)) +
  geom_point(size = 2.5) +
  scale_color_manual(values = c("All SNPs" = rgb(0, 0, 0, 0.5), "True Causal SNPs" = "red")) +
  theme_minimal() +
  theme(legend.position = "bottom") + 
  labs(title = "SuSiE Fine-mapping",
       x = "SNP Location",
       y = "SuSiE PIP",
       color = NULL)

Q4: Compare the SuSiE PIP plot to your ABF PIP plot. Did the marginal PIPs for our true causal variants improve?

Answer

Yes, by allowing for multiple causal variants in a region, SuSiE was able to assign higher PIPs to one more true causal variant.

[Optional] In the susie_rss run, we set L = allowed_causal (which equals 4). Try temporarily changing this to L = 1 in the code and re-running the SuSiE block. How do the SuSiE PIPs compare to the ABF PIPs now?

The Impact of Sample Size and LD

Fine-mapping resolution is heavily influenced by sample size and the correlation structure of the region (LD). We wrapped our simulation into a safe function to let you adjust these parameters interactively.

simulate_and_finemap <- function(N_sim, LD_baseline) {
  set.seed(1)
  # 1. Simulate LD and X
  LD_sim <- matrix(LD_baseline, nrow = p, ncol = p)
  diag(LD_sim) <- 1.0
  X_sim <- mvrnorm(n = N_sim, mu = rep(0, p), Sigma = LD_sim)
  in_sample_LD_sim <- cov(scale(X_sim))
  
  # 2. Simulate Phenotype 
  eff_sim <- rnorm(L, mean = 0, sd = sqrt(h2g/L)) 
  b_sim <- rep(0, p)
  b_sim[causal_ind] <- eff_sim
  
  gen_eff_sim <- X_sim %*% b_sim
  var_g_sim <- var(gen_eff_sim)
  sigma_sq_sim <- ifelse(1 - var_g_sim > 0.05, 1 - var_g_sim, 0.1)
  
  y_sim <- as.numeric(gen_eff_sim + rnorm(N_sim, 0, sqrt(sigma_sq_sim)))
  
  # 3. Get Sumstats
  bhat <- numeric(p); shat <- numeric(p)
  for (i in 1:p) {
    mod <- summary(lm(y_sim ~ X_sim[, i] - 1))
    bhat[i] <- mod$coefficients[1, 1]
    shat[i] <- mod$coefficients[1, 2]
  }
  
  # 4. Run SuSiE
  susie_sim <- susie_rss(bhat = bhat, shat = shat, n = N_sim, 
                         R = in_sample_LD_sim, var_y = var(y_sim), L = 4)
  
  return(susie_sim)
}

Q5: Run the simulation with a much larger sample size (\(N = 5000\)) while keeping the LD baseline at 0.3. What happens to the PIPs?

res_large_N <- simulate_and_finemap(N_sim = 5000, LD_baseline = 0.3)
print(res_large_N$pip[causal_ind])
Answer

With the higher sample size, all three causal variants are now assigned high PIPs.

Q6: Now run the simulation with our original sample size (\(N = 1000\)), but increase the baseline LD to 0.95 (simulating a region where almost all variants are highly correlated). What happens?

res_high_LD <- simulate_and_finemap(N_sim = 1000, LD_baseline = 0.95)
print(res_high_LD$pip[causal_ind])
Answer

Because the variants are highly correlated, the algorithm cannot easily distinguish the true causal variant from its LD neighbors. The probability is split across many variants, resulting in lower PIPs for the causal variants.

[Optional] Change the LD settings to 0 and see what happens to the causal PIPs.

[Optional] Try altering the simulate_and_finemap function directly. Lower the heritability from h2g = 0.1 to h2g = 0.01 (representing a trait with much weaker genetic effects), re-compile the function, and run it with the baseline \(N = 1000\). What happens to the causal PIPs?


Investigating Credible Sets

SuSiE outputs Credible Sets (CS), which groups highly correlated variants together when the algorithm is uncertain which specific one is causal. By default, SuSiE returns 95% Credible Sets.

# View the credible sets found by SuSiE
print(susie_results$sets$cs)

Q7: How many credible sets did SuSiE output? Do they capture the true causal variants?

Answer

One credible sets. They captured the one causal variants that were assigned very high-PIP.

[Optional] Check the PIPs for the other causal variant 68 and reason why it did not get assigned into a credible set
Answer

This has to do with the credible set “coverage” setting. The default setting is 0.95. The other causal variants PIPs are not high enough.

Let’s run another simulation that is more challenging for SuSiE, raising LD_baseline to 0.6

res_challenging <- simulate_and_finemap(N_sim = 1000, LD_baseline = 0.6)
print(res_challenging$pip[causal_ind])
print(res_challenging$sets$cs)

Q8: How many credible sets did SuSiE yield in this challenging scenario? Which true causal variant(s) is(are) in any credible set? What happens if we relax the credible set coverage threshold from 95% to 80%? Try running the code below.

# Extract 80% credible sets from the existing SuSiE model
susie_80_cs <- susie_get_cs(res_challenging, X = in_sample_LD, coverage = 0.80)
print(susie_80_cs$cs)
Answer

One credible set. The true causal variant with index 68 is in a credible set of three variants. After lowering the coverage threshold, we see the credible set size shrink to containing only one variant: variant 68, because now the credible sets only need to be 80% sure that each contains a causal variant.


Fine-Mapping Best Practices

When applying fine-mapping to real datasets, the pipeline depends on what kind of data you have access to. A mismatch between the GWAS cohort and the LD reference panel can introduce false positives.

Review your Fine-Mapping Best Practices Flowchart (link: https://www.colorado.edu/ibg/media/1720) to answer the following scenarios.

Q9: Scenario A: You only have single-cohort GWAS summary statistics for an African ancestry cohort, but the only available LD reference panel with sufficient sample size is European. What should your pipeline look like?

Answer

Single-Cohort GWAS (Sumstats Only)Select External Reference LDSevere Ancestry Mismatch. You should Restrict to L=1 or 2 when applying multiple-causal-variant methods or use ABF. Using ABF is a better choice because it does not require an LD matrix, it is not affected by reference LD mismatch.

Q10: Scenario B: You have full individual-level genotype and phenotype data for a mixed ancestry cohort (consists of one European cohort, one African cohort, and one Asian cohort). What is the recommended pipeline?

Answer

Individual-Level DataCohort Composition: Mixed Ancestries. You should Impute & QC uniformly, run GWAS & LD separately for the populations, and then Apply Joint fine-mapping (e.g., SuSiEx, Multi-SuSiE). Standard single-cohort fine-mapping methods (SuSiE/FINEMAP) assume a homogeneous LD structure and effect size, so mixed ancestries must be handled by joint models.

Q11: Scenario C: You have summary statistics from a massive Meta-Analysis GWAS of mixed ancestry cohorts. You have the cohort-specific GWAS summary stats, so you decide to perform single-cohort fine-mapping. But you do not have good reference LD panels for every cohort. What should you do next?

Answer

Depending on whether the single cohort has good reference LD or not, follow protocols in the middle column. Run ABF or SuSiE/FINEMAP. After running fine-mapping on separate cohorts, follow Meta-analysis (sumstats only)Cohort-specific GWAS availableRun FastMap