Expression quantitative trait loci (eQTL) analysis identifies genetic variants associated with gene expression levels. Proper quality control (QC) is essential to ensure reliable results. Below, we outline standard QC steps for genotype data, RNA-seq count data, and covariates, along with checks for proper data integration.
Genotype data typically comes in formats such as VCF, PLINK (.bed/.bim/.fam), or similar.
The RNA-seq count file used for eQTL analysis is a gene expression count matrix (TSV/CSV format). The matrix represents raw or normalized gene expression counts for multiple samples.
Structure of the File: 1. Rows: Genes (or transcripts, depending on the quantification method). 2. Columns: Individual samples (biological replicates, conditions, etc.). 3. First column: Unique gene/transcript identifiers (e.g., Ensembl IDs or gene symbols). 4. Remaining columns: Integer values representing raw or normalized counts per sample.
You are going to be performing an eQTL analysis today. You will be using data from the GEUVADIS project. The GEUVADIS (Genetic European Variation in Health and Disease) Project is an international initiative aimed at characterizing the impact of genetic variation on gene expression. By integrating whole-genome sequencing with RNA sequencing data from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project, GEUVADIS has provided a rich resource for studying expression quantitative trait loci (eQTLs). This dataset has been instrumental in understanding how genetic variants influence gene regulation across populations and has contributed to advances in functional genomics, disease mapping, and precision medicine. The project’s findings serve as a foundation for linking genetic variation to molecular phenotypes, enhancing our ability to interpret genome-wide association studies (GWAS) and their implications for human health.
You have been provided with 3 files for this practical:
geuvadis_chr22_genotypes.txt: QC’d genotype data from chr22 (MAF > 5%, SNVs only, HWE p-value < 1e-6 removed).
geuvadis_chr22_rnaseq.txt: PEER-residualized RPKM on chr22 genes (some non-coding genes removed).
geuvadis_covariates.txt: Sample covariates (including sex and genotype PCs).
Load required R packages for the practical and set your working directory.
# Load necessary libraries
library(data.table) # data I/O
library(dplyr) # data frame manipulation
library(stringr) # string manipulation
library(MatrixEQTL) # eQTL calculation
library(coloc) # Colocalization analysis
library(locuszoomr) # Locuszoom plots
library(EnsDb.Hsapiens.v75) # Gene mapping for locuszoom plots
# Set working directory (update from the provided directory below)
setwd("E:/isgw/")
You will now load the genotype, RNA-Seq, and covariate files and answer some basic questions about their contents.
Genotyping was performed Variants in the 1000 Genomes Project were identified using low-coverage whole-genome sequencing, deep exome sequencing, and dense SNP-array genotyping
The genotype data was filtered using PLINK2.
Original genotypes were downloaded from the 1000 Genomes Project.
The dataset has been restricted to subjects from the YRI (Yoruba in Ibadan, Nigeria) population.
geno = fread("geuvadis_chr22_genotypes.txt", data.table = F)
head(geno[,1:10])
Q1: How many samples are there?
nrow(geno)
Q2: How many variants are included?
ncol(geno) - 6
The RNA-Seq data was obtained from the GEUVADIS project.
It contains normalized gene expression counts.
rna = fread("geuvadis_chr22_rnaseq.txt", data.table = F)
head(rna)[,1:10]
Q3: How many subjects are there?
ncol(rna) - 4
Q4: How many genes are there?
nrow(rna)
Q5: Confirm that all genes are on chromosome 22.
table(rna$Chr)
Covariate information was downloaded from the 1000 Genomes Project.
Genetic principal components were previously generated for the 2024 IGES tutorial on PRS calculation.
covar = fread("geuvadis_covariates.txt", data.table = F)
head(covar)
Q6: How many subjects are included?
nrow(covar)
Q7: How many columns are included?
ncol(covar)
Q8: Are there any missing data in the covariate file?
Q9: What are some of the options you could do if you had missing values in the matrix?
For our purposes, we will subset to samples present across all 3 files. This will remove missing data from the covariates file.
Q10: Do all files have the expected format?
Q11: Do all files have the same set of subjects?
MatrixEQTL analysis requires 3 files: genotypes, normalized expression, and covariates. Now you will focus on preparing the genotypes, RNA-Seq, and covariate data for use by MatrixEQTL.
To prepare genotype data for MatrixEQTL, we need to prepare 2 files:
# Extract genotype values
geno_vals = geno %>% dplyr::select(IID, starts_with("22"))
rownames(geno_vals) = geno_vals$IID
geno_vals$IID = NULL
geno_vals = geno_vals[rownames(geno_vals) %in% colnames(rna), ]
geno_vals = as.data.frame(t(geno_vals))
# Filter rows where each homozygous genotype appears at least 4 times
geno_vals = geno_vals[rowSums(geno_vals == 0) >= 3 & rowSums(geno_vals == 2) >= 3, ]
# Write out formatted genotypes
fwrite(geno_vals, "geno_vals.txt", row.names = T)
# genotype locus information
geno_locs = str_split(rownames(geno_vals), ":", simplify = T)
geno_locs = as.data.frame(cbind(rownames(geno_vals), geno_locs))
colnames(geno_locs) = c("snp", "chr", "pos")
geno_locs$pos = str_split(geno_locs$pos, "\\_", simplify = T)[,1]
# Write out formatted genotype positions
fwrite(geno_locs, "geno_locs.txt")
To prepare RNA-Seq data for MatrixEQTL, we need to prepare 2 files:
rna_vals = rna[, c(2, 5:ncol(rna))]
rownames(rna_vals) = rna_vals$Gene_Symbol
rna_vals$Gene_Symbol = NULL
rna_vals = rna_vals[, colnames(rna_vals) %in% colnames(geno_vals)] #445
# Write out formatted covariates
fwrite(rna_vals, "rna_vals.txt", row.names = T)
# RNA locus information
rna_locs = rna[, 2:4]
rna_locs$Coord = rna_locs$Coord - 1e6
rna_locs$Coord2 = rna_locs$Coord + 2e6
names(rna_locs) = c("geneid", "chr", "s1", "s2")
# Write out formatted covariates
fwrite(rna_locs, "rna_locs.txt")
We will use sex and 5 genetic principal components as covariates in modeling.
rownames(covar) = covar$`Sample name`
covar = covar %>% dplyr::select(Sex, PC1, PC2, PC3, PC4, PC5)
covar = covar[rownames(covar) %in% colnames(rna_vals), , drop = F]
# Convert sex to a numeric (binary) variable
covar$Sex[covar$Sex == "male"] = 0
covar$Sex[covar$Sex == "female"] = 1
# Order covariates by sample name
covar = covar[order(rownames(covar)), ]
# Confirm sample order matches between files
all(rownames(covar) == colnames(rna_vals))
# Write out formatted covariates
fwrite(as.data.frame(t(covar)), "covariates.txt")
You are going to perform both cis-eQTL and trans-eQTL analyses simultaneously. A cis-eQTL analysis focuses on testing SNPs physically near (or in) the gene being tested. The cis definition must be specified. A common cis-definition is: +/-Mb from the TSS, such that for a given gene, only SNPs inside the specified window will be tested for association with gene expression. Because each gene has its own location in the genome, the set of SNPs being tested varies by gene. A cis-eQTL analysis focuses on regions surrounding the gene that are most likely to harbor regulatory elements that affect the gene. Where the cis-eQTL analysis focuses on SNPs physically close to a given gene, a trans-eQTL analysis will test for associations with all SNPs that are not close to the gene (ie on the same chromosome, but far away, or on different chromosomes. A trans-eQTL analysis involves conducting many statistical tests (MANY! SNPs per gene x many genes), thus multiple testing correction is imperative to control for false positive associations.
You will use the software package MatrixEQTL, designed for fast eQTL analysis on large datasets. MatrixEQTL can test for association between genotype and gene expression using linear regression with either additive or ANOVA genotype effects. The models can include covariates to account for factors as population stratification, sex, and clinical/experimental variables. For more details see Shabalin (2012) [LINK].
To perform both eQTL analyses, you will specify the following model:
PEER-normalized gene expression ~ Sex + PC1 + PC2 + PC3 + PC4 + PC5
snps = SlicedData$new()
snps$fileDelimiter = ","
snps$fileOmitCharacters = "NA"
snps$fileSkipRows = 1
snps$fileSkipColumns = 1
snps$fileSliceSize = 2000
snps$LoadFile("geno_vals.txt")
# Read in genotype positions
snpspos = read.table("geno_locs.txt", sep = ",", header = T)
gene = SlicedData$new()
gene$fileDelimiter = ","
gene$fileOmitCharacters = "NA"
gene$fileSkipRows = 1
gene$fileSkipColumns = 1
gene$fileSliceSize = 2000
gene$LoadFile("rna_vals.txt")
# Read in gene positions
genepos = read.table("rna_locs.txt", sep = ",", header = T)
# Read in covariates
cvrt = SlicedData$new()
cvrt$fileDelimiter = ","
cvrt$fileOmitCharacters = "NA"
cvrt$fileSkipRows = 1
cvrt$fileSkipColumns = 0
cvrt$LoadFile("covariates.txt")
# Run MatrixEQTL
me = Matrix_eQTL_main(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = "trans_eqtls.txt",
pvOutputThreshold = 1e-3, # p-value output threshold for trans-eqtls
useModel = modelLINEAR,
errorCovariance = numeric(),
verbose = TRUE,
output_file_name.cis = "cis_eqtls.txt",
pvOutputThreshold.cis = 1e-3, # p-value output threshold for cis-eqtls
snpspos = snpspos,
genepos = genepos,
cisDist = 1e6, # distance threshold for cis associations
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE)
You will now do some preliminary evaluation of the eQTLs you have generated.
# Check the structure of results
str(me)
You will now plot a quantile-quantile (QQ) plot, which visually compares the distribution of observed p-values from your analysis to the expected uniform distribution under the null hypothesis. This plot helps assess whether there is an enrichment of small p-values, indicating potential true associations, or if the distribution follows the null expectation. A well-calibrated analysis should show most points aligning along the diagonal reference line, while deviations (particularly at the lower end of p-values) suggest significant signals. Systematic inflation or deflation of p-values can indicate confounding effects, population structure, or technical biases, which may require further investigation and correction. MatrixEQTL provides functionality that automatically returns a QQ plot stratified by cis and trans eQTLs.
# Generate QQ plots
plot(me)
Q12: Do cis and trans eQTLs seem well-calibrated?
Q13: Do cis or trans eQTLs demonstrate stronger hits?
The title of the QQ plot displays how many tests were performed for cis- and trans-eqtls, respectively.
Q14: How many cis- and trans- associations were tested?
# Create objects for cis and trans eqtls
cis = me$cis$eqtls
trans = me$trans$eqtls
Q15: What is the smallest nominal p-value you’ve found?
Q16: What is the smallest FDR-adjusted p-value you’ve found?
cis[1, ]
A cis-eQTL analysis involves conducting many statistical tests (many SNPs per gene x many genes), thus multiple testing correction is imperative to control for false positive associations. MatrixEQTL uses the qvalue method to perform multiple testing correction. EXPAND!!!!!!!!!!
Q17: How many significant (FDR < 0.05) cis-eQTLs (SNP-gene pairs) have you found?
nrow(cis)
Q18: How many cis eGenes have you found?
length(table(cis$gene))
Q19: How many trans-eQTLs (SNP-gene pairs) have you found?
nrow(trans)
Q20: How many trans eGenes have you found?
length(table(trans$gene))
head(cis)
head(trans)
You will now calculate distance between SNPs and eGenes. Once you’ve done so, plot histograms of distances for both cis and trans eQTLs.
# define positions
cis$chrom = str_split(cis$snps, ":", simplify = T)[,1]
cis$pos = str_split(cis$snps, ":", simplify = T)[,2]
cis$pos = str_split(cis$pos, "_", simplify = T)[,1]
trans$chrom = str_split(trans$snps, ":", simplify = T)[,1]
trans$pos = str_split(trans$snps, ":", simplify = T)[,2]
trans$pos = str_split(trans$pos, "_", simplify = T)[,1]
cis$distance = as.numeric(cis$pos) - as.numeric(genepos$s1[match(cis$gene, genepos$geneid)] + 1e6)
trans$distance = as.numeric(trans$pos) - as.numeric(genepos$s1[match(trans$gene, genepos$geneid)] + 1e6)
# Plot histograms of distances for cis and trans eQTLs
hist(cis$distance, main = "Cis eQTL Distances", xlab = "Distance from TSS (bp)", col = "red", breaks = 50)
Q21: Do cis-eQTLs appear enriched for a certain
distance?
Now plot trans-eQTL distances.
hist(trans$distance, main = "Trans eQTL Distances", xlab = "Distance from TSS (bp)", col = "blue", breaks = 50)
Next you are going to visualize some of the cis-eQTLs. The idea is to select a specific SNP-gene pair and plot gene expression by genotype.
# Extract top eGenes for cis and trans
top_cis_gene = cis$gene[1]
top_trans_gene = trans$gene[1]
geno_sub_cis = geno_vals[rownames(geno_vals) %in% cis$snps[1], , drop = FALSE]
rna_sub_cis = rna_vals[rownames(rna_vals) %in% top_cis_gene, , drop = FALSE]
df_cis = data.frame(SNP = as.factor(t(geno_sub_cis)), Gene = as.numeric(t(rna_sub_cis)))
boxplot(Gene ~ SNP, data = df_cis, main = paste("Expression of", top_cis_gene, "by Genotype"))
geno_sub_trans = geno_vals[rownames(geno_vals) %in% trans$snps[1], , drop = FALSE]
rna_sub_trans = rna_vals[rownames(rna_vals) %in% top_trans_gene, , drop = FALSE]
df_trans = data.frame(SNP = as.factor(t(geno_sub_trans)), Gene = as.numeric(t(rna_sub_trans)))
boxplot(Gene ~ SNP, data = df_trans, main = paste("Expression of", top_trans_gene, "by Genotype"))
Q22: Which allele is associated with higher gene expression of TTC38?
Q23: Is the eQTL effect size larger for the TTC38 cis-eQTL or for the IGLL3P trans-eQTL? Hint: the eQTL effect size can be quantified as the slope of the regression, or the log allelic fold change (aFC), which represents the fold difference in gene expression between haplotypes carrying the reference and alternative alleles at a given genetic variant.
Another way to visualize eQTL results is to plot summary statistics along the genome. In GWAS, we typically use a Manhattan plot to display associations across the entire genome. However, for cis-eQTL analysis, we are often interested in specific regions surrounding a gene, as each gene has its own cis-window (~20K genes in total). To explore associations at the gene level, LocusZoom is a widely used tool for visualizing genetic associations in their genomic context.
You will generate LocusZoom plots for both the top cis-eQTL and top trans-eQTL associations.
To add linkage disequilibrium (LD) information to the plots, we use results generated from LDlink, an NIH resource that provides population-specific LD estimates, proxy variants, haplotypes, and related annotations. LDlink is typically accessed through an API and requires a personal API token. Because API access can create practical issues during a group workshop, I generated the LDlink results ahead of time myself.
For this practical, you will not query LDlink directly. Instead, you
will load precomputed locuszoomr objects that already
contain the LDlink-derived LD information. These objects have been saved
as .rds files and can be read into R using
readRDS().
Now load the precomputed cis-eQTL locus object. This object already contains the association results, gene annotations, recombination information, and LD information retrieved from LDlink.
# Plot top cis eGene
loc = readRDS("loc_cis_TTC38_YRI.rds")
summary(loc)
locus_plot(loc)
Q24: Where, relative to TTC38, are the strongest associations?
Next, load the precomputed trans-eQTL locus object. As above, the LDlink results have already been added, so this step only reads the saved object and produces the plot.
# Plot top trans eGene
loc = readRDS("loc_trans_IGLL3P_YRI.rds")
summary(loc)
locus_plot(loc, filter_gene_name = "IGLL3P")
Q25: How far is the SNP from IGLL3P?
Now that you have identified eQTLs, the next step is to explore the SNPs and eGenes using publicly available databases. These resources provide insights into the functional context of genetic variants and their role in gene regulation and disease.
Use the UCSC genome browser to visualize the genomic location of your top cis-eQTL.
Q26: What rsid is associated with our variant of interest?
This rsID can now be used for further exploration in Open Targets Genetics.
Use Open Targets Genetics to investigate associations between your variant and gene expression, disease traits, and drug targets.
Q27: Which population has the highest effect allele frequency?
Q28: What genes does this variant act as an eQTL for?
Use the GTEx Portal to analyze tissue-specific gene expression and regulatory variants.
Q29: What tissue is TTC38 most highly expressed in?
Q30: What cell type is TTC38 most frequently expressed in?
Q31: What is the top single tissue eqtl?
Q32: What tissue is this eQTL in?
Q33: What is the top single tissue sQTL?
Q34: What tissue is this sQTL in?
Below are some potentially useful resources.
If you have any questions, concerns, or quibbles, please reach out to me at iain.konigsberg@cuanschutz.edu
The code below is included for transparency and reproducibility. You
do not need to run this section during the practical. I
ran this code in advance to query LDlink, add YRI-specific LD
information to the locuszoomr objects, and save the
resulting objects as .rds files.
library(dplyr)
library(stringr)
library(locuszoomr)
library(EnsDb.Hsapiens.v75)
ldlink_token = "YOUR_TOKEN_HERE"
# Clean SNP IDs once
cis2 = cis %>%
mutate(
snps = str_split(snps, "_", simplify = TRUE)[, 1],
snps = paste0("chr", snps)
)
trans2 = trans %>%
mutate(
snps = str_split(snps, "_", simplify = TRUE)[, 1],
snps = paste0("chr", snps)
)
# -------------------------
# cis locus
# -------------------------
loc_cis = locus(
data = cis2[cis2$gene == cis2$gene[1], ],
gene = cis2$gene[1],
flank = 1e5,
ens_db = "EnsDb.Hsapiens.v75"
)
loc_cis = link_LD(
loc_cis,
pop = "YRI",
token = ldlink_token,
method = "proxy"
)
loc_cis = link_recomb(
loc_cis,
table = "hapMapRelease24YRIRecombMap"
)
saveRDS(loc_cis, file = "loc_cis_TTC38_YRI.rds")
# -------------------------
# trans locus
# -------------------------
loc_trans = locus(
data = trans2[trans2$gene == trans2$gene[1], ],
gene = trans2$gene[1],
flank = c(1e4, 8e6),
ens_db = "EnsDb.Hsapiens.v75"
)
loc_trans = link_LD(
loc_trans,
pop = "YRI",
token = ldlink_token,
method = "proxy"
)
loc_trans = link_recomb(
loc_trans,
table = "hapMapRelease24YRIRecombMap"
)
saveRDS(loc_trans, file = "loc_trans_IGLL3P_YRI.rds")