Background

Expression quantitative trait loci (eQTL) analysis identifies genetic variants associated with gene expression levels. Proper quality control (QC) is essential to ensure reliable results. Below, we outline standard QC steps for genotype data, RNA-seq count data, and covariates, along with checks for proper data integration.

Quality Control for Genotype Data

Genotype data typically comes in formats such as VCF, PLINK (.bed/.bim/.fam), or similar.

1.1 Sample-Level QC

  • Check sex consistency by comparing inferred genetic sex (from X and Y chromosome markers) with reported sex in metadata.
  • Remove individuals with high missingness (typically >5% missing genotypes).
  • Check for relatedness and remove duplicates to ensure no unexpected familial relationships or sample duplications.

1.2 Variant-Level QC

  • Remove SNPs with high missingness (e.g., variants missing in more than 5% of samples).
  • Apply Hardy-Weinberg Equilibrium (HWE) filtering to exclude variants that significantly deviate from expected genetic distributions (p > 1e-6).
  • Filter for minor allele frequency (MAF) to retain only common variants (e.g., MAF > 5%).

1.3 Population Structure Correction

  • Perform principal component analysis (PCA) on the genotype data to detect population stratification.
  • Retain key principal components (PCs) as covariates in the eQTL analysis to account for ancestry-related effects.

Quality Control for RNA-Seq Data

The RNA-seq count file used for eQTL analysis is a gene expression count matrix (TSV/CSV format). The matrix represents raw or normalized gene expression counts for multiple samples.

Structure of the File:
1. Rows: Genes (or transcripts, depending on the quantification method).
2. Columns: Individual samples (biological replicates, conditions, etc.).
3. First column: Unique gene/transcript identifiers (e.g., Ensembl IDs or gene symbols).
4. Remaining columns: Integer values representing raw or normalized counts per sample.

2.1 Sample-Level QC

  • Identify and remove outliers using PCA or hierarchical clustering.
  • Check sex consistency using expression of sex-specific genes (e.g., XIST for females, Y-linked genes for males).
  • Remove low-quality samples based on quality metrics such as RNA integrity number (RIN < 6) and sequencing depth (< 10M reads per sample).

2.2 Gene-Level QC

  • Remove lowly expressed genes, keeping only those expressed in a sufficient number of samples (e.g., > 1-10 CPM in > 20% of samples).
  • Transformation for Normality for eQTL Models:
    • Most eQTL analyses assume a linear relationship between genotype and expression.
    • Raw counts follow a negative binomial distribution, which is not suitable for regression.
    • Standard transformations applied to raw counts:
    • log2(CPM + 1) (log-transformed counts per million)
    • voom transformation (for modeling mean-variance trends, see limma-voom)
    • inverse normal transformation (ranks gene expression per gene across samples and applies normal quantile transformation (mean = 0, SD =1). Most common for eQTL mapping

Quality Control for Covariates

  • Address missing data by either imputing values or removing samples with incomplete covariate information.
  • Assess correlations between covariates, ensuring that highly correlated variables (e.g., PEER factors and genotype PCs) do not introduce collinearity (r < 0.8).
  • Include genotype principal components (typically >3, depending on cohort) and PEER factors (typically 10-50) to account for unknown batch effects and confounding.
  • Correlate PEER factors with known covariates to assess whether they capture relevant experimental or biological variations.

Ensuring Sample Matching Across Data Types

  • Ensure sample IDs are consistent across genotype, expression, and covariate files.
  • Remove unmatched samples to ensure a 1:1 correspondence across data types.

Overview of the GEUVADIS Dataset

You are going to be performing an eQTL analysis today. You will be using data from the GEUVADIS project. The GEUVADIS (Genetic European Variation in Health and Disease) Project is an international initiative aimed at characterizing the impact of genetic variation on gene expression. By integrating whole-genome sequencing with RNA sequencing data from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project, GEUVADIS has provided a rich resource for studying expression quantitative trait loci (eQTLs). This dataset has been instrumental in understanding how genetic variants influence gene regulation across populations and has contributed to advances in functional genomics, disease mapping, and precision medicine. The project’s findings serve as a foundation for linking genetic variation to molecular phenotypes, enhancing our ability to interpret genome-wide association studies (GWAS) and their implications for human health.

You have been provided with 3 files for this practical:

  • geuvadis_chr22_genotypes.txt: QC’d genotype data from chr22 (MAF > 5%, SNVs only, HWE p-value < 1e-6 removed).

  • geuvadis_chr22_rnaseq.txt: PEER-residualized RPKM on chr22 genes (some non-coding genes removed).

  • geuvadis_covariates.txt: Sample covariates (including sex and genotype PCs).


Setup

Load required R packages for the practical and set your working directory.

# Load necessary libraries
library(data.table) # data I/O
library(dplyr) # data frame manipulation
library(stringr) # string manipulation
library(MatrixEQTL) # eQTL calculation
library(coloc) # Colocalization analysis
library(locuszoomr) # Locuszoom plots
library(EnsDb.Hsapiens.v75) # Gene mapping for locuszoom plots

# Set working directory
#setwd("~/practicals/4.3.GenomeAnnotation_WeiZhou/final")

Read and Prepare Data

You will now load the genotype, RNA-Seq, and covariate files and answer some basic questions about their contents.

Genotypes

  • Genotyping was performed through low-coverage whole-genome sequencing.

  • The genotype data was filtered using PLINK2.

  • Original genotypes were downloaded from the 1000 Genomes Project.

  • The dataset has been restricted to subjects from the YRI (Yoruba in Ibadan, Nigeria) population.

geno = fread("/usr/local/data/geuvadis_chr22_genotypes.txt", data.table = F)
head(geno[,1:10])

Q1: How many samples are there? A: 108

nrow(geno) 
## [1] 108

Q2: How many variants are included? A: 95,190

ncol(geno) - 6 
## [1] 95190

RNA-Seq

  • The RNA-Seq data was obtained from the GEUVADIS project.

  • It contains normalized gene expression counts.

rna = fread("/usr/local/data/geuvadis_chr22_rnaseq.txt", data.table = F)
head(rna)[,1:10]

Q3: How many subjects are there? A: 462

ncol(rna) - 4
## [1] 462

Q4: How many genes are there? A: 420

nrow(rna)
## [1] 420

Q5: Confirm that all genes are on chromosome 22.

table(rna$Chr) 
## 
##  22 
## 420

Covariates

covar = fread("/usr/local/data/geuvadis_covariates.txt", data.table = F)

head(covar)

Q6: How many subjects are included? A: 4,978

nrow(covar) 
## [1] 4978

Q7: How many columns are included? A: 14

ncol(covar)
## [1] 14

Q8: Are there any missing data in the covariate file? A: Yes

Q9: What are some of the options you could do if you had missing values in the matrix?

  • do nothing
  • impute missing values
  • Remove the sample or covariate

For our purposes, we will subset to samples present across all 3 files. This will remove missing data from the covariates file.

Q10: Do all files have the expected format?

  • Genotypes? A: Yes
  • RNA-seq? A: Yes
  • Covariates? A: Yes

Q11: Do all files have the same set of subjects? A: No


Formatting for MatrixEQTL

MatrixEQTL analysis requires 3 files: genotypes, normalized expression, and covariates. Now you will focus on preparing the genotypes, RNA-Seq, and covariate data for use by MatrixEQTL.

Genotypes

To prepare genotype data for MatrixEQTL, we need to prepare 2 files:

  • Genotype values
  • Genotype locations
# Extract genotype values
geno_vals = geno %>% dplyr::select(IID, starts_with("22")) 
rownames(geno_vals) = geno_vals$IID
geno_vals$IID = NULL
geno_vals = geno_vals[rownames(geno_vals) %in% colnames(rna), ] 
geno_vals = as.data.frame(t(geno_vals)) 

# Filter rows where each homozygous genotype appears at least 4 times
geno_vals = geno_vals[rowSums(geno_vals == 0) >= 3 & rowSums(geno_vals == 2) >= 3, ]

# Write out formatted genotypes
fwrite(geno_vals, "geno_vals.txt", row.names = T)

# genotype locus information
geno_locs = str_split(rownames(geno_vals), ":", simplify = T)
geno_locs = as.data.frame(cbind(rownames(geno_vals), geno_locs))
colnames(geno_locs) = c("snp", "chr", "pos")
geno_locs$pos = str_split(geno_locs$pos, "\\_", simplify = T)[,1]

# Write out formatted genotype positions
fwrite(geno_locs, "geno_locs.txt")

RNA-Seq

To prepare RNA-Seq data for MatrixEQTL, we need to prepare 2 files:

  • RNA-Seq values
  • Gene locations
rna_vals = rna[, c(2, 5:ncol(rna))]
rownames(rna_vals) = rna_vals$Gene_Symbol
rna_vals$Gene_Symbol = NULL
rna_vals = rna_vals[, colnames(rna_vals) %in% colnames(geno_vals)] #445

# Write out formatted covariates
fwrite(rna_vals, "rna_vals.txt", row.names = T)

# RNA locus information
rna_locs = rna[, 2:4]
rna_locs$Coord = rna_locs$Coord - 1e6
rna_locs$Coord2 = rna_locs$Coord + 2e6
names(rna_locs) = c("geneid", "chr", "s1", "s2")

# Write out formatted covariates
fwrite(rna_locs, "rna_locs.txt")

Covariates

We will use sex and 5 genetic principal components as covariates in modeling.

rownames(covar) = covar$`Sample name`
covar = covar %>% dplyr::select(Sex, PC1, PC2, PC3, PC4, PC5)
covar = covar[rownames(covar) %in% colnames(rna_vals), , drop = F]

# Convert sex to a numeric (binary) variable
covar$Sex[covar$Sex == "male"] = 0
covar$Sex[covar$Sex == "female"] = 1

# Order covariates by sample name
covar = covar[order(rownames(covar)), ]

# Confirm sample order matches between files
all(rownames(covar) == colnames(rna_vals))
## [1] TRUE
# Write out formatted covariates
fwrite(as.data.frame(t(covar)), "covariates.txt")

MatrixEQTL Analysis

You are going to perform both cis-eQTL and trans-eQTL analyses simultaneously. A cis-eQTL analysis focuses on testing SNPs physically near (or in) the gene being tested. The cis definition must be specified. A common cis-definition is: +/-Mb from the TSS, such that for a given gene, only SNPs inside the specified window will be tested for association with gene expression. Because each gene has its’ own location in the genome, the set of SNPs being tested varies by gene. A cis-eQTL analysis focuses on regions surrounding the gene that are most likely to harbor regulatory elements that affect the gene. Where the cis-eQTL analysis focuses on SNPs physically close to a given gene, a trans-eQTL analysis will test for associations with all SNPs that are not close to the gene (ie on the same chromosome, but far away, or on different chromosomes. A trans-eQTL analysis involves conducting many statistical tests (MANY! SNPs per gene x many genes), thus multiple testing correction is imperative to control for false positive associations.

You will use the software package MatrixEQTL, designed for fast eQTL analysis on large datasets. MatrixEQTL can test for association between genotype and gene expression using linear regression with either additive or ANOVA genotype effects. The models can include covariates to account for factors as population stratification, sex, and clinical/experimental variables variables. For more details see Shabalin (2012) [LINK].

To perform both eQTL analyses, you will specify the following model:

PEER-normalized gene expression ~ Sex + PC1 + PC2 + PC3 + PC4 + PC5

Load genotype data

snps = SlicedData$new()
snps$fileDelimiter = ","      
snps$fileOmitCharacters = "NA" 
snps$fileSkipRows = 1          
snps$fileSkipColumns = 1       
snps$fileSliceSize = 2000      
snps$LoadFile("geno_vals.txt")
## Rows read: 2,000
## Rows read: 4,000
## Rows read: 6,000
## Rows read: 8,000
## Rows read: 10,000
## Rows read: 12,000
## Rows read: 14,000
## Rows read: 16,000
## Rows read: 18,000
## Rows read: 20,000
## Rows read: 22,000
## Rows read: 24,000
## Rows read: 26,000
## Rows read: 28,000
## Rows read: 30,000
## Rows read: 32,000
## Rows read: 34,000
## Rows read: 36,000
## Rows read: 38,000
## Rows read: 40,000
## Rows read: 42,000
## Rows read: 44,000
## Rows read: 46,000
## Rows read: 48,000
## Rows read: 50,000
## Rows read: 52,000
## Rows read: 54,000
## Rows read: 56,000
## Rows read: 56684 done.
# Read in genotype positions
snpspos = read.table("geno_locs.txt", sep = ",", header = T)

Load gene expression data

gene = SlicedData$new()
gene$fileDelimiter = ","      
gene$fileOmitCharacters = "NA" 
gene$fileSkipRows = 1          
gene$fileSkipColumns = 1      
gene$fileSliceSize = 2000      
gene$LoadFile("rna_vals.txt")
## Rows read: 420 done.
# Read in gene positions
genepos = read.table("rna_locs.txt", sep = ",", header = T)

Load covariates

# Read in covariates
cvrt = SlicedData$new()
cvrt$fileDelimiter = ","      
cvrt$fileOmitCharacters = "NA"
cvrt$fileSkipRows = 1
cvrt$fileSkipColumns = 0
cvrt$LoadFile("covariates.txt")
## Rows read: 6 done.

Run the analysis

# Run MatrixEQTL
me = Matrix_eQTL_main(
  snps = snps,
  gene = gene,
  cvrt = cvrt,
  output_file_name = "trans_eqtls.txt",
  pvOutputThreshold = 1e-3, # p-value output threshold for trans-eqtls
  useModel = modelLINEAR,
  errorCovariance = numeric(),
  verbose = TRUE,
  output_file_name.cis = "cis_eqtls.txt",
  pvOutputThreshold.cis = 1e-3, # p-value output threshold for cis-eqtls
  snpspos = snpspos,
  genepos = genepos,
  cisDist = 1e6, # distance threshold for cis associations
  pvalue.hist = "qqplot",
  min.pv.by.genesnp = FALSE,
  noFDRsaveMemory = FALSE) 
## Matching data files and location files
## 420 of 420 genes matched
## 56684 of 56684 SNPs matched
## Task finished in 0.011 seconds
## Reordering genes
## Task finished in 1.368 seconds
## Processing covariates
## Task finished in 0.002 seconds
## Processing gene expression data (imputation, residualization)
## Task finished in 0.002 seconds
## Creating output file(s)
## Task finished in 0.042 seconds
## Performing eQTL analysis
##  3.44% done, 40 cis-eQTLs, 1,054 trans-eQTLs
##  6.89% done, 142 cis-eQTLs, 1,819 trans-eQTLs
## 10.34% done, 243 cis-eQTLs, 2,635 trans-eQTLs
## 13.79% done, 523 cis-eQTLs, 3,793 trans-eQTLs
## 17.24% done, 583 cis-eQTLs, 4,309 trans-eQTLs
## 20.68% done, 786 cis-eQTLs, 5,395 trans-eQTLs
## 24.13% done, 1,104 cis-eQTLs, 6,173 trans-eQTLs
## 27.58% done, 1,213 cis-eQTLs, 6,877 trans-eQTLs
## 31.03% done, 1,392 cis-eQTLs, 7,785 trans-eQTLs
## 34.48% done, 1,467 cis-eQTLs, 8,576 trans-eQTLs
## 37.93% done, 1,600 cis-eQTLs, 9,278 trans-eQTLs
## 41.37% done, 1,690 cis-eQTLs, 9,953 trans-eQTLs
## 44.82% done, 1,764 cis-eQTLs, 10,663 trans-eQTLs
## 48.27% done, 1,785 cis-eQTLs, 11,436 trans-eQTLs
## 51.72% done, 1,908 cis-eQTLs, 12,824 trans-eQTLs
## 55.17% done, 2,008 cis-eQTLs, 13,475 trans-eQTLs
## 58.62% done, 2,175 cis-eQTLs, 14,164 trans-eQTLs
## 62.06% done, 2,308 cis-eQTLs, 15,107 trans-eQTLs
## 65.51% done, 2,536 cis-eQTLs, 15,670 trans-eQTLs
## 68.96% done, 2,949 cis-eQTLs, 16,335 trans-eQTLs
## 72.41% done, 3,064 cis-eQTLs, 17,031 trans-eQTLs
## 75.86% done, 3,159 cis-eQTLs, 17,839 trans-eQTLs
## 79.31% done, 3,235 cis-eQTLs, 18,704 trans-eQTLs
## 82.75% done, 3,466 cis-eQTLs, 19,876 trans-eQTLs
## 86.20% done, 3,517 cis-eQTLs, 20,541 trans-eQTLs
## 89.65% done, 3,571 cis-eQTLs, 21,413 trans-eQTLs
## 93.10% done, 3,683 cis-eQTLs, 22,201 trans-eQTLs
## 96.55% done, 3,816 cis-eQTLs, 22,917 trans-eQTLs
## 100.00% done, 3,944 cis-eQTLs, 23,162 trans-eQTLs
## Task finished in 1.495 seconds
## 

Downstream Results

You will now do some preliminary evaluation of the eQTLs you have generated.

# Check the structure of results
str(me)
## List of 5
##  $ time.in.sec: Named num 2.71
##   ..- attr(*, "names")= chr "elapsed"
##  $ param      :List of 11
##   ..$ output_file_name     : chr "trans_eqtls.txt"
##   ..$ pvOutputThreshold    : num 0.001
##   ..$ useModel             : int 117348
##   ..$ errorCovariance      : num[0 , 1] 
##   ..$ verbose              : logi TRUE
##   ..$ output_file_name.cis : chr "cis_eqtls.txt"
##   ..$ pvOutputThreshold.cis: num 0.001
##   ..$ cisDist              : num 1e+06
##   ..$ pvalue.hist          : chr "qqplot"
##   ..$ min.pv.by.genesnp    : logi FALSE
##   ..$ dfFull               : num 79
##  $ all        :List of 4
##   ..$ ntests     : num 23807280
##   ..$ neqtls     : num 27106
##   ..$ hist.bins  : num [1:6175] 0.00 2.24e-309 2.51e-309 2.82e-309 3.16e-309 ...
##   ..$ hist.counts: num [1:6174] 0 0 0 0 0 0 0 0 0 0 ...
##  $ trans      :List of 5
##   ..$ ntests     : num 21306927
##   ..$ neqtls     : num 23162
##   ..$ eqtls      :'data.frame':  23162 obs. of  6 variables:
##   .. ..$ snps     : chr [1:23162] "22:33006654_C" "22:33007374_G" "22:33007514_G" "22:33008121_T" ...
##   .. ..$ gene     : chr [1:23162] "IGLL3P" "IGLL3P" "IGLL3P" "IGLL3P" ...
##   .. ..$ statistic: num [1:23162] -6.53 -6.53 -6.53 -6.53 -6.4 ...
##   .. ..$ pvalue   : num [1:23162] 5.85e-09 5.85e-09 5.85e-09 5.85e-09 1.03e-08 ...
##   .. ..$ FDR      : num [1:23162] 0.0311 0.0311 0.0311 0.0311 0.0437 ...
##   .. ..$ beta     : num [1:23162] -0.33 -0.33 -0.33 -0.33 -0.319 ...
##   ..$ hist.bins  : num [1:6175] 0.00 2.24e-309 2.51e-309 2.82e-309 3.16e-309 ...
##   ..$ hist.counts: num [1:6174] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cis        :List of 5
##   ..$ ntests     : num 2500353
##   ..$ neqtls     : num 3944
##   ..$ eqtls      :'data.frame':  3944 obs. of  6 variables:
##   .. ..$ snps     : chr [1:3944] "22:46687220_G" "22:46688208_C" "22:46688321_T" "22:46689229_C" ...
##   .. ..$ gene     : chr [1:3944] "TTC38" "TTC38" "TTC38" "TTC38" ...
##   .. ..$ statistic: num [1:3944] -16.8 -16.8 -16.8 -16.8 -16.3 ...
##   .. ..$ pvalue   : num [1:3944] 7.43e-28 7.43e-28 7.43e-28 7.43e-28 5.38e-27 ...
##   .. ..$ FDR      : num [1:3944] 4.64e-22 4.64e-22 4.64e-22 4.64e-22 2.69e-21 ...
##   .. ..$ beta     : num [1:3944] -5.07 -5.07 -5.07 -5.07 -5.08 ...
##   ..$ hist.bins  : num [1:6175] 0.00 2.24e-309 2.51e-309 2.82e-309 3.16e-309 ...
##   ..$ hist.counts: num [1:6174] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "class")= chr [1:2] "list" "MatrixEQTL"

QQ Plot

You will now plot a quantile-quantile (QQ) plot, which visually compares the distribution of observed p-values from your analysis to the expected uniform distribution under the null hypothesis. This plot helps assess whether there is an enrichment of small p-values, indicating potential true associations, or if the distribution follows the null expectation. A well-calibrated analysis should show most points aligning along the diagonal reference line, while deviations (particularly at the lower end of p-values) suggest significant signals. Systematic inflation or deflation of p-values can indicate confounding effects, population structure, or technical biases, which may require further investigation and correction. MatrixEQTL provides functionality that automatically returns a QQ plot stratified by cis and trans eQTLs.

# Generate QQ plots
plot(me)

Q12: Do cis and trans eQTLs seem well-calibrated? A: Yes

Q13: Do cis or trans eQTLs demonstrate stronger hits? A: cis-eQTLs

The title of the QQ plot displays how many tests were performed for cis- and trans-eqtls, respectively.

Q14: How many cis- and trans- associations were tested? A: 2,500,353 cis, 21,306,927 trans


Basic Results

# Create objects for cis and trans eqtls
cis = me$cis$eqtls
trans = me$trans$eqtls

Q15: What is the smallest nominal p-value you’ve found? A: 7.43e-28

Q16: What is the smallest FDR-adjusted p-value you’ve found? A: 4.64e-22

cis[1, ]

Cis-eQTL analysis tests numerous SNP-gene pairs, increasing the risk of false positives. MatrixEQTL controls the false discovery rate (FDR) using the q-value method, which estimates the proportion of false positives among significant findings. Q-values are less conservative than the Bonferroni correction and balance statistical power and false positive control, making them ideal for large-scale eQTL studies. A threshold of q < 0.05 is typically used to identify significant eQTLs while minimizing false associations.

Q17: How many significant (FDR < 0.05) cis-eQTLs (SNP-gene pairs) have you found? A: 510

nrow(cis[cis$FDR <= 0.05, ])
## [1] 510

Q18: How many cis eGenes have you found? A: 40

length(table(cis[cis$FDR <= 0.05, ]$gene))
## [1] 40

Q19: How many trans-eQTLs (SNP-gene pairs) have you found? A: 5

nrow(trans[trans$FDR <= 0.05, ])
## [1] 5

Q20: How many trans eGenes have you found? A: 1

length(table(trans[trans$FDR <= 0.05, ]$gene))
## [1] 1
head(cis)
head(trans)

Distance Between eGenes and SNPs

You will now calculate distance between SNPs and eGenes. Once you’ve done so, plot histograms of distances for both cis and trans eQTLs.

# define positions
cis$chrom = str_split(cis$snps, ":", simplify = T)[,1]
cis$pos = str_split(cis$snps, ":", simplify = T)[,2]
cis$pos = str_split(cis$pos, "_", simplify = T)[,1]

trans$chrom = str_split(trans$snps, ":", simplify = T)[,1]
trans$pos = str_split(trans$snps, ":", simplify = T)[,2]
trans$pos = str_split(trans$pos, "_", simplify = T)[,1]

cis$distance = as.numeric(cis$pos) - as.numeric(genepos$s1[match(cis$gene, genepos$geneid)] + 1e6) 
trans$distance = as.numeric(trans$pos) - as.numeric(genepos$s1[match(trans$gene, genepos$geneid)] + 1e6) 

# Plot histograms of distances for cis and trans eQTLs
hist(cis$distance, main = "Cis eQTL Distances", xlab = "Distance from TSS (bp)", col = "red", breaks = 50)

Q21: Do cis-eQTLs appear enriched for a certain distance? A: Gene start sites

Now plot trans-eQTL distances.

hist(trans$distance, main = "Trans eQTL Distances", xlab = "Distance from TSS (bp)", col = "blue", breaks = 50)


Boxplots for Top eGenes

Next you are going to visualize some of the cis-eQTLs. The idea is to select a specific SNP-gene pair and plot gene expression by genotype.

# Extract top eGenes for cis and trans
top_cis_gene = cis$gene[1]
top_trans_gene = trans$gene[1]

geno_sub_cis = geno_vals[rownames(geno_vals) %in% cis$snps[1], , drop = FALSE]
rna_sub_cis = rna_vals[rownames(rna_vals) %in% top_cis_gene, , drop = FALSE]

df_cis = data.frame(SNP = as.factor(t(geno_sub_cis)), Gene = as.numeric(t(rna_sub_cis)))
boxplot(Gene ~ SNP, data = df_cis, main = paste("Expression of", top_cis_gene, "by Genotype"))

geno_sub_trans = geno_vals[rownames(geno_vals) %in% trans$snps[1], , drop = FALSE]
rna_sub_trans = rna_vals[rownames(rna_vals) %in% top_trans_gene, , drop = FALSE]

df_trans = data.frame(SNP = as.factor(t(geno_sub_trans)), Gene = as.numeric(t(rna_sub_trans)))
boxplot(Gene ~ SNP, data = df_trans, main = paste("Expression of", top_trans_gene, "by Genotype"))

Q22: Which allele is associated with higher gene expression of TTC38? A: the minor allele

Q23: Is the eQTL effect size larger for the TTC38 cis-eQTL or for the IGLL3P trans-eQTL? Hint: the eQTL effect size can be quantified as the slope of the regression, or the log allelic fold change (aFC), which represents the fold difference in gene expression between haplotypes carrying the reference and alternative alleles at a given genetic variant. A: TTC38


LocusZoom Plots

Another way to visualize eQTL results is to plot summary statistics along the genome. In GWAS, we typically use a Manhattan plot to display associations across the entire genome. However, for cis-eQTL analysis, we are often interested in specific regions surrounding a gene, as each gene has its own cis-window (~20K genes in total). To explore associations at the gene level, LocusZoom is a widely used tool for visualizing genetic associations in their genomic context.

You will generate LocusZoom plots for both the top cis-eQTL and top trans-eQTL associations.

First, extract summary statistics for all SNP-gene pairs tested for the gene of interest from the MatrixEQTL output file.

Next, you will generate a LocusZoom plot for each gene.

To further refine association signals, you can use LDlink to retrieve LD information. LDlink requires an API token, which you can obtain by registering an account at the following link: https://ldlink.nih.gov/?tab=apiaccess. After receiving your token via email, replace the ‘token =’ argument in the code below with your access token.

cis$snps = stringr::str_split(cis$snps, "_", simplify = T)[,1]
cis$snps = paste0("chr", cis$snps)

# Plot top cis eGene
loc = locus(data = cis[cis$gene == cis$gene[1], ], gene = cis$gene[1], flank = 1e5, ens_db = "EnsDb.Hsapiens.v75")
## TTC38, chromosome 22, position 46563858 to 46789905
## 181 SNPs/datapoints
summary(loc)
## Gene TTC38 
## Chromosome 22, position 46,563,858 to 46,789,905
## 181 SNPs/datapoints
## 7 gene transcripts
## 7 protein_coding 
## Ensembl version: 75 
## Organism: Homo sapiens 
## Genome build: GRCh37
loc = link_LD(loc, pop = "YRI", token = "47d02c2de92e") 
## 
## LDlink server is working...
## Matched 178 SNPs (26.7 secs)
loc = link_recomb(loc, table = "hapMapRelease24YRIRecombMap") 
## Retrieving recombination data from UCSC
locus_plot(loc)

Q24: Where, relative to TTC38, are the strongest associations? A: gene body

  • 5’
  • 3’
  • gene body
# Trans
trans$snps = stringr::str_split(trans$snps, "_", simplify = T)[,1]
trans$snps = paste0("chr", trans$snps)

# Plot top trans eGene
loc = locus(data = trans[trans$gene == trans$gene[1], ], gene = trans$gene[1], flank = c(1e4, 8e6), ens_db = "EnsDb.Hsapiens.v75")
## IGLL3P, chromosome 22, position 25704223 to 33716047
## 6 SNPs/datapoints
summary(loc)
## Gene IGLL3P 
## Chromosome 22, position 25,704,223 to 33,716,047
## 6 SNPs/datapoints
## 240 gene transcripts
## 89 protein_coding, 39 pseudogene, 34 lincRNA, 33 antisense, 9 misc_RNA, 9 snRNA, 8 sense_intronic, 7 miRNA, 4 rRNA, 3 snoRNA, 2 IG_C_pseudogene, 2 IG_V_pseudogene, 1 processed_transcript 
## Ensembl version: 75 
## Organism: Homo sapiens 
## Genome build: GRCh37
loc = link_LD(loc, pop = "YRI", token = "47d02c2de92e") 
## 
## LDlink server is working...
## Matched 6 SNPs (23.7 secs)
loc = link_recomb(loc, table = "hapMapRelease24YRIRecombMap") 
## Retrieving recombination data from UCSC
locus_plot(loc, filter_gene_name = "IGLL3P") # only plot eGene for gene track due to large chromosomal window

Q25: How far is the SNP from IGLL3P? A: ~ 7 Mb


Database Look-ups

Now that you have identified eQTLs, the next step is to explore the SNPs and eGenes using publicly available databases. These resources provide insights into the functional context of genetic variants and their role in gene regulation and disease.

Key Databases:

  • UCSC Genome Browser: A web-based tool for visualizing and analyzing genomic data, integrating reference genomes, annotations, and experimental datasets.
  • GTEx (Genotype-Tissue Expression Project): A resource linking genetic variation to gene expression across multiple human tissues to study regulatory effects and disease associations.
  • Open Targets Genetics: A platform leveraging genetic association data to prioritize and explore potential drug targets based on human genetics.

UCSC Genome Browser

Use the UCSC genome browser to visualize the genomic location of your top cis-eQTL.

  1. Go to the UCSC Genome Browser.
  2. Ensure you are using GRCh37/hg19 as the reference genome.
  3. Enter the chromosome:position of your top cis-eQTL (e.g., 22:46687220) into the search bar and press enter.
  4. Locate the dbSNP track and click on your variant to retrieve its rsID.

Q26: What rsid is associated with our variant of interest? A: rs73886792

This rsID can now be used for further exploration in Open Targets Genetics.


Open Targets Genetics

Use Open Targets Genetics to investigate associations between your variant and gene expression, disease traits, and drug targets.

  1. Go to Open Targets Genetics.
  2. Enter the rsid you obtained from the UCSC Genome Browser into the search bar.
  3. Select the autocompleted variant, which should have the same rsid.

Q27: Which population has the highest effect allele frequency? A: African/African-American

Q28: What genes does this variant act as an eQTL for? A: TTC38, TRMU, PKDREJ


GTEx

Use the GTEx Portal to analyze tissue-specific gene expression and regulatory variants.

  1. Go to the the Genotype-Tissue Expression (GTEx) Portal.
  2. Enter TTC38 in the search bar at the top right corner of the page.
  3. Explore the bulk tissue expression data to determine where TTC38 is most highly expressed.

Q29: What tissue is TTC38 most highly expressed in? A: Liver

  1. Scroll down to the single-cell expression tab to determine its primary cell type of expression.

Q30: What cell type is TTC38 most frequently expressed in? A: Adipocytes

  1. Scroll further down to the single-tissue eQTL tab to identify top eQTL associations.

Q31: What is the top single tissue eqtl? A: chr22_46292311_C_A_b38; rs6008552

Q32: What tissue is this eQTL in? A: Artery - Tibial

  1. Finally, explore the single-tissue sQTL tab to identify splicing quantitative trait loci (sQTLs).

Q33: What is the top single tissue sQTL? A: chr22_46292311_C_A_b38; rs6008552

Q34: What tissue is this sQTL in? A: Nerve - Tibial


Resources

Below are some potentially useful resources.