Background

Expression quantitative trait loci (eQTL) analysis identifies genetic variants associated with gene expression levels. Proper quality control (QC) is essential to ensure reliable results. Below, we outline standard QC steps for genotype data, RNA-seq count data, and covariates, along with checks for proper data integration.

Quality Control for Genotype Data

Genotype data typically comes in formats such as VCF, PLINK (.bed/.bim/.fam), or similar.

1.1 Sample-Level QC

Check sex consistency by comparing inferred genetic sex (from X and Y chromosome markers) with reported sex in metadata.
Remove individuals with high missingness (typically >5% missing genotypes).
Check for relatedness and remove duplicates to ensure no unexpected familial relationships or sample duplications.

1.2 Variant-Level QC

Remove SNPs with high missingness (e.g., variants missing in more than 5% of samples).
Apply Hardy-Weinberg Equilibrium (HWE) filtering to exclude variants that significantly deviate from expected genetic distributions (p < 1e-6).
Filter for minor allele frequency (MAF) to retain only common variants (e.g., MAF > 5%).

1.3 Population Structure Correction

Perform principal component analysis (PCA) on the genotype data to detect population stratification.
Retain key principal components (PCs) as covariates in the eQTL analysis to account for ancestry-related effects.

Quality Control for RNA-Seq Data

The RNA-seq count file used for eQTL analysis is a gene expression count matrix (TSV/CSV format). The matrix represents raw or normalized gene expression counts for multiple samples.

Structure of the File: 1. Rows: Genes (or transcripts, depending on the quantification method). 2. Columns: Individual samples (biological replicates, conditions, etc.). 3. First column: Unique gene/transcript identifiers (e.g., Ensembl IDs or gene symbols). 4. Remaining columns: Integer values representing raw or normalized counts per sample.

2.1 Sample-Level QC

Identify and remove outliers using PCA or hierarchical clustering.
Check sex consistency using expression of sex-specific genes (e.g., XIST for females, Y-linked genes for males).
Remove low-quality samples based on quality metrics such as RNA integrity number (RIN < 6) and sequencing depth (< 10M reads per sample).

2.2 Gene-Level QC

Remove lowly expressed genes, keeping only those expressed in a sufficient number of samples (e.g., > 1-10 CPM in > 20% of samples).
Transformation for Normality for eQTL Models:
- Most eQTL analyses assume a linear relationship between genotype and expression.
- Raw counts follow a negative binomial distribution, which is not suitable for regression.
- Standard transformations applied to raw counts:
- log2(CPM + 1) (log-transformed counts per million)
- voom transformation (for modeling mean-variance trends, see limma-voom)
- inverse normal transformation (ranks gene expression per gene across samples and applies normal quantile transformation (mean = 0, SD =1). Most common for eQTL mapping

Quality Control for Covariates

Address missing data by either imputing values or removing samples with incomplete covariate information.
Assess correlations between covariates, ensuring that highly correlated variables (e.g., PEER factors and genotype PCs) do not introduce collinearity (r < 0.8).
Include genotype principal components (typically >3, depending on cohort) and PEER factors (typically 10-50) to account for unknown batch effects and confounding.
Correlate PEER factors with known covariates to assess whether they capture relevant experimental or biological variations.

Ensuring Sample Matching Across Data Types

Ensure sample IDs are consistent across genotype, expression, and covariate files.
Remove unmatched samples to ensure a 1:1 correspondence across data types.

Overview of the GEUVADIS Dataset

You are going to be performing an eQTL analysis today. You will be using data from the GEUVADIS project. The GEUVADIS (Genetic European Variation in Health and Disease) Project is an international initiative aimed at characterizing the impact of genetic variation on gene expression. By integrating whole-genome sequencing with RNA sequencing data from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project, GEUVADIS has provided a rich resource for studying expression quantitative trait loci (eQTLs). This dataset has been instrumental in understanding how genetic variants influence gene regulation across populations and has contributed to advances in functional genomics, disease mapping, and precision medicine. The project’s findings serve as a foundation for linking genetic variation to molecular phenotypes, enhancing our ability to interpret genome-wide association studies (GWAS) and their implications for human health.

You have been provided with 3 files for this practical:

geuvadis_chr22_genotypes.txt: QC’d genotype data from chr22 (MAF > 5%, SNVs only, HWE p-value < 1e-6 removed).
geuvadis_chr22_rnaseq.txt: PEER-residualized RPKM on chr22 genes (some non-coding genes removed).
geuvadis_covariates.txt: Sample covariates (including sex and genotype PCs).

Setup

Load required R packages for the practical and set your working directory.

# Load necessary libraries
library(data.table) # data I/O
library(dplyr) # data frame manipulation
library(stringr) # string manipulation
library(MatrixEQTL) # eQTL calculation
library(coloc) # Colocalization analysis
library(locuszoomr) # Locuszoom plots
library(EnsDb.Hsapiens.v75) # Gene mapping for locuszoom plots


# Run in terminal to copy tutorial files to home directory
# cp -r /home/ben/practicals/eQTL .

# Set working directory (UPDATE TO YOUR HOME DIRECTORY)
setwd("/home/iaink/eQTL")

Read and Prepare Data

You will now load the genotype, RNA-Seq, and covariate files and answer some basic questions about their contents.

Genotypes

Genotyping was performed Variants in the 1000 Genomes Project were identified using low-coverage whole-genome sequencing, deep exome sequencing, and dense SNP-array genotyping
The genotype data was filtered using PLINK2.
Original genotypes were downloaded from the 1000 Genomes Project.
The dataset has been restricted to subjects from the YRI (Yoruba in Ibadan, Nigeria) population.

geno = fread("geuvadis_chr22_genotypes.txt", data.table = F)
head(geno[,1:10])

Q1: How many samples are there?

Answer

108

nrow(geno)

Q2: How many variants are included?

Answer

95,190

ncol(geno) - 6

RNA-Seq

The RNA-Seq data was obtained from the GEUVADIS project.
It contains normalized gene expression counts.

rna = fread("geuvadis_chr22_rnaseq.txt", data.table = F)
head(rna)[,1:10]

Q3: How many subjects are there?

Answer

462

ncol(rna) - 4

Q4: How many genes are there?

Answer

420

nrow(rna)

Q5: Confirm that all genes are on chromosome 22.

table(rna$Chr)

Covariates

Covariate information was downloaded from the 1000 Genomes Project.
Genetic principal components were previously generated for the 2024 IGES tutorial on PRS calculation.

covar = fread("geuvadis_covariates.txt", data.table = F)

head(covar)

Q6: How many subjects are included?

Answer

4,978

nrow(covar)

Q7: How many columns are included?

Answer

ncol(covar)

Q8: What are some of the options you could do if you had missing values in the matrix?

do nothing
impute missing values
Remove the sample or covariate

For our purposes, we will subset to samples present across all 3 files. This will remove missing data from the covariates file.

Q9: Do all files have the expected format?

Genotypes?

Answer

Yes

RNA-seq?

Answer

Yes

Covariates?

Answer

Yes

Q10: Do all files have the same set of subjects?

Answer

Formatting for MatrixEQTL

MatrixEQTL analysis requires 3 files: genotypes, normalized expression, and covariates. Now you will focus on preparing the genotypes, RNA-Seq, and covariate data for use by MatrixEQTL.

Genotypes

To prepare genotype data for MatrixEQTL, we need to prepare 2 files:

Genotype values
Genotype locations

# Extract genotype values
geno_vals = geno %>% dplyr::select(IID, starts_with("22")) 
rownames(geno_vals) = geno_vals$IID
geno_vals$IID = NULL
geno_vals = geno_vals[rownames(geno_vals) %in% colnames(rna), ] 
geno_vals = as.data.frame(t(geno_vals)) 

# Filter rows where each homozygous genotype appears at least 4 times
geno_vals = geno_vals[rowSums(geno_vals == 0) >= 3 & rowSums(geno_vals == 2) >= 3, ]

# Write out formatted genotypes
fwrite(geno_vals, "geno_vals.txt", row.names = T)

# genotype locus information
geno_locs = str_split(rownames(geno_vals), ":", simplify = T)
geno_locs = as.data.frame(cbind(rownames(geno_vals), geno_locs))
colnames(geno_locs) = c("snp", "chr", "pos")
geno_locs$pos = str_split(geno_locs$pos, "\\_", simplify = T)[,1]

# Write out formatted genotype positions
fwrite(geno_locs, "geno_locs.txt")

RNA-Seq

To prepare RNA-Seq data for MatrixEQTL, we need to prepare 2 files:

RNA-Seq values
Gene locations

rna_vals = rna[, c(2, 5:ncol(rna))]
rownames(rna_vals) = rna_vals$Gene_Symbol
rna_vals$Gene_Symbol = NULL
rna_vals = rna_vals[, colnames(rna_vals) %in% colnames(geno_vals)] #445

# Write out formatted covariates
fwrite(rna_vals, "rna_vals.txt", row.names = T)

# RNA locus information
rna_locs = rna[, 2:4]
rna_locs$Coord = rna_locs$Coord - 1e6
rna_locs$Coord2 = rna_locs$Coord + 2e6
names(rna_locs) = c("geneid", "chr", "s1", "s2")

# Write out formatted covariates
fwrite(rna_locs, "rna_locs.txt")

Covariates

We will use sex and 5 genetic principal components as covariates in modeling.

rownames(covar) = covar$`Sample name`
covar = covar %>% dplyr::select(Sex, PC1, PC2, PC3, PC4, PC5)
covar = covar[rownames(covar) %in% colnames(rna_vals), , drop = F]

# Convert sex to a numeric (binary) variable
covar$Sex[covar$Sex == "male"] = 0
covar$Sex[covar$Sex == "female"] = 1

# Order covariates by sample name
covar = covar[order(rownames(covar)), ]

# Confirm sample order matches between files
all(rownames(covar) == colnames(rna_vals))

# Write out formatted covariates
fwrite(as.data.frame(t(covar)), "covariates.txt")

MatrixEQTL Analysis

You are going to perform both cis-eQTL and trans-eQTL analyses simultaneously. A cis-eQTL analysis focuses on testing SNPs physically near (or in) the gene being tested. The cis definition must be specified. A common cis-definition is: +/-Mb from the TSS, such that for a given gene, only SNPs inside the specified window will be tested for association with gene expression. Because each gene has its own location in the genome, the set of SNPs being tested varies by gene. A cis-eQTL analysis focuses on regions surrounding the gene that are most likely to harbor regulatory elements that affect the gene. Where the cis-eQTL analysis focuses on SNPs physically close to a given gene, a trans-eQTL analysis will test for associations with all SNPs that are not close to the gene (ie on the same chromosome, but far away, or on different chromosomes. A trans-eQTL analysis involves conducting many statistical tests (MANY! SNPs per gene x many genes), thus multiple testing correction is imperative to control for false positive associations.

You will use the software package MatrixEQTL, designed for fast eQTL analysis on large datasets. MatrixEQTL can test for association between genotype and gene expression using linear regression with either additive or ANOVA genotype effects. The models can include covariates to account for factors as population stratification, sex, and clinical/experimental variables. For more details see Shabalin (2012) [LINK].

To perform both eQTL analyses, you will specify the following model:

PEER-normalized gene expression ~ Sex + PC1 + PC2 + PC3 + PC4 + PC5

Load genotype data

snps = SlicedData$new()
snps$fileDelimiter = ","      
snps$fileOmitCharacters = "NA" 
snps$fileSkipRows = 1          
snps$fileSkipColumns = 1       
snps$fileSliceSize = 2000      
snps$LoadFile("geno_vals.txt")

# Read in genotype positions
snpspos = read.table("geno_locs.txt", sep = ",", header = T)

Load gene expression data

gene = SlicedData$new()
gene$fileDelimiter = ","      
gene$fileOmitCharacters = "NA" 
gene$fileSkipRows = 1          
gene$fileSkipColumns = 1      
gene$fileSliceSize = 2000      
gene$LoadFile("rna_vals.txt")

# Read in gene positions
genepos = read.table("rna_locs.txt", sep = ",", header = T)

Load covariates

# Read in covariates
cvrt = SlicedData$new()
cvrt$fileDelimiter = ","      
cvrt$fileOmitCharacters = "NA"
cvrt$fileSkipRows = 1
cvrt$fileSkipColumns = 0
cvrt$LoadFile("covariates.txt")

Run the analysis

# Run MatrixEQTL
me = Matrix_eQTL_main(
  snps = snps,
  gene = gene,
  cvrt = cvrt,
  output_file_name = "trans_eqtls.txt",
  pvOutputThreshold = 1e-3, # p-value output threshold for trans-eqtls
  useModel = modelLINEAR,
  errorCovariance = numeric(),
  verbose = TRUE,
  output_file_name.cis = "cis_eqtls.txt",
  pvOutputThreshold.cis = 1e-3, # p-value output threshold for cis-eqtls
  snpspos = snpspos,
  genepos = genepos,
  cisDist = 1e6, # distance threshold for cis associations
  pvalue.hist = "qqplot",
  min.pv.by.genesnp = FALSE,
  noFDRsaveMemory = FALSE)

Downstream Results

You will now do some preliminary evaluation of the eQTLs you have generated.

# Check the structure of results
str(me)

QQ Plot

You will now plot a quantile-quantile (QQ) plot, which visually compares the distribution of observed p-values from your analysis to the expected uniform distribution under the null hypothesis. This plot helps assess whether there is an enrichment of small p-values, indicating potential true associations, or if the distribution follows the null expectation. A well-calibrated analysis should show most points aligning along the diagonal reference line, while deviations (particularly at the lower end of p-values) suggest significant signals. Systematic inflation or deflation of p-values can indicate confounding effects, population structure, or technical biases, which may require further investigation and correction. MatrixEQTL provides functionality that automatically returns a QQ plot stratified by cis and trans eQTLs.

# Generate QQ plots
plot(me)

Q11: Do cis and trans eQTLs seem well-calibrated?

Answer

Yes

Q12: Do cis or trans eQTLs demonstrate more significant associations?

Answer

cis-eQTLs

The title of the QQ plot displays how many tests were performed for cis- and trans-eqtls, respectively.

Q13: How many cis- and trans- associations were tested?

Answer

2,500,353 cis, 21,306,927 trans

Basic Results

# Create objects for cis and trans eqtls
cis = me$cis$eqtls
trans = me$trans$eqtls

Q14: What is the smallest nominal p-value you’ve found?

Answer

7.43e-28

Q15: What is the smallest FDR-adjusted p-value you’ve found?

Answer

4.64e-22

cis[1, ]

A cis-eQTL analysis involves conducting many statistical tests (many SNPs per gene x many genes), thus multiple testing correction is imperative to control for false positive associations. MatrixEQTL uses the qvalue method to perform multiple testing correction.

Q16: How many significant (FDR < 0.05) cis-eQTLs (SNP-gene pairs) have you found?

Answer

510

nrow(cis[cis$FDR <= 0.05, ])

Q17: How many cis eGenes have you found?

Answer

length(table(cis[cis$FDR <= 0.05, ]$gene))

Q18: How many trans-eQTLs (SNP-gene pairs) have you found?

Answer

nrow(trans[trans$FDR <= 0.05, ])

Q19: How many trans eGenes have you found?

Answer

length(table(trans[trans$FDR <= 0.05, ]$gene))

head(cis)

head(trans)

Distance Between eGenes and SNPs

You will now calculate distance between SNPs and eGenes. Once you’ve done so, plot histograms of distances for both cis and trans eQTLs.

# define positions
cis$chrom = str_split(cis$snps, ":", simplify = T)[,1]
cis$pos = str_split(cis$snps, ":", simplify = T)[,2]
cis$pos = str_split(cis$pos, "_", simplify = T)[,1]

trans$chrom = str_split(trans$snps, ":", simplify = T)[,1]
trans$pos = str_split(trans$snps, ":", simplify = T)[,2]
trans$pos = str_split(trans$pos, "_", simplify = T)[,1]

cis$distance = as.numeric(cis$pos) - as.numeric(genepos$s1[match(cis$gene, genepos$geneid)] + 1e6) 
trans$distance = as.numeric(trans$pos) - as.numeric(genepos$s1[match(trans$gene, genepos$geneid)] + 1e6) 

# Plot histograms of distances for cis and trans eQTLs
hist(cis$distance, main = "Cis eQTL Distances", xlab = "Distance from TSS (bp)", col = "red", breaks = 50)

Q20: Do cis-eQTLs appear enriched for a certain distance?

Answer

Transcription Start Sites

Now plot trans-eQTL distances.

hist(trans$distance, main = "Trans eQTL Distances", xlab = "Distance from TSS (bp)", col = "blue", breaks = 50)

Boxplots for Top eGenes

Next you are going to visualize some of the cis-eQTLs. The idea is to select a specific SNP-gene pair and plot gene expression by genotype.

# Extract top eGenes for cis and trans
top_cis_gene = cis$gene[1]
top_trans_gene = trans$gene[1]

geno_sub_cis = geno_vals[rownames(geno_vals) %in% cis$snps[1], , drop = FALSE]
rna_sub_cis = rna_vals[rownames(rna_vals) %in% top_cis_gene, , drop = FALSE]

df_cis = data.frame(SNP = as.factor(t(geno_sub_cis)), Gene = as.numeric(t(rna_sub_cis)))
boxplot(Gene ~ SNP, data = df_cis, main = paste("Expression of", top_cis_gene, "by Genotype"))

geno_sub_trans = geno_vals[rownames(geno_vals) %in% trans$snps[1], , drop = FALSE]
rna_sub_trans = rna_vals[rownames(rna_vals) %in% top_trans_gene, , drop = FALSE]

df_trans = data.frame(SNP = as.factor(t(geno_sub_trans)), Gene = as.numeric(t(rna_sub_trans)))
boxplot(Gene ~ SNP, data = df_trans, main = paste("Expression of", top_trans_gene, "by Genotype"))

Q21: Which allele is associated with higher gene expression of TTC38?

Answer

the reference allele [0]

Q22: Is the eQTL effect size larger for the TTC38 cis-eQTL or for the IGLL3P trans-eQTL? Hint: the eQTL effect size can be quantified as the slope of the regression, or the log allelic fold change (aFC), which represents the fold difference in gene expression between haplotypes carrying the reference and alternative alleles at a given genetic variant.

Answer

cis-eQTL

LocusZoom Plots

Another way to visualize eQTL results is to plot summary statistics along the genome. In GWAS, we typically use a Manhattan plot to display associations across the entire genome. However, for cis-eQTL analysis, we are often interested in specific regions surrounding a gene, as each gene has its own cis-window (~20K genes in total). To explore associations at the gene level, LocusZoom is a widely used tool for visualizing genetic associations in their genomic context.

You will generate LocusZoom plots for both the top cis-eQTL and top trans-eQTL associations.

To add linkage disequilibrium (LD) information to the plots, we use results generated from LDlink, an NIH resource that provides population-specific LD estimates, proxy variants, haplotypes, and related annotations. LDlink is typically accessed through an API and requires a personal API token. Because API access can create practical issues during a group workshop, I generated the LDlink results ahead of time myself.

For this practical, you will not query LDlink directly. Instead, you will load precomputed locuszoomr objects that already contain the LDlink-derived LD information. These objects have been saved as .rds files and can be read into R using readRDS().

Now load the precomputed cis-eQTL locus object. This object already contains the association results, gene annotations, recombination information, and LD information retrieved from LDlink.

# Plot top cis eGene
loc = readRDS("loc_cis_TTC38_YRI.rds")
summary(loc)
locus_plot(loc)

Next, load the precomputed trans-eQTL locus object. As above, the LDlink results have already been added, so this step only reads the saved object and produces the plot.

# Plot top trans eGene
loc = readRDS("loc_trans_IGLL3P_YRI.rds")
summary(loc)
locus_plot(loc, filter_gene_name = "IGLL3P")

Q23: How far is the SNP from IGLL3P?

Answer

7 Mb

Bonus Section: Database Look-ups

Now that you have identified eQTLs, the next step is to explore the SNPs and eGenes using publicly available databases. These resources provide insights into the functional context of genetic variants and their role in gene regulation and disease.

Key Databases:

UCSC Genome Browser: A web-based tool for visualizing and analyzing genomic data, integrating reference genomes, annotations, and experimental datasets.
GTEx (Genotype-Tissue Expression Project): A resource linking genetic variation to gene expression across multiple human tissues to study regulatory effects and disease associations.
Open Targets Genetics: A platform leveraging genetic association data to prioritize and explore potential drug targets based on human genetics.

UCSC Genome Browser

Use the UCSC genome browser to visualize the genomic location of your top cis-eQTL.

Go to the UCSC Genome Browser.
Ensure you are using GRCh37/hg19 as the reference genome.
Enter the chromosome:position of your top cis-eQTL (e.g., 22:46687220) into the search bar and press enter.
Locate the dbSNP track and click on your variant to retrieve its rsID.

Q24: What rsid is associated with our variant of interest?

Answer

rs73886792

This rsID can now be used for further exploration in Open Targets Genetics.

Open Targets Genetics

Use Open Targets Genetics to investigate associations between your variant and gene expression, disease traits, and drug targets.

Go to Open Targets Genetics.
Enter the rsid you obtained from the UCSC Genome Browser into the search bar.
Select the autocompleted variant, which should have the same rsid.

Q25: Which population has the highest effect allele frequency?

Answer

African/African-American

Q26: What genes does this variant act as an eQTL for?

Answer

TTC38, TRMU, PKDREJ

GTEx

Use the GTEx Portal to analyze tissue-specific gene expression and regulatory variants.

Go to the the Genotype-Tissue Expression (GTEx) Portal.
Enter TTC38 in the search bar at the top right corner of the page.
Explore the bulk tissue expression data to determine where TTC38 is most highly expressed.

Q27: What tissue is TTC38 most highly expressed in?

Answer

Liver

Scroll down to the single-cell expression tab to determine its primary cell type of expression.

Q28: What cell type is TTC38 most frequently expressed in?

Answer

Adipocytes

Scroll further down to the single-tissue eQTL tab to identify top eQTL associations.

Q29: What is the top single tissue eqtl?

Answer

chr22_46292311_C_A_b38; rs6008552

Q30: What tissue is this eQTL in?

Answer

Artery - Tibial

Finally, explore the single-tissue sQTL tab to identify splicing quantitative trait loci (sQTLs).

Q31: What is the top single tissue sQTL?

Answer

chr22_46292311_C_A_b38; rs6008552

Q32: What tissue is this sQTL in?

Answer

Nerve - Tibial

Resources

Below are some potentially useful resources.

If you have any questions, concerns, or quibbles, please reach out to me at iain.konigsberg@cuanschutz.edu

Appendix I: Code Used to Generate Locus LD Information for LocusZoom Plotting

The code below is included for transparency and reproducibility. You do not need to run this section during the practical. I ran this code in advance to query LDlink, add YRI-specific LD information to the locuszoomr objects, and save the resulting objects as .rds files.

library(dplyr)
library(stringr)
library(locuszoomr)
library(EnsDb.Hsapiens.v75)

ldlink_token = "YOUR_TOKEN_HERE"

# Clean SNP IDs once
cis2 = cis %>%
  mutate(
    snps = str_split(snps, "_", simplify = TRUE)[, 1],
    snps = paste0("chr", snps)
  )

trans2 = trans %>%
  mutate(
    snps = str_split(snps, "_", simplify = TRUE)[, 1],
    snps = paste0("chr", snps)
  )

# -------------------------
# cis locus
# -------------------------

loc_cis = locus(
  data = cis2[cis2$gene == cis2$gene[1], ],
  gene = cis2$gene[1],
  flank = 1e5,
  ens_db = "EnsDb.Hsapiens.v75"
)

loc_cis = link_LD(
  loc_cis,
  pop = "YRI",
  token = ldlink_token,
  method = "proxy"
)

loc_cis = link_recomb(
  loc_cis,
  table = "hapMapRelease24YRIRecombMap"
)

saveRDS(loc_cis, file = "loc_cis_TTC38_YRI.rds")


# -------------------------
# trans locus
# -------------------------

loc_trans = locus(
  data = trans2[trans2$gene == trans2$gene[1], ],
  gene = trans2$gene[1],
  flank = c(1e4, 8e6),
  ens_db = "EnsDb.Hsapiens.v75"
)

loc_trans = link_LD(
  loc_trans,
  pop = "YRI",
  token = ldlink_token,
  method = "proxy"
)

loc_trans = link_recomb(
  loc_trans,
  table = "hapMapRelease24YRIRecombMap"
)

saveRDS(loc_trans, file = "loc_trans_IGLL3P_YRI.rds")

eQTL Practical: GEUVADIS Consortium Data

Iain Konigsberg, Wei Zhou, & Barbara Stranger

Updated: 2026-06-11

Background

Quality Control for Genotype Data

1.1 Sample-Level QC

1.2 Variant-Level QC

1.3 Population Structure Correction

Quality Control for RNA-Seq Data

2.1 Sample-Level QC

2.2 Gene-Level QC

Quality Control for Covariates

Ensuring Sample Matching Across Data Types

Overview of the GEUVADIS Dataset

Setup

Read and Prepare Data

Genotypes

RNA-Seq

Covariates

Formatting for MatrixEQTL

Genotypes

RNA-Seq

Covariates

MatrixEQTL Analysis

Load genotype data

Load gene expression data

Load covariates

Run the analysis

Downstream Results

QQ Plot

Basic Results

Distance Between eGenes and SNPs

Boxplots for Top eGenes

LocusZoom Plots

Bonus Section: Database Look-ups

Key Databases:

UCSC Genome Browser

Open Targets Genetics

GTEx

Resources

Appendix I: Code Used to Generate Locus LD Information for LocusZoom Plotting