Counting reads & differential expression analysis

We count the number of reads for each gene using HTSeq tool. After this we move the data to R and perform differential expression analysis with two different tools: DESeq2 and edgeR. Finally we annotate the ensembl IDs with gene names. For demonstration purposes, we now use different dataset (with more samples) in the R part -when you are performing your own analysis, use the whole dataset.

Count reads per genes using HTSeq

htseq-count -f bam --stranded=yes results-hisat/hesc_sorted.bam hisat-indexes/hg38chr19.gtf > hesc-counts-raw.tsv

Check what the different parameters mean. What is the minimum mapping quality for a read to be counted? Was the strandedness parameter correct?

htseq-count --help

Look at the beginning and the end of the result file. Can you see genes with counts? How many alignments did not overlap with any gene?

head hesc-counts-raw.tsv
tail hesc-counts-raw.tsv

Remove the last 5 rows starting with "__" (keep all but the last 5 rows).

head -n -5 hesc-counts-raw.tsv > hesc-counts.tsv


Bonus exercise: Can you repeat the steps to the other sample (GM12878.fastq.gz)?

For the next step, you need to combine your count files into one file, so that each sample has one count column. To make this tutorial run smooth, we switch now to another data set with several samples -in real life, you would repeat the previous preprocessing and alignment steps for each sample, and then combine the samples into one count table. You would run these steps in our supercomputers, and through a script -we will take a look into that a bit later, once we first have learned all the individual steps of the analysis.

Note: in real life, you would use a script to process all the samples at once. We will learn this a bit later.


RNA-seq differential expression analysis using R

Next, we load the count table and another table describing the samples to R. We are performing the differential expression analysis with two packages/tools, DeSeq2 and edgeR, so we load those packages to R. ggplot package is used for visualisations.

These R commands are also available in this R-script file:

1. Preparations

Open RStudio (or R) in the virtual machine: activate the Conda base environment and type rstudio. 

conda activate base

Note: you can also use R and RStudio on CSC's Puhti supercomputer. The Bioconductor packages needed here are already installed there.

...or you can also run these steps on your own computer, in which case you need to install these packages as descriped below.

Open a new R Script file (File -> New File -> R Script). Type or copy your commands here. You can run the commands from there (paint the command and use ctrl + enter, or use the Run button). This way it is easier to track what you have done so far.

Note that in the ready-made virtual machine, you can skip the installation steps, but you need to run the library commands to load the packages.

To install and load Bioconductor-packages to R:

# Note: everything after # is a comment -R won't run it!
# No need to run the installation commands on the VM:
# source("")
# biocLite("DESeq2")
# biocLite("edgeR")
# Update all/some/none? [a/s/n]: n
# install.packages("ggplot2")

# Load the libraries:

Change the directory in RStudio to the rna-seq folder. Either choose Session -> Set Working Directory -> Choose Directory -> rna-seq -> Open, or type:


Read in the data in, and examine it in R. What information is stored in phenodata.tsv, and what's in counts_data.tsv? How many samples there are? How many genes there are counts from? 

You downloaded the files already with wget, and they should now be in the rna-seq folder. 

counts_data <- read.table("ngs-data-table.tsv", header=T, sep="\t", row.names=1)
phenodata <- read.table("phenodata.tsv", header=T, sep="\t")

Get the experimental group information from the phenodata file. By checking the phenodata.tsv you learned that "group" is the column you want.

group <- "group"
groups <- as.character (phenodata[,group])


2A. Diffrential expression analysis with DESeq2

Let's perform differential expression analysis with DeSeq2 tool. We can also draw a PCA plot and a dispersion plot. From the PCA plot we can see if the samples are separating nicely, and whether there are some outliers in the data.

# Create a DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData=counts_data, colData=data.frame(groups=groups), design = ~ groups)

# Calculate statistic for differential expression, merge with original data table, keep significant DEGs, remove NAs and sort by FDR.
dds <- DESeq(dds) # The standard differential expression analysis steps are wrapped into a single function, DESeq

res <- results(dds) # Results tables are generated using the function "results"

sig <- cbind(counts_data, res) # combine original table and results table
sig <- # DataFrame != data.frame :/

p.value.cutoff <- 0.05
sig <- sig[sig$padj <= p.value.cutoff, ] # choose adj. p-values < the cut off (0.05)
sig <- sig[! ($padj)), ] # remove NAs
sig <- sig[ order(sig$padj), ] # order accroding to the adj. p-values

DESEq2_DEGs <- sig

# Get summary and write a .tsv-table. Open the table in Excel.
summary(res, alpha=p.value.cutoff)
write.table(sig, file="significant_DEGs_pval005_with_DESeq.tsv", sep="\t", row.names=T, quote=F)

How many differentially expressed genes were there between the samples? How many were up and how many were downregulated?

Experiment level quality control: PCA plot

Let's draw a PCA plot, so we can see how well the samples are separated. PCA plot can be used in experimental level quality control. From the PCA plot, we can see whether there are possible outliers, some underlying batch effects, and whether our samples are separated as expected (based on the experimental sample groups).

We also draw a dispersion plot of the normalized read counts, which shows us how the read counts are modified within DESeq2.

## PCA plot (using DESeq2):

#Perform transformation (sort of normalization, PCA needs these)

# Make PCA plot
data <- plotPCA(vst, intgroup=c("groups"), returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar")) # Tidy the variance percentage
desc <- phenodata[,"description"] # Get the sample descriptions from phenodata.
p <- ggplot(data, aes(PC1, PC2, color=groups, title="PCA plot for TREATMENT")) +
  geom_point(size=6) +
  geom_text(aes(label=desc),hjust=0, vjust=1.7, color="black", size=4) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance"))
# adjust figure size and save it.
ggsave(p, filename = "PCA plot.pdf", width = 15, height = 12)

## Dispersion plot:
plotDispEsts(dds, main="Dispersion plot", cex=0.2)

You should get the following plots as outputs:

Do the experimental groups separate from each other? Are there any outliers? How much (%) variation is explained by the first principal component (PC1)? What about PC2? What do you think these components represent?

Are any counts left as outliers?

Bonus: Drawing a heatmap

Let's draw a heatmap of the most highly variable genes: 


# get the vst transformed read counts:
trans_counts <- assay(vst)

# subset: get the top 500 most variable genes
var_genes <- apply(trans_counts, 1, var) # compute variance of the transformed counts
select_var <- names(sort(var_genes, decreasing=TRUE))[1:50] # sort based on variance and select top 50 most variable genes
highly_variable_genes <- trans_counts[select_var,] # select these genes from trans counts table

# Get the annotations:
ensembl <- useMart("ensembl", host= "", dataset="hsapiens_gene_ensembl")
variable_gene_names <- getBM(attributes <- "external_gene_name", filters = "ensembl_gene_id", values = rownames(highly_variable_genes), mart = ensembl, uniqueRows=T)
rownames(highly_variable_genes) <- as.matrix(variable_gene_names)



Heatmap of most variable genes


2B. Diffrential expression analysis with edgeR

Let's perform the same differential expression analysis with another tool, edgeR.

# Form the DGElist object:

# Filter out genes which have less than 5 counts in user-defined number of samples

filter <- 5 # "Analyze only genes which have counts in at least this many sample
keep <- rowSums(dge$counts>5) >= filter
dge <- dge[keep,]
dge$lib.size <- colSums(dge$counts)

# Calculate normalization factors
dge<-calcNormFactors(dge) # take a look at the dge-object!

# Modeling formula & design matrix
group <- factor(phenodata$group)

design <- model.matrix(~group)
rownames(design) <- phenodata$description # take a look at the design matrix!

# Estimate dispersions
dge <- estimateDisp(dge, design)
# Take a look at the dge-object, and compare the common dispersion value to trended dispersion values

# Dispersion plot
plotBCV(dge, main="Biological coefficient of variation")

# Estimate DE genes
# Generalized Linear Model (GLM) Likelihood Ratio Test
fit<-glmFit(dge, design) # fits the negative binomial generalized linear model (GLM)
lrt<-glmLRT(fit, coef=2) # likelihood ratio test (LRT)
# Note: coef=2 -> comparison between the "groups"
tt<-topTags(lrt, n=nrow(counts_data))
ttres <- data.frame(tt)

# Rounding:
ttres2<-round(ttres,6) # Compare the first 15 rows of ttres and ttres2 to see the effect of rounding

# Add count columns to the result table
counts_table <- getCounts(dge) # pick the counts from the dge object
counts_table<-counts_table[pmatch(rownames(ttres), rownames(counts_table)),] # organize counts_table according the FDR
ttres2<-cbind(ttres2, counts_table) # combine the two

# Write the results to file
write.table(ttres2, file="edger-results.tsv", sep="\t", row.names=T, col.names=T, quote=F)

# Filtering the EdgeR results table:
# Check the colnames:
# You want to filter those DEGs that have FDR < 0.05 in this treatment-comparison:
filt_dat <- ttres2[ttres2$FDR<0.05,]
write.table(filt_dat, "filtered-edgeR-results.tsv", sep="\t", row.names=F, col.names=T, quote=F)


As a result, you should get the following plot and the results.tsv table:

Compare the dispersion plots from DESeq2 and edgeR, do they look different? Why do you think this is?

How many differentially expressed genes did you get now?


3. Annotate the Ensembl IDs

Finally, let's annotate our lists of genes with actual gene names. We are using biomaRt package.

# Load the biomaRt package:
# source("")
# biocLite("biomaRt")

# Make a list of Ensembl-identifiers which we want to annotate:
dat <- filt_dat # = the filtered edgeR table
genes <- rownames(dat)

# Get the annotations:
# Choose which database and dataset to use (we use ensembl and hsapiens).
# You could use commands listMarts(), listDatasets() and ListAttributes() to see what is available.
ensembl <- useMart("ensembl", host= "", dataset="hsapiens_gene_ensembl")
genes_ensembl_org <- getBM(attributes <- c("entrezgene_id", "ensembl_gene_id", "external_gene_name", "description"), filters = "ensembl_gene_id", values = genes, mart = ensembl, uniqueRows=T)

# Build a table: combine the DEG values to the annotation information
pmatch_table    <- pmatch(genes, genes_ensembl_org[,2], duplicates.ok=T)
ensembl_table      <-, nrow=length(genes), ncol=8))
ensembl_table[which(!,] <- genes_ensembl_org[pmatch_table[(!], ];
rownames(ensembl_table)    <- genes;
colnames(ensembl_table) <- colnames(genes_ensembl_org);
results <- cbind(ensembl_table[,3:4], dat);
colnames(results) <- c("symbol", "description",  colnames(dat));
rownames(results) <- rownames(dat)

# write result table to output
write.table(results, file="annotated_results_edgeR.tsv", col.names=T, quote=F, sep="\t", row.names=T)


What was the most differentially expressed gene? 

4. Enrichment analysis

Lastly, we learn how to study the enrichment of certain kind of genes amongst the differentially expressed genes.

setwd("/home/rnaseq/rnaseq/Enrich")  # Set the working directory
library(enrichR) # Load enrichR package, an enrichment analysis tool
library(fgsea) # Load r version of GSEA package, a popular enrichment analysis tool
library(biomaRt) # Load annotation package
library(pheatmap) #Load package for drawing heatmaps

# Use this ready-made toy example dataset prepared for enrichment analysis from the DESeq2 output.
# Only data with gene names as queried from Biomart package are retained here.
rnaseq_data <- read.table("dataset_with_gene_names_DESeq.tsv", header=T, sep="\t", row.names=1)

Part 1: Over representation analysis using enrichR package, where we keep only most significant results to perform enrichment analysis. The first steps are similar to the steps performed earlier to the DESeq2 results.

p.value.cutoff <- 0.05
rnaseq_sig <- rnaseq_data
rnaseq_sig <- rnaseq_sig[rnaseq_sig$padj <= p.value.cutoff, ] # choose adj. p-values < the cut off (0.05)
rnaseq_sig <- rnaseq_sig[! ($padj)), ] # remove NAs
rnaseq_sig <- rnaseq_sig[ order(rnaseq_sig$padj), ] # order accroding to the adj. p-values

# Draw heatmap just to see the clusters in the dataset especially when you have several singficant genes.
# Please note that this is not a requirement for doing enrichment analysis;

clust <- pheatmap(rnaseq_sig[,1:10], scale = "row" , color = colorRampPalette(c("blue","white","red"))(255), border_color = NA, cluster_cols = F)

gene_clusters <- cutree(clust$tree_row,k = 2) # cut the hclust tree to get the groups by specifying
# number of groups

# listEnrichrDbs() # uncomment to see the available gene lists
# Get gene names of each cluster and characterise the cluster with enrichment analysis
cluster1 <- names(gene_clusters)[gene_clusters == 1]
cluster2 <- names(gene_clusters)[gene_clusters == 2]

# Enrichment analysis with GO terms; you can modify the gene sets by checking 'listEnrichrDbs()' function
dbs <- c("GO_Molecular_Function_2018", "GO_Cellular_Component_2018", "GO_Biological_Process_2018")
go_cluster1 <- enrichr(genes = cluster1,databases = dbs[1]) # just query one Database for illustration.
# 1 = Molecular Function, 2 = Cellular Component, 3 = Biological process
go_cluster1 <- go_cluster1[[dbs[1]]]
go_cluster1 <- go_cluster1[order(go_cluster1$Adjusted.P.value),] # get top enriched pathways

cbind(go_cluster1$Term[1:15],go_cluster1$Adjusted.P.value[1:15]) # check top enrichedpathways

# Plot top enriched pathways
barplot( -log10(go_cluster1$Adjusted.P.value[10:1]), xlab="Adjusted P value", horiz = T, names.arg = go_cluster1$Term[10:1],las=1)
# Exercise 1
# Repeat similar analysis with other clusters and pathways databases like Reactome and KEGG
#React_cluster2 <- enrichr(genes = cluster2,databases = "Reactome_2016")
#React_cluster2 <- React_cluster2$Reactome_2016
#React_cluster2 <- React_cluster2[order(React_cluster2$Adjusted.P.value),]


Part 2: Perform genome-wide enrichment analysis (GSEA) with unfiltered data set.
Perform Gene Set Enrichment Analysis (GSEA) using gene sets from MSigDB as downloaded from ''
Check meaning of gene sets here:

# Register and download DBs files and use gene set as it relates to your experiments
KEGG_pathways <- gmtPathways("./MSigDB/c2.cp.kegg.v6.2.symbols.gmt.txt")
# Order of genes matter here.
# One needs to give sorted list according certian criteria.
# We use logfold change value as sorting criteria;
# one may use t-statistic,p-value, combination of fold change and p - value
gene_rank <- rnaseq_data[,"log2FoldChange"]
names(gene_rank) <-  rownames(rnaseq_data) # need gene names (check gmtPathways to see accepted Gene IDs )

# Run actual gene set enrichment analysis:
fgRes <- fgsea(pathways=KEGG_pathways, stats=gene_rank, minSize=15, maxSize=500, nperm=10000)

# Retrieve enriched up/down regulated pathways based on the adjusted p-values
# NES = Normalised Enrichment Score
Up_regulated <- fgRes[NES > 0][head(order(padj), n=10), pathway]
Down_regulated <- fgRes[NES < 0][head(order(padj), n=10), pathway]
topPathways <- c(Up_regulated, rev(Down_regulated))

# Plot top enriched pathways
plotGseaTable(KEGG_pathways[topPathways], gene_rank, fgRes, gseaParam = 0.5)
# Draw enrichment plot for selected pathways
plotEnrichment(KEGG_pathways[["KEGG_DNA_REPLICATION"]],  gene_rank)



RNA-seq tutorial front page

1. Preprocessing and trimming

2. Alignment

3. Analysis

4. Analysing in Puhti