Alignment

We retrieve a reference sequence, modify it a bit for testing purposes, create the indexes and finally align our reads to this reference using HISAT2 aligner. Then we check our alignment quality with RSeQC tool. We also learn how to check the strandedness of the data.

1. Retrieve reference genome sequence, a GTF and a BED file

In this exercise we retrieve reference genome (fasta) in order to build index files for HISAT2. For the interest of time, we'll use reference data only for chromosome 19 (as our reads in the fastq file come from genes located in that chromosome). Normally you would use the whole genome. We also fetch the gene locations as a GTF and BED file formats for later use.

With your web browser, go to http://www.ensembl.org/info/data/ftp/.

In the Single species data table, click on the link for DNA (FASTA) for Homo Sapiens. Right-click on the chromosome 19 file and copy the link (ftp://ftp.ensembl.org/pub/release-82/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz)

Get the file to the virtual machine by typing wget and pasting the link to the command line:

wget ftp://ftp.ensembl.org/pub/release-82/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz

In the same way, obtain the GTF file: In the Single species data table, click on the link for Gene sets, GTF for Homo Sapiens. Copy the link for Homo_sapiens.GRCh38.82.gtf.gz (ftp://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh38.82.gtf.gz)

Get the file by typing wget and pasting the link:

wget ftp://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh38.82.gtf.gz

Check that you have the files and unzip the files in the folder.

ls -lh
gunzip Homo_sapiens.GRCh38.*

Check the beginning of the GTF file to see where the chromosome information is located for each entry.

head Homo_sapiens.GRCh38.82.gtf

Make a subset of the GTF file so that it contains only genes for chromosome 19 (take rows starting with 19).

grep -w ^19 Homo_sapiens.GRCh38.82.gtf > hg38chr19.gtf

Have a look at the beginning of the new GTF file. Does the first Havana transcript entry have different values in gene_id and transcript_id fields? This is important for the HTSeq tool that we will run later on.

head -2 hg38chr19.gtf

Remove the original, whole genome GTF file using

rm Homo_sapiens.GRCh38.82.gtf 

RseQC tool used after the alignment needs a BED file with gene locations and a BAM file. You can fetch a BED file for chromosome 19 using UCSC Table browser. BED file included in the virtual machine (refseq_chr19_actual.bed) was originally obtained this way. It has a different chromosome naming to what is used in your BAM file (chr19 vs 19), so you need to remove the chr prefix first.

Substitute chr with nothing using the command sed:

sed s/^chr// refseq_chr19_actual.bed > refseq_19.bed

Have a look at the BED file to see if the chr is gone with:

head refseq_19.bed

 

2. Make an index file for HISAT2

Note: in the course, indezes are already done for you: you don't need to run the commands with # in beginning! 

# mkdir hisat-indexes
# hisat2-build Homo_sapiens.GRCh38.dna.chromosome.19.fa hisat-indexes/hs_19

You can build the index with just the command above. However, we are learning to use also the gene locations information from the GTF file to create the index. For this, you first need to generate these HISAT specific splice site and exon files:

hisat2_extract_splice_sites.py hg38chr19.gtf > splice_sites.txt
hisat2_extract_exons.py hg38chr19.gtf > exons.txt
# hisat2-build --ss splice_sites.txt --exon exons.txt Homo_sapiens.GRCh38.dna.chromosome.19.fa hisat-indexes/hs_19

This command will take some time to run, so in the course we won't run this command. Instead, these indexes are included in the VM. You can find them in the folder called hisat-indexes.

Check the index files. You should have there files hs_19.1.ht2, hs_19.2.ht2...

ls -lh hisat-indexes

Why are there so many files? 

Please note that creating the full indexes takes some time. You want to do this only once, or maybe download the ready-made indexes from HISAT2 page (banner on the right side).

 

3. Align reads to reference genome with HISAT2

mkdir results-hisat

Align reads to reference genome using the index which is located in the hisat-indexes folder. Put the result files to results-hisat. Use

hisat2 --help

to understand the syntax. 

 

3A Bonus: Check the strandedness of your data

For alignment (HISAT2) and read counting (HTSeq), it is useful to know if our data was created with a directional protocol. This means that with some protocols, it is possible to tell from which strand the data originally came from: are the reads from the same strand as the genes/transcripts, or from the opposing ones. This helps in scenarios where there are genes present in both strands at the same location. You might now how your data is created, but if you don't know or aren't sure, you can check it. 

For checking the strandedness, we want to get a subset of the data (especially when our data would be a bit bigger than in this example case of ours!), try aligning it (with the unstranded option) and then use RSeQc tool infer_experiment.py to get an estimate of the strandedness. 

Subset the data: take 20 000 reads using Seq Kit sample tool:

seqkit sample -s 15 hesc.fastq -n 200000 -o results-hisat/hesc_subset.fastq

Run a "test alignment" with HISAT2 default parameters:

hisat2 -q -x hisat-indexes/hs_19 -U results-hisat/hesc_subset.fastq -S results-hisat/hesc_subset.sam

run the RSeqQC tool infer_experiment:

infer_experiment.py -i results-hisat/hesc_subset.sam -r refseq_19.bed > results-hisat/strandedness_data.txt
Check the output. Does the data seem to be ++,-- or +-,-+ type? The strandedness parameters in different tools can be a bit confusing. Study from RSeQC infer-experiment.py manual page what this output means. Compare this knowledge to the HISAT2 manual (check the - - rna-strandness parameter). Do you know which parameter to use for this data? Maybe the table on Chipster software's strandedness summary helps!
 
Bonus: Similarly, you could check the inner distance with the RSeQC inner-distance.py function. You need this information for the alignment of paired-end reads. Check the HISAT2 manual to see what are the default parameters for the minimum (-I/--minins ) and maximum (-X/--maxins) inner distances.
 

 

3B Alignment and BAM file

Finally, let's run HISAT2 to align our data:

hisat2 -q -x hisat-indexes/hs_19 -U results-trimmomatic/hesc-trimmed.fq.gz --rna-strandness F -S results-hisat/hesc.sam
 

What was the alignment rate? How many reads aligned multiple times? 

Let's convert the SAM into BAM.

cd results-hisat
samtools view -bS hesc.sam > hesc.bam
ls -lth

View the hesc.bam as text and check in the header if the alignments have been sorted:

samtools view -h hesc.bam | less

you can see the next page by pressing the spacebar and quit with q. Check the first couple of alignments. Can you find the different fields and tags?

Sort and index the BAM file:

samtools sort hesc.bam -o hesc_sorted.bam
ls -lth
samtools index hesc_sorted.bam
ls -lth

Return to the rnaseq folder with

cd ..

 

4. Perform quality control at alignment level with RseQC

Make a directory for RseQC result files:

mkdir results-rseqc

RseQC consists of several Python scripts, which we run separately. Check how many alignments are there and how many of them are non-primary alignments. Note that some of the result comes to the stderr stream, which you direct to a file using 2>, and some to stdout stream, which you direct to file using >. Open the folder results-rseqc so that you can view the files when they are generated.

Check some statistics about the alignment. How many alignments are there? How many reads are spliced?

bam_stat.py -i results-hisat/hesc_sorted.bam > results-rseqc/hesc-stats.txt

Get an overview of the read distribution. Do most of the reads map to exons?

read_distribution.py -r refseq_19.bed -i results-hisat/hesc_sorted.bam > results-rseqc/hesc-distribution.txt

Check if this tool counts all reads:

read_distribution.py -help

Check if the coverage is uniform along the transcripts:

geneBody_coverage.py -r refseq_19.bed -i results-hisat/hesc_sorted.bam -o results-rseqc/hesc
xdg-open results-rseqc/hesc.geneBodyCoverage.curves.pdf 

Check how many of the splice junctions are novel:

junction_annotation.py -r refseq_19.bed -i results-hisat/hesc_sorted.bam -o results-rseqc/hesc 2> results-rseqc/hesc-junction-counts.txt 
xdg-open results-rseqc/hesc.splice_events.pdf 
xdg-open results-rseqc/hesc.splice_junction.pdf 

Check the saturation status of the splice junctions:

junction_saturation.py -r refseq_19.bed -i results-hisat/hesc_sorted.bam -o results-rseqc/hesc
xdg-open results-rseqc/hesc.junctionSaturation_plot.pdf 

 

Bonus: Check the alignment in IGV genome browser

You can view the BAM files in IGV Genome Browser. Practise using IGV by browsing to https://igv.org/app/ and uploading the BAM and BAI files: Tracks -> Local files -> select hesc_sorted.bam and hesc_sorted.bai in the results-hisat folder (use ctrl to select two files). Check that you have the correct reference (hg19). Navigate for example to location chr19:281,093-281,252. Try zooming in and out. Which gene are we now looking at? Can you see any reads? What is the coverage? Can you spot any SNPs? 

 

RNA-seq tutorial front page

1. Preprocessing and trimming

2. Alignment

3. Analysis

4. Analysing in Puhti