RNAseq-alignment - Services for Research
We retrieve a reference sequence, modify it a bit for testing purposes, create the indexes and finally align our reads to this reference using HISAT2 aligner. Then we check our alignment quality with RSeQC tool. We also learn how to check the strandedness of the data.
1. Retrieve reference genome sequence, a GTF and a BED file
In this exercise we retrieve reference genome (fasta) in order to build index files for HISAT2. For the interest of time, we'll use reference data only for chromosome 19 (as our reads in the fastq file come from genes located in that chromosome). Normally you would use the whole genome. We also fetch the gene locations as a GTF and BED file formats for later use.
With your web browser, go to http://www.ensembl.org/info/data/ftp/.
In the Single species data table, click on the link for DNA (FASTA) for Homo Sapiens. Right-click on the chromosome 19 file and copy the link (ftp://ftp.ensembl.org/pub/release-82/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz)
Get the file to the virtual machine by typing wget and pasting the link to the command line:
In the same way, obtain the GTF file: In the Single species data table, click on the link for Gene sets, GTF for Homo Sapiens. Copy the link for Homo_sapiens.GRCh38.82.gtf.gz (ftp://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh38.82.gtf.gz)
Get the file by typing wget and pasting the link:
Check that you have the files and unzip the files in the folder.
Check the beginning of the GTF file to see where the chromosome information is located for each entry.
Make a subset of the GTF file so that it contains only genes for chromosome 19 (take rows starting with 19).
grep -w ^19 Homo_sapiens.GRCh38.82.gtf > hg38chr19.gtf
Have a look at the beginning of the new GTF file. Does the first Havana transcript entry have different values in gene_id and transcript_id fields? This is important for the HTSeq tool that we will run later on.
head -2 hg38chr19.gtf
Remove the original, whole genome GTF file using
RseQC tool used after the alignment needs a BED file with gene locations and a BAM file. You can fetch a BED file for chromosome 19 using UCSC Table browser. BED file included in the virtual machine (refseq_chr19_actual.bed) was originally obtained this way. It has a different chromosome naming to what is used in your BAM file (chr19 vs 19), so you need to remove the chr prefix first.
Substitute chr with nothing using the command sed:
sed s/^chr// refseq_chr19_actual.bed > refseq_19.bed
Have a look at the BED file to see if the chr is gone with:
2. Make an index file for HISAT2
Note: in the course, indezes are already done for you: you don't need to run the commands with # in beginning!
# mkdir hisat-indexes
# hisat2-build Homo_sapiens.GRCh38.dna.chromosome.19.fa hisat-indexes/hs_19
You can build the index with just the command above. However, we are learning to use also the gene locations information from the GTF file to create the index. For this, you first need to generate these HISAT specific splice site and exon files:
hisat2_extract_splice_sites.py hg38chr19.gtf > splice_sites.txt hisat2_extract_exons.py hg38chr19.gtf > exons.txt # hisat2-build --ss splice_sites.txt --exon exons.txt Homo_sapiens.GRCh38.dna.chromosome.19.fa hisat-indexes/hs_19
This command will take some time to run, so in the course we won't run this command. Instead, these indexes are included in the VM. You can find them in the folder called hisat-indexes.
Check the index files. You should have there files hs_19.1.ht2, hs_19.2.ht2...
ls -lh hisat-indexes
Why are there so many files?
Please note that creating the full indexes takes some time. You want to do this only once, or maybe download the ready-made indexes from HISAT2 page (banner on the right side).
3. Align reads to reference genome with HISAT2
Align reads to reference genome using the index which is located in the hisat-indexes folder. Put the result files to results-hisat. Use
to understand the syntax.
3A Bonus: Check the strandedness of your data
For alignment (HISAT2) and read counting (HTSeq), it is useful to know if our data was created with a directional protocol. This means that with some protocols, it is possible to tell from which strand the data originally came from: are the reads from the same strand as the genes/transcripts, or from the opposing ones. This helps in scenarios where there are genes present in both strands at the same location. You might now how your data is created, but if you don't know or aren't sure, you can check it.
For checking the strandedness, we want to get a subset of the data (especially when our data would be a bit bigger than in this example case of ours!), try aligning it (with the unstranded option) and then use RSeQc tool infer_experiment.py to get an estimate of the strandedness.
Subset the data: take 20 000 reads using Seq Kit sample tool:
seqkit sample -s 15 hesc.fastq -n 200000 -o results-hisat/hesc_subset.fastq
Run a "test alignment" with HISAT2 default parameters:
hisat2 -q -x hisat-indexes/hs_19 -U results-hisat/hesc_subset.fastq -S results-hisat/hesc_subset.sam
run the RSeqQC tool infer_experiment:
infer_experiment.py -i results-hisat/hesc_subset.sam -r refseq_19.bed > results-hisat/strandedness_data.txt
3B Alignment and BAM file
Finally, let's run HISAT2 to align our data:
hisat2 -q -x hisat-indexes/hs_19 -U results-trimmomatic/hesc-trimmed.fq.gz --rna-strandness F -S results-hisat/hesc.sam
What was the alignment rate? How many reads aligned multiple times?
Let's convert the SAM into BAM.
cd results-hisat samtools view -bS hesc.sam > hesc.bam ls -lth
View the hesc.bam as text and check in the header if the alignments have been sorted:
samtools view -h hesc.bam | less
you can see the next page by pressing the spacebar and quit with q. Check the first couple of alignments. Can you find the different fields and tags?
Sort and index the BAM file:
samtools sort hesc.bam -o hesc_sorted.bam ls -lth samtools index hesc_sorted.bam ls -lth
Return to the rnaseq folder with
4. Perform quality control at alignment level with RseQC
Make a directory for RseQC result files:
RseQC consists of several Python scripts, which we run separately. Check how many alignments are there and how many of them are non-primary alignments. Note that some of the result comes to the stderr stream, which you direct to a file using 2>, and some to stdout stream, which you direct to file using >. Open the folder results-rseqc so that you can view the files when they are generated.
Check some statistics about the alignment. How many alignments are there? How many reads are spliced?
bam_stat.py -i results-hisat/hesc_sorted.bam > results-rseqc/hesc-stats.txt
Get an overview of the read distribution. Do most of the reads map to exons?
read_distribution.py -r refseq_19.bed -i results-hisat/hesc_sorted.bam > results-rseqc/hesc-distribution.txt
Check if this tool counts all reads:
Check if the coverage is uniform along the transcripts:
geneBody_coverage.py -r refseq_19.bed -i results-hisat/hesc_sorted.bam -o results-rseqc/hesc xdg-open results-rseqc/hesc.geneBodyCoverage.curves.pdf
Check how many of the splice junctions are novel:
junction_annotation.py -r refseq_19.bed -i results-hisat/hesc_sorted.bam -o results-rseqc/hesc 2> results-rseqc/hesc-junction-counts.txt
Check the saturation status of the splice junctions:
junction_saturation.py -r refseq_19.bed -i results-hisat/hesc_sorted.bam -o results-rseqc/hesc xdg-open results-rseqc/hesc.junctionSaturation_plot.pdf
Bonus: Check the alignment in IGV genome browser
You can view the BAM files in IGV Genome Browser. Practise using IGV by browsing to https://igv.org/app/ and uploading the BAM and BAI files: Tracks -> Local files -> select hesc_sorted.bam and hesc_sorted.bai in the results-hisat folder (use ctrl to select two files). Check that you have the correct reference (hg19). Navigate for example to location chr19:281,093-281,252. Try zooming in and out. Which gene are we now looking at? Can you see any reads? What is the coverage? Can you spot any SNPs?