Quality control & trimming/filtering

Here we check the quality of the raw reads in the FASTQ files with FastQC and PRINSEQ tools. After this, we trim the reads with Trimmomatic to get rid of possible poor quality bases and reads. As a bonus exercise, you can try the trimming step also with PRINSEQ.

 

1. Check the quality of the reads with FastQC

Make a directory for the FastQC result files

mkdir results-fastqc

Run FastQC:

fastqc -o results-fastqc hesc.fastq.gz

Check what files were created with:

ls -lh results-fastqc

Use:

fastqc --help

to see what the parameter -o means.

Open the hesc_fastqc.html file in browser:

firefox results-fastqc/hesc_fastqc.html

You can compare your FastQC report to an example of a good Illumina data and to an example of a bad Illumina data.

How does the quality look like? How long are the reads? How many reads are there? What do you think has happened? 

 

 

2. Check the quality of the reads with PRINSEQ

Get a similar quality report using PRINSEQ. Compare the reports. Again, we start by creating a directory for the output files:

mkdir results-prinseq

Unzip the fastq file, because PRINSEQ is not able to handle compressed files.

gunzip < hesc.fastq.gz > hesc.fastq

Now, since the file is unzipped, you can take a look at what the FASTQ file looks like. How many lines are there for each read? Can you spot the sequence and the quality codings for the reads?

head hesc.fastq

Make first the graph file:

prinseq-lite.pl -fastq hesc.fastq -out_good null -out_bad null -graph_data results-prinseq/hescgraph -verbose

Check in the PRINSEQ manual what the different parameters mean using:

prinseq-lite.pl -help

Which statistics were chosen to be calculated for the graphs?

Check what files were created with:

ls -lh results-prinseq

Convert the graph file to an html report:

prinseq-graphs.pl -i results-prinseq/hescgraph -html_all -o results-prinseq/hesc_prinseq

Check what files were created with:

ls -lh results-prinseq

Open the hesc_fastqc.html file in browser:

firefox results-prinseq/hesc_prinseq.html

Is there some new information compared to FastQ report? 

 

 

3. Trim reads based on base quality with Trimmomatic

mkdir results-trimmomatic

Trim bases from the 3' end if the base quality is less than 5, and keep only those reads which are longer than 50 bases after trimming.

trimmomatic SE -threads 1 -phred33 hesc.fastq.gz results-trimmomatic/hesc-trimmed.fq.gz TRAILING:5 MINLEN:50

Check in the screen output how many reads were dropped.

Check the Trimmomatic manual (linked below) to understand the syntax of the command. What is the TRAILING parameter for? 

Bonus: Check the ILLUMINACLIP parameter from the manual. What is the most likely cause for finding adapter sequences in the data? Do you need some extra files to remove the adapters?

After trimming, see if the quality of the data was improved: run FastQC again and compare the report to the original report.

mkdir results-fastqc-after-trimming
fastqc -o results-fastqc-after-trimming results-trimmomatic/hesc-trimmed.fq.gz 
firefox results-fastqc-after-trimming/hesc-trimmed_fastqc.html

What changes you can spot in the report? 

 

4. BONUS exercise: Trim reads based on base quality with PRINSEQ

Try the trimming step also with PRINSEQ with identical parameters: trim bases from the 3' end if the base quality is less than 5, and keep only those reads which are longer than 50 bases after trimming.

prinseq-lite.pl -trim_qual_right 5 -trim_qual_rule lt -min_len 50 -no_qual_header -fastq hesc.fastq -out_good results-prinseq/hesc_trimmed -out_bad null -verbose

Check in the screen output: how many reads were dropped because of their quality, and how many because they became too short?

prinseq-lite.pl -help

to understand what the different parameters mean.

 

RNA-seq tutorial front page

1. Preprocessing and trimming

2. Alignment

3. Analysis

4. Analysing in Puhti