Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. It implements three algorithms, BWA-MEM (mem), BWA-Backtrack (aln) and BWA-SW (bwasw). BWA-Backtrack works for query sequences shorter than 200bp. The other two algorithms are used longer reads up to around 100kbp. BWA-MEM is recommend for reads longer than 70 gb. Both algorithms do gapped alignment.
BWA can be used to align both single-end and paired end reads to a reference genome or sequence set.
Version on CSC's Servers
Taito: 0.7.17, 0.7.15
FGI grid: 0.6.1
Chipster graphical user interface
To initialize the program in Taito you need to give command:
module load biokit
The basic syntax of BWA commands is:
bwa <command> [options]
Reference genome indexing
CSC does not maintain pre-compiled BWA indexes for reference genomes. Thus normally the fist step in creating alignment with BWA is downloading the reference genome and indexing it. Please note that your $HOME directory is often too small for working with complete genomes. In stead you should do the analysis in temporary directories like $WRKDIR.
You can use for example command ensemblfetch or wget to download a reference genome to CSC. For example
The command above retrieves the human genome sequence to a file called. Homo_sapiens.GRCh37.75.dna.toplevel.fa. You can calculate the BWA indexes for this file with command:
bwa index -a bwtsw homo_sapiens.GRCh37.75.dna.toplevel.fa
Note that for small less than 2 GB reference genomes you could use faster, "is" indexing algorithm ( bwa index -a is )
Once the indexing is ready you can carry out the alignment for singe-end reads with command:
bwa mem Homo_sapiens.GRCh37.63.dna.toplevel.fa reads.fastq > aln.sam
If you wish to use the aln (BWA-Backtrack) algorithm you need to do the alignment in two steps.
First calculate the actual alignment:
bwa aln Homo_sapiens.GRCh37.63.dna.toplevel.fa reads.fastq > aln_sa.sai
The result file is in BWA specific .sai format that you can convert to SAM format with bwa samse command:
bwa samse Homo_sapiens.GRCh37.63.dna.toplevel.fa aln_sa.sai reads.fastq > aln.sam
Paired end alignment
If you use the MEM algorithm you can do the paired-end alignment with just one command:
bwa mem Homo_sapiens.GRCh37.63.dna.toplevel.fa read1.fq read2.fq > aln.sam
In this case of BWA-Backtrack algorithm you should first do a separate alignment run for each read file:
bwa aln Homo_sapiens.GRCh37.63.dna.toplevel.fa reads1.fq > aln1.sai bwa aln Homo_sapiens.GRCh37.63.dna.toplevel.fa reads2.fq > aln2.sai
The two sai alignment files are combined with command bwa sampe:
bwa sampe Homo_sapiens.GRCh37.63.dna.toplevel.fa aln1.sai aln2.sai reads1.fq reads2.fq > aln.sam
Running BWA batch jobs in Taito
In Taito, BWA is initialized with command:
module load biokit
The biokit module sets up a set of commonly used bioinformatics tools. (Note however that there are bioinformatics tools in Taito, that have a separate setup commands.)
In Taito, BWA jobs should be run as batch jobs. Below is a sample batch job file, for running a BWA job in Taito:
#!/bin/bash -l #SBATCH -J bwa #SBATCH -o output_%j.txt #SBATCH -e errors_%j.txt #SBATCH -t 24:00:00 #SBATCH -n 1 #SBATCH --nodes=1 #SBATCH --cpus-per-task=4 #SBATCH --mem-per-cpu=8000 # #load the bio tools module load biokit # Index the reference genome bwa index -a bwtsw Homo_sapiens.GRCh37.71.dna.toplevel.fa # Run the alignnments bwa mem -t $SLURM_CPUS_PER_TASK Homo_sapiens.GRCh37.63.dna.toplevel.fa reads1.fq reads2.fq > aln.sam
In the batch job example above one BWA task (-n 1) is executed. The BWA job uses 4 cores (--cpus-per-task=4 ) with total of 32 GB of memory (--mem-per-cpu=8000). The maximum duration of the job is four hours (-t 04:00:00 ). All the cores are assigned from one computing node (--nodes=1 ).
You can submit the batch job file to the batch job system with command:
See the chapter 3 of the Taito user guide for more information about running batch jobs.
Running BWA alignments utilizing grid computing
Aligning millions of reads to a large reference genome can take several hours or even days. Using grid computing through grid_bwa command you can speed up the alignment process ten fold or more. grid_bwa command splits the alignment task into several subtasks that it submits to be simultaneously executed in the FGI grid environment. When all the subtasks are ready they are collected and combined into a single result alignment.
To be able to use grid_bwa command you should have:
- A valid grid certificate installed in the hippu.csc.fi server.
- Membership of FGI Virtual Organization
Please check the detailed instructions to obtain and install a grid certificate and to join FGI Virtual organization.
Once you have the certificate installed and the Virtual organization membership is approved, you can submit a grid_bwa jobs with for example following commands:
module load bwa module load nordugrid-arc arcproxy -c validityPeriod=72h grid_bwa aln -query seq_set1.fq -query2 seq_set2.fq -ref ref_genome.fasta -out paired_end_results
For more information about grid_bwa, run command:
More information about BWA can be found from: