Automate the analysis workflow using the array job functionality in Puhti

Once we know the steps we want to run for each sample, we can automate the analysis and run the tools in Puhti as a batch job. In our case, we are using an array job, as we are doing several similar independent analysis.

Now we try a ready-made array script (rnaseq_array_job_script.sh) which needs the following things:

  • fastq.gz files for all samples
  • file fastq.list containing the fastq file names
  • BED file refseq_19.bed
  • the folder HISAT2-indexes

First, open Terminal and log in to Puhti with your user credentials.

ssh <csc_username>@puhti.csc.fi

More information about running jobs in Puhti:

Next, we move to our projects SCRATCH directory. You can't run analyses on your HOME directory.  

Note: SCRATCH directory is joined for all the members in the project! Be a good researcher and don't mess with other people's data.

cd /scratch/project_xxxxxxx

More information about Puhti disk areas and projects:

Check whether there are already some files in the SCRATCH directory. Then make a new folder test_yourname (add your name!) and go there:

ls
mkdir rnaseq_test_yourname
cd rnaseq_test_yourname

Copy the files needed for the analysis to your folder and unzip the package:

wget https://a3s.fi/rnaseq_course_bucket/rnaseq_batch_job.tar.gz
tar -xvzf rnaseq_batch_job.tar.gz
cd rnaseq_batch_job
ls -lh

How many samples (=FASTQ files) are there? Check the files listed in fastq.list:

less fastq.list

Check what the script looks like:

less rnaseq_array_job_script.sh

Are the parameters reasonable?

  • How many jobs are we launching? How many cores and memory is each job using? How much time is reserved per each job?
  • Which partition are we using? Is the project correctly assigned?
  • How is the sample name changed in each job?

To run all the tools we need, we need to load a couple of modules. Check which modules are loaded from the script. You can see which modules you need from the software pages: for example, for FastQC you need to module load biokit.

Run the script with sbatch command. 

# During a live course:
sbatch --reservation=our_reservation_name rnaseq_array_job_script.sh
# Other time:
sbatch rnaseq_array_job_script.sh

Follow your job with command:

squeue -l -u your_username

Check your finished job with command:

sacct -u csc_username

Check the files created: 

ls -lh

Use the job ID (something like 1062932) to get some information about the run:

seff your_job_id

How much time did the job took? Were the reserved resources OK for the job (check the CPU and Memory Efficiency %)?

 

Bonus: Working with paired-end data

 The lung and lymhnode samples are paired-end. What kind of modifications would the bash script need to be able to run the analysis steps on these samples? Take a look at the rnaseq_array_job_script_PE.sh file. (Don't run the script in the course, as it will take 2 hours.)

 

Moving your data to Allas

After you are done with the analysis, you want to store the data in Allas. From Allas, you can also easily share some of the files to your colleagues.

Load Allas module and set up a connection to Allas:

module load allas
allas-conf project_XXXXXXX

Now we can store our data to Allas. After the data is in Allas, you can delete it from Puhti: SCRATCH directory is automatically cleaned every 90 days, but it might be easier to remember to move your data to Allas if you try to keep your SCRATCH directory clean. We use a-commands to load the data:

cd ..
ls -lh
a-put rnaseq_batch_job
a-check rnaseq_batch_job
a-list
a-list XXXXXX-puhti-SCRATCH

Where did the data go now? Try it a bit differently:

a-put rnaseq_batch_job -b yourname_rnaseq_bucket -o yourname_rnaseq_data
a-list yourname_rnaseq_bucket

Where is the data now?

Often we also want to share some of our data with our colleagues. If they are in the same project, they can of course access the data in Allas and in Puhti. However, if this is not the case, you can use a-publish command, and share the data via an URL.

NOTE: Do remember, that data shared this way is easily accessible to third parties as well! Thus, before loading the data with a-publish, consider encrypting it first with command gpg.

cd rnaseq_batch_job
tar czf yourname_rnaseq_counts.tar.gz results-htseq
gpg --symmetric --cipher-algo AES256 yourname_rnaseq_counts.tar.gz
# give a password!
a-publish yourname_rnaseq_counts.tar.gz.gpg

Copy the public link, and send that to your colleague. Text the password to colleague.

Now you can wget the package and decrypt it, which is when the tool asks for the password. You can try this:

wget link
gpg -o yourname_rnaseq_counts.tar.gz -d f yourname_rnaseq_counts.tar.gz.gpg
# give the password when asked

 

RNA-seq tutorial front page

1. Preprocessing and trimming

2. Alignment

3. Analysis

4. Analysing in Puhti