5.5 Using wget to download data from web sites to CSC

Wget is a handy command for downloading files from the WWW-sites and FTP servers. Once you have resolved the URL of the file, just give it as an argument for wget command to download the file to your current working directory.

wget ftp://path/to/the/file


For example you could download the nucleotide sequence of human chromosome Y from the ftp-site of UCSC (ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/) with command:

wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrY.fa.gz


The command above would produce a file called chrY.fa.gz to your working directory at CSC ( a gzip compressed fasta file). You can also retrieve a group of files by using asterisk (*) sign. For example all human chromosome files could be downloaded with command:

wget "ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr*.fa.gz"


Note that quotation marks around the file name are obligatory when asterisk (or any other special character) is used . This command would retrieve all the files, whose name start with chr and end with .fa.gz. i.e. the sequence files for all chromosomes.

You can fine tune the behaviour of wget command with several options. You can see the full list of available command options with command:

man wget


Below is listed some of the most commonly used wget options.

Table 5.2 Wget command options

Option Argument   Description
-i URL Read a file containing the URL:s to retrieve.
-O file name Name of the output file.
-o file name Name of the download log file.
-p directory Defines the directory where the downloaded data will be saved to. The default is . (the current directory).
-c   Continue getting a partially-downloaded file.
--user= username Specify the user name for file retrieval
--password= password Specify the password for file retrieval.
-N   Use time-stamping. Download the file only if it is newer that the file in the target directory.
-m   Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings.

For example if you would like to retrieve just chromosomes 2,3,7, you could first collect the addresses of the chromosome file to a single text file:



If the name of the file is chr_2.3.7.list the chromosomes could now be retrieved with command:

wget -i chr_2.3.7.list


We could also modify the command a bit more. For example command:

wget -i chr_2.3.7.list -P $WRKDIR -O chr_2.3.7.fa.gz


Would retrieve the same files, but instead for producing three separate files, all the files would be concatenated to file chr_2.3.7.fa.gz that would be created to the work directory.


