2.6 Packing and compression tools

When large data sets are stored at CSC or transported over the net it is usually reasonable to archive (i.e., pack multiple files into a single file) and compress (i.e., reduce its size without losing any data) the data. Archiving makes file transfer easier while compressed files require less storage space and are thus faster to move from one system to another. In this chapter we will provide introduction to targzipbzip2, and zip tools that are frequently used for archiving and compression.

Table 2.5: Frequently used archiving and compression tools and common corresponding file name extensions
Type Extension
A zip archive .zip, .ZIP, or .Z
A gzip compressed file .gz
A bzip2 compressed file .bz2, or .bz
A tar archive .tar
A gzip compressed tar archive .tar.gz, or .tgz
A bzip2 compressed tar archive .tar.bz2, .tar.bz, .tbz, or .tbz2


2.6.1 tar: packing several files into one file

tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from tape archive, as it was originally developed to write data to sequential I/O devices with no file system of their own. However, nowadays tar – and its GNU version, gtar – are mostly used for data archiving within a normal disk environment. The archive files created by tar contain various file system parameters, such as name, time stamps, ownership, file access permissions, and directory structures, which makes moving and storing large file sets easier that trying to manage separate files.

By default, tar does not compress the data. This means that the size of the tar archive file is the same as the sum of the sizes of packed files, plus some overhead metadata. If data compression is needed, you can use compression tools like gzip or bzip2 with tar.

tar and gtar are compatible with each other; if you archive your data with tar you can unarchive it with gtar and vice versa. Normally tar and/or gtar can be found from any Unix, Linux or Mac OS X system. In Windows systems you can use for example 7zip program to manage tar archive files.

The syntax of a tar command is:

tar options tar_archive file …

For example, we may have a directory called project_3, which contains nine files called sample1.txtsample2.txt, …, sample9.txt. To list the contents of the directory we could use the command ls -lh:

$ ls -lh project_3
total 53M
-rw-r--r--+ 1 testuser csc  16M Nov  9 10:41 sample1.txt
-rw-r--r--+ 1 testuser csc  16M Nov  9 10:41 sample2.txt
-rw-r--r--+ 1 testuser csc 1.3M Nov  9 10:41 sample3.txt
-rw-r--r--+ 1 testuser csc 1.9M Nov  9 10:41 sample4.txt
-rw-r--r--+ 1 testuser csc 1.9M Nov  9 10:41 sample5.txt
-rw-r--r--+ 1 testuser csc 3.7M Nov  9 10:42 sample6.txt
-rw-r--r--+ 1 testuser csc 4.0M Nov  9 10:42 sample7.txt
-rw-r--r--+ 1 testuser csc 3.9M Nov  9 10:42 sample8.txt
-rw-r--r--+ 1 testuser csc 3.9M Nov  9 10:42 sample9.txt

We can archive all the files in the project_3 directory to a tar archive called  project_3.tar with tar's create (c) command:

$ tar cvf project_3.tar project_3

The command above creates a new tar archive file, project_3.tar, which contains the directory project_3 and all the files in it. Note that the command does not modify or remove the original files in the source directory:

$ ls -lh
drwxr-xr-x+ 2 testuser csc   11 Nov  9 10:44 project_3
-rw-r--r--+ 1 testuser csc  52M Nov  9 10:46 project_3.tar

The tar command does not require that you assign a certain extension, like .tar, for your archive files. However, applying commonly used file name extensions will help you and other users to select right commands for processing the file later on. The newly created tar archive file can now be easily moved to another directory or system and then unarchived with the extract (x) command:

$ tar xvf project_3.tar

This command will create a directory called project_3, which contains all the same files as the original directory did. Note that if the extract command encounters a file that already exists in the file system, the existing file will be overwritten by the extracted file. This causes a potential danger: possibly a newer version of a file will be lost if an older version of the same file (or, just a file with a same name) exists in the tar archive that is being unarchived!

Table 2.6: tar commands
Command Operation
A Append tar files to an archive (i.e. concatenate archive files)
c Create a new tar archive
d Find differences between an archive and the file system
r Append files at the end of an archive
t List the contents of an archive
u Only append files that are newer than the existing ones in the archive
x Extract files from an archive
--delete Delete files from the archive

In addition to the commands above, the option f (file) is almost always used as it defines the name of the tar file to read from or to write to. The file name must follow right after the option, thus it is commonly the last option given. Below are some frequently used tar options.

Table 2.7: tar command options
Option Function
f Use the given file name as the source or target archive
v Be verbose, i.e. list processed files to the screen while processing
z Use gzip compression/decompression while creating or extracting an archive
j Use bzip2 compression/decompression while creating or extracting an archive

With options zip or j (no meaning, it was chosen because no meaningful letter were available) you can filter the archive through gzip or bzip2 respectively to (de)compress the archive on the fly:

$ tar cvzf project_3.tar.gz project_3

Here, in this sample case, the size of the uncompressed archive file is 52 MB but the compressed file is only 15 MB.

$ ls -lh
drwxr-xr-x+ 2 testuser csc   11 Nov  9 10:44 project_3
-rw-r--r--+ 1 testuser csc  52M Nov  9 10:46 project_3.tar
-rw-r--r--+ 1 testuser csc  15M Nov  9 10:56 project_3.tar.gz

Listing the contents of a tar archive file can be done with command list:

$ tar tf project_3.tar

You can also retrieve just one file from the archive by specifying the full file name to the data extraction command. For example, to extract just file sample2.txt from the compressed archive project_3.tar.gz we could use command:

$ tar xvzf project_3.tar.gz project_3/sample2.txt

2.6.2 Compressing files

Compressing files saves storage space, but it may take a lot of time. Data compression is CPU intensive and requires computing time: compressing a gigabyte size file takes about a minute and when several terabytes needs to be compressed, the task can easily require overnight computing. Therefore, compressing should mainly be used for large files that will need to be stored for a long period of time or moved to another location.

There are numerous algorithms and software tools available for data compression. Here we will focus on three tools that are most frequently used in Unix/Linux systems: gzipbzip2 and zip. What is common to all these tools is that they do the compression without data loss, i.e. when the files are uncompressed the data will be 100% identical to the original data. Typically, a text file compressed with one of these tools is about 20-40% of the size of the original file. However, the compressibility of a file depends heavily on the content of the file. gzip and gunzip

gzip is probably the most commonly used packing tool in Unix and Linux systems. It uses the Lempel-Ziv coding (LZ77) for compressing the data. gzip was already briefly mentioned in the tar chapter above, but gzip can also be used as a totally separate tool. The normal usage of gzip is straight forward. To compress a file you give the command:

$ gzip file_name

Running this command creates a compressed file and names it using the original file name with an extension .gz. When the compression is ready the original file is removed. If you want to preserve the original file, you need to use redirection, like:

$ gzip < file_name > file_name.gz

Decompressing a gzipped file is done with command gunzip. In addition to gzip compressed files, gunzip can also decompress  files compressed with zip command. The basic syntax of gunzip is analogous to the compressing command:

$ gunzip file_name.gz

The command above removes the compressed file when decompression is ready. If you wish the keep the compressed file, you'll need redirection again:

$ gunzip < file_name.gz > file_name

gzip has several command line options that are not discussed here. Use  man gzip  command to see the full list of available options. Also, note that gunzip command is actually just a wrapper for the gzip -d command (option -d instructs gzip to decompress, rather than compress) so you will not find a separate manual page for that. gzip example

Lets assume we are in the $WRKDIR directory of Taito-shell and there we have just one file called my_data.dat. Lets first check the size of that file with command ls -lh:

$ ls -lh
total 1.5G
-rw-r--r--+ 1 testuser csc 1.5G Nov  4 13:07 my_data.dat

The listing tells us that the size of the file is approximately 1,5 GB. Next, we compress the file with gzip and then check the file size again.

$ gzip my_data.dat
$ ls -lh
total 834M
-rw-r--r--+ 1 testuser csc 833M Nov  4 13:07 my_data.dat.gz 

The original file has now been removed and been replaced by the compressed file. However, as a result we now have a compressed file that requires only 833MB of disk space (55% of the original size). Next we decompress the file:

$ gunzip my_data.dat.gz
$ ls -lh
total 1.5G
-rw-r--r--+ 1 testuser csc 1.5G Nov  4 13:07 my_data.dat 

The file listing now shows that the compressed file has disappeared and the original file is available again. bzip2 and bunzip2

bzip2 is a compression program that is used very similarly compared to gzip. The main difference between the two programs is that bzip2 uses Burrows-Wheeler block sorting text compression algorithm combined with Huffman coding instead of the LZ77 algorithm used in gzip. The compression algorithm of bzip2 produces more effective compression than gzip's. However, computing the bzip2 compression usually is more complex and takes longer (i.e., uses more CPU cycles) than gzip compression. The usage of bzip2 is very similar to that of gzip, but not all the command line options are  identical. The basic compression syntax is:

$ bzip2 file_name
$ bzip2 < file_name > file_name.bz2

Similarly, the decompression can be done with command:

$ bunzip2 file_name.bz2
$ bunzip2 < file_name.bz2 > file_name

Note that a file compressed with bzip2 can not be uncompressed with gunzip and vice versa.

In addition to the standard bzip2 and bunzip2 programs, you can also use the parallel versions of bzip2 command: pbzip2 and pbunzip2. When these commands are used, the user must use option -p to define the number of processor cores to be used. For example, compressing the file my_data.dat using four cores can be done with command:

$ pbzip2 -p4 my_data.dat

Similarly, to decompress the file with two cores you can use command:

$ punbzip2 -p2 my_data.dat.bz2

The pbzip2 and pbunzip2 commands scale well for small core numbers. Already with two cores the pbzip2 is about as fast as gzip. The number of processors used does not affect to the actual result file. Thus, a file that has been compressed with parallel pbzip2 can be uncompressed with normal bunzip2 command and vice versa.

2.6.3 zip and unzip: the combined compression and file archiving tool

The zip program can be used for both archiving and compressing files. Given a list of files or a directories the zip command archives and compresses all the files in to a single zip archive file. So, in principle zip is analogous to the combination of tar and gzip commands. Later on the whole archive, or just certain files, can be extracted from the archive. The basic syntax of zip command is:

$ zip -options archive_file source_name

The source_name can be a list of files, directories, or a combination of both, that will be packed in to the archive_file. If the archive_file already exists, zip will replace the existing files in the archive with new ones from the source_name list, or add files if they do not exist in the archive_file yet. Note that unlike the tar command, zip does not add any files from subdirectories in to the archive by default. The option -r is needed to recursively add all files and subfolders from a given directory to the zip archive. Below is listed some commonly used zip command options.

Table 2.8: Option for the zip command
Option Function
-d Remove (delete) entries from a zip archive
-e Encrypt the contents of the zip archive using a  password  which is  entered  on  the terminal in response to a prompt
-f Replace  (freshen)  an existing file in the zip archive only if it has been modified more recently than the version  already in the zip archive
-l Translate  the Unix end-of-line character LF into the MSDOS convention CR LF
-ll Translate the MSDOS end-of-line CR LF into Unix LF
-r Travel the directory structure recursively
-u Update existing entries if newer on the file system and add new files
-@ Take the list of input files from standard input. Only one filename per line.

Creating a zip archive does not affect the original files. Note that the zip command will add the .zip extension to the archive file name by default, if it is not already included.

Zip archives can be extracted and studied with the unzip command. To extract files from a zip archive, use the command:

$ unzip archive_file_name

To just see the files included in the zip archive use command:

$ unzip -l archive_file_name

You can also extract just one file from an archive with command:

$ unzip archive_file_name file_name
Table 2.9: Options for the unzip command
Option Function
-f Freshen existing files, i.e., extract only those files that already exist on disk and  are newer than the ones on the disk
-l List the content of an archive file
-u Update existing files and create new ones if needed
-o Overwrite existing files without prompting
-p password Use password to decrypt an encrypted zip file zip example

To archive and compress the sample directory project_3, which contains the files sample1.txt, sample2.txt, …, sample9.txt (the same example that was used in the tar chapter) use the command:

$ zip -r project_3.zip project_3
  adding: project_3/ (stored 0%)
  adding: project_3/sample5.txt (deflated 71%)
  adding: project_3/sample2.txt (deflated 72%)
  adding: project_3/sample3.txt (deflated 70%)
  adding: project_3/sample4.txt (deflated 71%)
  adding: project_3/sample9.txt (deflated 71%)
  adding: project_3/sample7.txt (deflated 71%)
  adding: project_3/sample1.txt (deflated 73%)
  adding: project_3/sample6.txt (deflated 72%)
  adding: project_3/sample8.txt (deflated 71%)

Note that if you use the same zip command without the -r option, the archive file will not include the sample files in the directory but just the directory. However, in this case the project_3 directory does not include any subdirectories, so you could also do archiving with command:

$ zip project_3.zip project_3/*
  adding: project_3/sample1.txt (deflated 73%)
  adding: project_3/sample2.txt (deflated 72%)
  adding: project_3/sample3.txt (deflated 70%)
  adding: project_3/sample4.txt (deflated 71%)
  adding: project_3/sample5.txt (deflated 71%)
  adding: project_3/sample6.txt (deflated 72%)
  adding: project_3/sample7.txt (deflated 71%)
  adding: project_3/sample8.txt (deflated 71%)
  adding: project_3/sample9.txt (deflated 71%)

In the same way, you can later add a new file to an existing zip archive, e.g.:

$ cp sample10.txt project_3/
$ zip project_3.zip project_3/sample10.txt
 adding: project_3/sample10.txt (deflated 69%)

You can check the contents of the zip archive with unzip and option -l (small letter L):

$ unzip -l project_3.zip
Archive:  project_3.zip
  Length     Date   Time    Name
 --------    ----   ----    ----
 16662202  11-09-09 10:41   project_3/sample1.txt
 16397702  11-09-09 10:41   project_3/sample2.txt
  1303352  11-09-09 10:41   project_3/sample3.txt
  1925824  11-09-09 10:41   project_3/sample4.txt
  1989706  11-09-09 10:41   project_3/sample5.txt
  3813333  11-09-09 10:42   project_3/sample6.txt
  4176523  11-09-09 10:42   project_3/sample7.txt
  4056375  11-09-09 10:42   project_3/sample8.txt
  4085713  11-09-09 10:42   project_3/sample9.txt
  6541306  11-10-09 13:07   project_3/sample10.txt
 --------                   -------
 60952036                   10 files

To extract just the file sample3.txt from the zip archive, use the command:

$ unzip project_3.zip project_3/sample3.txt
Archive:  project_3.zip
replace project_3/sample3.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: project_3/sample3.txt 

To extract all the files from the archive, use the command:

$ unzip project_3.zip
Archive:  project_3.zip
replace project_3/sample1.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename:A
  inflating: project_3/sample1.txt
  inflating: project_3/sample2.txt
  inflating: project_3/sample3.txt
  inflating: project_3/sample4.txt
  inflating: project_3/sample5.txt
  inflating: project_3/sample6.txt
  inflating: project_3/sample7.txt
  inflating: project_3/sample8.txt
  inflating: project_3/sample9.txt
  inflating: project_3/sample10.txt

2.6.4 Creating very large compressed tar files

Tasks on Taito login nodes have one hour time limit. This might not be enough if you archive and compress very large data files. One way around it is to run the tar command on taito-shell and yet another improvement would be using the following command, which will also utilize a parallel version of the compression algorithm and display possible error messages.

$ tar c - folder-to-store | pbzip2 -c -p4 > archive.tar.bz2