news The new version LW-FQZip 2 is now available.

LW-FQZip is a lossless light-weight reference-based compression tool for FASTQ data. Particularly, LW-FQZip first splits the input FASTQ data into three streams of metadata, short reads and quality scores, and then eliminates their redundancy independently according to their own characteristics. The metadata and quality scores are compacted with incremental and run-length-limited encoding schemes, respectively. The short reads are stored using a reference-based compression scheme based on the light-weight mapping model. Afterward, the three processed data streams are packed together with general purpose compression algorithms like LZMA, bzip, gzip and Lzip (Currently, LZMA and Lzip are available in the package. An API for calling other compression algorithms will be provided in the next version). The general framework of the program is shown as follows.

framework

For implementation details, please refer to our paper:
Y. Zhang, L. Li, Y. Yang, X. Yang, S. He, and Z. Zhu, Light-weight reference-based compression of FASTQ data, BMC Bioinformatics, 16:188, 2015.

Source Code:

LWFQZip

Sample Data:

FASTQ data: SRR1063349.fastq

Reference: NC_017634.1.fasta

Installation:

LW-FQZip is developed in C and run on 32/64 bit Linux OS. A minimum memory of 1GB is suggested for good user experience. To install the program, please download and decompress the source code and then compile it in a GNU environment equipped with GCC compiler using the following commands:

make clean
make

The executable files ‘LWFQZip’ and ‘LWMapping’ will be generated in the same diretory of the source code. The light-weight mapping model can be used independently with LWFQZip. Make sure the LZMA compression tool 7zip, Lzip and archiving tool tar have been installed and configured correctly before running LWFQZip. The corresponding executable files of 7za,lzip and tar must be located in the same diretory with ‘LWFQZip’. Use ‘chmod’ to change the execution permission of these files if necessary.

A simple example is provided as follows to illustrate the use of ‘LWFQZip’. To compress the sample FASTQ file SRR1063349.fastq, the command ‘LWFQZip -c’ is executed with a reference NC_017634.1.fasta.

LWFQZip –c –i SRR1063349.fastq –r NC_017634.1.fasta

where the target FASTQ file is first mapped to the reference obtaining an intermediate output file ‘SRR1063349.fastq.map.txt’(in SAM format) in the same directory, and then the original data is compressed based on this mapping results. A compressed file ‘SRR1063349.fastq.lw’ is obtained.

To decompress the file, the command ‘LWFQZip -d’ should be called.

LWFQZip –d –i SRR1063349.fastq.lw –r NC_017634.1.fasta

More parameters can be specified for the mapping and compressiong parts as follows:

COMMANDS AND OPTIONS

LWFQZip <mode>...[options]
  Mode:
 
-c compression.
-d decompression.
  Compression/Decompression Options:
 
-i input FASTQ file or compressed file.
-r input Reference file.
-m maximal read length,ranging from 30000 to 300000 (default '-m 30000').
-h help.
-v version.
-t choose a general purpose compression algorithm with '0' indicating LZMA and '1' indicating Lzip (Default: '-t 0'). By using LZMA (Lzip), the compressed file is surfixed with '.lw'('.lz').
  Mapping Options:
 
-p specify the kmer prefixes, e.g.,'CG', 'AT', and 'TAG' (Default: '-p CG'). 'AA' is not recommended as a prefix.
-k length of a kmer used in locate local alignment. (Default: '-k 8')
-l the mini length of a legal alignment.(Default: '-l 30')
-e the tolerance rate of mismatches.(Default: '-e 0.05')

We perform comparison study between LW-FQZip and other state-of-the-art FASTQ data compression algorithms using eight real-world data sets from the Sequence Read Archive of the National Centre for Biotechnology Information (NCBI). The experimental results and disscussion are available in our paper. The results demonstrate the superiority of LW-FQZip in terms of compression ratio to the other state-of-the-art FASTQ data compression methods including Quip, DSRC, DSRC2 and Fqzcomp. The following provides supplement results of running time and the effects of the parameters.

The running time of LW-FQZip on the test data sets is reported in Table S1. Since LW-FQZip is a reference-based method that has to map the reads to a reference, it could be slower than reference-free methods. The total running time of LW-FQZip consists of the running times of three sub-procedures namely preprocessing, mapping, and packing. Three packing tools, i.e., LZMA, bzip2, and arithmetic coding were experimented and LZMA is shown to consume more time than the other two packing tools. Parallelism and more efficient coding schemes will be introduced to LW-FQZip in the coming new version, which is expected to improve the time and space efficiency significantly.

Table S1: Running time of LW-FQZip in seconds

  Preprocessing Mapping Packing
LZMA bzip2 Arithmetic Coding
ERR231645 88.62 46.60 913.08 145.07 130.56
ERR005143 44.53 56.21 358.75 63.95 59.06
SRR352384 709.02 289.02 5071.74 1007.42 694.04
SRR801793 139.13 79.19 1378.25 287.88 251.82
SRR554369 38.99 105.33 477.65 93.47 75.82
ERR654984 48.25 69.37 754.54 124.13 106.78
ERR233152 63.83 82.43 298.34 54.21 41.44
SRR327342 236.71 148.38 3567.75 759.06 589.05
Total running time = Preprocessing time + Mapping time + Packing time

The effects of the parameters k, L, and e are investigated on three representative data sets. The three parameters are tuned within {8, 10, 12}, {12, 20, 30}, and {0.02, 0.05, 0.08}, respectively. The results reported in Table S2 suggest that the compression ratios with different L and e values are not significantly different. The parameter k decides the length of kmers. Longer kmers would result in more precisely mapping, but less mismatches tolerance. The user can increase k to allow higher alignment sensitivity at the cost of more memory consumption. Following the setting of kmer in many short reads assembling applications, k is set to ~10 in our experiments. Based on the empirical studies, the setting k=10 (or k=12), e=0.05, and L=12 is a safe choice in terms of mapping time and compression ratio.

Table S2: Effects of the parameters k, L, and e.

k=10, L=12, e=0.05
Data Mapping Time Compression ratio #Unmapped reads Exact Inexact N*
ERR231645 66.46 0.128 336,470 5,421,935 585,634 18,066
ERR005143 118.41 0.152 33,164 0 3,517,969 1,418,260
ERR654984 141.46 0.140 1,211 0 1,166,084 1,107,540
k=10, L=12, e=0.02
Data Mapping Time Compression ratio #Unmapped reads Exact Inexact N*
ERR231645 68.07 0.128 340,086 5,349,379 654,574 18,431
ERR005143 117.90 0.152 33,936 0 3,517,197 1,418,416
ERR654984 141.02 0.141 1,216 0 1,166,079 1,107,543
k=10, L=12, e=0.08
Data Mapping Time Compression ratio #Unmapped reads Exact Inexact N*
ERR231645 68.18 0.128 336,470 5,455,223 552,346 17,880
ERR005143 117.56 0.152 33,164 1 3,517,968 1,417,890
ERR654984 142.10 0.140 1,211 0 1,166,084 1,107,540
k=8, L=12, e=0.05
Data Mapping Time Compression ratio #Unmapped reads Exact Inexact N*
ERR231645 375.30 0.129 397,485 5,165,072 781,482 53,572
ERR005143 929.47 0.159 126,480 0 3,424,653 1,022,413
ERR654984 1341.84 0.153 5,854 0 1,161,441 710,779
k=12, L=12, e=0.05
Data Mapping Time Compression ratio #Unmapped reads Exact Inexact N*
ERR231645 48.54 0.127 596,406 5,437,725 336,908 1,257
ERR005143 48.19 0.150 127,063 0 3,424,070 1,404,437
ERR654984 76.74 0.140 3,701 0 1,163,594 1,107,503
k=10, L=20, e=0.05
Data Mapping Time Compression ratio #Unmapped reads Exact Inexact N*
ERR231645 67.27 0.126 713,155 5,421,935 208,949 18,066
ERR005143 117.65 0.150 428,252 0 3,122,881 1,418,260
ERR654984 140.78 0.140 7,092 0 1,160,203 1,107,540
k=10, L=30, e=0.05
Data Mapping Time Compression ratio #Unmapped reads Exact Inexact N*
ERR231645 67.26 0.126 750,968 5,421,935 171,136 18,066
ERR005143 116.71 0.150 501,128 0 3,050,005 1,418,260
ERR654984 141.12 0.140 8,432 0 1,158,863 1,107,540
N*: the number of unmapped segments that can be realigned to the same reference genome.

Zexuan Zhu

College of Computer Science and Software Engineering
Shenzhen University
Address: Room A1036, College of Computer Science and Software Engineering, Shenzhen University, Nanhai Avenue 3688, Shenzhen 518060, China
E-mail: zhuzx@szu.edu.cn
Homepage: http://csse.szu.edu.cn/staff/zhuzx/