LW-FQZip is a lossless light-weight reference-based compression tool for FASTQ data. Particularly, LW-FQZip first splits the input FASTQ data into three streams of metadata, short reads and quality scores, and then eliminates their redundancy independently according to their own characteristics. The metadata and quality scores are compacted with incremental and run-length-limited encoding schemes, respectively. The short reads are stored using a reference-based compression scheme based on the light-weight mapping model. Afterward, the three processed data streams are packed together with general purpose compression algorithms like LZMA, bzip, gzip and Lzip (Currently, LZMA and Lzip are available in the package. An API for calling other compression algorithms will be provided in the next version). The general framework of the program is shown as follows.
For implementation details, please refer to our paper:
Y. Zhang, L. Li, Y. Yang, X. Yang, S. He, and Z. Zhu, Light-weight reference-based compression of FASTQ data, BMC Bioinformatics, 16:188, 2015.
Source Code:
Sample Data:
Installation:
LW-FQZip is developed in C and run on 32/64 bit Linux OS. A minimum memory of 1GB is suggested for good user experience. To install the program, please download and decompress the source code and then compile it in a GNU environment equipped with GCC compiler using the following commands:
make clean
makeThe executable files ‘LWFQZip’ and ‘LWMapping’ will be generated in the same diretory of the source code. The light-weight mapping model can be used independently with LWFQZip. Make sure the LZMA compression tool 7zip, Lzip and archiving tool tar have been installed and configured correctly before running LWFQZip. The corresponding executable files of 7za,lzip and tar must be located in the same diretory with ‘LWFQZip’. Use ‘chmod’ to change the execution permission of these files if necessary.
A simple example is provided as follows to illustrate the use of ‘LWFQZip’. To compress the sample FASTQ file SRR1063349.fastq, the command ‘LWFQZip -c’ is executed with a reference NC_017634.1.fasta.
LWFQZip –c –i SRR1063349.fastq –r NC_017634.1.fasta
To decompress the file, the command ‘LWFQZip -d’ should be called.
LWFQZip –d –i SRR1063349.fastq.lw –r NC_017634.1.fastaMore parameters can be specified for the mapping and compressiong parts as follows:
COMMANDS AND OPTIONS
LWFQZip | <mode>...[options] | ||||||||||||
Mode: | |||||||||||||
|
|||||||||||||
Compression/Decompression Options: | |||||||||||||
|
|||||||||||||
Mapping Options: | |||||||||||||
|
We perform comparison study between LW-FQZip and other state-of-the-art FASTQ data compression algorithms using eight real-world data sets from the Sequence Read Archive of the National Centre for Biotechnology Information (NCBI). The experimental results and disscussion are available in our paper. The results demonstrate the superiority of LW-FQZip in terms of compression ratio to the other state-of-the-art FASTQ data compression methods including Quip, DSRC, DSRC2 and Fqzcomp. The following provides supplement results of running time and the effects of the parameters.
The running time of LW-FQZip on the test data sets is reported in Table S1. Since LW-FQZip is a reference-based method that has to map the reads to a reference, it could be slower than reference-free methods. The total running time of LW-FQZip consists of the running times of three sub-procedures namely preprocessing, mapping, and packing. Three packing tools, i.e., LZMA, bzip2, and arithmetic coding were experimented and LZMA is shown to consume more time than the other two packing tools. Parallelism and more efficient coding schemes will be introduced to LW-FQZip in the coming new version, which is expected to improve the time and space efficiency significantly.
Preprocessing | Mapping | Packing | |||
LZMA | bzip2 | Arithmetic Coding | |||
ERR231645 | 88.62 | 46.60 | 913.08 | 145.07 | 130.56 |
ERR005143 | 44.53 | 56.21 | 358.75 | 63.95 | 59.06 |
SRR352384 | 709.02 | 289.02 | 5071.74 | 1007.42 | 694.04 |
SRR801793 | 139.13 | 79.19 | 1378.25 | 287.88 | 251.82 |
SRR554369 | 38.99 | 105.33 | 477.65 | 93.47 | 75.82 |
ERR654984 | 48.25 | 69.37 | 754.54 | 124.13 | 106.78 |
ERR233152 | 63.83 | 82.43 | 298.34 | 54.21 | 41.44 |
SRR327342 | 236.71 | 148.38 | 3567.75 | 759.06 | 589.05 |
The effects of the parameters k, L, and e are investigated on three representative data sets. The three parameters are tuned within {8, 10, 12}, {12, 20, 30}, and {0.02, 0.05, 0.08}, respectively. The results reported in Table S2 suggest that the compression ratios with different L and e values are not significantly different. The parameter k decides the length of kmers. Longer kmers would result in more precisely mapping, but less mismatches tolerance. The user can increase k to allow higher alignment sensitivity at the cost of more memory consumption. Following the setting of kmer in many short reads assembling applications, k is set to ~10 in our experiments. Based on the empirical studies, the setting k=10 (or k=12), e=0.05, and L=12 is a safe choice in terms of mapping time and compression ratio.
k=10, L=12, e=0.05 | ||||||
Data | Mapping Time | Compression ratio | #Unmapped reads | Exact | Inexact | N* |
ERR231645 | 66.46 | 0.128 | 336,470 | 5,421,935 | 585,634 | 18,066 |
ERR005143 | 118.41 | 0.152 | 33,164 | 0 | 3,517,969 | 1,418,260 |
ERR654984 | 141.46 | 0.140 | 1,211 | 0 | 1,166,084 | 1,107,540 |
k=10, L=12, e=0.02 | ||||||
Data | Mapping Time | Compression ratio | #Unmapped reads | Exact | Inexact | N* |
ERR231645 | 68.07 | 0.128 | 340,086 | 5,349,379 | 654,574 | 18,431 |
ERR005143 | 117.90 | 0.152 | 33,936 | 0 | 3,517,197 | 1,418,416 |
ERR654984 | 141.02 | 0.141 | 1,216 | 0 | 1,166,079 | 1,107,543 |
k=10, L=12, e=0.08 | ||||||
Data | Mapping Time | Compression ratio | #Unmapped reads | Exact | Inexact | N* |
ERR231645 | 68.18 | 0.128 | 336,470 | 5,455,223 | 552,346 | 17,880 |
ERR005143 | 117.56 | 0.152 | 33,164 | 1 | 3,517,968 | 1,417,890 |
ERR654984 | 142.10 | 0.140 | 1,211 | 0 | 1,166,084 | 1,107,540 |
k=8, L=12, e=0.05 | ||||||
Data | Mapping Time | Compression ratio | #Unmapped reads | Exact | Inexact | N* |
ERR231645 | 375.30 | 0.129 | 397,485 | 5,165,072 | 781,482 | 53,572 |
ERR005143 | 929.47 | 0.159 | 126,480 | 0 | 3,424,653 | 1,022,413 |
ERR654984 | 1341.84 | 0.153 | 5,854 | 0 | 1,161,441 | 710,779 |
k=12, L=12, e=0.05 | ||||||
Data | Mapping Time | Compression ratio | #Unmapped reads | Exact | Inexact | N* |
ERR231645 | 48.54 | 0.127 | 596,406 | 5,437,725 | 336,908 | 1,257 |
ERR005143 | 48.19 | 0.150 | 127,063 | 0 | 3,424,070 | 1,404,437 |
ERR654984 | 76.74 | 0.140 | 3,701 | 0 | 1,163,594 | 1,107,503 |
k=10, L=20, e=0.05 | ||||||
Data | Mapping Time | Compression ratio | #Unmapped reads | Exact | Inexact | N* |
ERR231645 | 67.27 | 0.126 | 713,155 | 5,421,935 | 208,949 | 18,066 |
ERR005143 | 117.65 | 0.150 | 428,252 | 0 | 3,122,881 | 1,418,260 |
ERR654984 | 140.78 | 0.140 | 7,092 | 0 | 1,160,203 | 1,107,540 |
k=10, L=30, e=0.05 | ||||||
Data | Mapping Time | Compression ratio | #Unmapped reads | Exact | Inexact | N* |
ERR231645 | 67.26 | 0.126 | 750,968 | 5,421,935 | 171,136 | 18,066 |
ERR005143 | 116.71 | 0.150 | 501,128 | 0 | 3,050,005 | 1,418,260 |
ERR654984 | 141.12 | 0.140 | 8,432 | 0 | 1,158,863 | 1,107,540 |
Zexuan Zhu
College of Computer Science and Software Engineering
Shenzhen University
Address: Room A1036, College of Computer Science and Software Engineering, Shenzhen University, Nanhai Avenue 3688, Shenzhen 518060, China
E-mail: zhuzx@szu.edu.cn
Homepage: http://csse.szu.edu.cn/staff/zhuzx/