CompMap is a reference-based compression program to speed up read mapping to related reference sequences. CompMap is designed to eliminate repeat subsequences based on reference-base compression in the input curating database, which could contain complete genomes, plasmids, and/or other sequences. Particularly, one or multiple sequences in the database are selected to form a reference to which the other sequences are aligned. The repeat segments in other sequences are then eliminated by recoding only their aligned positions and lengths at very little space cost. The non-aligned segments together with the reference are considered as the non-redundant representation of the database. NGS short reads are then mapped to the representation sequences rather than the whole database. The corresponding mapped positions of the short reads to the original database could be readily recovered according to the representation sequences and the alignment results.
If you use CompMap, please cite:
Z. Zhu, L. Li, Y. Zhang, Y. Yang, and X. Yang, CompMap: a reference-based compression program to speed up read mapping to related reference sequences, Bioinformatics, vol. 31, no. 3, pp. 426-428, 2015.
Source Code:
Sample Data:
Y. kristensenii short reads SRR031601
Y. ruckeri short reads SRR032505
Y. rohdei short reads SRR032501
Bacterial chromosomes database (170 complete chromosomes of B. anthracis, B. mallei, B. pseudomallei, Brucella sp., C. botulinum, E. coli O157:H7, F. tularensis, and Y. pestis)
E. coli short reads: SRR1063349, ERR385912, ERR231645
E. coli chromosomes database (5,338 E. coli genomes or plasmids)
Tools:
SeqCmp:a homemade tool to obtain pairwise similarity of input genome sequences. The tool can be used to optimize the selection of reference(s), which share the most similarity with other sequences, in the compression of CompMap.
CompMAP is developed in C and run on 32/64 bit Linux OS. A minimum memory of 1GB is suggested for good user experience. To install the program, please download and decompress the source code and then compile it in a GNU environment equipped with GCC compiler using the following commands:
make
make clean
The executable file namely ‘compMAP’ will be generated in the same diretory of the source code. Make sure a short read mapping tool e.g., BWA or Bowtie 2 has been installed and configured correctly before running CompMap.
A simple example is provided as follows to illustrate the use of CompMap. To map the short reads in file ‘reads.fq’ to the related sequences (in fasta format) stored in the directory ‘db_dir’, the ‘compMAP comp’ command should be first run to compress the sequences and obtain the reusable non-redundant representation of the sequences say ‘ref.fa’:
compMAP comp db_dir ref.fa
Here, the executable compMAP is assumed to be located in the same directory as ‘db_dir’. If not, the exact directory of compMAP should be specified. Afterward, a short read mapping tool like BWA or Bowtie 2 should be called to map the short reads ‘reads.fq’ to ‘ref.fa’:
bwa index ref.fa
bwa mem –a ref.fa reads.fq > aln.sam
or
bowtie2-build ref.fa bt2.index
bowtie2-align –a -x bt2.index –U reads.fq –S aln.sam
The mapping results are output to ‘aln.sam’ in standard SAM format. The compatibility of BWA and Bowtie 2 to CompMap has been tested by the authors. Nevertheless, CompMap can also work together with other short read mapping tools if only they can support input reference sequences in standard FASTA format and output results in standard SAM format.
Finally, the ‘compMAP map’ command should be executed to recover the mapped positions in the original sequences based on ‘aln.sam’ and ‘ref.fa’.
In the final output SAM file (SAM Format Specification), when junction alignment is detected, the field POS is set to 0, FLAG is set randomly, and the RNAME is set to ‘junction’. Note that because approximate matching is adopted in the identification of repeats, the final mapping results could have small deviation in POS and CIAGR fields.
COMMANDS AND OPTIONS
comp | compMAP comp <db_dir> [option] <ref.fa> | ||||||||||||
compress the database sequences in ‘db_dir’ directory and output in FASTA format. | |||||||||||||
OPTIONS | |||||||||||||
|
|||||||||||||
Examples of uisng ‘compMAP comp’ ‘compMAP comp db_dir ref.fa’‘compMAP comp db_dir –R 2 f1.fasta ref.fa’ ‘compMAP comp db_dir –L 200 -e 0.03 –N 16 –k 10 –P CG, AT –R 2 @text ref.fa’ Command ‘compMAP comp’ generates three files namely ‘ref.fa’, ‘ref.fa.mat’, and ‘ref.fa.log’. ‘ref.fa’ contains the non-redundant representative sequences of the database. ‘ref.fa.mat’ and ‘ref.fa.log’ store the positions of all valid repeat segments in all non-reference sequences. |
map | compMAP map [ option ] <aln.sam> <ref.fa> | ||||
Recover the mapped positions in the original sequences based on ‘aln.sam’ and ‘ref.fa’. ‘aln.sam’ is the output of the alignment tool such as BWA and Bowtie 2; ‘ref.fa’ is the output of ‘compMAP comp...’ Make sure ‘ref.fa.mat’ and ‘ref.fa.log’ are located in the same directory as ‘ref.fa’. | |||||
OPTIONS | |||||
|
|||||
Examples of using ‘compMAP map’ ‘compMAP map –D aln.sam ref.fa’ |
For the implementation details of CompMap and the guidelines of parameter setting, please refer to the supplement materials.
Zexuan Zhu
College of Computer Science and Software Engineering
Shenzhen University
Mail: Administration Building 446, Nanhai Avenue 3688, Shenzhen, China 518060
E-mail: zhuzx@szu.edu.cn
Homepage: http://csse.szu.edu.cn/staff/zhuzx/
Xiao Yang
Broad Institute of MIT & Harvard
Mail: 7 Cambridge Center, Cambridge, MA 02142
E-mail: xiaoyang@broadinstitute.org
Homepage: http://www.broadinstitute.org/~xiaoyang/