The family of Dense Codes (as text compressors).
This web is dedicated to the family of dense compressors. A new family of compression techniques well-suited for text compression that was born in 2003. Some of these techniques are semistatic(2) and others permit us to perform dynamic compression (4). All of them are statistical codes that use a word-based strategy and generate byte-oriented codewords. We also include here a dynamic word-based byte-oriented Huffman-based compressor (DPH), that was developed to obtain a base-line to compare our dynamic codes with.
The 2 semistatic techniques, which were developed in 2003, are called: End-Tagged Dense Code (ETDC) and (s,c)-Dense Code (SCDC).
The 4 dynamic dense codes are: Dynamic End-Tagged Dense Code (DETDC), Dynamic (s,c)-Dense Code (SCDC) , Dynamic Lightweight End-Tagged Dense Code (DLETDC), and Dynamic Lightweight(s,c)-Dense Code (DLSCDC) .
As a brief description, all these compressors obtain compression ratios around 30-35%, while being very fast at compression and decompression (some of them overcome gzip). As other techniques such as Plain Huffman or Tagged Huffman, our semistatic compressors (ETDC and SCDC) permit to search compressed text much faster than searching the uncompressed text. This is also applicable to our Lightweight compressors (DLETDC) and (DLSCDC), which represent an interesting contribution, since they become the first dynamic compression techniques that permit searching the compressed text faster than the uncompressed text. For a more detailled description, please refer to section publications.
We aim here at giving a small description of all our compressors and also at providing our source code under GNU General Public License. Moreover, we also include a small framework that allows to perform searches about our codes (except for DPH).
Our dense codes have been successfully used in different developments:
- Encoding integers: The encoding scheme of SCDC/ETDC, permits us to obtain a slightly stronger compression than that bytecodes (for being a dense coding technique).
- Boosting compression/indexing: We have shown that ETDC/SCDC can be used as text pre-processing techniques that boost the performance of general purpose compressors (bzip2, gzip,...), searchers such as lzgrep, and also self-indexing structures (ssa, affm-index,...). See section boosting.
- Self-indexing using wavelet-trees: See this sigir'08 paper.
- Variable-to-variable word-based text compression compression: See our papers in DCC'2010 and IPM'2011 .
About us...We are 3 researchers from the University of A Coruña (Spain) and 1 from the University of Chile (Chile).
- Gonzalo Navarro (phD). Computer Science Department. UChile
- Nieves R. Brisaboa (phD). Database Lab. UDC
- José R. Paramá (phD). Database Lab. UDC
- Antonio Fariña (phD). Database Lab. UDC
About our developments
Please note that the compressor/decompressor/search programs developed here are the prototypes used to validate our techniques empirically (as a part of our research interest), and not a final product.
Source code is available in Section downloads .