Dense Codes' site

This section is devoted to our byte-oriented word-based Dynamic (one-pass) statistical compressors. In this case, the idea is to present the dynamic counterparts of both the End-tagged Dense Code and the SC-Dense Codes. Such dynamic compressors are called Dynamic-ETDC and Dynamic SCDC [brisaboa et al. SP&E 2008]. Moreover, as we also needed a good baseline to compare our dynamic dense codes with, we also developed a word-based byte-oriented dynamic Huffman.

Our dynamic compressors are well-suited for transmission of compressed text, because being one-pass, they permit to start compression (and transmision) of the data in a compressed form as it appears. Both compressor and decompressor start with an initial empty model that is updated (keeping synchronism in a symmetric way) by both sender an receiver. In practice, our compressors permit real-time transmission of compressed text.

Both D-ETDC and D-SCDC are very fast compressors (faster than the semistatic counterparts as only one pass is performer through the source text file). They are also quite fast at decompression, but not so fast as their semistatic versions because the decompressor/receiver has to rebuild the model as it receives the compressed data. With respect to compression ratio, the dynamic compressors obtain actually the same values of ETDC-SCDC (for large files).

We have also implemented search-algorithms that permit to search the text compressed with D-ETDC and D-SCDC without a prior decompression, which are very simple as they just simulate the decompression process (skipping the output of words). The resulting algorithms permit to search the compressed text faster than other well-known techniques such as lzgrep working over text compressed with gzip, and faster than the pair "uncompress plus search".

More details about these 3 techniques are available our publications section. Please refer to the paper [brisaboa et al. SP&E 2008] and/or to Antonio Fariña's phd thesis.

If you are looking for the source code of such compressors probably you are interested in our download section.

Next we show the general ideas behind the D-ETDC technique.

Dynamic compression and pseudocode

D-ETDC

As presented in [brisaboa et al. SP&E 2008], both compressor and decompressor for D-ETDC are symmetric processes. Following a word-based statistical approach, it is necessary to gather the model (different words and their frequency) as the source text is read, and then the ETDC encoding schema is used to assign each word a codeword.

Basically, both compressor and decompressor start with an empty vocabulary and, as words are input, they add new words to the vocabulary and/or increase their number of occurrences.

When a new word comes, the compressor (sender) adds it to its vocabulary and then outputs a scape codeword followed by that new word (in plain form), to notify the decompressor (receiver) that a new word is being sent. When the compressor processes a word that is already in its vocabulary, it only sends the codeword associated to it (using ETDC encode(i) method), and then increases its frequency. Finally, the vocabulary might need to be "re-organized" to keep it sorted by frequency, so that the encoding schema can assign codewords properly in advance (smaller codewords being given to the most frequent words). As is shown in [brisaboa et al. SPE 2008], the re-organization consists just in a swap (in the vocabulary) between the word being processed and the first word of its same frequency. Next figure shows the pseudocode for the compressor process.

When the decompressor decodes a codeword, it checks if it is a scape codeword or not. If that codeword was a scape codeword, then it reads the next bytes to obtain the word sent in plain form. Otherwise the decoded value indicates the position in the vocabulary (words array) where that word is kept. Its pseudocode is shown in the following figure.

D-SCDC

As shown in [brisaboa et al. SP&E 2008], both compressor and decompressor for D-SCDC are symmetric processes as in the case of D-ETDC. The differences between D-SCDC and D-ETDC come from using SCDC encoding schema instead of ETDC. Based on SCDC, D-SCDC compressor needs to keep the encoding parameter (s) well-suited during the compression process. Therefore, encoding and decoding is identical to those in D-ETDC, but after processing each word D-SCDC (sender/receiver) has to check if "s" should change. This is done by calling TakeIntoAccount() algorithm after the call to update(). Its pseudocode follows:

Family of Dense Compressors

Dynamic compression and pseudocode

D-ETDC

D-SCDC