Sorry this page is still under construction, and some software downloads are not available yet.

#Please: ensure you compile with -m32 flag (Makefiles)...

Compressing with Dense Codes.

1. Removing ('\0') characters from the source text

In order to compress text with our codes, we have to ensure that they do not cointain the null character ('\0'). Instead of doing that during the parsing, we preferred to do that previosly by using (trivial) program removeZeroes. You can download it from [download]

syntax: remover fsrc fdst
results: copies fsrc into fdst skipping '\0' values.

2. Compressing with ETDC.

ETDC sources include 2 programs: CETDC and DETDC (compressor and decompressor respectively). Download them here: [download]

syntax: ./CETDC inFile outFile [VOCSIZE]
params: inFile (the uncompressed source text-file),
  outFile (the name of the compressed text created)
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary)
  A file timesComp is also created after compression. It logs compression times (useful to perform several runs of the compressor and then collects all the time measures)
syntax: ./DETDC inFile outFile
params: inFile (a file compressed with ETDC). A file inFile.voc should exist in the same directory
  outFile (the file that will be created)
  VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously).
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary).
  A file timesDec is also created after decompression. It logs decompression times (useful to perform several runs of the decompressor and then collects all the time measures)

3. Compressing with SCDC.

SCDC sources include 2 programs: CSCDC and DSCDC (compressor and decompressor respectively). Download them here: [download]

syntax: ./CSCDC inFile outFile [VOCSIZE]
params: inFile (the uncompressed source text-file),
  outFile (the name of the compressed text created)
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary).
  A file timesComp is also created after compression. It logs compression times (useful to perform several runs of the compressor and then collects all the time measures)
syntax: ./DETDC inFile outFile
params: inFile (a file compressed with SCDC). A file inFile.voc should exist in the same directory
  outFile (the file that will be created)
  VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously).
results: decompresses inFile, generating file outFile.
  A file timesDec is also created after decompression. It logs decompression times (useful to perform several runs of the decompressor and then collects all the time measures)

4. Compressing with DETDC.

DETDC sources include 2 programs: CDETDC and DDETDC (compressor and decompressor respectively). Download them here: [download]

syntax: ./CDETDC inFile outFile [VOCSIZE TOPSIZE]
params: inFile (the uncompressed source text-file),
  outFile (the name of the compressed text created)
results: compresses inFile, generating file outFile and outFile.voc (which saves VOCSIZE and TOPSIZE)
  VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously).
  TOPSIZE Optional: maximum frequency value (if it is known previously).
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary)
  A file timesComp is also created after compression. It logs compression times (useful to perform several runs of the compressor and then collects all the time measures)
syntax: ./DDETDC inFile outFile [VOCSIZE TOPSIZE]
params: inFile (a file compressed with DETDC). A file inFile.voc should exist in the same directory
  outFile (the file that will be created)
  VOCSIZE and TOPSIZE Optional: specifie the max-vocabulary size and max-frequency value.
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary).
  A file timesDec is also created after decompression. It logs decompression times (useful to perform several runs of the decompressor and then collects all the time measures)
  NOTE: In a pure dynamic scenario, the decompressor/receiver will not usually know the values VOCSIZE and TOPSIZE. However, in practice, in our dynamic decompressors, if those values are known (that is, if file inFile.voc exists in the same directory of inFile) the decompressor will take advantage of them to initialize its data structures. If those values are not provided (inFile.voc does not exist or VOCSIZE and TOPSIZE are not given), the decompressor will use Heaps' Law to estimate them.

5. Compressing with DSCDC.

We developed several heuristics that permit to adjust the parameter s as compression/decompression progresses: The first one countBytesApproach is a bit slower that the second rangesApproach. However, rangesApproach is based in a false assumpcion (which could leadDSCDC to umpredictable adjustmentes of s). Therefore, even when we include here both variants, we always use the countBytes version in our experiments. Sources for both variants include 2 programs: CDSCDC and DDSCDC (compressor and decompressor respectively). Download them through the followin links: for countBytes version of DSCDC [click here] and for ranges-approach version of DSCDC [click here]

syntax: ./CDSCDC inFile outFile [VOCSIZE TOPSIZE]
params: inFile (the uncompressed source text-file),
  outFile (the name of the compressed text created)
results: compresses inFile, generating file outFile and outFile.voc (which saves VOCSIZE and TOPSIZE)
  VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously).
  TOPSIZE Optional: maximum frequency value (if it is known previously).
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary)
  A file timesComp is also created after compression. It logs compression times (useful to perform several runs of the compressor and then collects all the time measures)
syntax: ./DDSCDC inFile outFile [VOCSIZE TOPSIZE]
params: inFile (a file compressed with DSCDC). A file inFile.voc should exist in the same directory
  outFile (the file that will be created)
  VOCSIZE and TOPSIZE Optional: specifie the max-vocabulary size and max-frequency value.
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary).
  A file timesDec is also created after decompression. It logs decompression times (useful to perform several runs of the decompressor and then collects all the time measures)
  NOTE: In a pure dynamic scenario, the decompressor/receiver will not usually know the values VOCSIZE and TOPSIZE. However, in practice, in our dynamic decompressors, if those values are known (that is, if file inFile.voc exists in the same directory of inFile) the decompressor will take advantage of them to initialize its data structures. If those values are not provided (inFile.voc does not exist or VOCSIZE and TOPSIZE are not given), the decompressor will use Heaps' Law to estimate them.

6. Compressing with DPH.

DPH sources include 2 programs: CDPH and DDPH (compressor and decompressor respectively). Download them here: [download]

syntax: ./CDPH inFile outFile [VOCSIZE]
params: inFile (the uncompressed source text-file),
  outFile (the name of the compressed text created)
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary)
  A file timesCDPH is also created after compression. It logs compression times (useful to perform several runs of the compressor and then collects all the time measures)
syntax: ./DDPH inFile outFile [VOCSIZE]
params: inFile (a file compressed with DPH). A file inFile.voc should exist in the same directory
  outFile (the file that will be created)
  VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously).
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary).
  A file timesDDPH is also created after decompression. It logs decompression times (useful to perform several runs of the decompressor and then collects all the time measures)

7. Compressing with DLETDC.

DLETDC sources include 2 programs: CDLETDC and DDLETDC (compressor and decompressor respectively). Download them here: [download]

syntax: ./CDLETDC inFile outFile [VOCSIZE TOPSIZE]
params: inFile (the uncompressed source text-file),
  outFile (the name of the compressed text created)
results: compresses inFile, generating file outFile and outFile.voc (which saves VOCSIZE and TOPSIZE)
  VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously).
  TOPSIZE Optional: maximum frequency value (if it is known previously).
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary)
  A file timesComp is also created after compression. It logs compression times (useful to perform several runs of the compressor and then collects all the time measures)
syntax: ./DDLETDC inFile outFile [VOCSIZE TOPSIZE]
params: inFile (a file compressed with DLETDC). A file inFile.voc should exist in the same directory
  outFile (the file that will be created)
  VOCSIZE and TOPSIZE Optional: specifie the max-vocabulary size and max-frequency value.
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary).
  A file timesDec is also created after decompression. It logs decompression times (useful to perform several runs of the decompressor and then collects all the time measures)
  NOTE: In a pure dynamic scenario, the decompressor/receiver will not usually know the values VOCSIZE and TOPSIZE. However, in practice, in our dynamic decompressors, if those values are known (that is, if file inFile.voc exists in the same directory of inFile) the decompressor will take advantage of them to initialize its data structures. If those values are not provided (inFile.voc does not exist or VOCSIZE and TOPSIZE are not given), the decompressor will use Heaps' Law to estimate them.

8. Compressing with DLSCDC.

DSCDC sources include 2 programs: CDLSCDC and DDLSCDC (compressor and decompressor respectively). Download them here: [download]

syntax: ./CDLSCDC inFile outFile [VOCSIZE TOPSIZE]
params: inFile (the uncompressed source text-file),
  outFile (the name of the compressed text created)
results: compresses inFile, generating file outFile and outFile.voc (which saves VOCSIZE and TOPSIZE)
  VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously).
  TOPSIZE Optional: maximum frequency value (if it is known previously).
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary)
  A file timesComp is also created after compression. It logs compression times (useful to perform several runs of the compressor and then collects all the time measures)
syntax: ./DDLSCDC inFile outFile [VOCSIZE TOPSIZE]
params: inFile (a file compressed with DLSCDC). A file inFile.voc should exist in the same directory
  outFile (the file that will be created)
  VOCSIZE and TOPSIZE Optional: specifie the max-vocabulary size and max-frequency value.
results: compresses inFile, generating file outFile and outFile.voc (the vocabulary).
  A file timesDec is also created after decompression. It logs decompression times (useful to perform several runs of the decompressor and then collects all the time measures)
  NOTE: In a pure dynamic scenario, the decompressor/receiver will not usually know the values VOCSIZE and TOPSIZE. However, in practice, in our dynamic decompressors, if those values are known (that is, if file inFile.voc exists in the same directory of inFile) the decompressor will take advantage of them to initialize its data structures. If those values are not provided (inFile.voc does not exist or VOCSIZE and TOPSIZE are not given), the decompressor will use Heaps' Law to estimate them.

Searching text compressed with dense codes.

Preliminaries: Searching framework

In order to be able to perform random searches over our different codes, and to obtain a fair comparison we wanted to be able to search for the same random patterns over those codes. Therefore,we create two programs that allow us to first create some sets of random patterns that will be used later. These two programs are: [EXTRACTwords] that generates a list of candidate patterns by applying a filter by frequency or by pattern-lenght, and [GenerateRandomPatterns] which permit us to create N sets of k random patterns. This sets of random patterns will be used by our searchers when random searches are needed.

Searches over ETDC and SCDC are performed using variants of Horspool and Set-Horspool algorithms, that have been adapted to deal with searches over the dense codes.

S.1. Searching code compressed with ETDC.

There exists 2 search algoritms developed: Single pattern search, and Multi pattern search. Download them here:

[SINGLE-PATTERN download]

syntax: ./selSEARCHERetdc compressedFile patternsFile iterations FTIMES
syntax: ./rdnSEARCHERetdc compressedFile patternsFile iterations FTIMES
params: compressedFile (a file compressed with etdc),
  patternsFile (A file containing N patterns (one at each line))
  iterations (The number of patterns that should be read from "patternsFile")
  patternsFile (A file containing N patterns (one at each line))
results: Searches compressedFile, searching for iterations patterns from patternsFile
  A file FTIMES will log search times (useful to perform several runs of the searcher and later collect all the time measures)

[MULTI-PATTERN download]

syntax: ./SEARCHERetdc compressedFile patternsFile iterations FTIMES numPatternsPerIter strmatlab
params: compressedFile (a file compressed with etdc),
  patternsFile (A file containing N patterns (one at each line). The format "is %9 %9 %s\n" as in rdnSEARCHERetdc).
  iterations (Number of sets of K random patterns being searched for.)
  numPatternsPerIter (The number =K of patterns that should be read from "patternsFile" and searched in parallel over "compressedFile", for each iteration.)
results: Searches compressedFile, searching for iterations patterns from patternsFile
  A file FTIMES will log search times (useful to perform several runs of the searcher and later collect all the time measures)
  The parameter strmatlab is needed. It indicates the name (i.e. multietdc) for the array that will keep the logged times in FTIMES for each search for K-patterns.

S.2. Searching code compressed with SCDC.

There exists 2 search algoritms developed: Single pattern search, and Multi pattern search. Download them here: [SINGLE-PATTERN download]

syntax: ./selSEARCHERscdc compressedFile patternsFile iterations FTIMES
syntax: ./rdnSEARCHERscdc compressedFile patternsFile iterations FTIMES
params: compressedFile (a file compressed with scdc),
  patternsFile (A file containing N patterns (one at each line))
  iterations (The number of patterns that should be read from "patternsFile")
  patternsFile (A file containing N patterns (one at each line))
results: Searches compressedFile, searching for iterations patterns from patternsFile
  A file FTIMES will log search times (useful to perform several runs of the searcher and later collect all the time measures)

[MULTI-PATTERN download]

syntax: ./SEARCHERscdc compressedFile patternsFile iterations FTIMES numPatternsPerIter strmatlab
params: compressedFile (a file compressed with scdc),
  patternsFile (A file containing N patterns (one at each line). The format "is %9 %9 %s\n" as in rdnSEARCHERscdc).
  iterations (Number of sets of K random patterns being searched for.)
  numPatternsPerIter (The number =K of patterns that should be read from "patternsFile" and searched in parallel over "compressedFile", for each iteration.)
results: Searches compressedFile, searching for iterations patterns from patternsFile
  A file FTIMES will log search times (useful to perform several runs of the searcher and later collect all the time measures)
  The parameter strmatlab is needed. It indicates the name (i.e. multiSCDC) for the array that will keep the logged times in FTIMES for each search for K-patterns.

Others

Plain Huffman Code.

In order to have a baseline for our semistatic compressors ETDC and SCDC, we have also implemented a version of Plain Huffman Code (see http://www.dcc.uchile.cl/~gnavarro/ps/tois00.pdf) following the same ideas of ETDC and SCDC, but with different encode and decode algorithms. A canonical-huffman code is implemented based on http://www.cs.mu.oz.au/~alistair/inplace.c Our source code can be downloaded here. [Plain Huffman]

Boosting Lzgrep.

Researches interested in performed searches over text compressed with gzip, can use the original Lzgrep software. Our modified Lzgrep that works over text preprocessed with ETDC and finally compressed with gzip can be downloaded here: [boosted-Lzgrep] .

Summary Table of Downloads.

Supported in part by Ministerio de Ciencia e Innovación (PGE and FEDER) [TIN2006-15071-C03-03, TIN2009-14560-C03-02, TIN2010-21246-C02-01, and CDTI CEN-20091048]; Xunta de Galicia (Feder) [2010/17] and (Agrupación Estratéxica) [CN 2012/211]; and Fondecyt [1-110066]