Sorry this page is still under construction, and some software downloads are not available yet.
#Please: ensure you compile with -m32 flag (Makefiles)...
Compressing with Dense Codes.
1. Removing ('\0') characters from the source text
In order to compress text with our codes, we have to ensure that they do
not cointain the null character ('\0'). Instead
of doing that during the parsing, we preferred to do that previosly by
using (trivial) program removeZeroes. You can
download it from [download]
syntax: |
remover fsrc fdst |
results: |
copies fsrc into fdst skipping '\0' values. |
2. Compressing with ETDC.
ETDC sources include 2 programs: CETDC and DETDC (compressor and decompressor respectively).
Download them here: [download]
syntax: |
./CETDC inFile outFile [VOCSIZE] |
params: |
inFile (the uncompressed source text-file), |
|
outFile (the name of the compressed text created) |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary) |
|
A file timesComp is also created after compression. It logs
compression times (useful to perform several runs of the compressor and
then collects all the time measures) |
syntax: |
./DETDC inFile outFile |
params: |
inFile (a file compressed with ETDC). A file inFile.voc
should exist in the same directory |
|
outFile (the file that will be created) |
|
VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously). |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary). |
|
A file timesDec is also created after decompression. It logs
decompression times (useful to perform several runs of the decompressor and
then collects all the time measures) |
3. Compressing with SCDC.
SCDC sources include 2 programs: CSCDC and DSCDC (compressor and decompressor respectively).
Download them here: [download]
syntax: |
./CSCDC inFile outFile [VOCSIZE] |
params: |
inFile (the uncompressed source text-file), |
|
outFile (the name of the compressed text created) |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary). |
|
A file timesComp is also created after compression. It logs
compression times (useful to perform several runs of the compressor and
then collects all the time measures) |
syntax: |
./DETDC inFile outFile |
params: |
inFile (a file compressed with SCDC). A file inFile.voc
should exist in the same directory |
|
outFile (the file that will be created) |
|
VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously). |
results: |
decompresses inFile, generating file outFile. |
|
A file timesDec is also created after decompression. It logs
decompression times (useful to perform several runs of the decompressor and
then collects all the time measures) |
4. Compressing with DETDC.
DETDC sources include 2 programs: CDETDC and DDETDC (compressor and decompressor respectively).
Download them here: [download]
syntax: |
./CDETDC inFile outFile [VOCSIZE TOPSIZE] |
params: |
inFile (the uncompressed source text-file), |
|
outFile (the name of the compressed text created) |
results: |
compresses inFile, generating file outFile and outFile.voc (which saves VOCSIZE and TOPSIZE) |
|
VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously). |
|
TOPSIZE Optional: maximum frequency value (if it is known previously). |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary) |
|
A file timesComp is also created after compression. It logs
compression times (useful to perform several runs of the compressor and
then collects all the time measures) |
syntax: |
./DDETDC inFile outFile [VOCSIZE TOPSIZE] |
params: |
inFile (a file compressed with DETDC). A file inFile.voc
should exist in the same directory |
|
outFile (the file that will be created) |
|
VOCSIZE and TOPSIZE Optional: specifie the max-vocabulary size and max-frequency value. |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary). |
|
A file timesDec is also created after decompression. It logs
decompression times (useful to perform several runs of the decompressor and
then collects all the time measures) |
|
NOTE: In a pure dynamic scenario, the decompressor/receiver will not
usually know the values VOCSIZE and TOPSIZE. However, in practice, in our dynamic decompressors, if those
values are known (that is, if file inFile.voc exists in the same directory of inFile)
the decompressor will take advantage of them to initialize its data structures. If those values are not
provided (inFile.voc does not exist or VOCSIZE and TOPSIZE are not given), the decompressor will use
Heaps' Law to estimate them. |
5. Compressing with DSCDC.
We developed several heuristics that permit to adjust the parameter s as
compression/decompression progresses: The first one countBytesApproach
is a bit slower that the second rangesApproach. However, rangesApproach
is based in a false assumpcion (which could leadDSCDC to umpredictable adjustmentes
of s). Therefore, even when we include here both variants, we always use the
countBytes version in our experiments. Sources for both variants include 2
programs: CDSCDC and DDSCDC (compressor and decompressor respectively).
Download them through the followin links:
for countBytes version of DSCDC [click here] and
for ranges-approach version of DSCDC [click here]
syntax: |
./CDSCDC inFile outFile [VOCSIZE TOPSIZE] |
params: |
inFile (the uncompressed source text-file), |
|
outFile (the name of the compressed text created) |
results: |
compresses inFile, generating file outFile and outFile.voc (which saves VOCSIZE and TOPSIZE) |
|
VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously). |
|
TOPSIZE Optional: maximum frequency value (if it is known previously). |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary) |
|
A file timesComp is also created after compression. It logs
compression times (useful to perform several runs of the compressor and
then collects all the time measures) |
syntax: |
./DDSCDC inFile outFile [VOCSIZE TOPSIZE] |
params: |
inFile (a file compressed with DSCDC). A file inFile.voc
should exist in the same directory |
|
outFile (the file that will be created) |
|
VOCSIZE and TOPSIZE Optional: specifie the max-vocabulary size and max-frequency value. |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary). |
|
A file timesDec is also created after decompression. It logs
decompression times (useful to perform several runs of the decompressor and
then collects all the time measures) |
|
NOTE: In a pure dynamic scenario, the decompressor/receiver will not
usually know the values VOCSIZE and TOPSIZE. However, in practice, in our dynamic decompressors, if those
values are known (that is, if file inFile.voc exists in the same directory of inFile)
the decompressor will take advantage of them to initialize its data structures. If those values are not
provided (inFile.voc does not exist or VOCSIZE and TOPSIZE are not given), the decompressor will use
Heaps' Law to estimate them. |
6. Compressing with DPH.
DPH sources include 2 programs: CDPH and DDPH (compressor and decompressor respectively).
Download them here: [download]
syntax: |
./CDPH inFile outFile [VOCSIZE] |
params: |
inFile (the uncompressed source text-file), |
|
outFile (the name of the compressed text created) |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary) |
|
A file timesCDPH is also created after compression. It logs
compression times (useful to perform several runs of the compressor and
then collects all the time measures) |
syntax: |
./DDPH inFile outFile [VOCSIZE] |
params: |
inFile (a file compressed with DPH). A file inFile.voc
should exist in the same directory |
|
outFile (the file that will be created) |
|
VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously). |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary). |
|
A file timesDDPH is also created after decompression. It logs
decompression times (useful to perform several runs of the decompressor and
then collects all the time measures) |
7. Compressing with DLETDC.
DLETDC sources include 2 programs: CDLETDC and DDLETDC (compressor and decompressor respectively).
Download them here: [download]
syntax: |
./CDLETDC inFile outFile [VOCSIZE TOPSIZE] |
params: |
inFile (the uncompressed source text-file), |
|
outFile (the name of the compressed text created) |
results: |
compresses inFile, generating file outFile and outFile.voc (which saves VOCSIZE and TOPSIZE) |
|
VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously). |
|
TOPSIZE Optional: maximum frequency value (if it is known previously). |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary) |
|
A file timesComp is also created after compression. It logs
compression times (useful to perform several runs of the compressor and
then collects all the time measures) |
syntax: |
./DDLETDC inFile outFile [VOCSIZE TOPSIZE] |
params: |
inFile (a file compressed with DLETDC). A file inFile.voc
should exist in the same directory |
|
outFile (the file that will be created) |
|
VOCSIZE and TOPSIZE Optional: specifie the max-vocabulary size and max-frequency value. |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary). |
|
A file timesDec is also created after decompression. It logs
decompression times (useful to perform several runs of the decompressor and
then collects all the time measures) |
|
NOTE: In a pure dynamic scenario, the decompressor/receiver will not
usually know the values VOCSIZE and TOPSIZE. However, in practice, in our dynamic decompressors, if those
values are known (that is, if file inFile.voc exists in the same directory of inFile)
the decompressor will take advantage of them to initialize its data structures. If those values are not
provided (inFile.voc does not exist or VOCSIZE and TOPSIZE are not given), the decompressor will use
Heaps' Law to estimate them. |
8. Compressing with DLSCDC.
DSCDC sources include 2 programs: CDLSCDC and DDLSCDC (compressor and decompressor respectively).
Download them here: [download]
syntax: |
./CDLSCDC inFile outFile [VOCSIZE TOPSIZE] |
params: |
inFile (the uncompressed source text-file), |
|
outFile (the name of the compressed text created) |
results: |
compresses inFile, generating file outFile and outFile.voc (which saves VOCSIZE and TOPSIZE) |
|
VOCSIZE Optional: specifies the max-vocabulary size (if it is known previously). |
|
TOPSIZE Optional: maximum frequency value (if it is known previously). |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary) |
|
A file timesComp is also created after compression. It logs
compression times (useful to perform several runs of the compressor and
then collects all the time measures) |
syntax: |
./DDLSCDC inFile outFile [VOCSIZE TOPSIZE] |
params: |
inFile (a file compressed with DLSCDC). A file inFile.voc
should exist in the same directory |
|
outFile (the file that will be created) |
|
VOCSIZE and TOPSIZE Optional: specifie the max-vocabulary size and max-frequency value. |
results: |
compresses inFile, generating file outFile and outFile.voc (the vocabulary). |
|
A file timesDec is also created after decompression. It logs
decompression times (useful to perform several runs of the decompressor and
then collects all the time measures) |
|
NOTE: In a pure dynamic scenario, the decompressor/receiver will not
usually know the values VOCSIZE and TOPSIZE. However, in practice, in our dynamic decompressors, if those
values are known (that is, if file inFile.voc exists in the same directory of inFile)
the decompressor will take advantage of them to initialize its data structures. If those values are not
provided (inFile.voc does not exist or VOCSIZE and TOPSIZE are not given), the decompressor will use
Heaps' Law to estimate them. |
Searching text compressed with dense codes.
Preliminaries: Searching framework
In order to be able to perform random searches over our different codes,
and to obtain a fair comparison we wanted to be able to search for the same random
patterns over those codes. Therefore,we create two programs that allow us to first create
some sets of random patterns that will be used later. These two programs are:
[EXTRACTwords] that generates a list of
candidate patterns by applying a filter by frequency or by pattern-lenght,
and [GenerateRandomPatterns]
which permit us to create N sets of k random patterns. This sets of random patterns will be used by
our searchers when random searches are needed.
Searches over ETDC and SCDC are performed using variants of Horspool and Set-Horspool algorithms, that have been adapted to deal with searches over the dense codes.
S.1. Searching code compressed with ETDC.
There exists 2 search algoritms developed: Single pattern search, and Multi pattern search.
Download them here:
[SINGLE-PATTERN download]
syntax: |
./selSEARCHERetdc compressedFile patternsFile iterations FTIMES |
syntax: |
./rdnSEARCHERetdc compressedFile patternsFile iterations FTIMES |
params: |
compressedFile (a file compressed with etdc), |
|
patternsFile (A file containing N patterns (one at each line)) |
|
iterations (The number of patterns that should be read from "patternsFile") |
|
patternsFile (A file containing N patterns (one at each line)) |
results: |
Searches compressedFile, searching for iterations patterns from patternsFile |
|
A file FTIMES will log search
times (useful to perform several runs of the searcher and later
collect all the time measures) |
[MULTI-PATTERN download]
syntax: |
./SEARCHERetdc compressedFile patternsFile iterations FTIMES numPatternsPerIter strmatlab |
params: |
compressedFile (a file compressed with etdc), |
|
patternsFile (A file containing N patterns (one at each line). The format "is %9 %9 %s\n" as in rdnSEARCHERetdc). |
|
iterations (Number of sets of K random patterns being searched for.) |
|
numPatternsPerIter (The number =K of patterns that should be read from "patternsFile" and searched in parallel over "compressedFile", for each iteration.) |
results: |
Searches compressedFile, searching for iterations patterns from patternsFile |
|
A file FTIMES will log search
times (useful to perform several runs of the searcher and later
collect all the time measures) |
|
The parameter strmatlab is needed. It indicates the name (i.e. multietdc) for
the array that will keep the logged times in FTIMES for each search for K-patterns. |
S.2. Searching code compressed with SCDC.
There exists 2 search algoritms developed: Single pattern search, and Multi pattern search.
Download them here: [SINGLE-PATTERN download]
syntax: |
./selSEARCHERscdc compressedFile patternsFile iterations FTIMES |
syntax: |
./rdnSEARCHERscdc compressedFile patternsFile iterations FTIMES |
params: |
compressedFile (a file compressed with scdc), |
|
patternsFile (A file containing N patterns (one at each line)) |
|
iterations (The number of patterns that should be read from "patternsFile") |
|
patternsFile (A file containing N patterns (one at each line)) |
results: |
Searches compressedFile, searching for iterations patterns from patternsFile |
|
A file FTIMES will log search
times (useful to perform several runs of the searcher and later
collect all the time measures) |
[MULTI-PATTERN download]
syntax: |
./SEARCHERscdc compressedFile patternsFile iterations FTIMES numPatternsPerIter strmatlab |
params: |
compressedFile (a file compressed with scdc), |
|
patternsFile (A file containing N patterns (one at each line). The format "is %9 %9 %s\n" as in rdnSEARCHERscdc). |
|
iterations (Number of sets of K random patterns being searched for.) |
|
numPatternsPerIter (The number =K of patterns that should be read from "patternsFile" and searched in parallel over "compressedFile", for each iteration.) |
results: |
Searches compressedFile, searching for iterations patterns from patternsFile |
|
A file FTIMES will log search
times (useful to perform several runs of the searcher and later
collect all the time measures) |
|
The parameter strmatlab is needed. It indicates the name (i.e. multiSCDC) for
the array that will keep the logged times in FTIMES for each search for K-patterns. |
Others
Plain Huffman Code.
In order to have a baseline for our semistatic compressors ETDC and SCDC, we have also implemented a version of Plain Huffman Code (see http://www.dcc.uchile.cl/~gnavarro/ps/tois00.pdf) following the same ideas of ETDC and SCDC, but with different encode and decode algorithms. A canonical-huffman code is implemented based on http://www.cs.mu.oz.au/~alistair/inplace.c Our source code can be downloaded here.
[Plain Huffman]
Boosting Lzgrep.
Researches interested in performed searches over text compressed with gzip, can use the original Lzgrep software. Our modified Lzgrep that works over text preprocessed with ETDC and finally compressed with gzip can be downloaded here: [boosted-Lzgrep] .
Summary Table of Downloads.