Word-based self-indexing (wcsa, wssa, flexible wcsa)

This web aims at presenting the first word-based self-indexes that permit to deal with natural language text indexing.

We firstly show our word-based self-indexes (word-CSA and word-SSA) that were born as the adaptation of their character-based counterparts to permit them to deal with very large alphabets. In natural language text collections, they typically require space (in-memory) around 35-45% of the size of the source data, and within that space, they both include an implicit representation of the text (so that it does not need to be kept elsewhere) and allow fast indexed searches. In particular, due to its suffix-array-based origins results very successful when searching for phrases. Both WCSA and WSSA offer several parameters that permit to adjust the space/time trade-off.

We also include our flexible word-based self-indexes. They follow the same ideas behing the non-flexible counterparts, yet adding the possibility of using very typical operation for natural language such as: Stemming, disregarding words-separators, skipping stopwords, case folding, etc.

We have developed our indexes following well-defined interfaces, so that future developments in self-indexing can be compared with little effort with our prototypes.

We include the basics and structure of our prototypes in sections WSI.html, ISI.html, and FWSI.html, where we also discuss and include some details of our source code. Finally, all the source code is available to download at downloads.html, and a list of related publications in publications.html.

For a more detailed description, you should probably refer to this publication:

Antonio Fariña, Nieves Brisaboa, Gonzalo Navarro, Francisco Claude, Ángeles Places, and Eduardo Rodríguez. Word-based Self-Indexes for Natural Language Text. ACM Transactions on Information Systems (TOIS), 30(1), 34 pages,2012. [abstract] [bibtex] [pdf]

About us...

We are several researchers from the University of A Coruña (Spain), from the University of Chile (Chile), from the University of Waterloo (Canada), and from Yahoo! Research Santiago (Chile).

Gonzalo Navarro (phD). Computer Science Department. UChile
Nieves R. Brisaboa (phD). Database Lab. UDC
Antonio Fariña (phD). Database Lab. UDC
Ángeles S. Places (phD). Database Lab. UDC
Eduardo Rodríguez (phD student). Database Lab. UDC
Francisco Claude (phD student). Univ. Waterloo.
Diego Arroyuelo (phD). Yahoo! Research. Santiago. Chile

About our developments

Please note that the software developed here are the prototypes used to validate our techniques empirically (as a part of our research interest), and not a final product.

Source code is available in Section downloads .