You can download our software, including the source code for the prototypes of our word-based self-indexes and some addtional utilities to create search-patterns from a given text. In addtion we provide some links to source text collections you can use (from pizza-chili site), and the lists of stopwords we used in our Flexible word-based self-indexes.

Int-based self-indexes (ISI)

- ICSA source code: icsa.tar.gz.
- ISSA source code: issa.tar.gz.

Word-based self-indexes (WSI)

- WCSA source code: wcsa.tar.gz.
- WSSA source code: wssa.tar.gz.

Flexible word-based self-indexes (FWSI)

- FWCSA source code (no-stemming): fwcsa.tar.gz.
- FWCSA source code (Porter's stemming): fwcsa.Porter.tar.gz.

Generating random patterns for a given text collection

- Generating patterns for words and phrases: extractorPatterns.tar.gz.
- Generating random intervals (for extractWords): genintervals.tar.gz.

Other stuff: Text collections, topics,...

Downloading text collections from pizza-chili site

You can download some English texts (as the 100MB we used in our sample scripts), from pizza-chili site.

List of Stop-words

We found 3 different lists of stopwords in The Internet.

In our experiments we used the shortest one (C). The list (B) was pointed by the Wikipedia, yet it contains words such as "system" or "computer".

INEX 2009-2010 Wikipedia Dataset and patterns Stop-words

For researchers looking for a large dataset, a 50GiB XML Wikipedia collection is available at http://www.mpi-inf.mpg.de/departments/d5/software/inex. INEX also provides two query-logs that are available for download (free registration required): 2010 https://inex.mmci.uni-saarland.de/protected/adhoc/2010-topics.xml and 2009 https://inex.mmci.uni-saarland.de/protected/adhoc/2009-topics.zip. From such query logs, we extracted a list of 429 queries which is available here: [plain] [formatted patterns]. We also provide a simple java program (filterDocs) to filter out the XML-tags from the 50GiB dataset that reduces its size to around 8.7GiB of text content.

Supported in part by MCIIN (PGE and FEDER) grants(TIN2006-15071-C03-03, TIN2009-14560-C03-02, TIN2010-21246-C02-01, and CDTI CEN-20091048); Xunta de Galicia grants (Feder) 2010/17 and (Agrupación Estratéxica) CN 2012/211); and AECI grant (A/8065/07).