-
Notifications
You must be signed in to change notification settings - Fork 36
How to create a new index
If you want to use another reference corpus than the english Wikipedia, you will have to create your own Lucene index. For this task we added some classes which should help you with this task.
First of all, we have to make clear that there are two types of indexes. The simpler index supports the boolean document, boolean paragraph or boolean sentence based probability estimations. If the newly created index should work with a window based probability estimation, e.g., the sliding window, you will have to create an index, that indexes the positions of the documents.
In our paper, we showed that one could use not only a boolean document based probability estimation, but also a paragraph or sentence based estimation. Regarding the classes used for the index creation all three types are the same. Their difference lies inside the preprocessing. For the boolean paragraph based probability estimation there has to be a preprocessing step that splits up the documents into paragraphs. From that point on, the single paragraphs are handled like normal documents. The same can be done using sentences. thus, for the rest of this page we will only differentiat between indexes that only support boolean documents and thos, that support a window based approach, too.
Preprocessing your documents is not part of Palmetto. Thus, we will assume that you already have preprocessed your documents. It is a very good idea to use the same preprocessing for the documents of the reference corpus as for the documents of the corpus that you are using to train the documents. If you are using different preprocessings, e.g., the training documents are stemmed while the reference documents are lemmatized, you could encounter some weird effects.
At the end of the preprocessing, you should have
- a list of Strings each containing the tokens of your documents separated by a whitespace
- the number of tokens contained inside your string (only for position storing indexes)
You simply can use the org.aksw.palmetto.corpus.lucene.creation.SimpleLuceneIndexCreator
class.
File indexDir = // some directory that shall contain your index
Iterator<String> docIterator = // an iterator that can iterate over the reference documents
SimpleLuceneIndexCreator creator = new SimpleLuceneIndexCreator(Palmetto.DEFAULT_TEXT_INDEX_FIELD_NAME);
creator.createIndex(indexDir, docIterator);
Creating the position storing index that supports the usage of window based coherences is a little bit more complicated. First of all, you will have to create a list of org.aksw.palmetto.corpus.lucene.creation.IndexableDocument
objects. Every object contains the text of a document and the number of tokens the document contains. After that, you can create the index using the org.aksw.palmetto.corpus.lucene.creation.PositionStoringLuceneIndexCreator
class. Note that, after the creation of the index, you will have to create a histogram file using the org.aksw.palmetto.corpus.lucene.creation.LuceneIndexHistogramCreator
class. (Please do not forget the last step, otherwise the coherence calculation system won't work)
PositionStoringLuceneIndexCreator creator = new PositionStoringLuceneIndexCreator(
Palmetto.DEFAULT_TEXT_INDEX_FIELD_NAME, Palmetto.DEFAULT_DOCUMENT_LENGTH_INDEX_FIELD_NAME);
File indexDir = // some directory that shall contain your index
Iterator<IndexableDocument> docIterator = // an iterator that can iterate over the reference documents
if(creator.createIndex(indexDir, docIterator)) {
LuceneIndexHistogramCreator histogramCreator = new LuceneIndexHistogramCreator(
Palmetto.DEFAULT_DOCUMENT_LENGTH_INDEX_FIELD_NAME);
histogramCreator.createLuceneIndexHistogram(indexDir.getAbsolutePath());
}
We know, that this creation is awkward because the given texts are tokenized and, thus, Lucene could count the tokens as well. But unfortunately we didn't had the time to implement it in such a more elegant way. Sorry for this!