Upgrading code to Lucene 4

Apache Lucene Web Site

For a client last year, we had to upgrade some old Lucene code to Lucene 4. Lucene 4 was a rather large release and there are many aspects to be aware when upgrading non trivial code. Let’s take a look at some of them.

Atomic Readers

As in previous versions, a Lucene 4 index is composed of multiple segments, each being a complete inverted index. In previous versions of Lucene, however, the API was not reflecting this fact as clearly as in Lucene 4. For example, in previous versions, one could, using an IndexReader, obtain the fields or terms for the entire index. Behind the scene, Lucene was iterating over all the segments and merging all of them to present a unified view. However, this process is not always efficient. In Lucene 4, this API was substantially re-worked to make it clear whether a single segment or multiple segments are being accessed. There are now two subclasses of IndexReader: AtomicReader and CompositeReader. The former allows for reading from a single segment, while the latter allows for reading from multiple segments. We can no longer obtain the fields or terms from IndexReader. All the methods used to obtain such information are now available from AtomicReader. This means that if we want those information for the entire index, it is now our responsibility to iterate over all the AtomicReaders and pick the needed information. Since we know exactly what information we need, this can increase the performance of our application rather than letting Lucene merge everything from the segments.

How to we obtain those AtomicReaders? We need to call the leaves() method of IndexReader, which is a new method in Lucene 4. If the IndexReader is actually a CompositeReader, leaves() will returns a list of AtomicReaderContexts, from which we can get the AtomicReader. If one’s index is on disk and a DirectoryReader is used to open it, that instance is effectively a CompositeReader, because DirectoryReader is a subclass of it.

For example, the following code loads all the field names of an index:

Set setFields = new TreeSet();
List leaves = reader.leaves();
for (AtomicReaderContext context : leaves) {
	AtomicReader atomicReader = context.reader();
	Fields fields = atomicReader.fields();
	for (String fieldName : fields) {
		setFields.add(fieldName);
	}
}

What if you can not change your code to iterate over all the AtomicReaders? In that case, SlowCompositeReaderWrapper.wrap() can be used, which, from a CompositeReader, will return a fake AtomicReader by iterating and merging all the segments (as in the previous versions of Lucene). Be aware, however, that this is slower than iterating through the AtomicReaders on one’s own.

For more information on atomic readers, please read this excellent article on the subject by Lucene’s committer Uwe Schindler.

Deletions

In Lucene 4, IndexReader no longer has an isDeleted() method to indicate whether or not a document has been deleted. This information is now available at the AtomicReader level, but not via a isDeleted() method. Instead, we have to use the new method getLiveDocs(), which returns a bit field representing all of the documents still existing.

Bits liveDocs = atomicReader.getLiveDocs();
if (liveDocs != null && !liveDocs.get(documentID)) {
	// this document is deleted...
}

The method will return a null result if there are currently no deleted documents, so we need to check this. In essence, we no longer ask whether a document is deleted, but whether it exists.

Again, this method is available at the AtomicReader level, so we need to iterate over all of them if we don’t know exactly in which segment a document is supposed to be.

IndexReader is read-only!

In previous Lucene version, IndexReader had write operations, such as committing or deleting documents. This is no longer the case in Lucene 4. An IndexReader is now for read-only operations, as the name itself suggests. All the write operations are available via IndexWriter.

Terms iteration

The code needed to iterate over the terms of a field has changed a bit in Lucene 4. First, as we have mentioned a few times, the method to iterate over the terms is available at the AtomicReader level, so it only returns the terms stored in a single segment.

Terms terms = atomicReader.terms("title");
TermsEnum te = terms.iterator(null);
BytesRef term;
while ((term = te.next()) != null) {
	// do something with the term...
}

A term is now represented with the BytesRef class. This highlights the fact that in Lucene 4, terms are now stored as bytes. The default encoding is UTF-8. To convert a term to a Java string, we can use term.utf8ToString() (unless the terms were indexed using a different encoding, using a custom Analyzer, for example).

FieldSelector removed

In previous Lucene versions, loading a document with specific fields required using a FieldSelector implementation to accept or reject the fields. Lucene 4 introduces an easier way: simply provide a set of field names and only those will be loaded.

Set fieldnames = new HashSet(Arrays.asList("title", "summary"));
Document doc = reader.document(documentID, fieldnames);

Custom analyzers

To create a custom analyzer, we need to implement the Analyzer class. This has not changed in Lucene 4. The method to implement has changed a bit, however. In previous versions, it was called tokenStream(). Now, the method to implement is called createComponents and returns a TokenStreamComponents. A TokenStreamComponents makes it easy for Lucene to reuse the TokenStream.

TokenStreamComponents createComponents(String name, Reader reader) {
    Tokenizer tokenizer = new StandardTokenizer(version, reader);
    TokenFilter filter = new LowerCaseFilter(version, tokenizer);
    filter = new StopFilter(version, filter, ...));
    filter = new PorterStemFilter(filter);
    return new TokenStreamComponents(tokenizer, filter);
}

Iterating the tokens out of a TokenStream

In Lucene 4, the code needed to iterate the tokens out of a TokenStream has also changed a bit. If one is interested only in the tokens themselves, here is how it can be done now:

CharTermAttribute termAttr = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
	String token = termAttr.toString();
	// do something with the token...
}
tokenStream.end();
tokenStream.close();

Other attributes can be added if needed, for example, to obtain the starting and ending positions of each token:

CharTermAttribute termAttr = tokenStream.addAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttr = tokenStream.addAttribute(OffsetAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
	String token = termAttr.toString();
	int start = offsetAttr.startOffset();
	int end = offsetAttr.endOffset();
	// do something with the token and positions...
}
tokenStream.end();
tokenStream.close();

Don’t forget the call to reset()! In previous versions, this was optional, but not anymore.

Conclusion

This is only a small subset of all the changes we need to be aware of when upgrading code to Lucene 4. A more complete list is available in this blog post. Don’t forget to also consult the migration guide, which contains other useful information.

If you need help with any Lucene code, don’t hesitate to contact us!