Facets with Lucene

Posted on August 1, 2014 by in Latest Articles

During the development of our latest product, Norconex Content Analytics, we decided to add facets to the search interface. They allow for exploring the indexed content easily. Solr and Elasticsearch both have facet implementations that work on top of Lucene. But Lucene also offers simple facet implementations that can be picked out of the box. And because Norconex Content Analytics is based on Lucene, we decided to go with those implementations.

We’ll look at those facet implementations in this blog post, but before, let’s talk about a new feature of Lucene 4 that is used by all of them.


DocValues are a new addition to Lucene 4. What are they? Simply put, they are a mean to associate a value for each document. We can have multiple DocValues per document. Later on, we can retrieve the value associated with a specific document.

You may wonder how a DocValue is different than a stored field. The difference is that they are not optimized for the same usage. Whereas all stored fields of a single document are meant to be loaded together (for example, when we need to display the document as a search result), DocValues are meant to be loaded at once for all documents. For example, if you need to retrieve all the values of a field for all documents, then iterating over all documents and retrieving a stored field would be slow. But if a DocValue is used, loading all of them for all documents is easy and efficient. A DocValue is stored in a column stride fashion, so all the values are kept together for easy access. You can learn more about DocValues in multiple places over the Web.

Lucene allows different kind of DocValues. We have numeric, binary, sorted (single string) and sorted set (multiple strings) DocValues. For example, a DocValue to store a “price” value should probably be a numeric DocValue. But if you want to save an alphanumeric identifier for each document, a sorted DocValues should be used.

Why is this important for faceting? Before DocValues, when an application wanted to do faceting, a common approach to build the facets values was to do field uninverting, that is, go over the values of an indexed field and rebuild the original association between terms and documents. This process needed to be redone every time new documents were indexed. But with DocValues, since the association between a document and a value is maintained in the index, it simplifies the work needed to build facets.

Let’s now look at the facets implementation in Lucene.

String Facet

The first facet implementation available in Lucene that we will look at is what we expect when we think of facets. It allows for counting the documents that share the same string value.

When indexing a document, you have to use a SortedSetDocValuesFacetField. Here is an example:

With this, Lucene will create a “facet_author” field with the author value indexed in it. But Lucene will also create a DocValue named “facet_author” containing the value. When building the facets at search-time, this DocValue will be used.

You’ve probably also noticed the FacetsConfig object. It allows us to associate a dimension name (“author”) with a field name (“facet_author”). Actually, when Lucene indexed the value in the “facet_author” field and DocValue, it also prefixes the value with the dimension name. This would allow us to have different facets (dimensions) indexed in the same field and DocValue. If we would have omitted the call to setIndexFieldName, the facets would have been indexed in a field called “$facets” (and the same name for the DocValue).

At search time, here is the code we would use to gather the author facets:

Here, the DefaultSortedSetDocValuesReaderState will be responsible for loading all the dimensions from the specified DocValue (facet_author). Note that this “state” object is costly to build, so it should be re-used if possible. Then, SortedSetDocValuesFacetCounts will be able to load the values of a specific dimension using the “state” object and to compute the count for each distinct value.

You can find more code examples in the file SimpleSortedSetFacetsExample.java in the Lucene sources.

Numeric Range Facet

This next facet implementation is to be used with numbers to build range facets. For example, it would group documents of the same price range together.

When indexing a document, you have to add a numeric DocValue for each document. Like this:

In that case, we only need to use a standard NumericDocValuesField and not a specialized FacetField.

When searching, we need to first define the set of ranges that we want. Here is how it could be built:

With those ranges, we can build the facets:

Lucene will calculate the count for each range.

For code sample, see RangeFacetsExample.java in Lucene sources.

Taxonomy Facet

This was the first facet implementation, and it was actually available before Lucene 4. This implementation is different than the others in several aspects. First, all the unique values for a taxonomy facet are stored in a separate Lucene index (often called the sidecar index). Second, this implementation supports hierarchical facets.

For example, imagine a “path” facet where “path” represents where a file was on a filesystem (or the Web). Imagine the file “/home/mike/work/report.txt”. If we were to store the path (“/home/mike/work”) as a taxonomy facet, it will actually be split into 3 unique values: “/home”, “/home/mike” and “/home/mike/work”. Those 3 values are stored in the sidecar index with each being assigned a unique ID. In the main index, a binary DocValue is created so that each document is assigned the ID of its corresponding path value (the ID from the sidecar index). In this example, if “/home/mike/work” was assigned ID 3 in the sidecar index, the DocValue for the document “/home/mike/work/report.txt” would be 3 in the main index. In the sidecar index, all values are linked together, so it is easy later on to retrieve the parents and children of each value. For example, “/home” would be the parent of “/home/mike”, which would be the parent of “/home/mike/work”. We’ll see how this information is used.

Here is some code to index the path facet of a file under “/home/mike/work”:

Notice here that we need to create a taxonomy writer, which is used to write in the sidecar index. After that, we can add the actual facets. Like with SortedSetDocValuesFacetField, we need to define the configuration of the facet field (dimension name and field name). We also have to indicate that the facets will be hierarchical. Once it is set, we can use FacetField with the dimension name and all the hierarchy of values for the facet. Finally, we add it to the main index (via the writer object), but we also need to pass the taxo writer object so that the sidecar index is also updated.

Here is some code to retrieve those facets:

For each matching document, the ID of the facet value is retrieved (via the DocValues). Lucene will count how much there is for each unique facet value by counting how many documents are assigned to each ID. After that, it can fetch the actual facet values from the sidecar index using those IDs.

In the last example, we did not specify any specific path, so all facets for all paths are returned (including all child paths). But we could restrict to a more specific path to get only the facets underneath it, for example “/home/mike/work”:

This is where the hierarchical aspect of the taxonomy facets gets interesting. Because of the relations kept between the facets in the sidecar index, Lucene is able to count the documents for the facets at different levels in the hierarchy.

Again, for more code example about taxonomy facets, see MultiCategoryListsFacetsExample.java in the Lucene sources.


So we’ve seen that Lucene offers facets implementations out of the box. A lot of interesting features can be built on top of them! For more info, refer to the Lucene sources and javadoc.


  • fm

    LabelAndValue lv = facets.labelValues[i];

    it is wrong here.

    • Pascal Dimassimo

      Thanks for spotting this! It should be fixed now.

  • Andreas

    I use a wildcard query for FacetsCollector.search. Looking at the documents returned by FacetsCollector.search this works fine. However when it comes to facets.getTopChildren I also get results not matching the previous wildcard query made by FacetsCollector.search.

    For example my wildcard is ‘da*’ FacetsCollector.search would return documents containing the author fields { “Daeniken”, “Dalmas”, “Daley” }
    but facets.getTopChildren would also return authors and values not contained in the previous query like { “Daeniken (20)”, “Murakami (4), “Robertson (2), “Dalmas (1)”, “Daley(1)” }

    What am I missing?
    By the way: basically I’m using your string facet example. The only additions that I made are having multiple facets fields and allowing multivalued for each of the facets.

    • Pascal Dimassimo

      Are you querying on the author field? Did you create another indexed field for this purpose? I don’t think you can search on the indexed field that is created by Lucene with the DocValues. You should have another “normal” indexed field for that.

      • Andreas

        Thanks for you reply Pascal!

        I do use another non-DocValue indexed field for the query. It does indeed not work with DocValue indexed fields.
        Here is what I’m doing when I create the index:

        doc.add( new StringField(“author”, author )), Field.Store.YES) );
        doc.add( new SortedSetDocValuesFacetField(“author_dim”, author ));

        FacetsConfig config = new FacetsConfig();
        config.setMultiValued(“author_dim”, true);

        and this is my query:

        new WildcardQuery(new Term(“author”, “Da*”));

        • Pascal Dimassimo

          I see that you use built your query yourself (“new WildcardQuery(new Term(…”). When doing that, you need to make sure that the terms you are passing matched the indexed terms. So if you used an analyzer for your “author” field that changed all the terms to lowercase (like the StandardAnalyzer), you must make sure to use lowercase terms when querying (“new Term(“author”, “da*”)”). An easy way to take care of this is to use Lucene’s QueryParser with the same analyzer that you used to index and build your query using that QueryParser.

          • Andreas

            Thanks Pascal,

            the searched and indexed terms are matching. I do not use any Analyzer for this field and FacetsCollector.search does return the items that I am searching for.
            So I would assume SortedSetDocValuesFacetCounts would only make use of the items available in the Collector (collected by FacetsCollector.search). But it doesn’t.

          • Pascal Dimassimo

            I don’t understand what’s happening on your side. Using this code:

            WildcardQuery query = new WildcardQuery(new Term(“author”, “da*”));

            SortedSetDocValuesReaderState state =
            new DefaultSortedSetDocValuesReaderState(
            reader, “author_facet”);

            FacetsCollector fc = new FacetsCollector();
            FacetsCollector.search(searcher, query, 10, fc);
            Facets facets = new SortedSetDocValuesFacetCounts(state, fc);
            FacetResult result = facets.getTopChildren(10, “author_dim”);

            The result only contains facets starting with “da”. If I instead use “dal*” in the wildcard query, I only have facets staring with “dal”.

  • Hardy Ferentschik

    Hi Pascal, very nice post. I in fact implemented dynamic faceting the way you described and it works nicely. One thing I am trying to figure out thought if and how multi value numeric value faceting is possible. For strings one can use
    SortedSetDocValuesFacetField when indexing the values and SortedSetDocValuesFacetCounts
    for the actual facet query. This way I can add multiple values for the same field. However there seems to be no equivalent for numbers.

    NumericDocValuesField in combination with LongRangeFacetCounts works for single value fields. There is a SortedNumericDocValuesField for indexing, but there seems to be no matching facet counter.
    Any ideas? I posted a question on the Lucene forum – http://mail-archives.apache.org/mod_mbox/lucene-java-user/201507.mbox/browser, but did not get an answer so far.

    • Pascal Dimassimo

      From what I can see in the source code, it is not implemented. I can’t not think of a reason why it can’t be implemented, but I might be wrong. Maybe you could bump your question on the mailing list to check again if someone has any idea about this?

      Sorry for the late response.

      • fere0010

        Ok, thanks. As you say, it feels like there is just something missing. It should be doable, unless the fact that numbers are encoded differently changes things. I’ll try to “bump” my question. Maybe even ask on the dev mailing list instead.