Search Relevancy: Art or Science? (Part 2)

In Search Relevancy: Art or Science? (Part 1) I discussed several problem areas relating to the difficulties of improving search result relevancy. In this post I’ll discuss some of the different ways these problem areas can be addressed.

Relevancy Tuning Options

There are many ways to improve search result relevancy within your enterprise search environment, thereby greatly increasing the odds of your users finding the documents they’re looking for.

Advanced Search

Providing users with advanced search capabilities through an advanced search area of your search system, can greatly help with improving search relevancy. We recommend creating an advanced search feature with care, while making sure not to sacrifice good search result relevancy or guided navigation for its implementation. Advanced search capabilities can be very useful in many areas. For example, instead of having a free text search where some of the terms should be stemmed and some not, an advanced search could offer the ability to search on specific fields associated to content metadata fields. Think about “code” based content categories (product code, ISBN, etc). While your standard search query box should accommodate these codes, it can be challenging to implement a search that handles codes mixed in with free form text. A dedicated search field on an advanced search screen can allow you to perform precise matching. Free text searching across meta-data fields such as title or author from an advanced search screen, can also provide more relevant results if the user chooses to search only on those areas of information. Make sure you don’t go overboard on the number of specific fields you add to an advanced search, as an advanced search can quickly become intimidating to users. A good rule of thumb is to consider that if something can easily be represented through guided navigation (e.g. facets), it probably doesn’t make sense to include it in your advanced search.

Data Clean Up

While not always practical or even possible within an enterprise environment, having clean data can go a long way towards improving search result relevancy. The moment you start allowing field based searching, or start adjusting search result weighting based on which fields were matched, it becomes quite important to make sure your indexed content fields have been populated correctly. For a CMS, a good practice to follow is to force some field data-entry to occur before a page can be approved. Having clean data also means indexing only content that is relevant. In search systems that have indexed web based content, it can be quite common to see login pages, status pages, printer-friendly pages, etc. in search results. Great care should be put into making sure these types of pages are eliminated from your index. It is a good idea to prevent these types of pages from being indexed in the first place, through robot restrictions, crawler rules, etc. Short of being able to clean all of your data to ensure consistent quality, you might want to consider adopting new publishing practices to help keep your data clean. If one of the reasons your organization purchased an enterprise search engine is because it’s impossible to extract logical information out of your free form documents, then this option is likely not for you.

Data Decoration

When data clean up has reached its limit, there are other options to consider that can help ensure high quality data in your search index. One of these options involves “decorating” your documents. There are many ways document decoration can be performed, but it is usually an automated task performed at indexing time. A typical approach is to analyze incoming document content and try to match it against a set of predefined rules (regular expressions, query language, etc). Each one of these rules has a key/value pair associated to it. Upon matching a rule, the key/value pair gets added to the incoming document as a new searchable document field. This is a great way to fill in holes when parts of your documents are missing specific meta-data fields. Data decoration can help add consistency to the way your documents are represented, as well as allow you to offer more field based searching and facet options. Both can significantly increase your search result relevancy.

Data Monitoring

Within live search systems, data grows and evolves. Parts of your content may one day disappear or change drastically, and new content will get added in. In an effort to ensure that your search index stays current and relevant, you should consider continuously monitoring what goes in and out of your index. Orphan documents should be deleted from your index. Meta data fields on modified content may need to be indexed differently. New content may not be automatically picked up and may need to be explicitly added. Data monitoring will lead to frequent fine-tuning of your data repository “connectors” or “crawlers”. Not keeping an eye on the health of your search index will lead to the gradual decrease of its relevance, and ultimately provide a bad search experience for your users.

Data Normalization

When mixing multiple data sources, resist the temptation to have every meta-data field you have found in your content, as a field in your search index. Instead of having complex queries searching on multiple fields every time, consider mapping your remote data source fields into normalized search fields. This can improve relevancy when you deal with field weighing. Data normalization can also help provide a constant set of values for differentiating your documents. For example, if you want to offer a facet on category names, data normalization can help ensure that the category name stored in your search index is always spelled the same way for the same category. This task is similar in concept to data clean up, but is specifically targeted for your search index.

Facets (here we’re including guided navigation, parametric refinement, taxonomy based searching, etc.) are one of the most effective ways for a search user to quickly find what they’re looking for. Facets help keep users from having to enter precise search terms, or use a set of intimidating advanced search fields that when combined, can potentially lead to no search results. Every important aspect of your content should be broken down into facets. Examples could include category, publication date ranges, authors, department, etc. Presenting these facets after a simple search will help point your users to exactly where the results were found on your site (with an associated document count), and will naturally help direct your users to exactly what they’re looking for. Facets can be generated in numerous manners, such as being generated from related concepts extracted from search results, or generated and presented hierarchically based on category positions within a taxonomy.

Weight Adjustments

Most modern search engines support the idea of using “weight” values to influence relevancy. Here are a few examples of weight based adjustments supported by some search engines.

Field Weight: If you have confidence in the quality of your meta-data fields, this can make a significant difference to relevance. For example, on a website it is common to see search engines give more weight to documents where matches occur in “title” or “keyword” fields. You can extend this to other fields that are relevant to your organization.
Term Weight: If you have a set of key terms you know best represent some of your documents, you may give more weight to these terms when they are provided in a user’s search query. When using synonyms, it may be a good idea to boost the term variation you know is more representative. Alternatively, you may want to decrease the weight of terms you know might be found in almost all documents in the index.
Database Weight: Let’s say your organization provides product information, product news, and promotion details through its search. These three types of information are stored and managed in your index as separate databases or collections. You want your users to be able to search across all databases, but you can boost the database weight so that promotion results are weighted higher, followed by news, and finally product information. Boosting document relevancy score based on which database the document is found in, can greatly affect relevancy.
Document Weight: Sometimes you just need to give an arbitrary weight to a document. For example, let’s consider a product home page. By attributing a score to a document, either via decoration, as set of rules, or via your content publishing system, you can influence where your documents should appear in a result list.

Multi-Search

If your search engine uses distinct collections of a different nature, you may want to revisit how you present results to your users. Let’s say you have a People database and a Products database. You may want to present content from both of these databases to your search users, by clearly showing that the results are coming from two separate information sources. You could consider having two results columns, having multiple results tabs, or showing only the top 2 or 3 hits of each result set, offering to expand your search to a specific topic of interest. Considering the above example, a search on a name could bring back a list of individuals and a list of products with that name, with the results listed on top of each other of beside each other, alongside a message stating: “X documents were found in this database, click here to see the complete list”. People only interested in results from one of the databases may get the result they want quickly, without having had to filter up front which database they wanted to target.

Thesaurus

By thesaurus I mean any kind of variation on user provided terms (synonyms, token ring, acronyms, etc). Providing term variations can effectively assist in helping your users find what they are looking for. If your organization uses controlled vocabulary throughout your published content, a good approach to consider is replacing user terms with proper terms, for searching purposes. If the vocabulary used is open, you may want to add variations to the user terms. Depending on your logic for finding equivalent terms, and how many terms get used, this could generate very long queries that affect search performance. Short of applying thesaurus functionality to a limited set of terms, you may want to consider a document decoration approach to add alternate terms as part of your documents, as they’re indexed. This way user search queries would no longer have to be modified to match term variations. Keep in mind that although a document decoration approach may prove to be more efficient, it might be more difficult to apply thesaurus list changes quickly. When considering thesaurus functionality, it’s a good idea to mix in the use of search statistics to help decide which synonyms could really help your users.

User Profiles

With the help of user authentication, or via cookies, having a way to identify your users gives you a way to also identify their preferences. Keeping track of concepts or categories a user selects often could be used to influence his or her future searches. Based on a users past search experience, you may decide to make pre-selections for that user (with the option to undo this pre-selections). Another way to speed up information retrieval for a user is to allow that user to save their queries for future use (or by keeping a user accessible search history). While these options may offer little value the first few times a search is used, they will often show users that your search system is getting better over time.

Other Search Engine Specific Tuning Options

All search vendors offer specific ways to deal with relevancy issues. It is always a good idea to read up on the relevancy tuning options offered by your vendor.

Conclusion

While the above list of relevancy tuning options may seem large, it is by no means exhaustive. Which relevancy tuning tips and techniques will work best for you? That’s the tricky part, and where some may consider it being more of an art than a science. A mix of the above, along with very good search analytics, can take you a long way towards improving end user search experiences. One thing to keep in mind is that having good search result relevancy is part of the solution, but not the solution itself. There are several other complementary search options that can help make your enterprise search shine, and your users coming back happily.

Pascal Essiembre

Pascal Essiembre has been a successful Enterprise Application Developer for several years before founding Norconex in 2007 and remaining its president to this day. Pascal has been responsible for several successful Norconex enterprise search projects across North America. Pascal is also heading the Product Division of Norconex and leading Norconex Open-Source initiatives.