Norconex Importer 2.4.0 released

Posted on November 2, 2015 by in Latest Releases

Norconex is proud to release version 2.4.0 of its Norconex Importer open-source product.  In addition to the usual bug fixes and stability enhancements, this release provides more possibilities for parsing and enriching your documents.  Most significantly, Importer 2.4.0 allows for scripting and DOM navigation.  Keep reading for more details and usage samples.


Whereas it has always been possible to extend the importer to implement your own document processing logic, now you can inject the importer via configuration using a scripting language. The following new handlers enable the use of scripting languages to manipulate documents: ScriptFilter, ScriptTagger, and ScriptTransformer.

The “JavaScript” script engine, which is already present as part of your Java installation, is the script engine used by these classes.  The JavaScript engine used by the Oracle implementation of Java is based on Mozilla Rhino. You can find extensive JavaScript documentation on the Mozilla Rhino site.

Java developers can extend the importer to increase support for additional scripting languages. These new classes rely on JSR 223 API, which allows you to “plug” into any script engines to support your favorite scripting language.


DOM navigation

It is now possible to reference elements of an HTML or XML document using friendly CSS or JQuery-like syntax to navigate its domain object model (DOM). The jsoup parser is used to load document content into a DOM tree.

The new DOMContentFilter can be used to reject documents containing a specific HTML/XML path or element. The DOMSplitter can be used to break HTML/XML with “list” elements into different documents. Finally, the DOMTagger allows you to extract specific HTML/XML tag values or attributes and store them in your own fields (e.g., extract <h1> tags into a “title” field).


Other features

This release features several other helpful and interesting changes and additions.  For instance, CharacterCaseTagger can now be used to adjust the character case of field names (in addition to values). A few additional file formats are also supported.  For a complete list of changes, see the release notes.


Useful links

Pascal Essiembre has been a successful Enterprise Application Developer for several years before founding Norconex in 2007 and remaining its president to this day. Pascal has been responsible for several successful Norconex enterprise search projects across North America. Pascal is also heading the Product Division of Norconex and leading Norconex Open-Source initiatives.