Norconex Importer 2.2.0 released

This release of Norconex Importer brings many fixes, increased stability, and nice new features. The following highlights some of the additions with XML configuration or Java code samples.

Retrieve a document Length

[ezcol_1half]

Thanks to the new DocumentLengthTagger, you can now store a document byte length in a metadata field of your choice. The length can be obtained at any document processing stage.  For instance, it can be obtained before any transformation took place, or after it was parsed.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DocumentLengthTagger"
  field="doc-length" overwrite="true" >
</tagger>

 [/ezcol_1half_end]

Add the current date to a document

[ezcol_1half]

The new CurrentDateTagger allows to add the current date to a metadata field and date format of your choice. This can be useful to indicate when a document was actually processed by the Importer.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.CurrentDateTagger"
  field="date-imported" format="yyyy-MM-dd" />

 [/ezcol_1half_end]

Filter documents on numeric or date range

[ezcol_1half]

NumericMetadataFilter and DateMetadataFilter now allow you to filter documents based on metadata field numeric or date values, respectively. You can define both closed ranges and open-ended ranges.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Numeric range filter -->
<filter class="com.norconex.importer.handler.filter.impl.NumericMetadataFilter"
      onMatch="include" field="age" >
  <condition operator="ge" number="20" />
  <condition operator="lt" number="30" />
</filter>

<!-- Date range filter -->
<filter class="com.norconex.importer.handler.filter.impl.DateMetadataFilter"
      onMatch="include" field="publish_date" >
  <condition operator="ge" date="TODAY-7" />
  <condition operator="lt" date="TODAY" />
</filter>

 [/ezcol_1half_end]

Use external parsers

[ezcol_1half]

Wrapping a Tika class of the same name, the new ExternalParser allows Java programmers to point to external command-line applications to parse documents. One example can be for using “pdftotext” to parse PDFs instead of the default PDF parser based on PDFBox, which is much slower (but does a better job overall).

[/ezcol_1half]

[ezcol_1half_end]

import java.util.Map;

import com.norconex.commons.lang.file.ContentType;
import com.norconex.importer.parser.GenericDocumentParserFactory;
import com.norconex.importer.parser.IDocumentParser;
import com.norconex.importer.parser.impl.ExternalParser;

public class CustomDocumentParserFactory extends GenericDocumentParserFactory {

    @Override
    protected Map<ContentType, IDocumentParser> createNamedParsers() {
        Map<ContentType, IDocumentParser> parsers = super.createNamedParsers();

        ExternalParser pdfParser = new ExternalParser();
        pdfParser.setCommand(
                // Replace this with your own executable path
                "C:\\Apps\\pdftotext.exe", 
                "-enc", "UTF-8", "-raw", "-q", "-eol", "unix",                 
                ExternalParser.INPUT_FILE_TOKEN, 
                ExternalParser.OUTPUT_FILE_TOKEN);
        parsers.put(ContentType.PDF, pdfParser);
        return parsers;
    }
}

  [/ezcol_1half_end]

Other improvements

There are more changes under the hood, like upgrading to Apache Tika 1.8, as well as the fixing of OutOfMemory errors and document parsing sometimes never returning. You can find the complete list of changes in the release notes.

Several of these improvements were made possible thanks to the great feedback of the open-source community. Keep doing so: you make a difference.

Useful links

 

Pascal Essiembre has been a successful Enterprise Application Developer for several years before founding Norconex in 2007 and remaining its president to this day. Pascal has been responsible for several successful Norconex enterprise search projects across North America. Pascal is also heading the Product Division of Norconex and leading Norconex Open-Source initiatives.