Norconex HTTP Collector 2.8.0 released

Norconex is proud to announce the release of Norconex HTTP Collector version 2.8.0.  This release is accompanied by new releases of many related Norconex open-source products (Filesystem Collector, Importer, Committers, etc.), and together they bring dozens of new features and enhancements highlighted below.

 

Extract a “Featured Image” from web pages

[ezcol_1half]

In addition to taking screenshots of webpages, you can now extract the main image of a web page thanks to the new FeaturedImageProcessor. You can specify conditions to identify the image (first one encountered matching a minimum site or a given pattern). You also have the option to store the image on file or as a BASE64 string with the crawled document (after scaling it to your preferred dimensions) or simply store a reference to it.

[/ezcol_1half]

[ezcol_1half_end]

<preImportProcessors>
  <processor class="com.norconex.collector.http.processor.impl.FeaturedImageProcessor">
    <minDimensions>300x400</minDimensions>
    <scaleDimensions>50</scaleDimensions>
    <imageFormat>jpg</imageFormat>
    <scaleQuality>max</scaleQuality>  	
    <storage>inline</storage>
  </processor>
</preImportProcessors>

[/ezcol_1half_end]

Limit link extraction to specific page portions

[ezcol_1half]

The GenericLinkExtractor now makes it possible to only extract links to be followed found within one or more specific sections of a web page. For instance, you may want to only extract links found in navigation menus and not those found in content areas in case the links usually point to other sites you do not want to crawl.

[/ezcol_1half]

[ezcol_1half_end]

<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
 
  <extractBetween>
    <start><![CDATA[<!-- BEGIN NAV LINKS -->]]></start>
    <end><![CDATA[<!-- END NAV LINKS -->]]></end>
  </extractBetween>
 
  <noExtractBetween>
    <start><![CDATA[<!-- BEGIN EXTERNAL SITES -->]]></start>
    <end><![CDATA[<!-- END EXTERNAL SITES -->]]></end>
  </noExtractBetween>
 
</extractor>

[/ezcol_1half_end]

Truncate long field values

[ezcol_1half]

The new TruncateTagger offers the ability to truncate long values and the option to replace the truncated portion with a hash to help preserve uniqueness when required. This is especially useful in preventing errors with search engines (or other repositories) and field length limitations.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.TruncateTagger"
    fromField="mySuperLongField"
    maxLength="500"
    toField="myTruncatedField"
    overwrite="true"
    appendHash="true"
    suffix="!" />

[/ezcol_1half_end]

Add metadata to a document using an external application

[ezcol_1half]

The new ExternalTagger allows you to point to an external (i.e., command-line) application to “decorate” a document with extra metadata information. Both the existing document content and metadata can be supplied to the external application. The application output can be in a specific format (json, xml, properties) or free-form combined with metadata extraction patterns you can configure. Either standard streams or files can be supplied as arguments to the external application. To transform the content using an external application instead, have a look at the ExternalTranformer, which has also been updated to support metadata.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.ExternalTagger">
  <command>
    /app/addressExtractor ${INPUT} ${INPUT_META} ${REFERENCE}
  </command>
  <metadata inputFormat="json">
    <pattern field="address" valueGroup="1">
      ^address=(.*)$
    </pattern>
  </metadata>
</tagger>

[/ezcol_1half_end]

Other improvements

This release includes many more new features and enhancements:

  • To create a document checksum, you can now combine metadata with content.
  • The TextPatternTagger can now extract field names dynamically in addition to values.
  • The ReplaceTagger and ReplaceTransformer now support empty/null replacement values.
  • There are new configuration options on the GenericHttpClientFactory:
    • “authFormParams” to add arbitrary parameters to authentication forms.
    • “authPreemptive” to use preemptive authentication with BASIC authentication.
  • The Amazon CloudSearch and Elasticsearch Committers both have a new “fixBadIds” flag to safely handle URLs that do not meet product limitations.

For the complete list of changes, refer to these product release notes:

Useful links

Pascal Essiembre has been a successful Enterprise Application Developer for several years before founding Norconex in 2007 and remaining its president to this day. Pascal has been responsible for several successful Norconex enterprise search projects across North America. Pascal is also heading the Product Division of Norconex and leading Norconex Open-Source initiatives.