Norconex HTTP and Filesystem Collector 2.7.0 released

Posted on April 26, 2017 by in Latest Releases

Norconex released version 2.7.0 of both its HTTP Collector and Filesystem Collector.  This update, along with related component updates, introduces several interesting features.

HTTP Collector changes

The following items are specific to the HTTP Collector.  For changes applying to both the HTTP Collector and the Filesystem Collector, you can proceed to the “Generic changes” section.

Crawling of JavaScript-driven pages

The alternative document fetcher PhantomJSDocumentFetcher now makes it possible to crawl web pages with JavaScript-generated content. This much awaited feature is now available thanks to integration with the open-source PhantomJS headless browser.   As a bonus, you can also take screenshots of web pages you crawl.


More ways to extract links

This release introduces two new link extractors.  You can now use the XMLFeedLinkExtractor to extract links from RSS or Atom feeds. For maximum flexibility, the RegexLinkExtractor can be used to extract links using regular expressions.


Generic changes

The following changes apply to both Filesystem and HTTP Collectors. Most of these changes come from an update to the Norconex Importer module (now also at version 2.7.0).

Much improved XML configuration validation

You no longer have to hunt for a misconfiguration.  Schema-based XML configuration validation was added and you will now get errors if you have a bad XML syntax for any configuration options.   This validation can be trigged on command prompt with this new flag: -k or --checkcfg.


Enter durations in human-readable format

Having to convert a duration in milliseconds is not the most friendly. Anywhere in your XML configuration where a duration is expected, you can now use a human-readable representation (English only) as an alternative.


Lua scripting language

Support for Lua scripting has been added to ScriptFilter, ScriptTagger, and ScriptTransformer.  This gives you one more scripting option available out-of-the-box besides JavaScript/ECMAScript.


Modify documents using an external application

With the new ExternalTransformer, you can now use an external application to perform document transformation.  This is an alternative to the existing ExternalParser, which was enhanced to provide the same environment variables and metadata extraction support as the ExternalTransformer.


Combine document fields

The new MergeTagger can be used for combining multiple fields into one. The target field can be either multi-value or single-value separated with the character of your choice.


New Committers

Whether you do not have a target repository (Solr, Elasticsearch, etc) ready at the time of crawling, or whether you are not using a repository at all, Norconex Collectors now ships with two file-based Committers for easy consumption by your own process: XMLFileCommitter and JSONFileCommitter. All available committers can be found here.


More

Several additional features or changes can be found in the latest Collector releases.  Among them:

  • New Importer RegexReferenceFilter for filtering documents based on matching references (e.g. URL).
  • New SubstringTransformer for truncating content.
  • New UUIDTagger for giving a unique id to each documents.
  • CharacterCaseTagger now supports “swap” and “string” to swap character case and capitalize beginning of a string, respectively.
  • ConstantTagger offers options when dealing with existing values: add to existing values, replace them, or do nothing.
  • Components such as Importer, Committers, etc., are all easier to install thanks to new utility scripts.
  • Document Access-Control-List (ACL) information is now extracted from SMB/CIFS file systems (Filesytem Collector).
  • New ICollectorLifeCycleListener interface that can be added on the collector configuration to be notified and take action when the collector starts and stops.
  • Added “removeTrailingHash” as a new GenericURLNormalizer option (HTTP Collector).
  • New “detectContentType” and “detectCharset” options on GenericDocumentFetcher for ignoring the content type and character encoding obtained from the HTTP response headers and detect them instead (Filesytem Collector).
  • Start URLs and start paths can now be dynamically created thanks to IStartURLsProvider and IStartPathsProvider (HTTP Collector and Filesystem Collector).

To get the complete list of changes, refer to the HTTP Collector release notes, Filesystem Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.

Download

Pascal Essiembre has been a successful Enterprise Application Developer for several years before founding Norconex in 2007 and remaining its president to this day. Pascal has been responsible for several successful Norconex enterprise search projects across North America. Pascal is also heading the Product Division of Norconex and leading Norconex Open-Source initiatives.


Comments

  • radomirml

    Hi Pascal, Congratulations on v2.7! I like the idea of JSONFileCommitter but have a question – sorry if I missed it in the docs… Is there support for pushing data from one committer (e.g. JSONFileCommitter data files) to another (e.g. Solr) without crawling and parsing files again?

    • Thank you @radomirml:disqus. No, there is no built-in way of sending one committer output to another but maybe there is a way to do what you want. What would be the use case? You can use the MultiCommitter to send the data to two or more Committers as once if that’s what you are after. If you rather want to use the JSONFileCommitter as a way to “backup” data and commit it again later, there is currently nothing for that. But you may be able to do so by using the Filesystem Collector with your Solr committer, with a custom document splitter (from the Importer module) to split the JSON file into individual records.

      • radomirml

        Thanks Pascal. I don’t have use case at the moment but had in the past. I was thinking along line of backing up crawled data. What would be even more useful I think is to use local cache of a crawled site (assuming the original document was saved) to quickly re-parse and re-tag all documents without need to crawl everything again.

        • You can actually save documents with the “keepDownloads” flag. If you can survive not having the original URL somehow, you can use the Filesystem Collector to re-crawl those at will. The importer section can be a shared include file between your HTTP and Filesystem collectors. This can indeed be very useful at least during developing/testing phase.

          If you think of an approach that would make things even more convenient, feel free to open a ticket on Github (feature request).