tag page

Optical character recognition (ORC), content translation, title generation, detection and text extraction from more file formats, are among the new features now part of your favorite crawlers: Norconex HTTP Collector 2.1.0 and Norconex Filesystem Collector 2.1.0. They are both available now and can be downloaded for free.  They both ship with and use the latest version of the Norconex Importer module, which is in big part responsible for many of these new features.

For more details and usage examples, check this article.

These two Collector releases also include bug fixes and stability improvements.  We recommend to existing users to upgrade.

Get your copy

Download Norconex HTTP Collector

Download Norconex Filesystem Collector

GATINEAU, QC, CANADA – Monday, December 1, 2014 – Norconex announces the launch of its Google Search Appliance (GSA) Committer module for its Norconex Collectors Crawler Suite. Enterprise search developers and enthusiasts now have a flexible and extensible option for feeding documents to their GSA infrastructure. GSA is a target repository for crawled documents released by Norconex HTTP Collector, Norconex Filesystem Collector, and any future Collector released by Norconex . These Collectors can reside on any server (like remote filesystems) and send discovered documents across the network to a GSA installation. The GSA Committer is the latest addition to the growing list of Committers already available to Norconex Collector users: Apache Solr, Elasticsearch, HP IDOL, and Lucidworks.

“The increasing popularity of our universal crawlers motivates us to provide support for more search engines. Search engines come and go in an organization, but your investment in your crawling infrastructure can be protected by having re-usable crawler setups that can outlast any search engine installation,” said Norconex President Pascal Essiembre.

GSA Committer Availability

GSA Committer is part of Norconex’s commitment to delivering quality open-source products backed by community or commercial support. GSA Committer is available for immediate download at /collectors/committer-gsa.

Founded in 2007, Norconex is a leader in enterprise search and data discovery. The company offers a wide range of products and services designed to help process and analyze structured and unstructured data.

For more information on GSA Committer:

GSA Committer Website: /collectors/committer-gsa
Norconex Collectors: /collectors
Email: info@norconex.com

Norconex just released major upgrades to all its Norconex Collectors and related projects.  That is, Norconex HTTP Collector and Norconex Filesystem Collector, along with the Norconex Importer module and all available committers (Solr, Elasticsearch, HP IDOL, etc), were all upgraded to version 2.0.0.

With these major product upgrades come a new website that makes it easier to get all the software you need in one location: the Norconex Collectors website.  At a quick glance you can find all Norconex Collectors and Committers available for download.

Among the new features added to your crawling arsenal you will find:

  • Can now split a document into multiple documents.
  • Can now treat embedded documents as individual documents (like documents found in zip files or in other documents such as Word files).
  • Language detection (50+ languages).
  • Parsing and formatting of dates from/to any format.
  • Character case modifiers.
  • Can now index basic content statistics with each documents (word count, average word length, average words per sentences, etc).
  • Can now supply a “seed file” for listing start URLs or start paths to your crawler.
  • Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used.  This reduces I/O and improves performance.
  • New event model where listeners can listen for any type of crawler events.
  • Can now  ignore parsing of specific content types.
  • Can filter documents based on arbitrary regular expressions performed on the document content.
  • Enhanced debugging options, where you can print out specific field content as they are being processed.
  • HTTP Collector: Can add link names to the document the links are pointing to (e.g. to create cleaner titles).
  • More…

Another significant change is all Norconex open-source projects are now licensed under The Apache License 2.0.   We hope this will facilitate adoption with third party commercial offerings.

It is important to note version 2.0.0 are not compatible with their previous 1.x version.  The configuration options changed in many areas so do not expect to run your existing configuration under 2.0.0.   Please refer to the latest documentation for new and modified configuration options.

Visit to the new Norconex Collectors website now.

GATINEAU, QC, CANADA – Thursday, August 25, 2014 Norconex is announcing the launch of Norconex Filesystem Collector, providing organizations with a free “universal” filesystem crawler. The Norconex Filesystem Collector enables document indexing into target repositories of choice, such as enterprise search engines.

Following on the success of Norconex HTTP Collector web crawler, Norconex Filesystem Collector is the second open source crawler contribution to the Norconex “Collector” suite. Norconex believes this crawler allows customers to adopt a full-featured enterprise-class local or remote file system crawling solution that outlasts their enterprise search solution or other data repository.

“This not only facilitates any future migrations but also allows customer addition of their own ETL logic into a very flexible crawling architecture, whether using Autonomy, Solr/LucidWorks, ElasticSearch, or any others data repository,” said Norconex President Pascal Essiembre.

Norconex Filesystem Collector Availability

Norconex Filesystem Collector is part of Norconex’s commitment to deliver quality open-source products, backed by community or commercial support. Norconex Filesystem Collector is available for immediate download at /collectors/collector-filesystem/download.

Founded in 2007, Norconex is a leader in enterprise search and data discovery. The company offers a wide range of products and services designed to help with the processing and analyzing of structured and unstructured data.

For more information on Norconex Filesystem Collector:

Website: /collectors/collector-filesystem

Email: info@norconex.com

###