GATINEAU, QC, CANADA – Monday, December 1, 2014 – Norconex announces the launch of its Google Search Appliance (GSA) Committer module for its Norconex Collectors Crawler Suite. Enterprise search developers and enthusiasts now have a flexible and extensible option for feeding documents to their GSA infrastructure. GSA is a target repository for crawled documents released by Norconex HTTP Collector, Norconex Filesystem Collector, and any future Collector released by Norconex . These Collectors can reside on any server (like remote filesystems) and send discovered documents across the network to a GSA installation. The GSA Committer is the latest addition to the growing list of Committers already available to Norconex Collector users: Apache Solr, Elasticsearch, HP IDOL, and Lucidworks.

“The increasing popularity of our universal crawlers motivates us to provide support for more search engines. Search engines come and go in an organization, but your investment in your crawling infrastructure can be protected by having re-usable crawler setups that can outlast any search engine installation,” said Norconex President Pascal Essiembre.

GSA Committer Availability

GSA Committer is part of Norconex’s commitment to delivering quality open-source products backed by community or commercial support. GSA Committer is available for immediate download at /collectors/committer-gsa.

Founded in 2007, Norconex is a leader in enterprise search and data discovery. The company offers a wide range of products and services designed to help process and analyze structured and unstructured data.

For more information on GSA Committer:

GSA Committer Website: /collectors/committer-gsa
Norconex Collectors: /collectors
Email: info@norconex.com

System integration conceptDuring a recent client project, I was required to crawl several websites with specific requirements for each.  For example, one of the websites required:

  • to have a meta tag content be used as a URL replacement for the actual URL,
  • the header, footer and any repetitive content be excluded from each page,
  • to be able to ignore robots.txt since it is meant for external crawlers only (Google, Bing, etc.), and
  • to index them in LucidWorks.

LucidWorks built-in web crawler is based on Aperture.  It is great for basic web crawls, but I needed more advanced features that it could not provide.  I had to configure LucidWorks with an external crawler that had more advanced built-in capabilities and the ability to create new functionality.

(more…)