Norconex HTTP Collector

Open-Source Enterprise Web Crawler

Getting Started Download 2.6.0

Crawl web content

Collect web sites content for your search engine or any other data repository. Run this full-featured collector on its own, or embed it in your own application. Works on any operating system, is fully documented and is packaged with sample crawl examples running out-of-the-box to get you started quickly.

Norconex HTTP Collector shares common features with other Norconex Collectors. Find out about those here.

More features

The following is a non exhaustive list of features supported by the Norconex HTTP Collector:

  • Multi-threaded.
  • Supports different hit interval according to different schedules.
  • Can crawls millions on a single server of average capacity.
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents.
  • Language detection.
  • Many content and metadata manipulation options.
  • OCR support on images and PDFs.
  • Translation support.
  • Dynamic title generation.
  • Configurable crawling speed.
  • Offers URL normalization.
  • Detects modified and deleted documents.
  • Supports different frequencies for re-crawling certain pages.
  • Supports various web site authentication schemes.
  • Supports sitemap.xml (including "lastmod" and "changefreq").
  • Supports robot rules.
  • Supports canonical URLs.
  • Can filter documents based on URL, HTTP headers, content, or metadata.
  • Can treat embedded documents as distinct documents.
  • Can split a formatted document into multiple documents.
  • Can store crawled URLs in different database engines.
  • Can re-process or delete URLs no longer linked by other crawled pages.
  • Supports different URL extraction strategies for different content types.
  • Fires more than 20 crawler event types for custom event listeners.
  • Date parsers/formatters to match your source/target repository dates.
  • Can create hierarchical fields.
  • Supports scripting languages for manipulating documents.
  • Reference XML/HTML elements using simple DOM tree navigation.
  • Many others.

Latest news

Norconex Elasticsearch Committer 1.1.0 released
Now ships with CloudSearch SDK 1.11.29 and new committer install script. More...

Norconex CloudSearch Committer 2.1.0 released
Now ships with Elasticsearch API 2.3.5 and new committer install script. More...

Norconex HP IDOL Committer 2.1.0 released
New committer install script. More...

Norconex Solr Committer 2.2.0 released
Now ships with SolrJ 6.1 and new committer install script. More...

Norconex HTTP Collector 2.6.0 released
New document parsing/manipulating capabilities, and more... More...

Copyright © 2013-2016 Norconex Inc.