Norconex HTTP Collector

Open-Source Enterprise Web Crawler

Getting Started Download 2.2.1

Crawl web content

Collect web sites content for your search engine or any other data repository. Run this full-featured collector on its own, or embed it in your own application. Works on any operating system, is fully documented and is packaged with sample crawl examples running out-of-the-box to get you started quickly.

Norconex HTTP Collector shares common features with other Norconex Collectors. Find out about those here.

More features

The following is a non exhaustive list of features supported by the Norconex HTTP Collector:

  • Multi-threaded.
  • Supports different hit interval according to different schedules.
  • Can crawls millions on a single server of average capacity.
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents.
  • Language detection.
  • Many content and metadata manipulation options.
  • OCR support on images and PDFs.
  • Translation support.
  • Dynamic title generation.
  • Configurable crawling speed.
  • Offers URL normalization.
  • Detects modified and deleted documents.
  • Supports various web site authentication schemes.
  • Supports sitemap.xml and robot rules.
  • Supports canonical URLs.
  • Can filter documents based on URL, HTTP headers, content, or metadata.
  • Can treat embedded documents as distinct documents.
  • Can split a formatted document into multiple documents.
  • Can store crawled URLs in different database engines.
  • Can re-process or delete URLs no longer linked by other crawled pages.
  • Supports different URL extraction strategies for different content types.
  • Fires more than 20 crawler event types for custom event listeners.
  • Date parsers/formatters to match your source/target repository dates.
  • Can create hierarchical fields.
  • Supports scripting languages for manipulating documents.
  • Many others.

Latest news

Norconex HTTP Collector 2.2.1 released
Bug fix release. More...

Norconex Collector Core 1.2.1 released
Maintenance release. Minor updates. More...

Norconex Importer 2.3.1 released
Maintenance release. Maven dependency updates. More...

Norconex Committer Core 2.0.2 released
Maintenance release. Maven dependency updates. More...

Norconex HTTP Collector 2.2.0 released
Additional content transformation capabilities, canonical URL support, increased stability, and more. More...

Copyright © 2013-2015 Norconex Inc.