Norconex HTTP Collector

Open-Source Enterprise Web Crawler

Getting Started Download 2.3.0

Crawl web content

Collect web sites content for your search engine or any other data repository. Run this full-featured collector on its own, or embed it in your own application. Works on any operating system, is fully documented and is packaged with sample crawl examples running out-of-the-box to get you started quickly.

Norconex HTTP Collector shares common features with other Norconex Collectors. Find out about those here.


More features

The following is a non exhaustive list of features supported by the Norconex HTTP Collector:

  • Multi-threaded.
  • Supports different hit interval according to different schedules.
  • Can crawls millions on a single server of average capacity.
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents.
  • Language detection.
  • Many content and metadata manipulation options.
  • OCR support on images and PDFs.
  • Translation support.
  • Dynamic title generation.
  • Configurable crawling speed.
  • Offers URL normalization.
  • Detects modified and deleted documents.
  • Supports various web site authentication schemes.
  • Supports sitemap.xml and robot rules.
  • Supports canonical URLs.
  • Can filter documents based on URL, HTTP headers, content, or metadata.
  • Can treat embedded documents as distinct documents.
  • Can split a formatted document into multiple documents.
  • Can store crawled URLs in different database engines.
  • Can re-process or delete URLs no longer linked by other crawled pages.
  • Supports different URL extraction strategies for different content types.
  • Fires more than 20 crawler event types for custom event listeners.
  • Date parsers/formatters to match your source/target repository dates.
  • Can create hierarchical fields.
  • Supports scripting languages for manipulating documents.
  • Reference XML/HTML elements using simple DOM tree navigation.
  • Many others.

Latest news

Norconex Elasticsearch Committer 2.0.2 released
2015-12-23
Maintenance release. Upgraded to Elasticsearch 1.7.4. More...

Norconex HTTP Collector 2.3.0 released
2015-11-06
Use sitemaps as start URLs, add custom headers to HTTP requests, use scripting, DOM navigation, and more... More...

Norconex Filesystem Collector 2.3.0 released
2015-11-06
Takes advantage of new features from Collector Core 1.3.0 and Importer 2.4.0. More...

Norconex Collector Core 1.3.0 released
2015-11-06
Maintenance release. Minor updates. More...

Norconex Committer Core 2.0.3 released
2015-11-03
Maintenance release. Minor updates. More...

Copyright © 2013-2016 Norconex Inc.