Norconex HTTP Collector

Open-Source Enterprise Web Crawler

Getting Started Download 2.2.0

Crawl web content

Collect websites content for your search engine or any other data repository. Run this full-featured collector on its own, or embed it in your own application. Works on any operating system, is fullly documented and is packaged with sample crawl examples running out-of-the-box to get you started quickly.

Norconex HTTP Collector shares common features with other Norconex Collectors. Find out about those here.


More features

The following is a non exhaustive list of features supported by the Norconex HTTP Collector:

  • Multi-threaded.
  • Supports different hit interval according to different schedules.
  • Can crawls millions on a single server of average capacity.
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents.
  • Language detection.
  • Many content and metadata manipulation options.
  • OCR support on images and PDFs.
  • Translation support.
  • Dynamic title generation.
  • Configurable crawling speed.
  • Offers URL normalization.
  • Detects modified and deleted documents.
  • Supports various website authentication schemes.
  • Supports sitemap.xml and robot rules.
  • Supports canonical URLs.
  • Can filter documents based on URL, HTTP headers, content, or metadata.
  • Can treat embedded documents as distinct documents.
  • Can split a formatted document into multiple documents.
  • Can store crawled URLs in different database engines.
  • Can re-process or delete URLs no longer linked by other crawled pages.
  • Supports different URL extraction strategies for different content types.
  • Fires more than 20 crawler event types for custom event listeners.
  • Date parsers/formatters to match your source/target repository dates.
  • Can create hierarchical fields.
  • Many others.

Latest news

Norconex HTTP Collector 2.2.0 released
2015-07-22
Additional content transformation capabilities, canonical URL support, increased stability, and more. More...

Norconex Filesystem Collector 2.2.0 released
2015-07-22
Takes advantage of new features from Collector Core 1.2.0 and Importer 2.3, along with minor improvements. More...

Norconex Collector Core 1.2.0 released
2015-07-22
Decide how to handle references turned bad, plus more features and changes. More...

Norconex Importer 2.3.0 released
2015-07-22
New TextPatternTagger for extracting text matching regular expressions. More...

Norconex Importer 2.2.0 released
2015-06-15
Numeric and date filters, document length and current date taggers, as well as fixes and stability improvements. More...

Copyright © 2013-2015 Norconex Inc.