Norconex HTTP Collector

Open-Source Enterprise Web Crawler

Getting Started Download 2.1.0

Crawl web content

Collect websites content for your search engine or any other data repository. Run this full-featured collector on its own, or embed it in your own application. Works on any operating system, is fullly documented and is packaged with sample crawl examples running out-of-the-box to get you started quickly.

Norconex HTTP Collector shares common features with other Norconex Collectors. Find out about those here.


More features

The following is a non exhaustive list of features supported by the Norconex HTTP Collector:

  • Multi-threaded.
  • Supports different hit interval according to different schedules.
  • Can crawls millions on a single server of average capacity.
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents.
  • Language detection.
  • Many content and metadata manipulation options.
  • OCR support on images and PDFs.
  • Translation support.
  • Dynamic title generation.
  • Configurable crawling speed.
  • Offers URL normalization.
  • Detects modified and deleted documents.
  • Supports various website authentication schemes.
  • Supports sitemap.xml and robot rules.
  • Supports canonical URLs.
  • Can filter documents based on URL, HTTP headers, content, or metadata.
  • Can treat embedded documents as distinct documents.
  • Can split a formatted document into multiple documents.
  • Can store crawled URLs in different database engines.
  • Can re-process or delete URLs no longer linked by other crawled pages.
  • Supports different URL extraction strategies for different content types.
  • Fires more than 20 crawler event types for custom event listeners.
  • Date parsers/formatters to match your source/target repository dates.
  • Can create hierarchical fields.
  • Many others.

Latest news

Norconex Importer 2.2.0 released
2015-06-15
Numeric and date filters, document length and current date taggers, as well as fixes and stability improvements. More...

Norconex HTTP Collector 2.1.0 released
2015-04-08
Takes advantage of new features from Importer 2.1, along with fixes and improvements. More...

Norconex Filesystem Collector 2.1.0 released
2015-04-08
Takes advantage of new features from Importer 2.1, along with fixes and improvements. More...

Norconex Collector Core 1.1.0 released
2015-04-08
Now ships with Importer 2.1.1. Includes fixes and improvements as well. More...

Norconex Importer 2.1.0 released
2015-04-01
Document OCR, translation, and title generation are amongst the new features introduced in this release. More...

Copyright © 2013-2015 Norconex Inc.