Norconex HTTP Collector

Open-Source Enterprise Web Crawler

Getting Started Download 2.4.0

Crawl web content

Collect web sites content for your search engine or any other data repository. Run this full-featured collector on its own, or embed it in your own application. Works on any operating system, is fully documented and is packaged with sample crawl examples running out-of-the-box to get you started quickly.

Norconex HTTP Collector shares common features with other Norconex Collectors. Find out about those here.

More features

The following is a non exhaustive list of features supported by the Norconex HTTP Collector:

  • Multi-threaded.
  • Supports different hit interval according to different schedules.
  • Can crawls millions on a single server of average capacity.
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents.
  • Language detection.
  • Many content and metadata manipulation options.
  • OCR support on images and PDFs.
  • Translation support.
  • Dynamic title generation.
  • Configurable crawling speed.
  • Offers URL normalization.
  • Detects modified and deleted documents.
  • Supports different frequencies for re-crawling certain pages.
  • Supports various web site authentication schemes.
  • Supports sitemap.xml (including "lastmod" and "changefreq").
  • Supports robot rules.
  • Supports canonical URLs.
  • Can filter documents based on URL, HTTP headers, content, or metadata.
  • Can treat embedded documents as distinct documents.
  • Can split a formatted document into multiple documents.
  • Can store crawled URLs in different database engines.
  • Can re-process or delete URLs no longer linked by other crawled pages.
  • Supports different URL extraction strategies for different content types.
  • Fires more than 20 crawler event types for custom event listeners.
  • Date parsers/formatters to match your source/target repository dates.
  • Can create hierarchical fields.
  • Supports scripting languages for manipulating documents.
  • Reference XML/HTML elements using simple DOM tree navigation.
  • Many others.

Latest news

Norconex Amazon CloudSearch Committer 1.0.0 released
You can now user Norconex Collectors with Amazon CloudSearch. More...

Norconex HTTP Collector 2.4.0 released
Better redirect handling, passwords can now be encrypted, several fixes, and more... More...

Norconex Filesystem Collector 2.4.0 released
Now takes relative local paths as start paths. Takes advantage of new features from Collector Core 1.4.0 and Importer 2.5.0. More...

Norconex Collector Core 1.4.0 released
Maintenance release. Minor updates. More...

Norconex Importer 2.5.0 released
Improved character encoding support, DOMTagger and DOMFilter improvements, and more. More...

Copyright © 2013-2016 Norconex Inc.