Norconex HTTP Collector

Open-Source Enterprise Web Crawler

Getting Started Download 2.0.2

Crawl web content

Collect websites content for your search engine or any other data repository. Run this full-featured collector on its own, or embed it in your own application. Works on any operating system, is fullly documented and is packaged with sample crawl examples running out-of-the-box to get you started quickly.

Norconex HTTP Collector shares common features with other Norconex Collectors. Find out about those here.

More features

The following is a non exaustive list of features supported by the Norconex HTTP Collector:

  • Multi-threaded.
  • Supports different hit interval according to different schedules.
  • Can crawls millions on a single server of average capacity.
  • Language detection.
  • Configurable crawling speed.
  • Offers URL normalization.
  • Detects modified and deleted documents.
  • Supports various website authentication schemes
  • Supports for sitemap.xml and robot rules
  • Can filter documents based on URL, HTTP headers, content, or metadata.
  • Can treat embeded documents as distinct documents.
  • Can split a formatted document into multiple documents.
  • Can store crawled URLs in different database engines.
  • Can reprocess or delete URLs no longer linked by other crawled pages.
  • Supports different URL extraction strategies for different content types
  • Fires more than 20 crawler event types for custom listeners.
  • Date parsers/formatters to match your source/target repository dates.
  • Can create hierarchical fields.

Latest news

How to crawl Facebook
See how Norconex HTTP Collector can be used to crawl facebook posts. More...

Norconex HTTP and Filesystem Collectors 2.0.2 released
Fixes problem stopping the collector and better handling of child document deletion.

Monitor your crawler’s progress with JEF Monitor
New tutorial showing how to integrate JEF Monitor with Norconex Collectors. More...

Norconex HTTP and Filesystem Collectors 2.0.1 released
Fixes a file name collision issue when keeping downloads.

Norconex GSA Committer 1.0.0 released
Norconex officially released its Google Search Appliance Committer. More...

Copyright © 2013-2015 Norconex Inc.