Crawl web content
Collect web sites content for your
search engine or any other data repository. Run this full-featured
on its own, or embed it in your own application.
Works on any operating system,
is fully documented and is packaged with sample crawl configurations
running out-of-the-box to get you started quickly.
Norconex HTTP Collector shares common features with other Norconex
Collectors. Find out about those
The following is a non exhaustive list of features supported by
the Norconex HTTP Collector:
- Supports different hit interval according to different schedules.
- Can crawls millions on a single server of average capacity.
- Extract text out of many file formats (HTML, PDF, Word, etc.)
- Extract metadata associated with documents.
- Language detection.
- Many content and metadata manipulation options.
- OCR support on images and PDFs.
- Page screenshots.
- Extract page "featured" image.
- Translation support.
- Dynamic title generation.
- Configurable crawling speed.
- URL normalization.
- Detects modified and deleted documents.
- Supports different frequencies for re-crawling certain pages.
- Supports various web site authentication schemes.
- Supports sitemap.xml (including "lastmod" and "changefreq").
- Supports robot rules.
- Supports canonical URLs.
- Can filter documents based on URL, HTTP headers, content, or metadata.
- Can treat embedded documents as distinct documents.
- Can split a document into multiple documents.
- Can store crawled URLs in different database engines.
- Can re-process or delete URLs no longer linked by other crawled pages.
- Supports different URL extraction strategies for different content types.
- Fires more than 20 crawler event types for custom event listeners.
- Date parsers/formatters to match your source/target repository dates.
- Can create hierarchical fields.
- Supports scripting languages for manipulating documents.
- Reference XML/HTML elements using simple DOM tree navigation.
- Supports external commands to parse or manipulate documents.
- Many others.