Crawl web content
Collect websites content for your
search engine or any other data repository. Run this full-featured
on its own, or embed it in your own application.
Works on any operating system,
is fullly documented and is packaged with sample crawl examples running
out-of-the-box to get you started quickly.
Norconex HTTP Collector shares common features with other Norconex
Collectors. Find out about those
The following is a non exaustive list of features supported by
the Norconex HTTP Collector:
- Supports different hit interval according to different schedules.
- Can crawls millions on a single server of average capacity.
- Extract text out of many file formats (HTML, PDF, Word, etc.)
- Extract metadata associated with documents.
- Language detection.
- OCR support on images and PDFs.
- Translation support.
- Dynamic title generation.
- Configurable crawling speed.
- Offers URL normalization.
- Detects modified and deleted documents.
- Supports various website authentication schemes
- Supports for sitemap.xml and robot rules
- Can filter documents based on URL, HTTP headers, content, or metadata.
- Can treat embeded documents as distinct documents.
- Can split a formatted document into multiple documents.
- Can store crawled URLs in different database engines.
- Can reprocess or delete URLs no longer linked by other crawled pages.
- Supports different URL extraction strategies for different content types
- Fires more than 20 crawler event types for custom listeners.
- Date parsers/formatters to match your source/target repository dates.
- Can create hierarchical fields.