Welcome to Norconex HTTP Collector
Norconex HTTP Collector is a web crawler that aims to make Enterprise Search integrators and developers's life easier.
The current stable version is 1.1.1.
Another crawler... really?
Pretty much all commercial Enterprise Search vendors sell their own HTTP crawler, and there are several open-source offerings already, so this begs the question: why another crawler?
The answer is simple: none were entirely satisfying to us. At Norconex, we have extensive experience with different crawler implementations as Enterprise Search integrators. While they all have their strength and weaknesses, we always wished we could get our hands on one that combines all the things we like, while minimizing many of the recurrent pain points we kept experiencing. After years of waiting for it we took matters into our own hands and the results is here. While at first its main goal was to facilitate our own job, we now hope it can benefit you too. Please be vocal about things you would like to see included in future releases.
We made sure the minimum set of features we wish a crawler should have was part of the original release of Norconex HTTP Connector. While you will find many of these features in several HTTP crawlers out there, we never found one that supported all these out-of-the-box, to our satisfaction (we are picky!). Features and qualities are:
Deploy your configuration to different location (staging, prod, etc) and environment specific settings (IPs, URLs, etc) are kept in separate, environment-specific variable files.
Almost every feature can be replaced by custom code or be extended.
Break your configuration the way you like in configuration fragments, and share entire configuration sections between many crawler instances.
- Ease of maintenance
Only one central copy of the crawler binaries is required and/or needs to be updated. No more duplication of binaries or identical configuration values across several crawlers.
- Easy for developers to extend
Many insertion points. Basic Java, no custom framework to learn. Interfaces are clearly defined and to use your own implementation, you simply write your class name where appripriate in your configuration file. The HTTP Collector is expected to be extended very often by its users.
- Supports different hit interval according to different schedules
For long running crawls, if your web site customer traffic is high during the day, but low at night or weekends, you can define the delay between each URL crawled for different time period.
- Easy for non-technical people to use
While developers will see more benefits, the out-of-the-box features remain well-documented and easy to configure for non-developers.
- Good documentation
Besides examples provided on this site or in distributed package, up-to-date Javadocs are automatically generated with every release, and they constitute most of the documentation. This prevents from having to maintain two sets of documentation and having discrepencies over time.
- Community support
Can get free help from other users of HTTP Collector or from GitHub free ticket system. No need to be a paying customer to get basic help. Share your tips and tricks!
- Commercial Support
Norconex provides professional support for all its open-source products (email@example.com).
- Free/Open Source
Free is good. Open-source: you do not like something, fix it and never get stuck again!
- No vendor lock-in
Being open-source, you are not tied to a vendor agenda, or even to a vendor existence.
- Cross vendors
Not tied to a specific search engine. You can swith to another search vendor and still use this connector.
Can do more than just crawling. Allows for any kinds of document manipulation, before of after their text is extracted.
100% Java-based. As long as you can get Java on your operating system, you are good to go. For instance, the same HTTP Collector installation can run on Windows or Linux.
- Obtain and manipulate document metadata
All metadata found in documents are extracted by default (both from document properties and HTTP headers). You can easily add, modify, remove, rename, etc.
- Handles deleted documents
The HTTP Collector will detect and perform appropriate action when a URL no longer exists (e.g. remove from search index).
- Robots.txt and in-page Robots meta tag support
Can easily be turned off for your own sites.
- sitemap.xml support
Includes sitemap index, text or gzip.
Can be integrated and run from other java applications (while still using file-based configuration or not).
- Easy to create reports
Offers listeners for most crawling events, allowing you to build your own reports.
- Logs are meaningful and verbose
Can log every document failure with exact cause of failure. Uses log4j to control log level.
- Resumable upon failure
Upon a major crash (e.g. server shutdown), you can resume the HTTP Collector and it will pick-up exactly where it left. You can also stop it and resume it yourself.
- Can indicate progress
Uses Norconex JEF which automatically stores it progress on file. With the use of tools such as Norconex JEF Monitor, you can visualize the crawl progression.
- XML Configurable
Configured via simple XML.
- Supports various website authentication schemes
Supports FORM, BASIC, and DIGEST implementations out-of-the-box. If you use custom authentication mechanism, you can plug your own authentication logic.
- Broken down into reusable modules
HTTP Collector uses Norconex Importer and Norconex Committer which can both be used on their own for your own purpose (e.g. to build your own collectors).
- Supports cookies
- Relies on proven technology
Many of HTTP Collector core features rely on proven technology such as Apacke Tika and Apache Derby.
- Detects modified documents
Detect if documents have been modified from previous run.
Can control the crawling speed. Also offers to skip downloading of unmodified documents, based on HTTP headers only (for subsequent runs).
Can use multiple threads to crawl web sites faster, while respecting the minimum delay between each page crawling.
- Can normalize URLs
Offers URL normalization to treat URL variations as one.
- Tested with millions of URLs
Single or multiple crawler instances have been tested with millions of web pages and documents without issues.
There are more than one way to do something. Experiment and have fun!
Does it meet every web crawling needs?
No. The original focus of the Norconex HTTP Collector is Enterprise Search, especially when several crawlers are required to perform all kinds customer-specific requests (even weird ones) and you wish to avoid maintenance and support headaches. There are things it is not optimized for, even though it may be one day (provided by Norconex or community). For example:
- Does not focus on BigData (but has been tested with millions -- see feature list above).
- Does not focus on preserving document structure.
While we hope you use and like this HTTP Crawler, you are still encouraged to use the HTTP crawler you best see fit for your needs. Let us know how Norconex HTTP Collector can be improved to better suit your needs.