Tested on many sites with millions of pages. Crawl what you want, where you want. Manipulate extracted content at will. Get it now
Licensed under GNU General Public License v3.0 and maintained on GitHub.
Community and commercial support available.
Easy to Run
Fully documentated and works out-of-the-box with sample configurations you can modify to suit your needs. Get started!
Integrates with virtually any search engines (or else). Easily add new functionalities, or replace existing ones. See for yourself
100% Java-based. Runs anywhere. Test on an operating system (e.g. Windows) and deploy to another one (e.g. Linux).
From supporting robot rules to detecting document deletions, you will be pleased with its long list of features.