Norconex HTTP Collector

If you do not find answers to your questions here, please ask your question on GitHub and it may find its way here.

All Collectors

What file formats are supported?

The parsing of downloaded files is performed by the Norconex Importer. You can read on its web site the full list of supported file formats.

How to prevent fields from being added to a document

DeleteTagger and KeeyOnlyTagger will help you produce just the fields you want. The later is probably the one you want in most cases. The following shows how to eliminate all fields from a document, except for the document reference, keywords, and description fields:

        <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference, keywords, description</fields> 

How to chose the right Crawl Database implementation

Norconex Collectors need a database to store key reference information about a collected document (URL, path, etc.). Three implementations are offered out-of-the-box: MVStore, MapDB, MongoDB, and JDBC (Derby or H2). Prior to version 2.5.0 of both HTTP and Filesystem collectors, MapDB was the default implementation. Since version 2.5.0 of these collectors, MVStore is now the default implementation. Using the default implementation does not require explicit configuration. The following will help you decide which one is the right one for you: