Norconex Elasticsearch Committer

Configuration

When used with a Norconex Crawler, you can use the following XML to configure Elasticsearch as the <committer> section of your Norconex Crawler configuration:

<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
    <nodes>...</nodes>
    <indexName>...</indexName>
    <typeName>...</typeName>
    <ignoreResponseErrors>[false|true]</ignoreResponseErrors>
    <discoverNodes>[false|true]</discoverNodes>
    <dotReplacement>...</dotReplacement>
    <jsonFieldsPattern>...</jsonFieldsPattern>
    <connectionTimeout>(milliseconds)</connectionTimeout>
    <socketTimeout>(milliseconds)</socketTimeout>
    <maxRetryTimeout>(milliseconds)</maxRetryTimeout>
    <fixBadIds>[false|true]</fixBadIds>
    <username>...</username>
    <password>...</password>
    <passwordKey>...</passwordKey>
    <passwordKeySource>[key|file|environment|property]</passwordKeySource>
    <sourceReferenceField keep="[false|true]">...</sourceReferenceField>
    <sourceContentField keep="[false|true]">...</sourceContentField>
    <targetContentField>...</targetContentField>
    <queueDir>...</queueDir>
    <queueSize>...</queueSize>
    <commitBatchSize>...</commitBatchSize>
    <maxRetries>...</maxRetries>
    <maxRetryWait>...</maxRetryWait>
</committer>

Tag descriptions:

Tag Description
nodes Comma delimited list of host URLs to connect to join the cluster. Default is http://localhost:9200.
indexName Index name to use when committing documents to Elasticsearch.
typeName Type name to use when committing documents to Elasticsearch.
ignoreResponseErrors Optionally ignore errors in Elasticsearch response. When ignored, errors are logged instead of throwning an exception. Default is false.
discoverNodes Optionally enable automatic discovery of cluster nodes beyond the configured ones. Default is false.
dotReplacement Optionally replace dots in field names with any value. Default is null (does not replace dots).
jsonFieldsPattern Optional regular expression to identify fields containing JSON objects instead of regular strings.
connectionTimeout Elasticsearch connection timeout (default 1 second).
socketTimeout Elasticsearch socket timeout (default 30 seconds).
maxRetryTimeout Maximum amount of time to wait before retrying a failing Elasticsearch host (default 30 seconds).
fixBadIds Flag to fix ids not matching Elasticsearch ID limitations.
username Basic authentication user name.
password Basic authentication password.
passwordKey Reference to password key (or actual key) for encrypted passwords. See the API Documentation for encryption instructions.
passwordKeySource Source of password key for encrypted passwords. See the API Documentation for encryption instructions.
sourceReferenceField Name of source field that will be mapped to the Elasticsearch id field. Default is the document reference the Committer stores as document.reference. The metadata source field is deleted, unless keep is set to true.
sourceContentField Source field name for a document content/body. Default is not a field, but rather the document body content. Once re-mapped, the metadata source field is deleted, unless keep is set to true.
targetContentField Target field name for a document content/body. Default is: content.
queueDir Optional path where to queue files before sending them to Elasticsearch. Default is: ./committer-queue.
queueSize Optional maximum queue size before sending document to Elasticsearch. Default is: 1000.
commitBatchSize Optional maximum of documents to send to Elasticsearch at once. Default is: 100.
maxRetries Maximum retries upon commit failures. Default is 0 (no retry).
maxRetryWait Maximum delay (millisecond) between retries. Default is 0 (no delay).