Norconex Importer

Open-Source document text extractor and transformer

Getting Started Download 2.8.0

Content importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a computer file as plain text, whatever its native format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before importing/using it in your own service or application.

A typical but not limited usage is to “import” crawled content for use by a search engine. We invite you to consider one of Norconex Collectors for this purpose (which rely on Norconex Importer).

Have a look at the supported file formats.

Latest news

Norconex CloudSearch Committer 1.4.0 released
New proxy support. More...

Norconex HTTP Collector 2.8.0 released
Featured image extractor, more site authentication options, etc. More...

Norconex Filesystem Collector 2.8.0 released
New features through dependency updates. More...

Norconex Importer 2.8.0 released
New TruncateTagger, ExternalTagger, etc. More...

Norconex Collector Core 1.9.0 released
More checksum options and other minor updates and fixes. More...

Copyright © 2013-2018 Norconex Inc.