cat page

HTTP Collector 2.6

Norconex has released version 2.6.0 of its HTTP Collector web crawler! Among new features, an upgrade of its Importer module brings new document parsing and manipulating capabilities. Some of the changes highlighted here also benefit the Norconex Filesystem Collector.

New URL normalization to remove trailing slashes

[ezcol_1half]

The GenericURLNormalizer has a new pre-defined normalization rule: “removeTrailingSlash”. When used, it makes sure to remove forward slash (/) found at the end of URLs so such URLs are treated the same as those not ending with such character. As an example:

  • https://norconex.com/ will become https://norconex.com
  • https://norconex.com/blah/ will become https://norconex.com/blah

It can be used with the 20 other normalization rules offered, and you can still provide your own.

[/ezcol_1half]

[ezcol_1half_end]

<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
  <normalizations>
    removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
    decodeUnreservedCharacters, removeDefaultPort,
    encodeNonURICharacters, removeTrailingSlash
  </normalizations>
</urlNormalizer>

[/ezcol_1half_end]

Prevent sitemap detection attempts

[ezcol_1half]

By default StandardSitemapResolverFactory is enabled and tries to detect whether a sitemap file exists at the “/sitemap.xml” or “/sitemap_index.xml” URL path. For websites without sitemaps files at these location, this creates unnecessary HTTP request failures. It is now possible to specify an empty “path” so that such discovery does not take place. In such case, it will rely on sitemap URLs explicitly provided as “start URLs” or sitemaps defined in “robots.txt” files.

[/ezcol_1half]

[ezcol_1half_end]

<sitemapResolverFactory>
  <path/>
</sitemapResolverFactory>

[/ezcol_1half_end]

Count occurrences of matching text

[ezcol_1half]

Thanks to the new CountMatchesTagger, it is now possible to count the number of times any piece of text or regular expression occurs in a document content or one of its fields. A sample use case may be to use the obtained count as a relevancy factor in search engines. For instance, one may use this new feature to find out how many segments are found in a document URL, giving less importance to documents with many segments.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.CountMatchesTagger"> 
  <countMatches 
      fromField="document.reference"
      toField="urlSegmentCount" 
      regex="true">
    /[^/]+
  </countMatches>
</tagger>

[/ezcol_1half_end]

Multiple date formats

[ezcol_1half]

DateFormatTagger now accepts multiple source formats when attempting to convert dates from one format to another. This is particularly useful when the date formats found in documents or web pages are not consistent. Some products, such as Apache Solr, usually expect dates to be of a specific format only.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DateFormatTagger"
    fromField="Last-Modified"
    toField="solr_date"
    toFormat="yyyy-MM-dd'T'HH:mm:ss.SSS'Z'">
  <fromFormat>EEE, dd MMM yyyy HH:mm:ss zzz</fromFormat>
  <fromFormat>EPOCH</fromFormat>
</tagger>

[/ezcol_1half_end]

DOM enhancements

[ezcol_1half]

DOM-related features just got better. First, the DOMTagger, which allows one to extract values from an XML/HTML document using a DOM-like structurenow supports an optional “fromField” to read the markup content from a field instead of the document content. It also supports a new “defaultValue” attribute to store a value of your choice when there are no matches with your DOM selector. In addition, now both DOMContentFilter and DOMTagger supports many more selector extraction options: ownText, data, id, tagName, val, className, cssSelector, and attr(attributeKey).

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
  <dom selector="div.contact" toField="htmlContacts" extract="html" />
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger"
    fromField="htmlContacts">
  <dom selector="div.firstName" toField="firstNames" 
       extract="ownText" defaultValue="NO_FIRST_NAME" />
  <dom selector="div.lastName"  toField="lastNames" 
       extract="ownText" defaultValue="NO_LAST_NAME" />
</tagger>

[/ezcol_1half_end]

More control of embedded documents parsing

[ezcol_1half]

GenericDocumentParserFactory now allows you to control which embedded documents you do not want extracted from their containing document (e.g., do not extract embedded images). In addition, it also allows you to control which containing document you do not want to extract their embedded documents (e.g., do not extract documents embedded in MS Office documents). Finally, it also allows you now to specify which content types to “split” their embedded documents into separate files (as if they were standalone documents), via regular expression (e.g. documents contained in a zip file).

[/ezcol_1half]

[ezcol_1half_end]

<documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
  <embedded>
    <splitContentTypes>application/zip</splitContentTypes>
    <noExtractEmbeddedContentTypes>image/.*</noExtractEmbeddedContentTypes>
    <noExtractContainerContentTypes>
      application/(msword|vnd\.ms-.*|vnd\.openxmlformats-officedocument\..*)
    </noExtractContainerContentTypes>
  </embedded>
</documentParserFactory>

[/ezcol_1half_end]

Document parsers now XML configurable

[ezcol_1half]

GenericDocumentParserFactory now makes it possible to overwrite one or more parsers the Importer module uses by default via regular XML configuration. For any content type, you can specify your custom parser, including an external parser.

[/ezcol_1half]

[ezcol_1half_end]

<documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
  <parsers>
    <parser contentType="text/html" 
        class="com.example.MyCustomHTMLParser" />
    <parser contentType="application/pdf" 
        class="com.norconex.importer.parser.impl.ExternalParser">
      <command>java -jar c:\Apps\pdfbox-app-2.0.2.jar ExtractText ${INPUT} ${OUTPUT}</command>
    </parser>
  </parsers>
</documentParserFactory>

[/ezcol_1half_end]

More languages detected

[ezcol_1half]

LanguageTagger now uses Tika language detection, which supports at least 70 languages.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.LanguageTagger">
  <languages>en, fr</languages>
</tagger>

[/ezcol_1half_end]

What else?

Other changes and stability improvements were made to this release. A few examples:

  • New “checkcfg” launch action that helps detect configuration issues before an actual launch.
  • Can now specify “notFoundStatusCodes” on GenericMetadataFetcher.
  • GenericLinkExtractor no longer extracts URL from HTML/XML comments by default.
  • URL referrer data is now always preserved by default.

To get the complete list of changes, refer to the HTTP Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.

Useful links

HTTP Collector 2.5

Norconex has released Norconex HTTP Collector version 2.5.0! This new version of our open source web crawler was released to help minimize your re-crawling frequencies and download delays, and it allows you to specify a locale for date parsing/formatting. The following highlights these key changes and additions:

Minimum re-crawl frequency

[ezcol_1half]

Not all web pages and documents are updated as regularly. In addition, updates are not as important to capture right away for all types of content. Re-crawling every page every time to find out if they changed or not can be time consuming (and sometimes taxing) on larger sites. For instance, you may want to re-crawl news pages more regularly than other types of pages on a given site. Luckily, some websites provide sitemaps which give crawlers pointers to its document update frequencies.

This release introduces “recrawlable resolvers” to help control the frequency of document re-crawls. You can now specify a minimum re-crawl delay, based on a document matching content type or reference pattern. The default implementation is GenericRecrawlableResolver, which supports sitemap “lastmod” and “changefreq” in addition to custom re-crawl frequencies.

[/ezcol_1half]

[ezcol_1half_end]

<recrawlableResolver
    class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
    sitemapSupport="last" >
  <minFrequency applyTo="contentType" value="monthly">application/pdf</minFrequency>
  <minFrequency applyTo="reference" value="1800000">.*latest-news.*\.html</minFrequency>
</recrawlableResolver>

[/ezcol_1half_end]

Download delays based on document URL

[ezcol_1half]

ReferenceDelayResolver is a new “delay resolver” that controls delays between each document download. It allows you to define different delays for different URL patterns. This can be useful for more fragile websites negatively impacted by the fast download of several big documents (e.g., PDFs). In such cases, introducing a delay between certain types of download can help keep the crawled website performance intact.

[/ezcol_1half]

[ezcol_1half_end]

<delay class="com.norconex.collector.http.delay.impl.ReferenceDelayResolver"
    default="2000"
    ignoreRobotsCrawlDelay="true"
    scope="crawler" >
  <pattern delay="10000">.*\.pdf$</pattern>
</delay>

[/ezcol_1half_end]

Specify a locale in date parsing/formatting

[ezcol_1half]

Thanks to the Norconex Importer 2.5.2 dependency update, it is now possible to specify a locale when parsing/formatting dates with CurrentDateTagger and DateFormatTagger.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DateFormatTagger"
    fromField="date"
    fromFormat="EEE, dd MMM yyyy HH:mm:ss 'GMT'"
    fromLocale="fr"
    toFormat="yyyy-MM-dd'T'HH:mm:ss'Z'"
    keepBadDates="false"
    overwrite="true" />

[/ezcol_1half_end]

 

Useful links

  • Download Norconex HTTP Collector
  • Get started with Norconex HTTP Collector
  • Report your issues and questions on Github
  • Norconex HTTP Collector Release Notes

 

Norconex just released an Amazon CloudSearch Committer module for its open-source crawlers (Norconex “Collectors”). This is an especially useful contribution to CloudSearch users given that CloudSearch does not have its own crawlers.

If you’re not yet familiar with Norconex Collectors, head over to the Norconex Collectors website to see what you’ve been missing.
Assuming you’re already familiar with Norconex Collectors, you can enable CloudSearch as your crawler’s target search engine by following these steps:

  1. Download the CloudSearch Committer.
  2. Extract the zip, and copy the content of the “lib” folder to the “lib” folder of your existing Collector installation.
  3. Add this minimum required configuration snippet to your Collector configuration file:
    <committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter">
      <serviceEndpoint>(CloudSearch service endpoint)</serviceEndpoint>
      <accessKey>
         (Optional CloudSearch access key. Will be taken from environment when blank.)
      </accessKey>
      <secretKey>
         (Optional CloudSearch secret key. Will be taken from environment when blank.)
      </secretKey>
    </committer>
  4. The document endpoint represents the CloudSearch domain you’ll want to use to store your crawled documents. It can be obtained from your CloudSearch domain’s main page.

CloudSearch main page

As for the AWS access and secret keys, they can also be stored outside the configuration file using one of the methods described here.
The complete list of configuration options is available here.

For further information:

Norconex HTTP Collector 2.3.0

Norconex is proud to release version 2.3.0 of its Norconex HTTP Collector open-source web crawler.  Thanks to incredible community feedback and efforts, we have implemented several feature requests, and your favorite crawler is now more stable than ever. The following describes only a handful of these new features with a focus on XML configuration. Refer to the product release notes for a complete list of changes.

Restrict crawling to a specific site

[ezcol_1half]

Up until now, you could restrict crawling to a specific domain, protocol, and port using one or more reference filters (e.g., RegexReferenceFilter). Norconex HTTP Collector 2.3.0 features new configuration options to “stay on a site”, called stayOnProtocol, stayOnDomain, and stayOnPort.  These new settings can be applied to the <startURLs> tag of your XML configuration.  They are particularly useful when you have many “start URLs” defined and you do not want to create many reference filters to stay on those sites.

[/ezcol_1half]

[ezcol_1half_end]

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
  <url>http://mysite.com</url>
</startURLs>

[/ezcol_1half_end]

 

Add HTTP request headers

[ezcol_1half]

GenericHttpClientFactory now allows you to set HTTP request headers on every HTTP calls that a crawler will make. This new feature can save the day for sites expecting certain header values to be present to render properly. For instance, some sites may rely on the “Accept-Language” request header to decide which language to pick to render a page.

[/ezcol_1half]

[ezcol_1half_end]

<httpClientFactory>
  <headers>
    <header name="Accept-Language">fr</header>
    <header name="From">john@smith.com</header>
  </headers>
</httpClientFactory>

[/ezcol_1half_end]

Specify a sitemap as a start URL

[ezcol_1half]

It is now possible to specify one or more sitemap URLs as “start URLs.”  This is in addition to the crawler attempting to detect sitemaps at standard locations. To only use the sitemap URL provided as a start URL, you can disable the sitemap discovery process by adding ignore="true" to <sitemapResolverFactory> as shown in the code sample.  To only crawl pages listed in sitemap files and not further follow links found in those pages, remember to set the <maxDepth> to zero.

[/ezcol_1half]

[ezcol_1half_end]

<startURLs>
  <sitemap>http://mysite.com/sitemap.xml</sitemap>
</startURLs>
<sitemapResolverFactory ignore="true" />

[/ezcol_1half_end]

Basic URL normalization always performed

[ezcol_1half]

URL normalization is now in effect by default using GenericURLNormalizer. The following are the default normalization rules applied:

  • Removing the URL fragment (the “#” character and everything after)
  • Converting the scheme and host to lower case
  • Capitalizing letters in escape sequences
  • Decoding percent-encoded unreserved characters
  • Removing the default port
  • Encoding non-URI characters

You can always overwrite the default normalization settings or turn off normalization altogether by adding the disabled="true" attribute to the <urlNormalizer> tag.

[/ezcol_1half]

[ezcol_1half_end]

<urlNormalizer>
  <normalizations>
    lowerCaseSchemeHost, upperCaseEscapeSequence, removeDefaultPort, 
    removeDotSegments, removeDirectoryIndex, removeFragment, addWWW 
  </normalizations>
  <replacements>
    <replace><match>&amp;view=print</match></replace>
    <replace>
       <match>(&amp;type=)(summary)</match>
       <replacement>$1full</replacement>
    </replace>
  </replacements>
</urlNormalizer>

[/ezcol_1half_end]

Scripting Language and DOM navigation

We introduced additional features when we upgraded the Norconex Importer dependency to its latest version (2.4.0). You can now use scripting languages to insert your own document processing logic or reference DOM elements of a XML or HTML file using a friendly syntax.  Refer to the Importer 2.4.0 release announcement for more details.

Useful links

There is so much more offered by this release. Use the following links to find out more about Norconex HTTP Collector.

Norconex is proud to release version 2.4.0 of its Norconex Importer open-source product.  In addition to the usual bug fixes and stability enhancements, this release provides more possibilities for parsing and enriching your documents.  Most significantly, Importer 2.4.0 allows for scripting and DOM navigation.  Keep reading for more details and usage samples.

Scripting

[ezcol_1half]

Whereas it has always been possible to extend the importer to implement your own document processing logic, now you can inject the importer via configuration using a scripting language. The following new handlers enable the use of scripting languages to manipulate documents: ScriptFilter, ScriptTagger, and ScriptTransformer.

The “JavaScript” script engine, which is already present as part of your Java installation, is the script engine used by these classes.  The JavaScript engine used by the Oracle implementation of Java is based on Mozilla Rhino. You can find extensive JavaScript documentation on the Mozilla Rhino site.

Java developers can extend the importer to increase support for additional scripting languages. These new classes rely on JSR 223 API, which allows you to “plug” into any script engines to support your favorite scripting language.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Reject documents that are not about "apple". -->
<filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
  <script><![CDATA[
      isAppleDoc = metadata.getString('fruit') == 'apple'
              || content.indexOf('Apple') > -1;
      /*return*/ isAppleDoc;
  ]]></script>
</filter>

<!-- Add a "fruit" metadata field with the value "apple". --> 
<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
  <script><![CDATA[
      metadata.addString('fruit', 'apple');
  ]]></script>
</tagger>

<!-- Modify all occurences of "Alice" with "Roger". -->
<transformer 
    class="com.norconex.importer.handler.transformer.impl.ScriptTransformer">
  <script><![CDATA[
      modifiedContent = content.replace(/Alice/g, 'Roger');
      /*return*/ modifiedContent;
  ]]></script>
</transformer>

 [/ezcol_1half_end]

DOM navigation

[ezcol_1half]

It is now possible to reference elements of an HTML or XML document using friendly CSS or JQuery-like syntax to navigate its domain object model (DOM). The jsoup parser is used to load document content into a DOM tree.

The new DOMContentFilter can be used to reject documents containing a specific HTML/XML path or element. The DOMSplitter can be used to break HTML/XML with “list” elements into different documents. Finally, the DOMTagger allows you to extract specific HTML/XML tag values or attributes and store them in your own fields (e.g., extract <h1> tags into a “title” field).

[/ezcol_1half]

[ezcol_1half_end]

<!-- Exclude documents containing GIF images. -->
<filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"
      selector="img[src$=.gif]" onMatch="exclude" />

<!-- Store H1 tags in a title field. -->
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
  <dom selector="h1" toField="title" overwrite="false" />
</tagger>

<!-- Create a new contact document for each occurence of the "contact" tag. -->
<splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
    selector="contact" />

 [/ezcol_1half_end]

Other features

[ezcol_1half]

This release features several other helpful and interesting changes and additions.  For instance, CharacterCaseTagger can now be used to adjust the character case of field names (in addition to values). A few additional file formats are also supported.  For a complete list of changes, see the release notes.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Make every instance of "title" field name lowercase. -->
<tagger class="com.norconex.importer.handler.tagger.impl.CharacterCaseTagger">
  <characterCase fieldName="title" type="lower" applyTo="field" />
</tagger>

 [/ezcol_1half_end]

Useful links

The latest release of Norconex HTTP Collector provides more content transformation capabilities, canonical URL support, increased stability, and more additional features.  

Norconex HTTP Collector 2.2 now availableAs the Internet grows, so does the demand for better ways to extract and process web data. Several commercial and open-source/free web crawling solutions have been available for years now. Unfortunately, most are limited by one or more of the following:

  • Feature set is too limited
  • Unfriendly and complex to setup
  • Poorly documented
  • Require strong programming skills
  • No longer supported or active
  • Integrates with a single search engine or repository
  • Geared solely on big data solutions (like the popular Apache Nutch has become)
  • Difficult to extend with your own features
  • High cost of ownership

Norconex is changing this with its full-featured, enterprise-class, open-source web crawler solution. Norconex HTTP Collector is entirely configurable using simple XML, yet offers many extension points for adventurous Java programmers. It integrates with virtually any repository or search engine (Solr, Elasticsearch, IDOL, GSA, etc.). You will find it is thoroughly documented in a single location, with sample configurations files working out of the box on any operating system.

The latest release builds upon the great community requests and feedback to provide the following highlights:

Canonical Links Detector

[ezcol_1half]

Canonical links are a way for the webmaster to help crawlers avoid duplicates by indicating the preferred URL for accessing a web page. The HTTP Collector now detects canonical links found in both HTML and HTTP headers.

The GenericCanonicalLinkDetector looks within the HTML <head> tags for a <link> tag following this pattern:

<link rel="canonical" href="https://norconex.com/sample" />

It also looks for an HTTP response header field named “Link” with a value following this pattern:

<https://norconex.com/sample.pdf> rel="canonical"

The advantage for webmasters in defining canonical URLs in the HTTP response header over an HTML page is twofold. First, it allows web crawlers to reject non-canonical pages before they are downloaded (saving bandwidth). Second, they can apply to any content types, not just HTML pages.

[/ezcol_1half]

[ezcol_1half_end]

<canonicalLinkDetector
    class="com.norconex.collector.http.url.impl.GenericCanonicalLinkDetector"
    ignore="false">
</canonicalLinkDetector>

[/ezcol_1half_end]

URL Reports Creation

[ezcol_1half]

URLStatusCrawlerEventListener is a new crawler event listener that can produce spreadsheet-friendly reports on fetched URLs and their statuses. Among other things, it can be useful for finding broken links on a site being crawled.

[/ezcol_1half]

[ezcol_1half_end]

<listener
    class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
  <statusCodes>404</statusCodes>
  <outputDir>/a/path/broken-links.tsv</outputDir>
</listener>

[/ezcol_1half_end]

Spoiled State Resolver

[ezcol_1half]

A new class called GenericSpoiledReferenceStrategizer allows you to specify how to handle URLs that were once valid, but turned “bad” on a subsequent crawl. You can chose to delete them from your repository, give them a single chance to recover on the next crawl, or simply ignore them.

[/ezcol_1half]

[ezcol_1half_end]

<spoiledReferenceStrategizer 
    class="com.norconex.collector.core.spoil.impl.GenericSpoiledReferenceStrategizer"
    fallbackStrategy="IGNORE">
  <mapping state="NOT_FOUND" strategy="DELETE" />
  <mapping state="BAD_STATUS" strategy="GRACE_ONCE" />
  <mapping state="ERROR" strategy="IGNORE" />
</spoiledReferenceStrategizer>

[/ezcol_1half_end]

Extra Filtering and Data Manipulation Options

Norconex HTTP Collector internally relies on the Norconex Importer library for parsing documents and manipulating text and metadata. The latest release of the Importer brings you several new options, such as:

  • CurrentDateTagger: Add the current date to a document.
  • DateMetadataFilter: Accepts or rejects a document based on the date value of a metadata field.
  • NumericMetadataFilter: Accepts or rejects a document based on the numeric value of a metadata field.
  • TextPatternTagger: Extracts and adds all text values matching the regular expression provided to a metadata field.

Want to crawl a filesystem instead?

Whether you are interested in crawling a local drive, a network drive, a FTP site, webav, or any other types of filesystems, Norconex Filesystem Collector is for you; it was recently upgraded to version 2.2.0 as well. Check its release notes for details.

Useful Links

This release of Norconex Importer brings many fixes, increased stability, and nice new features. The following highlights some of the additions with XML configuration or Java code samples.

Retrieve a document Length

[ezcol_1half]

Thanks to the new DocumentLengthTagger, you can now store a document byte length in a metadata field of your choice. The length can be obtained at any document processing stage.  For instance, it can be obtained before any transformation took place, or after it was parsed.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DocumentLengthTagger"
  field="doc-length" overwrite="true" >
</tagger>

 [/ezcol_1half_end]

Add the current date to a document

[ezcol_1half]

The new CurrentDateTagger allows to add the current date to a metadata field and date format of your choice. This can be useful to indicate when a document was actually processed by the Importer.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.CurrentDateTagger"
  field="date-imported" format="yyyy-MM-dd" />

 [/ezcol_1half_end]

Filter documents on numeric or date range

[ezcol_1half]

NumericMetadataFilter and DateMetadataFilter now allow you to filter documents based on metadata field numeric or date values, respectively. You can define both closed ranges and open-ended ranges.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Numeric range filter -->
<filter class="com.norconex.importer.handler.filter.impl.NumericMetadataFilter"
      onMatch="include" field="age" >
  <condition operator="ge" number="20" />
  <condition operator="lt" number="30" />
</filter>

<!-- Date range filter -->
<filter class="com.norconex.importer.handler.filter.impl.DateMetadataFilter"
      onMatch="include" field="publish_date" >
  <condition operator="ge" date="TODAY-7" />
  <condition operator="lt" date="TODAY" />
</filter>

 [/ezcol_1half_end]

Use external parsers

[ezcol_1half]

Wrapping a Tika class of the same name, the new ExternalParser allows Java programmers to point to external command-line applications to parse documents. One example can be for using “pdftotext” to parse PDFs instead of the default PDF parser based on PDFBox, which is much slower (but does a better job overall).

[/ezcol_1half]

[ezcol_1half_end]

import java.util.Map;

import com.norconex.commons.lang.file.ContentType;
import com.norconex.importer.parser.GenericDocumentParserFactory;
import com.norconex.importer.parser.IDocumentParser;
import com.norconex.importer.parser.impl.ExternalParser;

public class CustomDocumentParserFactory extends GenericDocumentParserFactory {

    @Override
    protected Map<ContentType, IDocumentParser> createNamedParsers() {
        Map<ContentType, IDocumentParser> parsers = super.createNamedParsers();

        ExternalParser pdfParser = new ExternalParser();
        pdfParser.setCommand(
                // Replace this with your own executable path
                "C:\\Apps\\pdftotext.exe", 
                "-enc", "UTF-8", "-raw", "-q", "-eol", "unix",                 
                ExternalParser.INPUT_FILE_TOKEN, 
                ExternalParser.OUTPUT_FILE_TOKEN);
        parsers.put(ContentType.PDF, pdfParser);
        return parsers;
    }
}

  [/ezcol_1half_end]

Other improvements

There are more changes under the hood, like upgrading to Apache Tika 1.8, as well as the fixing of OutOfMemory errors and document parsing sometimes never returning. You can find the complete list of changes in the release notes.

Several of these improvements were made possible thanks to the great feedback of the open-source community. Keep doing so: you make a difference.

Useful links

 

Optical character recognition (ORC), content translation, title generation, detection and text extraction from more file formats, are among the new features now part of your favorite crawlers: Norconex HTTP Collector 2.1.0 and Norconex Filesystem Collector 2.1.0. They are both available now and can be downloaded for free.  They both ship with and use the latest version of the Norconex Importer module, which is in big part responsible for many of these new features.

For more details and usage examples, check this article.

These two Collector releases also include bug fixes and stability improvements.  We recommend to existing users to upgrade.

Get your copy

Download Norconex HTTP Collector

Download Norconex Filesystem Collector

This feature release of Norconex Importer brings bug fixes, enhancements, and great new features, such as OCR and translation support.  Keep reading for all the details on some of this release’s most interesting changes. While Java can be used to configure and use the Importer, XML configuration is used here for demonstration purposes.  You can find all Importer configuration options here.

About Norconex Importer

Norconex Importer is an open-source product for extracting and manipulating text and metadata from files of various formats.  It works for stand-alone use or as a Java library.  It’s an essential component of Norconex Collectors for processing crawled documents.  You can make Norconex Importer an essential piece of your ETL pipeline.

OCR support

[ezcol_1half]

Norconex Importer now leverages Apache Tika 1.7’s newly introduced ORC capability. To convert popular image formats (PNG, TIFF, JPEG, etc.) to text, download a copy of Tesseract OCR for your operating system, and reference its install location in your Importer configuration.  When enabled, OCR will process embedded images too (e.g., PDF with image for text). The class configure to enable OCR support is GenericDocumentParserFactory.

[/ezcol_1half]

[ezcol_1half_end]

<documentParserFactory 
    class="com.norconex.importer.parser.GenericDocumentParserFactory" >
  <ocr path="(path to Tesseract OCR software install)">
    <languages>eng,fra</languages>
  </ocr>
</documentParserFactory>

 [/ezcol_1half_end]

Translation support

[ezcol_1half]

With the new TranslatorSplitter class, it’s now possible to hook Norconex Importer with a translation API.  The Apache Tika API has been extended to provide the ability to translate a mix of document content or specific document fields.  The translation APIs supported out-of-the-box are Microsoft, Google, Lingo24, and Moses.

[/ezcol_1half]

[ezcol_1half_end]

<postParseHandlers>
  <spitter
      class="com.norconex.importer.handler.splitter.impl.TranslatorSplitter"
      api="microsoft">
    <clientId>YOUR_CLIENT_ID</clientId>
    <secretId>YOUR_SECRET_ID</secretId>
  </spitter>
</postParseHandlers>

 [/ezcol_1half_end]

Dynamic title creation

[ezcol_1half]

Too many documents do not have a valid title, when they have a title at all.  What if you need a title to represent each document?  What do you do in such cases?   Do you take the file name as the title? Not so nice.  Do you take the document property called “title”?  Not reliable.  You now have a new option with the TitleGeneratorTagger.  It will try to detect a decent title out of your document.  In cases where it can’t, it offers a few alternate options. You always get something back.

[/ezcol_1half]

[ezcol_1half_end]

<postParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger"
          toField="generated_title"
          fallbackMaxLength="250"
          detectHeading="true"
          detectHeadingMinLength="10"
          detectHeadingMaxLength="500" />
</postParseHandlers>

 [/ezcol_1half_end]

Saving of parsing errors

[ezcol_1half]

A new top-level configuration option was introduced so that every file generating parsing errors gets saved in a location of your choice.  These files will be saved along with the metadata obtained so far (if any), along with the Java exception that was thrown. This is a great addition to help troubleshoot parsing failures.

[/ezcol_1half]

[ezcol_1half_end]

<importer>
  <parseErrorsSaveDir>/path/to/store/bad/files</parseErrorsSaveDir>
</importer>

 [/ezcol_1half_end]

Document parsing improvements

The content type detection accuracy and performance were improved with this release.  In addition, document parsing features the following additions and improvements:

  • Better PDF support with addition of PDF XFA (dynamic forms) text extraction, as well as improved space detection (eliminating many space-stripping issues).  Also, PDFs with JBIG2 and jpeg2000 image formats are now parsed properly.
  • New XFDL parser (PureEdge Extensible Forms Description Language).  Supports both Gzipped/Base64 encoded and plain text versions.
  • New, much improved WordPerfect parser now parsing WordPerfect documents according to WordPerfect file specifications.
  • New Quattro Pro parser for parsing Quattro Pro documents according to Quattro Pro file specifications.
  • JBIG2 and jpeg2000 image formats are now recognized.

You want more?

The list of changes and improvements doesn’t stop here.  Read the product release notes for a complete list of changes.

Unfamiliar with this product? No sweat — read this “Getting Started” page.

If not already out when you read this, the next feature release of Norconex HTTP Collector and Norconex Filesystem Collector will both ship with this version of the Importer.  Can’t wait for the release? Manually upgrade the Norconex Importer library to take advantage of these new features in your favorite crawler.

Download Norconex Importer 2.1.0.

Release 1.6.0 of Norconex Commons Lang provides new Java utility classes and enhancements to existing ones:

New Classes

TimeIdGenerator

[ezcol_1half]

Use TimeIdGenerator when you need to generate numeric IDs that are unique within a JVM. It generates Java long values that are guaranteed to be in order (but can have gaps).  Can generate up to 1 million unique IDs per milliseconds. Read Javadoc.

[/ezcol_1half]

[ezcol_1half_end]

long id = 0;

id = TimeIdGenerator.next();
System.out.println(id); // prints 1427256596604000000

id = TimeIdGenerator.last();
System.out.println(id); // prints 1427256596604000000

id = TimeIdGenerator.next();
System.out.println(id); // prints 1427256596604000001

[/ezcol_1half_end]

TextReader

[ezcol_1half]

A new class for reading large text, one chunk at a time, based on a specified maximum read size. When a text is too large, it tries to split it wisely at each paragraphs, sentences, or words (whichever one is possible). Read Javadoc.

[/ezcol_1half]

[ezcol_1half_end]

// Process maximum 500KB at a time
TextReader reader = new TextReader(originalReader, 500 * 1024);
String textChunk = null;
while ((textChunk = reader.readText()) != null) {
    // do something with textChunk
}
reader.close();

[/ezcol_1half_end]

ByteArrayOutputStream

[ezcol_1half]

An alternate version of Java and Apache Commons ByteArrayOutputStream. Like the Apache version, this version is faster than Java ByteArrayOutputStream. In addition, it provides additional methods for obtaining a subset of bytes ranging from zero to the total number of bytes written so far. Read Javadoc.

[/ezcol_1half]

[ezcol_1half_end]

ByteArrayOutputStream out = new ByteArrayOutputStream();
out.write("ABCDE".getBytes());        
out.write("FGHIJKLMNOPQRSTUVWXYZ".getBytes());        

byte[] b = new byte[10];
out.getBytes(b, 0);
System.out.println(new String(b)); // prints ABCDEFGHIJ
System.out.println((char) out.getByte(15)); // prints P

[/ezcol_1half_end]

Enhancements

IOUtil enhancements

The following utility methods were added to the IOUtil class:

Other improvements

Get your copy

Download Norconex Commons Lang 1.6.0.

You can also view the release notes for a complete list of changes.