Open-Source – Page 4 – Norconex Inc

Norconex just released major upgrades to all its Norconex Collectors and related projects. That is, Norconex HTTP Collector and Norconex Filesystem Collector, along with the Norconex Importer module and all available committers (Solr, Elasticsearch, HP IDOL, etc), were all upgraded to version 2.0.0.

With these major product upgrades come a new website that makes it easier to get all the software you need in one location: the Norconex Collectors website. At a quick glance you can find all Norconex Collectors and Committers available for download.

Among the new features added to your crawling arsenal you will find:

Can now split a document into multiple documents.

Can now treat embedded documents as individual documents (like documents found in zip files or in other documents such as Word files).

Language detection (50+ languages).

Parsing and formatting of dates from/to any format.

Character case modifiers.

Can now index basic content statistics with each documents (word count, average word length, average words per sentences, etc).

Can now supply a “seed file” for listing start URLs or start paths to your crawler.

Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used. This reduces I/O and improves performance.

New event model where listeners can listen for any type of crawler events.

Can now ignore parsing of specific content types.

Can filter documents based on arbitrary regular expressions performed on the document content.

Enhanced debugging options, where you can print out specific field content as they are being processed.

HTTP Collector: Can add link names to the document the links are pointing to (e.g. to create cleaner titles).

More…

Another significant change is all Norconex open-source projects are now licensed under The Apache License 2.0. We hope this will facilitate adoption with third party commercial offerings.

It is important to note version 2.0.0 are not compatible with their previous 1.x version. The configuration options changed in many areas so do not expect to run your existing configuration under 2.0.0. Please refer to the latest documentation for new and modified configuration options.

Visit to the new Norconex Collectors website now.

This feature release brings the following additions…

Simple Pipeline

Useful if you want to quickly assemble multiple tasks to be run into a single “pipeline” while keeping it ultra simple. The following example does it all in a single class only to keep it short.

public class MyPipeline extends Pipeline<String> {

    public MyPipeline() {
        addStage(new MyTask1());
        addStage(new MyTask2());
    }
    
    // Class: Task1
    private class MyTask1 implements IPipelineStage<String> {
        @Override
        public boolean execute(String context) {
            System.out.println("Task 1 executed: " + context);
            return true;
        }
    }  

    // Class: Task2
    private class MyTask2 implements IPipelineStage<String> {
        @Override
        public boolean execute(String context) {
            System.out.println("Task 2 executed: " + context);
            return true;
        }
    }  
    
    public static void main(String[] args) {
        new MyPipeline().execute("hello");
        
        // Will print out:
        //     Task 1 executed: hello
        //     Task 2 executed: hello
    }
}

Cacheable Streams

There are several excellent object caching mechanism available to Java already if you need something sophisticated. This release offers a very lightweight cache implementation that can make InputStream and OutputStream reusable. It stores the stream in memory until a configurable threshold is reached, after which it switches to fast file lookup. A CachedStreamFactory is used to obtain cached streams sharing the same pool of memory.

        int size10mb = 10 * 1024 * 1024;
        int size1mb  = 1024 * 1024;
        InputStream is = null; // <-- your original input stream
        OutputStream os = null; // <-- your original output stream
        
        CachedStreamFactory streamFactory = new CachedStreamFactory(size10mb, size1mb);
        
        //--- Reuse the input stream ---
        CachedInputStream cachedInput = streamFactory.newInputStream(is);
        
        // Read the input stream the first time
        System.out.println(IOUtils.toString(cachedInput));
        // Read the input stream a second time
        System.out.println(IOUtils.toString(cachedInput));
        
        // Released the cached data, preventing further re-use
        cachedInput.dispose();

        //--- Reuse the output stream ---
        CachedOutputStream cachedOutput = streamFactory.newOuputStream(os);
        
        IOUtils.write("lots of data", cachedOutput);
        
        // Obtain a new input stream from the output
        CachedInputStream newInputStream = cachedOutput.getInputStream();
        
        // Do what you want with this input stream

Enhanced XML Writing

The Java XMLStreamWriter is a useful class, but is a bit annoying to use when you are not always writing strings. The EnhancedXMLStreamWriter add convenience method for primary types and others.

        Writer out = null; // <-- your target writer
        
        EnhancedXMLStreamWriter xml = new EnhancedXMLStreamWriter(out);
        xml.writeStartDocument();
        xml.writeStartElement("item");
        
        xml.writeElementInteger("quantity", 23);
        
        xml.writeElementString("name", "something");
        
        xml.writeStartElement("size");
        xml.writeAttributeInteger("height", 24);
        xml.writeAttributeInteger("width", 26);
        xml.writeEndElement();

        xml.writeElementBoolean("sealwrapped", true);

        xml.writeEndElement();
        xml.writeEndDocument();
        
        /* Will write:
          
          <?xml version="1.0" encoding="UTF-8"?>
          <item>
              <quantity>23</quantity>
              <name>something</name>
              <size height="24" width="26" />
              <sealwrapped>true</sealwrapped>
          </item>
         */

More Equality checks

More methods were added to EqualUtils:

        EqualsUtil.equalsAnyIgnoreCase("toMatch", "candidate1", "candiate1");
        EqualsUtil.equalsAllIgnoreCase("toMatch", "candidate1", "candiate1");
        EqualsUtil.equalsNoneIgnoreCase("toMatch", "candidate1", "candiate1");

Discover More Features

A few more features and updates were made to the Norconex Commons Lang library. For more information, check out the full release notes.

Download your copy now.

GATINEAU, QC, CANADA – Thursday, August 25, 2014 – Norconex is announcing the launch of Norconex Filesystem Collector, providing organizations with a free “universal” filesystem crawler. The Norconex Filesystem Collector enables document indexing into target repositories of choice, such as enterprise search engines.

Following on the success of Norconex HTTP Collector web crawler, Norconex Filesystem Collector is the second open source crawler contribution to the Norconex “Collector” suite. Norconex believes this crawler allows customers to adopt a full-featured enterprise-class local or remote file system crawling solution that outlasts their enterprise search solution or other data repository.

“This not only facilitates any future migrations but also allows customer addition of their own ETL logic into a very flexible crawling architecture, whether using Autonomy, Solr/LucidWorks, ElasticSearch, or any others data repository,” said Norconex President Pascal Essiembre.

Norconex Filesystem Collector Availability

Norconex Filesystem Collector is part of Norconex’s commitment to deliver quality open-source products, backed by community or commercial support. Norconex Filesystem Collector is available for immediate download at /collectors/collector-filesystem/download.

Founded in 2007, Norconex is a leader in enterprise search and data discovery. The company offers a wide range of products and services designed to help with the processing and analyzing of structured and unstructured data.

For more information on Norconex Filesystem Collector:

Website: /collectors/collector-filesystem

Email: info@norconex.com

###

Release 1.3.0 of Norconex Importer is now available. Release overview:

Now stores the content “family” for each documents as “importer.contentFamily”.
New SplitTagger: Split values into multiple-values using a separator of choice.
New CopyTagger: copies document metadata fields to other fields.
New HierarchyTagger: splits a field string into multiple segments representing each node of a hierarchical branch.
ReplaceTagger now supports regular expressions.
Improved mime types detection.
More…

Download it now.

Web site: /collectors/importer/

During the development of our latest product, Norconex Content Analytics, we decided to add facets to the search interface. They allow for exploring the indexed content easily. Solr and Elasticsearch both have facet implementations that work on top of Lucene. But Lucene also offers simple facet implementations that can be picked out of the box. And because Norconex Content Analytics is based on Lucene, we decided to go with those implementations.

We’ll look at those facet implementations in this blog post, but before, let’s talk about a new feature of Lucene 4 that is used by all of them.

(more…)

Norconex Commons Lang 1.4.0 was just released.

New features:

New DataUnit classe to perform data unit (KB, MB, GB, etc) conversions much like Java TimeUnit class.
New DataUnitFormatter to format any data unit ot a human-readable format taking into account locale and decimals
New percentage formatter.
New ContentType class to represent a file media/MIME type and obtain its usual name, content family, and file extension(s).
New ContentFamily class to represent a group of files of similar content types. Useful for content categorization.
New ObservableMap class.
More…

Download it now.

Web site: /product/commons-lang/

Release 1.3 of Norconex HTTP Collector is now available. Among new features added to our open-source web crawler, you can expect the following:

Now supports NTLM authentication. Experimental support added for SPNEGO and Kerberos.
Document checksums are added to each document metadata.
Refactoring of HTTPClient creation with many new configuration options added (connection timeout, charset, maximum redirects, and several more).
Can optionally trust all SSL certificate now.
Integrates new features of Norconex Importer 1.2.0 such as support for WordPerfect document parsing, new filter and transformers, etc.
Integrates new features of Norconex Committer 1.2.0 such as defining multiple committers, retrying upon commit failure, etc.
Other third-party library upgrades.

Download it now!

Norconex Importer 1.2.0 was just released along with a new website for it.

New features:

Now support text extraction from WordPerfect documents.
New transformer to reduce consecutive instances of the same string to only one instance.
New transformer to perform search and replace on document content using regular expression.
New filter to exclude/include documents with no data for one or more specified metadata properties.
Now attempts to detect the character encoding from a character stream by looking at a Content-Type metadata. If none is present, defaults to UTF-8.

Download it now!

Web site: /collectors/importer/

Upgrade Norconex Committer and all is current concrete implementations (Solr, Elasticsearch, IDOL) have been upgraded and have seen a redesign of their web sites. Committers are libraries responsible for posting data to various repositories (typically search engines). They are in other products or projects, such as Norconex HTTP Collector. (more…)

Norconex Commons Lang 1.3.0 was just released along with a new website for it.

New features:

New YearMonthDay class for a local date without time.
New YearMonthDayInterval class for a local date range without time.
New FileMonitor and IFileChangeListener to be notified of file changes.
New methods on FileUtil to visit empty directories or delete empty directories older than a date.

Grab it while it is still warm!

Web site: /product/commons-lang/

Happy coding!