Crawler – Page 3 – Norconex Inc

Norconex is proud to release version 2.3.0 of its Norconex HTTP Collector open-source web crawler. Thanks to incredible community feedback and efforts, we have implemented several feature requests, and your favorite crawler is now more stable than ever. The following describes only a handful of these new features with a focus on XML configuration. Refer to the product release notes for a complete list of changes.

Restrict crawling to a specific site

[ezcol_1half]

Up until now, you could restrict crawling to a specific domain, protocol, and port using one or more reference filters (e.g., RegexReferenceFilter). Norconex HTTP Collector 2.3.0 features new configuration options to “stay on a site”, called stayOnProtocol, stayOnDomain, and stayOnPort. These new settings can be applied to the <startURLs> tag of your XML configuration. They are particularly useful when you have many “start URLs” defined and you do not want to create many reference filters to stay on those sites.

[/ezcol_1half]

[ezcol_1half_end]

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
  <url>http://mysite.com</url>
</startURLs>

[/ezcol_1half_end]

Add HTTP request headers

[ezcol_1half]

GenericHttpClientFactory now allows you to set HTTP request headers on every HTTP calls that a crawler will make. This new feature can save the day for sites expecting certain header values to be present to render properly. For instance, some sites may rely on the “Accept-Language” request header to decide which language to pick to render a page.

[/ezcol_1half]

[ezcol_1half_end]

<httpClientFactory>
  <headers>
    <header name="Accept-Language">fr</header>
    <header name="From">john@smith.com</header>
  </headers>
</httpClientFactory>

[/ezcol_1half_end]

Specify a sitemap as a start URL

[ezcol_1half]

It is now possible to specify one or more sitemap URLs as “start URLs.” This is in addition to the crawler attempting to detect sitemaps at standard locations. To only use the sitemap URL provided as a start URL, you can disable the sitemap discovery process by adding ignore="true" to <sitemapResolverFactory> as shown in the code sample. To only crawl pages listed in sitemap files and not further follow links found in those pages, remember to set the <maxDepth> to zero.

[/ezcol_1half]

[ezcol_1half_end]

<startURLs>
  <sitemap>http://mysite.com/sitemap.xml</sitemap>
</startURLs>
<sitemapResolverFactory ignore="true" />

[/ezcol_1half_end]

Basic URL normalization always performed

[ezcol_1half]

URL normalization is now in effect by default using GenericURLNormalizer. The following are the default normalization rules applied:

Removing the URL fragment (the “#” character and everything after)
Converting the scheme and host to lower case
Capitalizing letters in escape sequences
Decoding percent-encoded unreserved characters
Removing the default port
Encoding non-URI characters

You can always overwrite the default normalization settings or turn off normalization altogether by adding the disabled="true" attribute to the <urlNormalizer> tag.

[/ezcol_1half]

[ezcol_1half_end]

<urlNormalizer>
  <normalizations>
    lowerCaseSchemeHost, upperCaseEscapeSequence, removeDefaultPort, 
    removeDotSegments, removeDirectoryIndex, removeFragment, addWWW 
  </normalizations>
  <replacements>
    <replace><match>&amp;view=print</match></replace>
    <replace>
       <match>(&amp;type=)(summary)</match>
       <replacement>$1full</replacement>
    </replace>
  </replacements>
</urlNormalizer>

[/ezcol_1half_end]

Scripting Language and DOM navigation

We introduced additional features when we upgraded the Norconex Importer dependency to its latest version (2.4.0). You can now use scripting languages to insert your own document processing logic or reference DOM elements of a XML or HTML file using a friendly syntax. Refer to the Importer 2.4.0 release announcement for more details.

Useful links

There is so much more offered by this release. Use the following links to find out more about Norconex HTTP Collector.

Download Norconex HTTP Collector
Get started with Norconex HTTP Collector
Report your issues and questions on Github.
Norconex HTTP Collector Release Notes

Optical character recognition (ORC), content translation, title generation, detection and text extraction from more file formats, are among the new features now part of your favorite crawlers: Norconex HTTP Collector 2.1.0 and Norconex Filesystem Collector 2.1.0. They are both available now and can be downloaded for free. They both ship with and use the latest version of the Norconex Importer module, which is in big part responsible for many of these new features.

For more details and usage examples, check this article.

These two Collector releases also include bug fixes and stability improvements. We recommend to existing users to upgrade.

Get your copy

Download Norconex HTTP Collector

Download Norconex Filesystem Collector

This tutorial will show you how to extend Norconex HTTP Collector using Java to create a link checker to ensure all URLs in your web pages are valid. The link checker will crawl your target site(s) and create a report file of bad URLs. It can be used with any existing HTTP Collector configuration (i.e., crawl a website to extract its content while simultaneously reporting on its broken links). If you are not familiar with Norconex HTTP Collector already, you can refer to our Getting Started guide.

The link checker we will create will record:

URLs that were not found (404 HTTP status code)
URLs that generated other invalid HTTP status codes
URLs that generated an error from the HTTP Collector

The links will be stored in a tab-delimited-format, where the first row holds the column headers. The columns will be:

Referrer: the page containing the bad URL
Bad URL: the culprit
Cause: one of “Not Found,” “Bad Status,” or “Crawler Error”

One of the goals of this tutorial is to hopefully show you how easy it is to add your own code to the Norconex HTTP Collector. You can download the files used to create this tutorial at the bottom of this page. You can jump right there if you are already familiar with Norconex HTTP Collector. Otherwise, keep reading for more information.

Get your workspace setup

To perform this tutorial in your own environment, you have two main choices. If you are a seasoned Java developer and an Apache Maven enthusiast, you can create a new Maven project including Norconex HTTP Collector as a dependency. You can find the dependency information at the bottom of its download page.

If you want a simpler option, first download the latest version of Norconex HTTP Collector and unzip the file to a location of your choice. Then create a Java project in your favorite IDE. At this point, you will need to add to your project classpath all Jar files found in the “lib” folder under your install location. To avoid copying compiled files manually every time you change them, you can change the compile output directory of your project to be the “classes” folder found under your install location. That way, the collector will automatically detect your compiled code when you start it.

You are now ready to code your link checker.

Listen to crawler events

There are several interfaces offered by the Norconex HTTP Collector that we could implement to achieve the functionality we seek. One of the easiest approaches in this case is probably to listen for crawler events. The collector provides an interface for this called ICrawlerEventListener. You can have any number of event listeners for your crawler, but we only need to create one. We can implement this interface with our link checking logic:

package com.norconex.blog.linkchecker;

public class LinkCheckerCrawlerEventListener 
        implements ICrawlerEventListener, IXMLConfigurable {

    private String outputFile;

    @Override
    public void crawlerEvent(ICrawler crawler, CrawlerEvent event) {
        String type = event.getEventType();
        
        // Create new file on crawler start
        if (CrawlerEvent.CRAWLER_STARTED.equals(type)) {
            writeLine("Referrer", "Bad URL", "Cause", false);
            return;
        }

        // Only keep if a bad URL
        String cause = null;
        if (CrawlerEvent.REJECTED_NOTFOUND.equals(type)) {
            cause = "Not found";
        } else if (CrawlerEvent.REJECTED_BAD_STATUS.equals(type)) {
            cause = "Bad status";
        } else if (CrawlerEvent.REJECTED_ERROR.equals(type)) {
            cause = "Crawler error";
        } else {
            return;
        }

        // Write bad URL to file
        HttpCrawlData httpData = (HttpCrawlData) event.getCrawlData();
        writeLine(httpData.getReferrerReference(), 
                httpData.getReference(), cause, true);
    }

    private void writeLine(
            String referrer, String badURL, String cause, boolean append) {
        try (FileWriter out = new FileWriter(outputFile, append)) {
            out.write(referrer);
            out.write('\t');
            out.write(badURL);
            out.write('\t');
            out.write(cause);
            out.write('\n');
        } catch (IOException e) {
            throw new CollectorException("Cannot write bad link to file.", e);
        }
    }

    // More code exists: download source files
}

As you can see, the previous code focuses only on the crawler events we are interested in and stores URL information associated with these events. We do not have to worry about other aspects of web crawling in that implementation. The above code is all the Java we need to write for our link checker.

Configure your crawler

If you have not seen a Norconex HTTP Collector configuration file before, you can find sample ones for download, along with all options available, on the product configuration page.

This is how we reference the link checker we created:

<crawlerListeners>
  <listener class="com.norconex.blog.linkchecker.LinkCheckerCrawlerEventListener">
    <outputFile>${workdir}/badlinks.tsv</outputFile>
  </listener>
</crawlerListeners>

By default, the Norconex HTTP Collector does not keep track of referring pages with every URL it extracts (to minimize information storage and increase performance). Because having a broken URL without knowing which page holds it is not very useful, we want to keep these referring pages. Luckily, this is just a flag to enable on an existing class:

<linkExtractors>
  <extractor class="com.norconex.collector.http.url.impl.HtmlLinkExtractor"
     keepReferrerData="true" />
</linkExtractors>

In addition to these configuration settings, you will want to apply more options, such as restricting your link checker scope to only your site or a specific sub-section or your site. Use the configuration file sample at the bottom of this page as your starting point and modify it according to your needs.

You are ready

Once you have your configuration file ready and the compiled Link Checker listener in place, you can give it a try (replace .bat with .sh on *nix platforms):

collector-http.bat -a start -c path/to/your/config.xml

The bad link report file will be written at the location you specified above.

Source files

Download the source files used to create this article

Despite all the “noise” on social media sites, we can’t deny how valuable information found on social media networks can be for some organizations. Somewhat less obvious is how to harvest that information for your own use. You can find many posts online asking about the best ways to crawl this or that social media service: Shall I write a custom web scraper? Should I purchase a product to do so?

This article will show you how to crawl Facebook posts (more…)

On large environments, it’s common to have many crawlers running at once, or at scheduled intervals, in order to keep your collected content up-to-date. For example, this is a typical requirement of search engines installations. They need their internal indices updated frequently in order to keep their search results relevant.

Keeping track of individual crawler execution can be challenging. How many are currently running? For how long? Any of them failed? Sure you can log in on the servers where these crawlers are running to get valuable insights. Your operating system can list running processes, and you can analyze each crawler logs. What if your supervisor or a non-technical person wants to know the current crawl status? You can quickly become a bottleneck.

This approach is not ideal to say the least.

Luckily, Norconex Collectors were designed to take advantage of the Norconex JEF (Job Execution Framework) library. As a result, all Norconex Collector crawlers you have defined are just waiting to be monitored by Norconex JEF Monitor, a web-based progress and status monitoring application. What’s best is you do not need to change anything in your crawler configurations to get this monitoring.

If you already have a JEF Monitor installation up and running, feel free to scroll down to skip the JEF Monitor installation.

Install JEF Monitor

Download the latest stable copy of JEF Monitor (4.0 as of this writing). Decompress the obtained zip file in a directory of your choice, on the same server where one or more Norconex Collectors are installed.

This will create the following files and directory structure:

norconex-jef-monitor-4.0.0/
     apidocs/
     classes/
     config/
     lib/
     third-party/
     jef-monitor.bat
     jef-monitor.sh
     LICENSE.TXT
     NOTICE.TXT

To start JEF Monitor, execute jef-monitor.bat or jef-monitor.sh whether you are on a Windows or *nix environment. Open your favorite browser, and access JEF Monitor using this URL:

http://localhost:8080/

Replace localhost with the proper server name if your browser was not started from the same server where you installed JEF Monitor.

With version 4.0, the default port is 8080. To change that port or to have JEF Monitor accessible via https only, modify the config/setup.properties file accordingly before starting JEF Monitor.

First-time configuration

The first time JEF Monitor is accessed, you have to go through a few initial configuration screens:

Hit “Let’s Go!”

JEF Monitor Installation Name

You can have several JEF Monitor installations. Any installation can report on other installations to give you a unified view of all your jobs (in this case, crawler jobs). For this reason, you need to give a unique name to this installation. It can be anything you like.

This tutorial will pretend we are only monitoring crawlers found on a dedicated server. We’ll call this installation “Crawler Server”.

Noroconex Collector Jobs to Monitor

This is where we tell JEF Monitor where our crawlers are running. For JEF, a Norconex Collector and its configured crawlers are treated as “jobs.” When running, each Norconex Collector configured creates an .index file in a subdirectory of the collector progress directory called “latest”. A collector progress directory can be configured using the <progressDir> configuration option.

We need to tell JEF Monitor about your Collector jobs. Click on “Add Files…”

In this tutorial, we’re pretending we have an HTTP Collector set up to crawl Wikipedia. We called it “Wikipedia Crawl” with two crawlers: “Wikipedia English” and “Wikipedia French” (to be shown in JEF Monitor later).

The index file can be found in this location:

[…]/wikipedia/progress/latest/Wikipedia_32_Crawl.index

Select your own index file, and click the “Choose” button.

You should see your selection in the list of jobs to monitor. If you have more than one Norconex Collector installation you want to monitor, repeat the exercise. Alternatively, if you have multiple progress files in a directory, have sub-directories, or have not yet executed your Norconex Collector installation, you can add a directory to be monitored. Index files found under the selected directories will show up when they get created.

When you are done, click “Continue”.

With each JEF “job” being monitored, you can optionally perform “actions.” With the default installation of JEF Monitor, two actions for viewing the logs in your browser are available and already configured. Leave those there, and click “Continue”.

Happy Monitoring!

Launch your Norconex Collector as you normally do, and you should eventually see its progress automatically updated.

To monitor additional Norconex Collector installations, click on “Monitored Jobs” under the “Settings” menu. You will then be presented with the now familiar “Jobs to monitor” screen (similar to the one higher up).

More options are available in JEF Monitor, such as tracking remote JEF Monitor installation from this one.

Experiment and have fun.

GATINEAU, QC, CANADA – Monday, December 1, 2014 – Norconex announces the launch of its Google Search Appliance (GSA) Committer module for its Norconex Collectors Crawler Suite. Enterprise search developers and enthusiasts now have a flexible and extensible option for feeding documents to their GSA infrastructure. GSA is a target repository for crawled documents released by Norconex HTTP Collector, Norconex Filesystem Collector, and any future Collector released by Norconex . These Collectors can reside on any server (like remote filesystems) and send discovered documents across the network to a GSA installation. The GSA Committer is the latest addition to the growing list of Committers already available to Norconex Collector users: Apache Solr, Elasticsearch, HP IDOL, and Lucidworks.

“The increasing popularity of our universal crawlers motivates us to provide support for more search engines. Search engines come and go in an organization, but your investment in your crawling infrastructure can be protected by having re-usable crawler setups that can outlast any search engine installation,” said Norconex President Pascal Essiembre.

GSA Committer Availability

GSA Committer is part of Norconex’s commitment to delivering quality open-source products backed by community or commercial support. GSA Committer is available for immediate download at /collectors/committer-gsa.

Founded in 2007, Norconex is a leader in enterprise search and data discovery. The company offers a wide range of products and services designed to help process and analyze structured and unstructured data.

For more information on GSA Committer:

GSA Committer Website: /collectors/committer-gsa
Norconex Collectors: /collectors
Email: info@norconex.com

Norconex just released major upgrades to all its Norconex Collectors and related projects. That is, Norconex HTTP Collector and Norconex Filesystem Collector, along with the Norconex Importer module and all available committers (Solr, Elasticsearch, HP IDOL, etc), were all upgraded to version 2.0.0.

With these major product upgrades come a new website that makes it easier to get all the software you need in one location: the Norconex Collectors website. At a quick glance you can find all Norconex Collectors and Committers available for download.

Among the new features added to your crawling arsenal you will find:

Can now split a document into multiple documents.

Can now treat embedded documents as individual documents (like documents found in zip files or in other documents such as Word files).

Language detection (50+ languages).

Parsing and formatting of dates from/to any format.

Character case modifiers.

Can now index basic content statistics with each documents (word count, average word length, average words per sentences, etc).

Can now supply a “seed file” for listing start URLs or start paths to your crawler.

Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used. This reduces I/O and improves performance.

New event model where listeners can listen for any type of crawler events.

Can now ignore parsing of specific content types.

Can filter documents based on arbitrary regular expressions performed on the document content.

Enhanced debugging options, where you can print out specific field content as they are being processed.

HTTP Collector: Can add link names to the document the links are pointing to (e.g. to create cleaner titles).

More…

Another significant change is all Norconex open-source projects are now licensed under The Apache License 2.0. We hope this will facilitate adoption with third party commercial offerings.

It is important to note version 2.0.0 are not compatible with their previous 1.x version. The configuration options changed in many areas so do not expect to run your existing configuration under 2.0.0. Please refer to the latest documentation for new and modified configuration options.

Visit to the new Norconex Collectors website now.

GATINEAU, QC, CANADA – Thursday, August 25, 2014 – Norconex is announcing the launch of Norconex Filesystem Collector, providing organizations with a free “universal” filesystem crawler. The Norconex Filesystem Collector enables document indexing into target repositories of choice, such as enterprise search engines.

Following on the success of Norconex HTTP Collector web crawler, Norconex Filesystem Collector is the second open source crawler contribution to the Norconex “Collector” suite. Norconex believes this crawler allows customers to adopt a full-featured enterprise-class local or remote file system crawling solution that outlasts their enterprise search solution or other data repository.

“This not only facilitates any future migrations but also allows customer addition of their own ETL logic into a very flexible crawling architecture, whether using Autonomy, Solr/LucidWorks, ElasticSearch, or any others data repository,” said Norconex President Pascal Essiembre.

Norconex Filesystem Collector Availability

Norconex Filesystem Collector is part of Norconex’s commitment to deliver quality open-source products, backed by community or commercial support. Norconex Filesystem Collector is available for immediate download at /collectors/collector-filesystem/download.

Founded in 2007, Norconex is a leader in enterprise search and data discovery. The company offers a wide range of products and services designed to help with the processing and analyzing of structured and unstructured data.

For more information on Norconex Filesystem Collector:

Website: /collectors/collector-filesystem

Email: info@norconex.com

###

2013/06/05

Say hello to Norconex HTTP Collector! At Norconex, we have always recognized the value open-source brings to software development, and to a greater extent, the world. It benefits us when building custom solutions for our customers and ourselves. As long-time consumers of open-source, it is time for us to give back.

As a result, Norconex is proud to announce open-sourcing of a handful of its libraries and products, so that the community can save time and money like it did for us. The Norconex HTTP Collector is an HTTP Crawler meant to give the greatest flexibility possible for developers and integrators. (more…)