How to crawl Facebook

Posted on February 5, 2015 by in Latest Articles

Despite all the “noise” on social media sites, we can’t deny how valuable information found on social media networks can be for some organizations. Somewhat less obvious is how to harvest that information for your own use. You can find many posts online asking about the best ways to crawl this or that social media service: Shall I write a custom web scraper? Should I purchase a product to do so?

This article will show you how to crawl Facebook posts using the java-based, open-source crawler, Norconex HTTP Collector. The same approach can be applied to the Collector to crawl other social media sites such as Twitter or Google+. This article also serves as a tutorial on extending the Norconex HTTP Collector. At the bottom of this article, you can download the complete and fully-functional files used to create the examples below.

Web Site vs API

When thinking about a web crawler, the first thing that comes to mind is crawling websites – that is, extracting all HTML pages from a site along with URLs to other pages that will also be extracted as well as other referenced multimedia files (images, video, PDF, etc.). While this process is definitely a common usage, alternative crawling approaches are often more appropriate. For instance, whenever dealing with websites exposing structured or normalised “records” in HTML webpages, you will usually benefit from being able to obtain that data in a raw format instead. Crawling regular HTML web pages holding this information will generate a lot of noise you do not want (data labels, header, footer, side navigation, etc.). A typical solution to this problem is to extract the raw information from the page, stripping all markup and labels you do not want and using patterns in the page source you can identify.   This is a flaky solution and should be avoided if possible. You do not want a simple UI change to break your crawler parsing logic. Luckily, many popular data-oriented websites offer HTTP APIs to access their content, free of any UI clutter. Such is the case with the free Facebook Graph API.

Facebook Graph API

Facebook offers a secure HTTP-based API. It allows developers to query public posts of specific users or organizations via authenticated HTTP calls.

To get started, you need a personal Facebook account. With that account, access the Facebook Developers website. From that site you will find, in the Tool & Support menu at the top, a very useful application called, “Graph API Explorer”. Using this application is a great way to familiarize yourself with the Facebook Graph API before moving further. Extensive Graph API documentation is also available on this site.

Graph API Menu Option

It is important to note that as of this writing, Facebook Graph API will simply not work with Facebook business accounts. You can use the API for your business, but you’ll have to create a personal account for this purpose.

Norconex HTTP Collector

If you do not already have a copy, download the Norconex HTTP Collector. If you are a first-time user, you can familiarize yourself with the Collector by first running the sample crawler configurations as described on the product’s Getting Started page.

In order to factor in logic specific to Facebook Graph API, we will cover appropriate configuration settings as well as provide custom implementations of certain crawler features. Java classes you write must end up in the “classes” folder of the Norconex HTTP Collector to be automatically picked up. You can find more information on extending the HTTP Collector here. You can always refer to the HTTP Collector Configuration page for additional information on what we’ll cover here.

Start URL

Let’s first establish what we want to crawl. For the purpose of this exercise, we will crawl Disney Facebook posts. Since we are using the Facebook Graph API, the URL obviously must be a valid Graph API URL. Obtaining this URL is fairly easy once you’ve familiarized yourself with the Graph API Explorer mentioned earlier. The URL can be found in the red rectangle in the image below. The reference to “disney” in the URL is the same reference you will see when accessing the Disney Facebook page on the web, ignoring character case: https://www.facebook.com/Disney.

 Obtaining a Graph API URL

Let’s cut-and-paste that URL and make it the crawler “start URL” prefixed with the Graph API base URL:

The above is the simplest URL we can create. In a “real-world” scenario, you probably want to get more than 5 posts at a time while also adding parameters to that URL such as explicitly listing fields to return (to limit your bandwidth usage and increase crawl performance).

Authentication

Using a plain URL like the one above won’t work without a little extra effort. Calls to the Facebook Graph API must be authenticated in which each Graph URL you invoke must contain a valid access token. See the above image, which contains an access token sample. Unfortunately, tokens eventually expire, so we can’t simply use the one from the Graph Explorer in our crawler. Because we do not want to manually provide a new token every time the crawler runs, a solution is to let the crawler fetch a new access token each time it runs before attempting to download any posts.

Luckily, the Facebook Graph API offers an easy way to get a fresh access token in exchange for a permanent “app id” and “app secret”. Facebook requires that you first create a Facebook application to obtain these. There are several tutorials online to help you with this step. Here is one from Facebook itself: https://developers.facebook.com/docs/opengraph/getting-started. You only need to perform the steps described in the “Create a Facebook App ID” section of the tutorial. Once you have obtained the app id and app secret, store them safely, and we’ll use them shortly.

Because the access token must be appended to every Graph API URL we invoke, we must modify the way the HTTP Collector invokes URLs – that’s the responsibility of the IHttpDocumentFetcher interface implementation. The default implementation is GenericDocumentFetcher, which already does a good job of downloading web content. We will reuse its existing logic by extending it, only to add the logic for appending the access token:

You will notice an implantation of the loadFromXML method was provided so that the app id and app secret can be specified in your configuration file as follows:

Extracting links to follow

The URL sample we used only returns 5 posts. To retrieve more, you can simply increase the limit. To retrieve much more, you will face a Graph API limit that will force you to use the Graph API “paging” feature. Paging URLs are already part of the JSON response containing a batch of posts.

To extract just the URL, we do not need to parse the whole response. A simple regex can do the job, in which we implement the ILinkExtractor interface:

Extracted links will be followed just as regular href links are followed by crawlers in HTML pages. This code can be modified to extract additional types of URLs found in the JSON response such as image URLs or the URL of a page described by the post. In your configuration, it will look like this:

Breaking posts into individual documents

Based on the “limit” you provide in your Graph API URL, you will often end up with many posts in a single JSON response. Chances are you want to treat these posts as individual documents in your own solution. The HTTP Collector has the capability to split documents into smaller ones, which is achieved by an IDocumentSplitter interface implementation from the Importer module. Instead of implementing one directly, we will extend the AbstractDocumentSplitter abstract class.

In our implementation, we use a JSON parser to hand-pick each field we want to keep in each of our “child” documents. We reference our implementation in the importer section of the configuration:

Wrap it up

Now that we have the Facebook-specific items “ironed out,” have a look at the many other configuration options offered by the Norconex HTTP Collector and the Norconex Importer module. You may find it interesting to add your own metadata to the mix, perform language detection, or otherwise. In addition, you may want to provide a custom location to store crawled Facebook posts. The sample configuration file has its <committer> section configured to use the FileSystemCommitter. You can create your own Committer implementation instead or use one of the existing committers available for free download (Solr, Elasticsearch, HP IDOL, GSA, etc.).

Ready, set, go!

Whether you reproduced the above examples or are using the sample files available for download below, you should now be in a position to try the crawler. Run it as usual, and monitor the logs for potential issues (Norconex JEF Monitor can also help you with the monitoring process).

If you use a similar approach with the Norconex HTTP Collector to crawl other social media sites, please share your comments below as we would like to hear about it!

Source files

Download the source file used to create this article:

Download HTTP Collector 2.0 compatible (2015-02-04)

Download HTTP Collector 2.3 compatible (2015-11-06)

Download HTTP Collector 2.6 compatible (2017-03-30)

Pascal Essiembre has been a successful Enterprise Application Developer for several years before founding Norconex in 2007 and remaining its president to this day. Pascal has been responsible for several successful Norconex enterprise search projects across North America. Pascal is also heading the Product Division of Norconex and leading Norconex Open-Source initiatives.


Comments

  • Tom

    After use the example, it shows errors in Eclipse in”import com.norconex.collector.core.CollectorException;” and”import com.google.gson.JsonArray;”, the eoors are :The import com.google cannot be resolved, The import com.norconex.importer cannot be resolved and The import com.norconex.collector cannot be resolved, so how to fix it?

    • The code from this article is meant to be used with HTTP Collector and is
      not code you can run on its own. If you want to import it into
      Eclipse, you’ll need the dependencies as well. Check the “Java
      Integration” section of the Getting Started here:
      https://www.norconex.com/collectors/collector-http/getting-started

      • Tom

        Thank you,I have one question: can we use Facebook API like twitter API to crawl the facebook data for one keyword, and then we can get any more data we want? Usually I find all crawl data from facebook can only get from one public account, then we need to change another account, it seems inconvenient, so how to use it more efficiency?

        • I am afraid I do not understand your question. Can you elaborate a bit more on what you are trying to accomplish?

          • Tom

            I mean: usually when get the posts, we need to search the one page like”Coca-cola”, then we can get the posts in this account, but I hope I can use it to search by keyword, like when I set the keyword like”CocaCola”, it can get all the posts about this topic….

          • I am not sure Facebook offers that possibility using their API. The search possibilities using the Graph API are described here: https://developers.facebook.com/docs/graph-api/using-graph-api/v2.4#search

            The closest might be to search for pages about Coca-Cola, using …/search?q=Coca-Cola&type=page

            You can also get all posts by Coca-Cola: …/cocacola?fields=id,name,posts

            Facebook supported searching on posts before but they removed this feature starting with version 2.0 of the Graph API. You can always try to use version 1.0, like this:

            https://graph.facebook.com/v1.0/search?q=Coca-Cola&type=post&access_token={access_token}
            I am not sure if version 1.0 is still supported though.

          • Tom

            Thank you, also I use python to request the data, I found this :g.request(‘search’, {‘q’ : ‘TaylorSwift’, ‘type’ : ‘page’, ‘limit’ : 10})[‘data’][0][‘ID’] so do you know how to modify it by searching use Group not Page?

          • According to the Graph API documentation, it would simply changing the type to be “group” instead of “page”. See https://developers.facebook.com/docs/graph-api/using-graph-api/v2.4#search

  • James Fu

    After using this example, I cannot fetch successfully… I enter my own AppID and appsecret, but I run the Norconex return as follwing:

    INFO [CrawlerEventManager] REJECTED_BAD_STATUS: https://graph.facebook.com/v2.4/disney/posts?limit=10
    INFO [AbstractCrawler] Facebook Posts: Deleting orphan references (if any)…
    INFO [AbstractCrawler] Facebook Posts: Deleted 0 orphan references…
    INFO [AbstractCrawler] Facebook Posts: Crawler finishing: committing documents.
    INFO [AbstractCrawler] Facebook Posts: 1 reference(s) processed.
    INFO [CrawlerEventManager] CRAWLER_FINISHED

    I download the source file, and build it by ‘mvn package’. After that, put facebook-crawler-1.0.0.jar to the folder ‘lib’. Then I modify the facebook-config.xml to correct appid and appsecret, run it with ./collector-http.sh -a start -c facebook-config.xml. Finally, I get the above ‘REJECTED_BAD_STATUS’… Is the the authentication related issue? Thanks for replying.

    • I recommend you modify the log4j.properties file found in the “classes” folder. Change the log levels to DEBUG. You will probably get more information about the bad status. You may also add a few debug statements to the Facebook-specific classes from the example. Maybe they are not loaded properly for some reason.

      • James Fu

        Hi, I change to DEBUG, and I still don’t know how to fix this issues 🙁 I use the package you provide, and following is the message in console:

        WARN [KeepOnlyTagger] Configuring fields to keep via the “fields” attribute is now deprecated. Now use the element instead.
        DEBUG [ConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:3000,[],false,crawler
        DEBUG [ConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider@506c4a08
        DEBUG [ConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory@75707c77[sitemapLocations=,lenient=false]
        DEBUG [ConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:com.norconex.collector.http.robot.impl.StandardRobotsMetaProvider@21943319
        INFO [HttpCrawlerConfig] Link extractor loaded: com.norconex.blog.facebook.crawler.FacebookLinkExtractor@4243a4d
        INFO [AbstractCollectorConfig] Configuration loaded: id=Facebook Collector; logsDir=/temp/facebook-crawler-example/logs;progressDir=/temp/facebook-crawler-example/progress
        INFO [AbstractCollector] Version: Norconex HTTP Collector 2.2.1 (Norconex Inc.)
        INFO [AbstractCollector] Version: Norconex Collector Core 1.2.1 (Norconex Inc.)
        INFO [AbstractCollector] Version: Norconex Importer 2.3.1 (Norconex Inc.)
        INFO [AbstractCollector] Version: Norconex JEF 4.0.6 (Norconex Inc.)
        INFO [AbstractCollector] Version: Norconex Committer Core 2.0.2 (Norconex Inc.)
        INFO [JobSuite] JEF work directory is: /temp/facebook-crawler-example/progress
        INFO [JobSuite] JEF log manager is : FileLogManager
        INFO [JobSuite] JEF job status store is : FileJobStatusStore
        INFO [AbstractCollector] Suite of 1 crawler jobs created.
        INFO [JobSuite] Initialization…
        DEBUG [FileJobStatusStore] Status serialization directory: /temp/facebook-crawler-example/progress
        DEBUG [FileJobStatusStore] Reading status file: /temp/facebook-crawler-example/progress/latest/status/Facebook_32_Posts__Facebook_32_Posts.job
        DEBUG [FileJobStatusStore] Facebook Posts last active time: Thu Sep 17 16:31:47 CST 2015
        INFO [JobSuite] Previous execution detected.
        INFO [JobSuite] Backing up previous execution status and log files.
        DEBUG [FileJobStatusStore] Status serialization directory: /temp/facebook-crawler-example/progress
        DEBUG [FileLogManager] Log directory: /temp/facebook-crawler-example/logs
        INFO [JobSuite] Starting execution.
        INFO [JobSuite] Running Facebook Posts: BEGIN (Thu Sep 17 16:35:31 CST 2015)
        INFO [MapDBCrawlDataStore] Initializing reference store /temp/facebook-crawler-example/crawlstore/mapdb/Facebook_32_Posts/
        INFO [MapDBCrawlDataStore] /temp/facebook-crawler-example/crawlstore/mapdb/Facebook_32_Posts/: Done initializing databases.
        INFO [HttpCrawler] Facebook Posts: RobotsTxt support: false
        INFO [HttpCrawler] Facebook Posts: RobotsMeta support: false
        INFO [HttpCrawler] Facebook Posts: Sitemap support: false
        INFO [HttpCrawler] Facebook Posts: Canonical links support: true
        INFO [CrawlerEventManager] CRAWLER_STARTED
        INFO [AbstractCrawler] Facebook Posts: Crawling references…
        INFO [CrawlerEventManager] REJECTED_BAD_STATUS: https://graph.facebook.com/v2.4/disney/posts?limit=10
        DEBUG [Pipeline] Unsuccessful stage execution: com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DocumentFetcherStage@1e7318
        DEBUG [FileJobStatusStore] Writing status file: /temp/facebook-crawler-example/progress/latest/status/Facebook_32_Posts__Facebook_32_Posts.job
        DEBUG [FileJobStatusStore] Writing status file: /temp/facebook-crawler-example/progress/latest/status/Facebook_32_Posts__Facebook_32_Posts.job
        INFO [AbstractCrawler] Facebook Posts: Deleting orphan references (if any)…
        INFO [AbstractCrawler] Facebook Posts: Deleted 0 orphan references…
        INFO [AbstractCrawler] Facebook Posts: Crawler finishing: committing documents.
        INFO [AbstractCrawler] Facebook Posts: 1 reference(s) processed.
        INFO [CrawlerEventManager] CRAWLER_FINISHED
        INFO [AbstractCrawler] Facebook Posts: Crawler completed.
        INFO [AbstractCrawler] Facebook Posts: Crawler executed in 1 second.
        INFO [MapDBCrawlDataStore] Closing reference store: /temp/facebook-crawler-example/crawlstore/mapdb/Facebook_32_Posts/
        DEBUG [FileJobStatusStore] Writing status file: /temp/facebook-crawler-example/progress/latest/status/Facebook_32_Posts__Facebook_32_Posts.job
        INFO [JobSuite] Running Facebook Posts: END (Thu Sep 17 16:35:31 CST 2015)

        • Can you provide a copy of your config (blanking the secret info)?

          • James Fu

            I use the config you provided in src/main/config/facebook-config.xml, and with my appid and appsecret.

          • I just tried it again and it works without issues for me. To rule out it is not an authentication or permission issue with the Facebook Graph API, try to generate a token and see if it works the Facebook “Graph API Explorer”, available from this page: https://developers.facebook.com/tools-and-support/

          • Anky Panky

            I have the exact same problem [CrawlerEventManager] REJECTED_BAD_STATUS and I checked my appId and secret by manually entering at the url bar “https://graph.facebook.com/oauth/access_token?client_id=%s&client_secret=%s&grant_type=client_credentials” and replacing %s and %s by appId and secret. The authentication token is returned, so the problem is not there. Any idea where to look next? thx

          • You did not say which version of Norconex HTTP Collector you are using and which version of the Facebook Graph API. I just attached a new version of the code for this article, made
            to work with HTTP Collector 2.3.0 and it will support the Facebook
            Graph API v2.5. Give it a try.

          • Anky Panky

            Collector version was 2.3.0 and API v2.5, thank you Pascal

  • suryansh vicky

    Hi i followed all the steps mentioned in this post. I build replaced credential inside facebook-config.xml file and build facebook-crawler-1.0.0.jar with maven and put in lib folder. Now when i try to run following command i am getting some Collector Configuration error. where can i find this Collector Configuration ? kindly help me, I am kind of new to this field…

    command line error…
    sh ./collector-http.sh -a start -c facebook-config.xml
    An ERROR occured:
    Collector Configuration cannot be null.
    Details of the error has been stored at: /Users/xxxxx/Documents/norconex-collector-http-2.2.1/./error-1442508558063.log

    And the log file has following information…

    java.lang.IllegalArgumentException: Collector Configuration cannot be null.
    at com.norconex.collector.core.AbstractCollector.(AbstractCollector.java:68)
    at com.norconex.collector.http.HttpCollector.(HttpCollector.java:55)
    at com.norconex.collector.http.HttpCollector$1.createCollector(HttpCollector.java:68)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:67)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

    • The collector configuration that comes with the article is in the zip file. Once extracted, it is in the facebook-crawlersrcmainconfig directory.

      The command you ran expects the facebook-config.xml file is in the same directory as the collector-http.sh script. Either move it there, or give the path to its location.

  • fusil

    Hi, I have found a base script clone from http://www.owloo.com analytic site (facebook, twitter, instagram) , this base script get all data via PHP/CURL for what I see, maybe can help for something.. this is the link: https://github.com/newbacknew/owloo.com of the source code.
    The important part is in “wservice” folder from the backups folders where is the Crawler files.

  • Colpain Charles

    This only works for public profiles, Graph api doesn’t offer generic user profile crawling. HTML crawling will be the only choice to do that.

  • bo

    hi, I downloaded the files and used them with the norconex crawler, got the following output, without the crawler files retrieved.

    DEBUG [ConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:3000,[],false,crawler

    DEBUG [ConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider@1d7acb34

    DEBUG [ConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory@1e4a7dd4[sitemapLocations=,lenient=false]

    DEBUG [ConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:com.norconex.collector.http.robot.impl.StandardRobotsMetaProvider@5e57643e

    INFO [HttpCrawlerConfig] Link extractor loaded: com.norconex.blog.facebook.crawler.FacebookLinkExtractor@133e16fd

    INFO [AbstractCollectorConfig] Configuration loaded: id=Facebook Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress

    INFO [AbstractCollector] Version: Norconex HTTP Collector 2.2.1 (Norconex Inc.)

    INFO [AbstractCollector] Version: Norconex Collector Core 1.2.1 (Norconex Inc.)

    INFO [AbstractCollector] Version: Norconex Importer 2.3.1 (Norconex Inc.)

    INFO [AbstractCollector] Version: Norconex JEF 4.0.6 (Norconex Inc.)

    INFO [AbstractCollector] Version: Norconex Committer Core 2.0.2 (Norconex Inc.)

    INFO [JobSuite] JEF work directory is: ./examples-output/minimum/progress

    INFO [JobSuite] JEF log manager is : FileLogManager

    INFO [JobSuite] JEF job status store is : FileJobStatusStore

    INFO [AbstractCollector] Suite of 1 crawler jobs created.

    INFO [JobSuite] Initialization…

    DEBUG [FileJobStatusStore] Status serialization directory: /Users/bo/Downloads/norconex-collector-http-2.2.1/./examples-output/minimum/progress

    DEBUG [FileJobStatusStore] Reading status file: /Users/bo/Downloads/norconex-collector-http-2.2.1/./examples-output/minimum/progress/latest/status/Facebook_32_Posts__Facebook_32_Posts.job

    DEBUG [FileJobStatusStore] Facebook Posts last active time: Mon Oct 19 23:17:11 CEST 2015

    INFO [JobSuite] Previous execution detected.

    INFO [JobSuite] Backing up previous execution status and log files.

    DEBUG [FileJobStatusStore] Status serialization directory: /Users/bo/Downloads/norconex-collector-http-2.2.1/./examples-output/minimum/progress

    DEBUG [FileLogManager] Log directory: /Users/bo/Downloads/norconex-collector-http-2.2.1/./examples-output/minimum/logs

    INFO [JobSuite] Starting execution.

    INFO [JobSuite] Running Facebook Posts: BEGIN (Mon Oct 19 23:17:53 CEST 2015)

    INFO [MapDBCrawlDataStore] Initializing reference store ./examples-output/minimum/crawlstore/mapdb/Facebook_32_Posts/

    INFO [MapDBCrawlDataStore] ./examples-output/minimum/crawlstore/mapdb/Facebook_32_Posts/: Done initializing databases.

    INFO [HttpCrawler] Facebook Posts: RobotsTxt support: false

    INFO [HttpCrawler] Facebook Posts: RobotsMeta support: false

    INFO [HttpCrawler] Facebook Posts: Sitemap support: false

    INFO [HttpCrawler] Facebook Posts: Canonical links support: true

    INFO [CrawlerEventManager] CRAWLER_STARTED

    INFO [AbstractCrawler] Facebook Posts: Crawling references…

    INFO [CrawlerEventManager] REJECTED_BAD_STATUS: https://graph.facebook.com/v2.5/disney/posts?limit=10

    DEBUG [Pipeline] Unsuccessful stage execution: com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DocumentFetcherStage@288c47

    DEBUG [FileJobStatusStore] Writing status file: /Users/bo/Downloads/norconex-collector-http-2.2.1/./examples-output/minimum/progress/latest/status/Facebook_32_Posts__Facebook_32_Posts.job

    DEBUG [FileJobStatusStore] Writing status file: /Users/bo/Downloads/norconex-collector-http-2.2.1/./examples-output/minimum/progress/latest/status/Facebook_32_Posts__Facebook_32_Posts.job

    INFO [AbstractCrawler] Facebook Posts: Deleting orphan references (if any)…

    INFO [AbstractCrawler] Facebook Posts: Deleted 0 orphan references…

    INFO [AbstractCrawler] Facebook Posts: Crawler finishing: committing documents.

    INFO [AbstractCrawler] Facebook Posts: 1 reference(s) processed.

    INFO [CrawlerEventManager] CRAWLER_FINISHED

    INFO [AbstractCrawler] Facebook Posts: Crawler completed.

    INFO [AbstractCrawler] Facebook Posts: Crawler executed in 1 second.

    INFO [MapDBCrawlDataStore] Closing reference store: ./examples-output/minimum/crawlstore/mapdb/Facebook_32_Posts/

    DEBUG [FileJobStatusStore] Writing status file: /Users/bo/Downloads/norconex-collector-http-2.2.1/./examples-output/minimum/progress/latest/status/Facebook_32_Posts__Facebook_32_Posts.job

    INFO [JobSuite] Running Facebook Posts: END (Mon Oct 19 23:17:53 CEST 2015)

    any ideas?

    thank you very much for your time.

    • You are not getting any data because the examples in this article were made to work with the Facebook graph API version 2.2 (the latest at the time). I just attached a new version of the code for this article, made to work with HTTP Collector 2.3.0 and it will support the Facebook Graph API v2.5. Give it a try.

  • Xenia

    Should I extract the sources to the “classes” folder? Because I still get an error that the system can not find the classes.

  • chara mademly

    hi, when used this example, got the following error
    com.norconex.collector.core.CollectorException: Cannot load crawler configurations.
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:93)
    at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:183)
    at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:76)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
    Caused by: com.norconex.commons.lang.config.ConfigurationException: This class could not be instantiated: “com.norconex.blog.facebook.crawler.FacebookDocumentSplitter”.
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:190)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:333)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:265)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:115)
    at com.norconex.importer.ImporterConfig.loadImportHandlers(ImporterConfig.java:204)
    at com.norconex.importer.ImporterConfig.loadFromXML(ImporterConfig.java:171)
    at com.norconex.importer.ImporterConfigLoader.loadImporterConfig(ImporterConfigLoader.java:88)
    at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:339)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:123)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:83)
    … 4 more
    Caused by: java.lang.ClassNotFoundException: com.norconex.blog.facebook.crawler.FacebookDocumentSplitter
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Unknown Source)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:188)
    … 13 more

    • The error indicates that the FacebookDocumentSplitter class obtained for this article is not in your Java calsspath. Once you downloaded the zip from this article, extract it to any location of your choice. Then from that location, if you just want to to work with the binaries, copy the content of the “classes” folder into the “classes” folder of the Norconex HTTP Collector. This should fix the error you are getting. Let me know if that works for you.

      If you know your way around Java a bit, your are encouraged to modify the source code in this article to match your requirements.

  • chara mademly

    it works!!! Thank you for your help!!!

    • norconex

      ..,nia

      David Gaulin
      Enterprise Search Specialist
      Norconex Inc
      819-639-0999

      Sent from my mobile.

  • Manish Waran

    I tried to run the project after changing the APP-ID and SECRET-CODE. But I got main class not found. Please help me

    • How did you try to run it? Did you download and try to run the Norconex HTTP Collector (https://www.norconex.com/collectors/collector-http/) using its launch script? The files in this article are meant to extend the collector fonctionalities. You need to drop the content of the “classes” folder into the “classes” folder of your Norconex HTTP Collector installation.

      • Manish Waran

        thank you pascal. I will try and tell you 🙂

  • Christof Bless

    Hi, I didn’t quite understand how to follow the links extracted by the “linkExtractor” class. I’m only getting 25 posts so far. How can I retrieve more?

    • You can change the limit argument to your Facebook URL to be more than 25. Otherwise, links should be “followed” and everything should be extracted up to a certain point (e.g. the max docs you specified in crawler config, the max posts there are in total, etc). You know you have more posts than 25?

  • Tony

    I ran the crawler but it threw the NoClassDefFound Exception for JsonReader. I know that the package gson needed to be included in the project, but i think it is included before the classes were compiled, so why did it still throw this exception?

  • Tony

    I have solved the problem by downloading the gson jar and copy it to the lib folder of Norconex

    • Larry

      Thank you!

  • Tony

    I can crawl the posts from public pages only. How can I crawl from private pages, for example from my friends’ pages?

    • Try grabbing their pages using the Facebook Graph API Explorer first. If you can do it there, it means you can also crawl it using HTTP Collector. If the Graph API Explorer does not let you, it means Facebook do not offer that possibility via their API and you cannot crawl those friends pages using it.

  • Larry

    I have many URLs like this : http://en-us.facebook.com/people/-/100000054795235

    http://en-us.facebook.com/people/-/100000083638197

    How can i crawl the profile field like gender, education, email?

    • You can use the Facebook Graph API Explorer to discover all fields Facebook makes available to you and decide which ones to add to your query to have them returned.

  • Larry

    This is my error information when I trying to crawl the user information of user: 100000054795235.
    Could any one give me hints about it? Thank you in advance
    Version: Norconex HTTP Collector 2.3.0 (Norconex Inc.)
    INFO [AbstractCollector] Version: Norconex Collector Core 1.3.0 (Norconex Inc.)
    INFO [AbstractCollector] Version: Norconex Importer 2.4.0 (Norconex Inc.)
    INFO [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)
    INFO [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
    INFO [JobSuite] Running Facebook Posts: BEGIN (Fri Feb 12 16:55:18 EST 2016)
    INFO [MapDBCrawlDataStore] Initializing reference store ./examples-output/mycrawler/crawlstore/mapdb/Facebook_32_Posts/
    INFO [MapDBCrawlDataStore] ./examples-output/mycrawler/crawlstore/mapdb/Facebook_32_Posts/: Done initializing databases.
    INFO [HttpCrawler] Facebook Posts: RobotsTxt support: false
    INFO [HttpCrawler] Facebook Posts: RobotsMeta support: false
    INFO [HttpCrawler] Facebook Posts: Sitemap support: false
    INFO [HttpCrawler] Facebook Posts: Canonical links support: true
    INFO [HttpCrawler] Facebook Posts: User-Agent:
    INFO [HttpCrawler] 1 start URLs identified.
    INFO [CrawlerEventManager] CRAWLER_STARTED
    INFO [AbstractCrawler] Facebook Posts: Crawling references…
    INFO [CrawlerEventManager] REJECTED_BAD_STATUS: http://graph.facebook.com/v2.5/100000054795235
    INFO [AbstractCrawler] Facebook Posts: Deleting orphan references (if any)…
    INFO [AbstractCrawler] Facebook Posts: Deleted 0 orphan references…
    INFO [AbstractCrawler] Facebook Posts: Crawler finishing: committing documents.
    INFO [AbstractCrawler] Facebook Posts: 1 reference(s) processed.
    INFO [CrawlerEventManager] CRAWLER_FINISHED

    • Can you change the log level to DEBUG in the log4j.properties and try again? The logs may then give you more details on the failure (invalid token, etc).

  • Tony

    Could anyone please tell me about how to crawl some posts from my friends’s page?

    • If those posts are public (accessible using Facebook Graph API), you should be able to access them the way described in this article by replacing “disney” with your friend’s Facebook id/name. What have you tried?

      • Tony

        The posts are not public, but I can access them from FB Graph API. Using your crawler, I replaced the “disney” with my friend’s ID (same to what I used in Graph API), but it returned nothing. I think it is because when using Graph API, I get access token from my personal account and the API lets it to access my friend’s posts. When using your crawler, it uses the access token from the APP account, so it can crawl the public posts only. Do you think that is the reason?

        • It is very well possible. You can probably confirm by trying to put your APP token into the Graph API Explorer. If you get nothing it is probably the cause. Then check with Facebook if you can get your app to see those profiles, or maybe trying to use your personnal token (if possible).

          • Tony

            I have tried to use APP access token in Graph API but it also return nothing. I don’t know how to use personal access token in your crawler. Why do you have to use APP access token in the crawler, is it possible to use personal access token? Because if we use personal access token, we will be given more permissions to access the resources from our friends. APP access token can be use to access only the public pages, it is not so useful.

          • The app token is generated by your app ID and app secret in this case, which ensures it never expires (since it is generated each time the crawler runs). Your personnal one may expire if you insert it directly. If that’s not a concern to you, then I recommend you give it a try (change the code to use your token directly).

          • Tony

            I changed the code to use my personal access token and it worked. However, as you have mentioned, the problem is we must manually replace the new token and recompile each time running the crawler!

          • Yes, that’s a common complaint people have with the Facebook Graph API. You could check Facebook documentation about creating a long-lived token, but even those expire after a while (60 days I think). I have not yet found a way to grant server-side applications the same privileges as users for access. If you do find out, please share.

  • mark5050

    I want to be able to crawl posts in closed and public groups that mention computer and/or laptop

    • The Unraveller

      Crawling in closed groups is not possible as it is against their terms of use and privacy policy.

    • SocialCDN
  • The Unraveller

    Can we crawl without using any specific keyword like ‘disney’, I need to get list of users, groups and pages present on facebook for my project.
    Crawling with specific keyword might take too much time. And I would also like to know how does the google lists all the pages and user profiles, it must be crawling facebook. so, is there a sitemap or publicly accessible database for facebook? Thanks

    • Mark Truong

      I’m currently in the same situation, and my solution is:
      1. Hash the query in google search so we can get all the result. There are 909000 groups found.
      2. Crawl facebook by using the customized crawler.
      I’m wondering do you resolved this?

    • Aaron

      Have you resolved this? I’m facing the same problem.

  • Jo

    Hi

    I downloaded the norConex Collector, and unzipped it. the examples worked fine.
    then I downloaded the files from this article, and unzipped it, then I took the 4 files and put them in the “classes” dir of the collector. I edited the facebook-config.xml file and put there the appId and appSecret of mine.

    I tried to run the script, calling the facebook-config.xml file but I got an error, and this is the log:
    – I saw that also another person had the ssame issue, but I do have the “FacebookDocumentSplitter” file in the “classes” dir.

    com.norconex.collector.core.CollectorException: Cannot load crawler configurations.

    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:93)

    at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:183)

    at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78)

    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:76)

    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

    Caused by: com.norconex.commons.lang.config.ConfigurationException: This class could not be instantiated: “com.norconex.blog.facebook.crawler.FacebookDocumentSplitter”.

    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:190)

    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:333)

    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:265)

    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:115)

    at com.norconex.importer.ImporterConfig.loadImportHandlers(ImporterConfig.java:204)

    at com.norconex.importer.ImporterConfig.loadFromXML(ImporterConfig.java:171)

    at com.norconex.importer.ImporterConfigLoader.loadImporterConfig(ImporterConfigLoader.java:88)

    at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:339)

    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:123)

    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:83)

    … 4 more

    Caused by: java.lang.ClassNotFoundException: com.norconex.blog.facebook.crawler.FacebookDocumentSplitter

    at java.net.URLClassLoader.findClass(Unknown Source)

    at java.lang.ClassLoader.loadClass(Unknown Source)

    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)

    at java.lang.ClassLoader.loadClass(Unknown Source)

    at java.lang.Class.forName0(Native Method)

    at java.lang.Class.forName(Unknown Source)

    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:188)

    … 13 more

    • Did you move the .class files (not the .java) and did you preserve the same directory structure (the java package structure)? In other words, the class it is not finding should be under “/classes/com/norconex/blog/facebook/crawler/FacebookDocumentSplitter.class”.

      • Jo

        Hi thanks, now it works though I get three files: mapdb mapdb.p mapdb.t which is what? and what is the encoding of them? because I can’t get to read them propaply even when crawling on the disney example

        • Those are internal files for caching URLs crawled so far. You should not rely on these. You need to specify a “Committer” that will tell where to store your crawled content. A few committers are offered here: https://www.norconex.com/collectors/#committers but you can also create your own.

          The sample config from this article uses the FileSystem committer to store crawled data locally at the path indicated in that config. But that format may not be the most friendly either (but at least it is text-based).

          If your target repository is not listed in the above URL, your best options is probably to write a custom committer.

          • Jo

            I appritiate your help but I still can’t figure it out.i
            I downloaded for example the google comitter, copied all the files to the lib directory, ran again the examples, now where can I read the new docs? how do I config it to be done in the google comitter, and after I config it as a google comitter, where are the new files?
            I have some expirience with java and php programming, I just can’t figure how this api works and how to get started to see the minimal crawled data.

            when I open the files in the mapdb, crawlstore directory, all of the are in a bad encoding and I can’t read them, this is my main problam right now.

            thank you a lot for your help and patience.

          • Where do you want your files to go? The Google Committer is to save your crawled content to Google Search Appliance. Is that what you want? If you want them on the file system, the example will save them to files already with the FileSystemCommitter, and the path given in the tag. But if you want it saved in a specific format or specific repository that is not already supported, you have to write you own Committer in Java. This may help: https://www.norconex.com/collectors/committer-core/create-your-own

          • Jo

            I want the files to be saved on a directory on my system just like it is now, but the problem is that when I want to open the crawled files that are saved on my computer, then it opens them with bad encoding, and I can’t read it. I tried to open with notepad, notepad++, and other programs, but nothing… all with bad encoding.

          • Are you trying to read the “mapdb” files? They are NOT text files and you can’t read them as text. You have to rely on the Committer you chose. The Filesystem Committer in this example will save the files in the you provide, and they will be plain-text files. If that’s not the case for you and suspect an issue, please open a ticket with details to reproduce (copy of your config, etc), here: https://github.com/Norconex/collector-http/issues

          • Jo

            “The Filesystem Committer in this example will save the files in the you provide, and they will be plain-text files” – I opened the minimum-config example file just for checking, and I set up this as in the example:

            ./examples-output/minimum

            then I did get the crawled files in the /examples-output/minimum dir, the there is the crawledstore dir, with all the mapdb files (3 files).
            those are not the files that are saved from the crawling? if not, where are the crawled files?

            again, sorry for the trouble and thank you a lot for your help.

          • The sample config found in the zip file you can download in this blog has this in it:

            ${workdir}/crawledFiles

            The path in is where you will find text files when using the FileSystemCommitter. So: “committer” section, “directory” tag. I am not sure how else to explain it. 😉 While they will be text-based, the format may not suit you though. You are usually best to write your own Committer if you want the output in your own format or stored in a repository of your choice.

  • Aman

    Hi,
    I am getting this error- “Collector id attribute is mandatory”. Can you please tell me how to run the project? I have added the content of the “classes” folder into the “classes” folder of my Norconex HTTP Collector installation as you mentioned in another comment.

    • Once you have unzipped the attachment from this article, locate this config file: /src/main/config/facebook-config.xml

      You will see this line at the top:

      The “id” attribute is what the error refers too. Make sure you have it.

      If you are wondering how to start the collector, see documentation here: https://www.norconex.com/collectors/collector-http/getting-started

      • Aman

        Hi Pascal,
        Thanks for the prompt response. Yes, I referred to that document. I also confirmed that ‘id’ attribute is present.
        Now, when I am starting the collector with path set of “facebook-config.xml”, I am getting the error as “com.norconex.collector.core.CollectorException: Cannot load crawler configurations”. The given examples executed successfully.
        I’d be grateful for your help.

        • Since the code samples in this article worked for others, I suspect something specific to what you are doing. Can you please create a ticket on github with steps to reproduce (copy of your config, full stacktrace for the error you are getting, and whatever else that may help): https://github.com/Norconex/collector-http/issues/

  • Gunjan Kr

    Dear Pascal,
    Great tutorial , I am able to run the examples given as expected. Next i want the output using solr or elastic search committer. I need guidance in writing the section of the Norconex Collector configuration, suitable for this tutorial.
    sample is given as below for elastic search













    so what option i should choose.

    Thanks
    Gunjan Kumar

  • ARNULFOPLUMMER04

    Excellent post – I Appreciate the facts , Does anyone know where my company can get access to a blank a form copy to fill in ?

  • ChuckieGrenee2045

    my colleague accessed a sample a form version at this site http://goo.gl/l3Ul9j

  • Shimakaze

    Hi, is it possible to still use this for closed groups? because facebook restricted everything that gas to do with groups recently.
    If not, then what are the alternatives?

  • peterwilli

    @pascal_essiembre:disqus hi there, I’m trying to add likes to the mix, but i’d like your feedback on this one. I don’t think I’m working it out very well. I currently run Norconex with a SolrCommitter and it stores everything neatly. Based on your original code I created a new ‘FacebookLikeSplitter’ class that will accept and extract the /post/likes endpoint. The problem is that right now they are stored in separated objects in the same collection as the original facebook data like in your code.

    I was wondering how you would tackle this problem? Thanks!

    • Hello, I am not sure I fully get the challenge you are facing. If you have the “likes” as part of the JSON returned, you can extract them and do whatever you want with them. You likely need to iterate through the main collection first, then for each entry, iterate through the likes for each posts (or comments). Wouldn’t this work for you? If not, please share a concrete example so can I understand better.

  • Philson

    @Pascal Hey Can we do it in nodejs

    • This article focuses on doing it with the Norconex HTTP Collector, but the Facebook Graph API deals with HTTP calls, so I presume you should be able to pull it off.

  • Aaron

    The collector is up and working for me, so now I want to crawl a bigger list of facebook pages for their posts. In the Graph API Explorer, I can do this using the search feature (search?q=a&type=page&fields=posts), which gives me a long list of posts. When I then enter this url using the search feature as the start-url in my crawler, I get the error

    [Fatal Error] :26:61: The reference to entity “type” must end with the ‘;’ delimiter.

    Can you tell me what may be the cause for this and if there’s a better way of crawling a bigger sequence of different pages?

    • Aaron

      For this specific error, the problem was that I used an ampersand. In XML, I have to use & representing the ampersand symbol.
      My last question remains 🙂

  • Balazs

    @pascal_essiembre:disqus
    Hi! Thanks for the solution, I’m facing with a problem now of rejected error, I checked my credentials and it looks fine my error text looks this:

    INFO [CrawlerEventManager] REJECTED_ERROR: https://graph.facebook.com/v2.8/MupaBudapest/events/?fields=attending_count%2Cstart_time%2Cend_time%2Ccategory%2Cname%2Cplace%2Cdescription

    ERROR [AbstractCrawler] Facebook Events: Could not process document: https://graph.facebook.com/v2.8/MupaBudapest/events/?fields=attending_count%2Cstart_time%2Cend_time%2Ccategory%2Cname%2Cplace%2Cdescription (Could not obtain access token. Response was: {“access_token”:”XYZ…..|hE_QeH5Emvdxbmnq1v3PLyN1CzU”,”token_type”:”bearer”})

    My collector is the version 2.6 can it be the problem?

    • Balazs

      I tried with the 2.3.0 as well and it didn’t work out for me, even with the 2.5 Api link

    • Does obtaining that token works using the graph API explorer?

      • Balazs

        Probably I misunderstand something, I use the information from my fb app, so the app ID and the Secret. and in the graph api I use a generated token there, how can I check if it’s working with the graph API?

        • Balazs

          Yes I checked the graph API and the acces token is the same: XYZ…|hE_QeH5Emvdxbmnq1v3PLyN1CzU and it works

          • It turns out the Graph API response has changed over time when getting the access token. I attached an updated version of the source code that will work with the latest Graph API version (2.8 when writing this).

          • Balazs

            thanks for your update, now it is working perfectly 🙂

  • Balazs

    @pascal_essiembre:disqus
    Hi!
    Thanks again for your last help, it is working well for me now, I’m just thinking about, if it is possible to put all the crawled data into 1 json file, I have 250 urls to crawl now, and it is doing it well but in separate files in separate folders into smth.cntnt files. I just want to avoid merging those files by hand, and it could just append the same json for me. all the data is coming in exactly the same format.

    I tried to figure out but I didnt find a solution in these 4 java classes for this. can u help me with this?
    Thank you,

  • Rickz Lee

    What do i need to run this program ?
    I used the cmd but i keep getting an error

    D:>collector-http[.bat|.sh] -a start -c test/examples/minimum/minimum-config.xml
    ‘collector-http[.bat’ is not recognized as an internal or external command,
    operable program or batch file.

    • [.bat|.sh] means use collector-http.bat or collector-http.sh according to your OS. You appear to be on Windows, so you should be using “collector-http.bat”.