Tag Archives: Crawling

Crawling the web with Apache Nutch and Solr(cloud).

Crawling (spydering) the  web.

Web Crawling
Web Crawling

Crawling the whole web is an illusion, unless you want to spend the rest of your days in a cold data center!

Yes. This describes how I felt when I spent over 500 hours crawling with a Nutch single instance and fetched “only” 16 million pages.

How many pages are on the web?

I asked Google and they answered:

The Indexed Web contains at least 4.62 billion pages (Wednesday, 30 March, 2016). By “Website” we mean unique hostname (a name which can be resolved, using a name server, into an IP Address). It must be noted that around 75% of websites today are not active, but parked domains or similar.

Note that these are figures from 2 years ago. Since 2016 this amount is probably doubled. Also note that these are only domain names. When I do a careful estimation I would say there currently are 10 billion active  pages.

That’s a lot and it doesn’t come even comes close to the 16 M pages I crawled in this two weeks I sat there on my own in the Server Room with noisy Servers and an aircon which was set just enough to keep me cold.

why didn’t I leave that room?

The answer was that Nutch wasn’t really stable and loosing a crawl meant starting all over from the beginning…

Luckily Apache Nutch has become a lot more stable over the years and the possibilities for the most critical stages (indexing and reverting the index) have imroved dramatically with the arrival of Solr and SolrCloud. In my previous article I already explained how to setup a SolrCloud instance. In this article we will cook some soup of this and actually start crawling a few million pages. If you want to crawl and index the whole web I wish you lots of success but be prepared for disappointments. The trick is never to give up and start again over and over when necessary! In the end you probably didn’t succeed but at least you tried and (hopefully) learned a lot!

i’m not writing this to disencurr you but if you (like me) can’t afford a staff of 1000 experts and multiple Data centers (Like Google and Yahoo) chances are that you won’t succeed. You may however be able to crawl a big Intranet and make a living of it.

With having said that let’s start;

There are many ways to index web pages. We could download them, parse them, and index them with the use of Lucene and Solr. The indexing part is not a problem, at least in most cases. But there is another problem – how to fetch them? We could possibly create our own software to do that, but that takes time and resources. That’s why this article  will cover how to fetch and index web pages using Apache Nutch.

For the purpose of this task we will be using Version 1.5.1 of Apache Nutch. To download the binary package of Apache Nutch, please go to the download section of http://nutch. apache.org/.

Installation and Configuration

            1. First of all we need to install Apache Nutch. To do that we just need to extract the downloaded archive to the directory of our choice; for example, I installed it in the directory /usr/share/nutch. Of course this is a single server installation and it doesn’t include the Hadoop filelesystem, but for the purpose of the article  it will be enough. This directory will be referred to as
              $NUTCH_HOME.
            2. Then we’ll open the
              NUTCH_HOME/conf/nutch-default.xml

              and set the value http.agent.name to the desired name of your crawler (well use,’Crawler’ as a name). It should look like the following code:

            3. <property> 
              <name>http.agent.name</name> <valuCrawler</value> <description>HTTP 'User-Agent' request header.</description> 
              </property>
              
            4. Now let’s create empty directories called crawl and urls in the
              $NUTCH_HOME

              directory. After that we need to create the

              seed.txt

              file inside the created urls directory with the following contents: http://lucene.apache.org

           

          1. Now we need to edit the
            $NUTCH_HOME/conf/crawl-urlfilter.txt

            file. Replace the +.at the bottom of the file with

            +^http://([a-z0-9]*\.)*lucene.apache.org/

            So the appropriate entry should look like the following code:

            +^http://([a-z0-9]*\.)*lucene.apache.org/

            One last thing before fetching the data is Solr configuration. We start with copying the index structure (called schema-solr4. xml) from the

            $NUTCH_HOME/conf/

            directory to your Solr installation configuration  directory (which in my case was

            /usr/share/solr/collection1/conf/

            ). We also rename the copied file to:

            schema.xml

            Because this is what Solr expects.

Note the differences between the actual URL and the Regular Expression. In this case the RegEx is used to limit the crawler to only follow the links inside the lucene.apache.org domain (and it’s subdomains.

Starting the crawling process.

Now that the configuration has been completed we can start the crawl using the following command;

bin/nutch crawl crawl/nutch/site -dir crawl -depth 3 -topN 50 –threads 20

As you can see the command takes a few parameters: -dir, -depth, -topN and-threads. I’ll explain them below:

  • -dir – The local directory in which the index is stored. In the above example this is the ‘crawl’ directory straight under
    $NUTCH_HOME

    So the complete path to the index will be

    $NUTCH_HOME/crawl/.
  • -depth  – The crawler works recursive. To prevent the recursion to go on forever we can set this depth to a shallow (3) or deep (5 to 25 or even deeper value.
  • topN – Use only the top N scoring documents.
  • threads – The number of simultaneous threads. The Java language makes use of threading. You should be careful with this parameter. When you set it to high you will soon receive complaints from webmasters because your Nutch browser causes a high load on different sites.

Depending on your configuration and parameters the crawl process can be short or long. During the crawl you will see everything happening in the console and if you want to see more details, read the contents of the file:

$NUTCH_HOME/hadoop.log.

Happy Crawling!