Nutch 1 vs Nutch 2

Introduction

Nutch version 2 is already out for 7 years now, while version 1 is also still available and under active development. Currently we can choose from 2 major branches, the 1.x and 2.x branches. The main differences are that the 2.x branch comes with support for NoSQL Databases for it’s storage and Nutch 1.x stores it’s data and Index still in Apache SOLR.

The backend for the NoSQL connectivity is provided by Apache Gora, which provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores, distributed in-memory key/value stores, in-memory data grids, in-memory caches, distributed multi-model stores, and hybrid in-memory architectures.

Gora also enables analysis of data with extensive Apache Hadoop MapReduce™ and Apache Spark™ support. Gora uses the Apache Software License v2.0. Gora graduated from the Apache Incubator in January 2012 to become a top-level Apache project.

At DigitalPebble a study was performed between the two versions to determine which was the fastest. It was concluded that Nutch 1.x was still the fastest on all fronts and this was mainly due to Gora which is responsible for a lot of overhead.

Conclusion

At present there’s no need to upgrade from 1 to 2 and this will probably not change the coming years.

 

One thought on “Nutch 1 vs Nutch 2”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.