Skip to content
Surf Wiki
Save to docs
general/apache-software-foundation-projects

From Surf Wiki (app.surf) — the open knowledge base

Apache Nutch

Open source web crawler

Apache Nutch

Open source web crawler

FieldValue
nameApache Nutch
logoApache Nutch logo.svg
screenshotNutchScreenshot.png
screenshot size250px
captionNutch Web Interface Search
collapsibleyes
authorDoug Cutting, Mike Cafarella
developerApache Software Foundation
latest release version{{Multiple releases
branch11.x
version11.21
date1
branch22.x
version22.4
date2
latest release date
repo
programming languageJava
operating systemCross-platform
genreWeb crawler
licenseApache License 2.0
website

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Features

Nutch robot mascot

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented the MapReduce project and a distributed file system. The two projects have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.

In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl.

Release history

1.x2.xRelease dateDescription
1.12010-06-06This release includes several major upgrades of existing libraries (Hadoop, Solr, Tika, etc.) on which Nutch depends. Various bug fixes, and speedups (e.g., to Fetcher2) have also been included.
1.22010-10-24This release includes several improvements (addition of parse-html as a selectable parser again, configurable per-field indexing), new features (including adding timing information to all Tool classes, and implementation of parser timeouts), and bug fixes (fixing an NPE in distributed search, fixing of XML formatting issues per Document fields).
1.32011-06-07This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball—only about 2 MB).
1.42011-11-26This release includes several improvements including allowing Parsers to declare support for multiple MIME types, configurable Fetcher Queue depth, Fetcher speed improvements, tighter Tika integration, and support for HTTP auth in Solr indexing.
1.52012-06-07This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few.
2.02012-07-07This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora) for big data stores such as Apache Accumulo, Apache Avro, Apache Cassandra, Apache HBase, HDFS, an in memory data store and various high-profile SQL stores.
1.5.12012-07-10This release is a maintenance release of the popular 1.5.X mainstream version of Nutch which has been widely adopted within the community.
2.12012-10-05This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search.
1.62012-12-06This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API including the normalization of URLs and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8.
2.22013-06-08This release includes over 30 bug fixes and over 25 improvements representing the third release of increasingly popular 2.x Nutch series. This release features inclusion of Crawler-Commons which Nutch now utilizes for improved robots.txt parsing, library upgrades to Apache Hadoop 1.1.1, Apache Gora 0.3, Apache Tika 1.2 and Automaton 1.11-8.
1.72013-06-24This release includes over 20 bug fixes, as many improvements; most noticeably featuring a new pluggable indexing architecture which currently supports Apache Solr and Elastic Search. Shadowing the recent Nutch 2.2 release, parsing of Robots.txt is now delegated to Crawler-Commons. Key library upgrades have been made to Apache Hadoop 1.2.0 and Apache Tika 1.3.
2.2.12013-07-02This release includes library upgrades to Apache Hadoop 1.2.0 and Apache Tika 1.3, it is predominantly a bug fix for NUTCH-1591 - Incorrect conversion of ByteBuffer to String.
1.82014-03-17Although this release includes library upgrades to Crawler Commons 0.3 and Apache Tika 1.5, it also provides over 30 bug fixes as well as 18 improvements.
2.32015-01-22Nutch 2.3 release now comes packaged with a self-contained Apache Wicket-based Web Application. The SQL backend for Gora has been deprecated.
1.102015-05-06This release includes library upgrades to Tika 1.6, also provides over 46 bug fixes as well as 37 improvements and 12 new features.
1.112015-12-07This release includes library upgrades to Hadoop 2.X, Tika 1.11, also provides over 32 bug fixes as well as 35 improvements and 14 new features.
2.3.12016-01-21This bug fix release contains around 40 issues addressed.
1.122016-06-18
1.132017-04-02
1.142017-12-23
1.152018-08-09
1.162019-10-11
2.42019-10-11Expected to be the last release on the 2.X series, as "no committer is actively working on it".
1.172020-07-02
1.182021-01-24
1.192022-08-22
1.202024-04-09
1.212025-07-20

Scalability

IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.

Search engines built with Nutch

  • Common Crawl – publicly available internet-wide crawls, started using Nutch in 2014.
  • Creative Commons Search – an implementation of Nutch, used in the period of 2004–2006.
  • DiscoverEd – Open educational resources search prototype developed by Creative Commons
  • Krugle uses Nutch to crawl web pages for code, archives and technically interesting content.
  • mozDex (inactive)
  • Wikia Search - launched 2008, closed down 2009

References

Bibliography

  • {{cite book |access-date = August 15, 2009 |archive-url = https://web.archive.org/web/20091202104144/http://www.apress.com/book/view/9781590596876 |archive-date = December 2, 2009 |url-status = dead

References

  1. "Apache Nutch™ - Downloads".
  2. "Apache Nutch -".
  3. "Common Crawl's Move to Nutch – Common Crawl – Blog".
  4. (22 January 2015). "Nutch 2.3 Release". The Apache Software Foundation.
  5. (6 May 2015). "Nutch 1.10 Release Notes". The Apache Software Foundation.
  6. (7 December 2015). "Nutch 1.11 Release Notes". The Apache Software Foundation.
  7. (11 October 2019). "Nutch 2.4 Release". The Apache Software Foundation.
  8. "Scalability of the Nutch search engine".
  9. "Base Operating System Provisioning and Bringup for a Commercial Supercomputer".
  10. [http://boston.lti.cs.cmu.edu/crawler/crawlerstats.html The Sapphire Web Crawler - Crawl Statistics]. Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21.
  11. (2004-09-03). "Our Updated Search". Creative Commons.
  12. (2004-11-22). "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons.
  13. (2006-08-02). "New CC search UI". Creative Commons.
  14. "Where can I get the source code for Wikia Search?".
  15. (31 March 2009). "Update on Wikia – doing more of what's working {{pipe}} Jimmy Wales".
Info: Wikipedia Source

This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page.

Want to explore this topic further?

Ask Mako anything about Apache Nutch — get instant answers, deeper analysis, and related topics.

Research with Mako

Free with your Surf account

Content sourced from Wikipedia, available under CC BY-SA 4.0.

This content may have been generated or modified by AI. CloudSurf Software LLC is not responsible for the accuracy, completeness, or reliability of AI-generated content. Always verify important information from primary sources.

Report