Thursday, February 21, 2008

How to build a Google

Yahoo recently announced that it had shifted a significant portion of its search engine to Hadoop, a software project Yahoo has been involved in since 2006 that is designed to handle large-scale distributed computing tasks, ala Google. The Yahoo Search Webmap application runs on a 10,000 core Linux cluster and is producing data that is used in every Yahoo Web search query.

In December 2004, Google published a paper on the MapReduce algorithm. MapReduce allows Google to run large scale computations across large clusters of servers. Doug Cutting, an open source advocate and developer of the Lucene search technology, realized the importance of the paper in extending Lucene to allow for large search problems. He created Hadoop to allow applications based on the MapReduce model to run on large clusters of commodity hardware.

What follows is the time line defined by Yahoo as they announce their involvement in Hadoop in 2007:
  • 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella
  • December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.
  • January 2006 - Doug Cutting joins Yahoo!
  • February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS.
  • March 2006 - Formation of the Yahoo! Hadoop team
  • May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes
  • April 2006 - Sort benchmark run on 188 nodes in 47.9 hours
  • May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark)
  • October 2006 - Research cluster reaches 600 Nodes
  • December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8
  • January 2006 - Research cluster reaches 900 node
  • April 2007 - Research clusters - 2 clusters of 1000 nodes
According to Yahoo, "the Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search." [Source]

The size of the output from this process is over 300TB (compressed) with over 5 Petabytes of raw disk used in the production cluster and roughly 1 trillion links between pages in the index. [Source]

These numbers don't come close to those published by Google, where as of last September Google was processing 20,000 terabytes of data (20 petabytes) a day : [Source]


In Shanghai last November IBM announced their plans for Blue Cloud. IBM's initiative represents their first foray into what is referred to as "Cloud Computing". Blue Cloud is a series of cloud computing offerings "that will allow corporate data centers to operate more like the Internet by enabling computing across a distributed, globally accessible fabric of resources, rather than on local machines or remote server farms."

At the heart of Blue Cloud lies Hadoop which will be used for parallel workload scheduling. It appears IBM's initiative is to have corporate customers running their Web 2.0 applications in the Blue Cloud on their hardware and software stack, but given IBM's large corporate base I wouldn't be surprised to see IBM offering to train, build and support the infrastructure in corporate data centers as well.

It would appear that Google paper written back in 2004 has legs. Yahoo and IBM are changing the very infrastructure they use to provide their service offerings based largely on a freely available implementation of both Hadoop MapReduce and the Hadoop Distributed File System that accompanies it.

There are some very good tutorials online for building your own single-node or multi-node Hadoop clusters. The tutorials will even walk you through a MapReduce WordCount example that uses your clusters to count the number of times a word occurs. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. Having walked through these exercises I have the desire now to wire my basement up with racks of motherboards, hard drives and power supplies, or, perhaps a more economical and less "wife leaving me" effort would be to use the power of Amazon's Web Services.

Utilizing Amazon's EC2 and S3 services you can run massive Hadoop clusters. The combination of AWS and Hadoop will allow developers to write scalable algorithms and then bring up large numbers of servers for computing power which can then be shut down when they are not needed.

The Hadoop project has gained some significant momentum over the last year and it appears ready for large scale production use. Yahoo and IBM have access to an existing infrastructure to build out services around Hadoop. We have the infrastructure on loan to us through Amazon's Web Services if we want it. It will be interesting to see what comes of it and how far a good idea, ripe for this type of large scale distributed computing environment, can go.

0 comments: