3 and 1/2 minutes to sort a Terabyte, and a look at Hadoop's code structure

Apache Hadoop Wins Terabyte Sort Benchmark: "One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won."

Amazing. You can see the history of the benchmark here. Just as interesting is that this is a fraction of Y!'s capacity - they're running 13,000+ Hadoop nodes. What does it mean? I think that Hadoop is becoming an important piece of open source infrastructure for data processing, and potentially as a storage system in its own right.  Core Hadoop development is highly active, with a string of associated projects building on it, such as Pig, Mahout, HBase, Hive, Cascading and Zookeeper.

I think Hadoop is an incredible project. Crudely you can split Hadoop into two parts - compute and storage. There's a lot more going on than that of course - metrics, scheduling, job tracking and so on, cool stuff that doesn't get as much publicity as the map/reduce feature, such as the ability to be rack aware and potential features like pause/resume - the totality of what Hadoop does is amazing. Ideally then, if you buy into the fundamental data/compute split, there'd be clean layering - for example the filesytem layers would have no upward dependency on the map/reduce layer, which is a processing library.

I've been digging around the Hadoop code recently; I used Structure 101 from the folks at Headway Software to help me scan the codebase and its package architecture. Aside from being a good way to get my head around Hadoop's innards, the results were interesting. The basic code metrics:

  • Packages (that contain classes): 58
  • Classes (outer): 611
  • Classes (all): 1,246
  • LOC (Non Comment Non Blank Lines Of Code): ~118K

first off, that's a much smaller LOC than reported at Ohloh, almost half. Still, it's a fair chunk of code.

Here you can see Hadoop's top level package structure:

Those arrows that look like air traffic routes indicate cyclic dependencies amongst packages. The immediate thing that jumps out at this level is the the fs.* package has cyclic dependencies with 8 other packages. Here's a drilldown that gives a bit more detail on the fs package:

Cross package dependencies seem like an abstract concern but are useful to know about as they tend to hamper developer productivity and can become problematic as the code grows. I have to point out though that these packages are not seperate distributables today so this is purely an observation on the structure and can't really be a critique of the design intent. Speaking of distributables, we can also look at the what in Robert C. Martin's terms is the "unit of reuse" in Hadoop by looking for the largest cycle, or "tangle" in the code. The largest tangle is pretty big, traversing 11 packages. Here's a matrix view of that tangle:

and the structural view (it's very dense, you'll probably want to click through to see any detail):

What this suggests is that maroon box is probably the smallest unit of release in Hadoop core at the moment. [As an aside, I think it would be very hard to detect that purely looking at code, which is why code analysis tools like Structure 101, JDepend, SonarJ as used by SpringSource, et al are very handy for understanding large or unfamiliar codebases. Spring's code for example is known for being cycle free and doing a fine job of managing complexity.]

So that's a glance at the superstructure. Let's look at "complexity". Here's the package structure.

The red/blue bars above indicate the level of complexity - more blue meaning less complexity.  What's meant by 'complexity' here? Well, Robert C. Martin defined a number of measures for OO code years ago that are the basis for a number of code analysis tools today  - for example, Mike Clark's JDepend uses Martin's figures such as afferent coupling directly, and the red/blue line above is "XS" which is a variant that Structure 101 uses. Anyway, the chewiest areas of the code in terms of measurable complexity are clearly the mapred and hdfs packages.

Most of the structural complexity in hdfs is in the namenode package, and almost wholly located in the FSNamesystem class, which is a 4400 line long beast.


FSNamesystem is an important part of the Hadoop runtime as it watches the DataNodes, along with machine lists and heartbeats. Arguably it wants to become a package and there's been some recent activity to restructure the hdfs package. Sometimes I think Java classes like these don't get split out in OSS projects because a) they work as they are and b) that kind of refactoring makes it harder to diff and backport changes across branches.

The mapred complexity is spead out and so has a bit more to it:


TaskTracker and JobTracker are big classes, about 2500 lines each. Like FSNamesystem they're important classes - they track map reduce tasks and job submissions respectively. Interestingly JobConf, which is where these jobs are configured, has very low complexity, but has a cycle going outside the mapred. You can see mapred's internal dependencies below (click through for more detail):

So, it's always worth keeping an eye on the internal complexity of code as a project evolves, but is any of this actually a problem for Hadoop? Probably not right now - Hadoop's already widely deployed and proven at thousands of nodes. It's young for an infrastructure project - the next version will still only be 0.18.0. And let's not forget it can chew through a Terabyte of data like nothing else. As Martin Fowler suggests in DesignStaminaHypothesis, when and how much design to invest is a matter of tradeoffs:

Rather, the single biggest problem I've had coming to Hadoop and wanting to understand its internals is the test cycle, which can take hours to complete.  All the structural state of the codebase today suggests is that it'll need to see some refactoring as the project evolves, hopefully splitting out the file system from the processing/compute layers.



    Hi Bill,

    Great to see you using Structure101 on Hadoop, as well as at Newbay.

    If any of the folks working on Hadoop, or its associate projects, want licenses just drop us a mail at Headway, http://www.headwaysoftware.com/about/....

    Structure101 is freely available for use on Open Source.


    Thanks for the post, really good reading. Special thanks for providing links to various OS projects, didn't know Yahoo is working on so much cool projects, such as Zoo, Hive etc.


    Thanks! This is interesting. We'll definitely keep an eye on such metrics as hadoop evolves.

    The plan is, as you suggest, to split Hadoop into three parts: core, mapred and hdfs, with the latter two dependent on the first. This should not be very difficult.

    The cycles you cite above are mostly trivial. For example, in the fs package (part of the core) one class (HarFileSystem, an archive tool) depends on one thing in mapred (LineRecordReader, a utility for line-by-line input) not difficult to fix. I have not explored the other dependencies, but suspect they're similarly shallow.

    We try to maintain clean layering. Some things have slipped through, but the central APIs are monitored more closely.

    Please see http://issues.apache.org/jira/browse/... for more on this.