Record Hadoop Performance Results

Recently I had the honor of working with some smart engineers at IBM Platform Computing and an external company called STAC Research to demonstrate how IBM Platform Symphony can help organizations deliver excellent results running big data workloads.

We chose to employ a benchmark called SWIM developed by Yanpei Chen at the University of California at Berkeley. SWIM stands for Statistical Workload Injection for MapReduce. What’s cool about SWIM is that it allows users to replicate real-world workloads gathered from production traces gathered from participating MapReduce users. In our case we elected to use a Facebook 2010 model gathered at Facebook by the benchmark authors.

To cut to the chase, using identical hardware, and the same version of HDFS (Apache Hadoop 1.0) we managed to demonstrate that on average Platform Symphony reduced MapReduce run-times by a factor of 7.3. We decided to claim a 6 times performance improvement to be more conservative since the ratio of total run times without Symphony when compared to those with Symphony was closer to a factor of six.

This is a significant result in my view. It shows that just by employing the Platform Symphony middle-ware for scheduling, MapReduce users can achieve dramatically better results, or achieve the same results with much less hardware realizing big cost savings.

I don’t personally have rights to re-distribute the results, but for anyone interested, more information can be found at the IBM Platform Symphony web-site.