Breakthrough Hadoop MapReduce Performance

IBM recently introduced a new Adaptive MapReduce capability as a core feature of IBM InfoSphere ^®BigInsights ^TMEnterprise Edition, IBM’s big data Hadoop offering. Adaptive MapReduce, based on technology from IBM Platform Computing, provides customers with the option to replace the open source Hadoop implementation included in BigInsights with an advanced, low-latency workload scheduler proven in time-critical, high-performance computing applications. This enhanced MapReduce run-time can deliver higher levels of performance and reliability while maintaining full compatibility with a broad set of Hadoop MapReduce applications.

During the summer of 2013 I decided to put Adaptive MapReduce to the test. Along with some top talent at IBM, I engaged STAC ^®(The Securities Technology Analysis Center) to conduct a formal performance evaluation of InfoSphere BigInsights and contrast it with open-source Hadoop. STAC are experts at conducting benchmarks, and have published numerous performance audits and benchmarks for their clients in the hyper-competitive financial services industry.

While there are many Hadoop benchmarks to choose from, we selected a benchmark called SWIM (Statistical Workload Injection for MapReduce). SWIM has the advantage that it replicates real application patterns, and includes traces derived from production workloads at Facebook. It is freely available at GitHub where it is shared under a Berkeley open source software license. In order to prove the performance advantage of IBM InfoSphere BigInsights, we ran identical workloads on identical infrastructure and an identical HDFS file system under controlled conditions.

To make a long story short, we managed to achieve some impressive results. Some of the key results of the benchmark were:

In jobs derived from production Hadoop traces, IBM InfoSphere BigInsights accelerated Hadoop on average by approximately 4x .
The 4x improvement was consistent with an approximate 11x improvement in raw-scheduling performance in a corner-case workload designed to measure scheduling performance.

Needless to say, being able to run Hadoop and other large-scale analytic workloads quickly can be important for many customers. In application areas like law-enforcement, intelligence, fraud analytics and finance, seconds count and obtaining timely results can be critical.

While performance is important, keeping infrastructure costs under control is also important. By deploying IBM InfoSphere BigInsights along with other analytic workloads on a shared Platform Computing infrastructure (requiring additional software licenses), customers have the opportunity to significantly reduce infrastructure spending while delivering better service levels to the business. Better performance at a lower cost is pretty compelling!

Learn more about IBM InfoSphere BigInsights and these audited performance results

Learn more about IBM Platform Computing and high-performance, multitenant distributed computing solutions

See the STAC Report™. Testing involved the SWIM benchmark ( https://github.com/SWIMProjectUCB/SWIM) and jobs derived from production workload traces. Testing was conducted in controlled laboratory conditions.

Opinions express in this article are my own, and may not represent the views of IBM.

Published by