Untangling YARN: Building a Multitenant Hadoop Environment
All in the Hadoop world are excited about YARN. For those who don’t follow goofy topics like this, YARN is an acronym for “yet another resource negotiator”. YARN is an important development for organizations deploying Hadoop environments.
What YARN does is de-couple Hadoop workload management from resource management. This means that multiple applications can share a common infrastructure pool. While this idea is not new, it is new to Hadoop. Earlier versions of Hadoop consolidated both functions into a single JobTracker. This resulted in limitations for customers hoping to run multiple applications on the same infrastructure.
Open source Hadoop 2.2.0 and later incorpor0ate released versions of YARN. The community delivered the first generally available release of YARN in Hadoop 2.2.0 in October 2013, and major providers of Hadoop including IBM have been incorporating YARN into our commercial offerings.
Yet another resource negotiator
YARN is well named. While an important technology, the world is not suffering from a shortage of resource managers. Some commercial Hadoop providers including IBM are supporting open source innovations in YARN while others are supporting Apache Mesos. In addition, there is a plethora of general purpose batch workload managers supporting Hadoop as “yet another workload pattern” (YAWP – you heard it here first!). This includes our own Platform LSF with our freely available Hadoop Connector for LSF to enable existing Platform Computing customers to support Hadoop MapReduce natively on existing HPC clusters. Also, many distributed applications embed their own proprietary solutions for workload management in clustered environments.
To borrow from object-oriented programming terminology, multitenancy is an over-loaded term. It means different things to different people depending on their orientation and context. To say a solution is multitenant is not helpful unless we are specific about the meaning. Some interpretations of multitenancy in Big Data environments are:
- Support for multiple concurrent Hadoop jobs
- Support for multiple lines of business on a shared infrastructure
- Support for multiple application workloads of different types (Hadoop and non-Hadoop)
- Provisions for security isolation between tenants
- Contract-oriented service level guarantees for tenants
- Support for multiple versions of applications and application frameworks concurrently
- Reporting and analytics for usage reporting and chargeback accounting
YARN is beginning to address some of these requirements.
Standards ‘R’ us – but capabilities matter too
IBM’s view is simple. If open standards solve the business problem, this is the best solution.
In InfoSphere BigInsights, IBM offers a 100% standard Hadoop solution as the default choice, but also provides additional capabilities for clients that outgrow Hadoop. Good examples of innovations that can be optionally deployed are IBM GPFS for clients needing a POSIX compliant HDFS compatible alternative, Adaptive MapReduce for clients needing agile high-performance scheduling, and IBM BigSQL for clients needing a 100% ANSI compliant interface to data residing in HBASE or other data formats in Hadoop. There are many others. Customers who can choose to use YARN as a default resource manager, but for those with more challenging problems, we also offer more capable solutions.
True Multitenancy available now
In addition to Hadoop YARN, IBM offers a solution called IBM Platform Symphony profiled by Gartner research here. IBM Platform Symphony has been designed from day one to support multitenancy with all of the sophisticated capabilities required including security isolation, service level guarantees, chargeback accounting and dynamic resource sharing subject to policy. While IBM Platform Symphony supports your chosen Hadoop distribution including IBM’s own offerings we take a broader view – supporting not only Hadoop but a huge catalog of other applications as well – service-oriented apps, batch workloads, process-oriented workloads, MPI jobs, Hadoop jobs as well as various long-running service frameworks.
IBM has been delivering multitenant distributed computing solutions for over ten years. We are proud that 12 of the top 20 global investment banks have deployed IBM Platform Symphony to manage their multitenant infrastructure. If you are struggling with hosting multiple Hadoop versions or instances on a common foundation we may well have the answer.
Learn about one client’s success
With over 9 million members, USAA’s mission is to facilitate the financial security of its members, associates, and their families by providing a full range of highly competitive financial products and services; in so doing, USAA seeks to be the provider of choice for the military community.
USAA began to adopt Hadoop in 2011. USAA have long been recognized for their innovative application of technology, and they have made unique advances in terms of building and delivering a shared, multitenant big data infrastructure. USAA’s big data infrastructure is unique in IBM’s view and represents an industry best practice. Robert Ghavidel is the technical architect with the USAA CTO office who has been a key person instrumental in realizing a shared infrastructure environment at USAA.
Please join Robert and I on Wednesday May 21st in Las Vegas at IBM’s annual Edge Conference. Robert will speak to some of the business challenges that led to USAA embracing Hadoop, and describe how they have been able to position themselves for growth by working with IBM to build a multitenant big data environment.
Robert will be speaking in the Technical Edge track in Murano 3304 at the Las Vegas Venetian conference center on Wednesday May 21st.
You don’t want to miss it. We will post the presentation to the Edge website following the event – Follow me at @GJSissons to learn more.