Big Compute meets Big Data
HPC systems have evolved significantly over the last two decades. While once the dominion of purpose-built supercomputers, today, clustered systems rule the roost. Horizontal scaling has proven to be the most cost-efficient way to increase capacity. What supercomputers all have in common today is their reliance on distributed computing.
The move toward parallelism and distributed computing was enabled by innovation in several areas — low-latency interconnects like Infiniband help distributed systems behave like a single computer. Programming models like MPI help developers simplify the coordination of multiple concurrent execution threads on distributed topologies, and cluster and workload management technologies make large numbers of jobs and hosts practical manage. Distributed systems also demanded more capable storage architectures and today’s supercomputers most often rely on parallel file systems. Connected via switched fabrics, modern storage subsystems deliver file system I/O at rates that exceed that of even locally connected disk.
HPC is changing again, and the catalyst this time around is Big Data. As storage becomes more cost-effective, and we acquire the means to electronically gather more data faster than ever before, data architectures are being re-considered once again. What happened to compute over two decades ago is happening today with storage.
Consider that Facebook generates approximately 60 terabytes of log data per day. We become numb to large numbers like this and it is easy to forget just how much data this is. The time required to read 1 terabyte from a single disk at 50 megabytes per second is approximately 6 hours — reading 60 terabytes at this rate would take 15 days. When confronted with these volumes of data, the only path forward is to harness distributed approaches and rely on parallelism. Leaders like Google and Yahoo have done exactly this out of necessity creating the Google File System and leading to the creation of Hadoop. Hadoop relies on clusters of storage dense nodes to store vast data volumes reliably and economically while leveraging a programming framework that enables fast parallel processing of the data. One of the key ideas behind Hadoop MapReduce is that it is more efficient to vector compute tasks to where data already resides, rather than attempt to move the data across networks.
While technology approaches are converging around distributed computing principles, interestingly many business problems are demanding this type of convergence at the same time. Problems increasingly require a combination of both traditional high performance computing approaches as well as access to Big Data and analytics toolsets such as those found in Hadoop.
Consider an emerging field like precision agriculture. Optimizing crop yields depends on many factors. Weather modeling (essentially a computational fluid dynamics problem) is a major one, and this lies in the realm of HPC. If we can better predict weather, we have a better idea when to plant, the type of seed to plant, and the fertilizer to apply. Many capabilities in the field of agri-business however rely on Big Data. These include electronic sensors that detect soil-moisture, GPS enabled tractors for precision guidance, geo-referenced soil mapping, and remote sensing via aircraft or satellite to detect soil aridity, pests, snowpack and more. With better information, seed producers can genetically tailor products to deliver better yields based on ever-changing environmental conditions.
Genomic sequencing employed in seed development and life sciences is another example of where traditional HPC workloads are being augmented by Big Data techniques. Staple tools like BLAST (Basic Local Alignment Search Tool), Picard and Bowtie that run in traditional HPC environments are being supplemented by new tools that are expressed as Hadoop MapReduce algorithms including Quake, Contrail and BioPig that rely on Hadoop file systems.
We may not think of insurance companies as wrestling with HPC, but like most financial firms, insurers rely on Monte Carlo simulation for actuarial analysis and to value portfolios of financial instruments over large numbers of market scenarios. As insurance products have become more complex, and have trended toward hybrid insurance-investment products, the need for computer modeling has increased dramatically. Insurers are also confronted with Big Data challenges. Competitive pressures demand that they account for innovations like collision avoidance technologies when pricing policies, and firms are deploying new technologies like in-vehicle telematics and providing rebates to customers for good driving habits. Insurers are looking toward big data technologies like click-stream analytics to understand user experience on their website, and social media analytics to understand customer behaviors and seek out more desirable risk pools.
Despite the convergence taking place, organizations often feel pressure to implement systems that focus discretely on either HPC or Big Data problems for operational reasons. The problem with continuing to silo systems however is cost. Deploying separate infrastructures for each problem type is simply too expensive, and firms that do so will be at a competitive disadvantage relative to their peers.
Fortunately, tools and technologies already exist to make this convergence easier to manage. Modern workload schedulers like IBM Platform Symphony and IBM Platform LSF support combining both Hadoop workloads and traditional HPC workloads and allow them to co-exist and share resources according to policy. Storage solutions such as parallel file systems directly support both Hadoop-oriented file access semantics on distributed nodes as well as support for traditional high-bandwidth parallel access to shared storage. By consolidating Hadoop and HPC storage onto a single infrastructure, cost savings can be substantial. Firms can avoid replicated copies of data and can simplify and accelerate workflows by allowing HPC applications and Hadoop native access to the same datasets.
Increasingly, Big Data is everywhere we look, and while not the answer for every problem, innovations like Hadoop have the potential to accelerate scientific discovery and help contain associated costs. For many, the innovations taking place in Big Data analytics are too important to ignore.
Much as distributed approaches revolutionized “big compute,” the same is now happening with Big Data. Merging these complementary capabilities increases the potential to solve even more complex challenges, and utilize tried and true tools or techniques from both fields. A new era of discovery is upon us. Will you embrace it?
Gord Sissons is Product Marketing Manager for IBM Platform Symphony at IBM. He may be reached at editor@ScientificComputing.com.