Big Data: Something Borrowed, Something Blue
When you’re 100 years old, as IBM is this year, it would be easy to think that you’ve seen it all. What could possibly be new to Big Blue about “big data”? In the view of Robert LeBlanc, SVP of Middleware Software for the IBM Software Group, quite a bit.
The new problem set, defined by business opportunities opening up due to the availability of new sources of information, cannot be solved with traditional data systems alone. Kicking off the IBM Big Data Symposium for industry analysts at the Yorktown Research Center on May 11, LeBlanc itemized a number of challenges, including multi-channel customer sentiment and experience analysis, detection of life-threatening conditions at hospitals in time to intervene, Medicare fraud interdiction before payment, and weather pattern predictions to optimize wind turbine locations. (Note: The next TDWI Solution Summit, September 25-27 in San Diego, will feature case studies focused on the theme of “Deep Analytics for Big Data.”)
“Big data” is both an evolutionary and revolutionary phenomenon. Given that organizations have been working with large data warehouses and other types of files for some time, it should come as no surprise that the sheer quantity of data would continue to grow. Data is a renewable resource; the more applications and systems that use it, the more data that they tend to generate. Data warehouses will continue to be important, but even as the terabytes of structured data pile up, organizations are hunting down unstructured sources to tap their value and discover new competitive advantages.
IBM’s view of what makes big data revolutionary comes down to the convergence of the three “V’s”: volume, velocity, and variety. Volume is the easiest to understand, although IBM speakers at the Symposium described scenarios where so much data was streaming through in real time that storing it was impossible. Huge data volumes plus the velocity with which it is flowing in are opening up opportunities for technology alternatives, including Hadoop, MapReduce, and event stream processing. Variety, the third “V,” adds in the unstructured and complex data sources growing up on the Web, particularly in social media. Some organizations, of course, do store all this data; Eric Baldeschwieler, VP of Hadoop Development at Yahoo!, described their use of the Hadoop Distributed File System (HDFS) to store petabytes of data on nodes through its vast array of clusters. “Hadoop is behind everything we do,” he said.
It was not surprising news, but Baldeschwieler and IBM experts gave a full-throated defense of Apache Hadoop and the importance of having open source software at the foundation of big data programs. IBM did not mention EMC explicitly, but it was clear that the company was responding to EMC’s May 9 announcement of the new Greenplum HD Data Computing Appliance, which offers its own distribution of Apache Hadoop. IBM execs warned of the dangers of “forking,” which is what happened when vendors created their own versions of the UNIX operating system and users had to deal with competing standards. Baldeschwieler and IBM execs did acknowledge, however, that Apache Hadoop is far from a finished product, and in any case is not the solution to all problems.
I came away from the Symposium excited by the future of big data analytics but also aware that there’s a long way to go. “Big data” is not about a single technology, such as Hadoop or MapReduce (for more on Hadoop, see my colleague, Philip Russom’s interview with the CEO of Cloudera here). These technologies are more of a complement to data warehousing rather than replacement for it. Yahoo!’s Baldeschwieler made the point that Yahoo also has data warehouses. As each industry’s requirements become clearer, vendors such as IBM will assemble packages that will bring together the strengths in their existing solutions with new technologies. Then, organizations will have a better understanding of how to compare the vendors’ offerings. We’re not quite there yet.
Posted by David Stodder on May 17, 2011