Analysis: Microsoft [Hearts] Spark for Powering Big Data, Analytics
At this week's Spark Summit in San Francisco, Redmond announced the official availability of Spark for Azure HDInsight. It’s not just a commercial, supported Spark cloud service, however. The company is preparing a slew of Spark-related goodies as well.
- By Steve Swoyer
- June 8, 2016
Some contenders just keep racking up the endorsements.
Take the Spark cluster computing framework, for example. Since its debut, Spark’s been endorsed by just about everybody, from IBM and Hewlett-Packard Enterprise to Microsoft.
Last July, Microsoft announced a public preview of its new Spark cloud computing services, Azure HDInsight. Yesterday, at the Spark Summit in San Francisco, Redmond announced the official availability of Spark for Azure HDInsight.
Microsoft isn’t just offering a commercial, supported Spark cloud service, however. That would merely be keeping pace with competitors Amazon and Google, which also market Spark-in-the-cloud services. No, Microsoft's cooking up several other Spark-related goodies, including a Spark-ready version of “R Server” -- a product based on technology it acquired from the former Revolution Analytics -- for both on-premises environments and Azure HDInsight.
“Today we are announcing an extensive commitment for Spark to power Microsoft’s big data and analytics offerings including Cortana Intelligence Suite, Power BI, and Microsoft R Server,” wrote Tiffany Wissner, senior director of data platform marketing, in a post on Redmond's SQL Server Blog.
Microsoft also announced a new free R client that “data scientists [can use] to build high performance analytics using R,” Wissner continued. What does this have to do with Spark? According to Wissner, Microsoft's new free R client can be used with R Server running on the desktop or in tandem or with remote instances of R Server -- such as R Server for HDInsight. The is the equivalent of “pushing the computation to a production instance of Microsoft R Server such as SQL Server R Services, R Server for Hadoop and HD Insight with Spark,” Wissner explained.
The software giant already support using its PowerBI service with the public preview of Spark for Azure HDInsight. Data scientists or analysts could notionally use other visual discovery tools (e.g., Tableau or Qlik Sense) with Spark for HDInsight, too. At Spark Summit this week, Microsoft also announced new support for Spark Streaming in PowerBI.
“Spark support in Power BI is now expanded with new support for Spark Streaming scenarios. This allows you to publish real-time events from Spark Streaming directly into one of the fastest growing visualization tools in the market today,” Wissner wrote.
What's the Big Deal about Spark?
It's a good question. Spark is an in-memory cluster computing framework. It can run on either a standalone basis or in the context of a host platform such as Hadoop or Cassandra.
What was new and different -- and therefore valuable -- about Hadoop was its combination of (relatively) cheap distributed storage with (relatively) cheap general-purpose parallelism. What's new and different (and valuable) about Spark is that it's much better suited for workloads (e.g., highly iterative analytical algorithms, streaming analytics, certain kinds of SQL analytics) than Hadoop.
Spark is not a database, however. Strictly speaking, Hadoop isn't a database: it's a combined distributed processing and storage platform that (in HBase, Hive, and related projects) can perform database-like workloads. Yes, it's possible to perform database-like workloads in Spark (e.g., SQL query via its Spark SQL interpreter), but Spark itself performs none of the functions of a database management system. Spark doesn't persist data into a file system, let alone implement (hierarchical, relational, etc.) database logic. It is a massively parallel computing environment.
For highly iterative advanced analytics, Spark is a superior massively parallel computing environment. That helps explain much of its allure to vendors such as Microsoft.
Most advanced analytics algorithms iterate, which means they make multiple passes on the same data set (or on an evolving data set) to get an answer. It is extremely inefficient to perform this kind of work in a SQL database -- even though most self-styled “analytics” SQL databases (e.g., Actian Matrix; Hewlett-Packard Enterprise Vertica; IBM Netezza; SAP HANA, Teradata Aster) do incorporate hundreds of in-database analytics algorithms and functions.
It's also much faster to do as much processing in possible in physical memory. The real performance killer in iterative analytical workloads is I/O bandwidth, particularly with respect to time-consuming disk reads and writes. A standalone desktop workbench -- e.g., R, SAS, or SPSS -- will generally try to run the entirety of an advanced analytic workload in memory; failing that, these tools will try to so partition a problem that they're able to do multiple passes in memory, write the results to disk, and then bring everything together at the end. The problem with a standalone R workstation is that the amount of memory it can be stuffed with is capped by the desktop or workstation form factor: 64 GB, 128 GB, and in rarer cases, 256 GB of RAM.
Spark is a cluster computing framework. You can cluster multiple servers together to form a single, massively parallel processing clustered system. As a result, you can scale Spark to support extremely large configurations. Spark clusters can be stuffed with 256 or 512 processors (for a total of 512 or 1024 simultaneous threads) and 8 TB, 16 TB, or 32 TB of RAM. At such scale, it's possible to run iterative advanced analytical workloads entirely in memory and to run multiple iterative workloads (or multiple iterations of the same workload) simultaneously. The more quickly you can iterate through analytical passes, the more quickly you can test hypotheses, run regressions, and conduct simulations. Being able to iterate fast is key to advanced analytics.
Spark has the potential to extend the power of tools such as PowerBI, Alteryx, or Tableau in ways that conventional technologies (e.g., an in-memory column store on the desktop; an in-memory database server; even a massively parallel processing database) simply cannot.
“[Spark is] good for connecting BI tools,” Vida Ha, a lead solutions engineer at Spark commercial parent company Databricks, told attendees at Spark Summit East in February.
“Instead of writing all of your data directly into Tableau, you can have Tableau point to Spark and then ... read in your data from [Tableau] and then you’ll be able to utilize all of the cluster and distributed systems properties of Spark to analyze a larger data set,” she pointed out.
About the Author
Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at evets@alwaysbedisrupting.com.