Four Reasons Data Lakes Are Moving to the Cloud
From managing complexity to increasing scalability, we offer four reasons your enterprise can benefit from moving your data lakes to a cloud platform.
- By Ravindra Punuru
- September 3, 2019
An internet search on the future of Hadoop yields a handful of articles questioning whether Hadoop is "officially dead" or has become irrelevant. As recently as three years ago, Forrester predicted annual growth of nearly 33 percent for Hadoop. Less than a year later, their analysis concluded that Hadoop would "take its place alongside data warehouses and mainframes."
Hadoop reset expectations for managing and analyzing extreme data volumes and varieties, but there's no question that its place in the data ecosystem has shifted in the wake of competitive cloud offerings that deliver more flexibility, lower cost, and simpler development.
Solutions from cloud providers Amazon, Microsoft, and Google offer more flexible, agile environments that can scale elastically on demand and offer the capabilities to capture, store, process, and analyze modern data faster, more easily, and at a lower price point.
Organizations that have made significant investments in Hadoop are likely to continue using it to store enterprise data while building data lakes in the cloud to capture new types of data or to migrate data from Hadoop over time. For smaller companies or companies just getting started with data lakes, the trend now is to start in the cloud and forego Hadoop altogether. Here are four reasons for moving data lakes to the cloud.
Reason #1: Complexity and Cost
The Hadoop ecosystem is notoriously complex and therefore expensive to manage, requiring deep Java skills and knowledge of the Hadoop platform. It has been said that "to properly support a Hadoop project, two data engineers are needed for every data scientist." Organizations are frustrated by the steep learning curve, inefficiency with small data sets, lack of security, and slow performance for analytics. Data lakes built on cloud platforms are much more intuitive, requiring less technical knowledge, and, in turn, less cost for highly skilled, difficult-to-find resources.
Meanwhile, cloud-based on-demand infrastructures eliminate the need for investment in hardware to store and process data, allowing businesses to pay only for what they use. They no longer pay for maintenance and upkeep of the hardware, and charges are typically based on actual storage and compute costs with billing per query, per terabyte, per month, and so on.
Much of the software that serves cloud data lakes is also built in the cloud and is serverless, allowing organizations to get started faster at less cost, again paying only for what they use.
Critics say that this pay-as-you-go model can spiral out of control. It's true that costs must be closely monitored, but the savings in engineering costs, specialized talent, proprietary hardware, and other expenses more than make up for this potential drawback.
Reason #2: Technology Maturity
It has become increasingly difficult to move today's larger, more complex data from a growing variety of data sources into on-premises data lakes. Traditional data integration tools -- with their extract, transform, and load architectures -- couldn't deliver the volumes of data to Hadoop fast enough. Meanwhile, business users became extremely impatient with the long response times to analyze the volumes of data within Hadoop.
Though organizations committed to Hadoop have developed custom tools and work-arounds to overcome its limitations, these are often costly because they are the work of rare, highly skilled resources not available to all organizations.
Today's cloud data lakes are supported by a more mature technology landscape that supports the full data journey, from source to target, including data integration, transformation, aggregation, and BI and visualization. These cloud-native tools are designed for the variety, volume, and velocity of modern data. Available "as a service," they're easier to deploy, more intuitive to use, and always up to date.
It's also worth noting that cloud data lakes are more suited for the complex deep learning required for artificial intelligence and machine learning applications.
Reason #3: Scalability
On-premises data lakes require significant manual effort to add and configure servers to accommodate more data sets, additional users, and spikes in activity. The on-demand infrastructures offered by public cloud providers enable organizations to elastically scale data lakes to support these ebbs and flows without scaling up the maintenance and operating costs.
In fact, infrastructure-as-a-service solutions by the public cloud providers now offer auto-scaling features that allow organizations to automatically optimize resource utilization based on rules they create. They decide on minimum and maximum instances to ensure applications stay running without breaking the budget.
Cloud infrastructure is more scalable, and so are many of the technologies that support cloud data lakes. Software-as-a-service solutions also offer pay-as-you-go, on-demand solutions that are available through the cloud and that easily scale up and down to handle increased data volumes and users without extra implementations or hardware requirements.
Reason #4: Security and Governance
Data privacy and security can be complicated within the Hadoop ecosystem. With so many tools in the stack, each one must have the right data access, authentication, and encryption. Accomplishing this within the complex Hadoop environment requires the right kind of expertise and attention.
Most security and governance requirements are now part of the cloud providers' infrastructure-as-a-service, some of which come with their own credentialing tools. They all support compliance with regulations such as HIPAA, PCI DSS, GDPR, ISO, FedRAMP, and Sarbanes-Oxley, to name a few.
Though data security has a come a long way with the cloud, it shouldn't be left entirely to the cloud providers. Every organization has to take responsibility for ensuring data security and privacy.
Moving Forward
The movement to Hadoop certainly led many of us to rethink the expectations we previously held for our data, but cloud technologies continued to advance the conversation and the as-a-service model is only gaining in popularity and driving efficiencies in people and budgets at many organizations. For those challenged by their current data lake environment or considering the cloud, now is the time to make the leap.
The old model of complex on-premises data management and processing is revealing itself to be a costly and cumbersome process for which hiring the right talent has become increasingly difficult. By moving these processes to the cloud, organizations will quickly realize significant time and cost savings and can establish a more efficient and reliable data management system.
About the Author
Ravindra Punuru is cofounder and CTO of Diyotta, Inc., where he is responsible for modern data integration technology strategy, product innovation, and direction. With more than 20 years of experience in data management and consulting, Ravindra has broad knowledge of corporate management and the strategic and tactical use of cloud-based, data-driven technologies to improve innovation, productivity, and efficiency. Ravindra’s past roles have included architecting and delivering enterprise data warehouse programs, with large corporations including AT&T, k, Time Warner Cable, and Bank of America. You can contact the author via email or LinkedIn.