TDWI Articles

Avoid Ending Up with a Marshy Mess Instead of a Data Lakehouse

What do you need to know about implementing a data lakehouse and using it effectively? Ori Rafael, CEO and co-founder of Upsolver, shares his perspective.

We’ve heard about data lakes becoming swamps. Is your data lakehouse at risk as well? Ori Rafael, CEO and co-founder of Upsolver, spoke to Upside about data lakehouses: managing them, the best use cases, and how they interact with a data mesh.

For Further Reading:

Sunrise at the Lakehouse: Why the Future Looks Bright for the Data Lake’s Successor

The Data Lakehouse: Bridging Information Gaps in the Enterprise

Q&A: Data Mesh/Data Fabric Implementation Tips for Success

Upside: How is a data lakehouse different from a data lake?

Ori Rafael: The defining characteristic of a data lake is that it allows you to cheaply and easily store any kind of data in its original format. This distinguishes it from a database or data warehouse where the data needs to conform to the (structured) native data storage format.

A data lakehouse is an attempt to provide data warehouse functionality on top of a data lake. It is a solution built on top of a data lake’s cloud object storage that organizes the data to make it ready for analytics and provides a way to query the data. Data lakes by themselves are slow for analytics queries because the storage layer is optimized for cost and flexibility, not speed.

Technologies and products that bridge the gap between a raw data lake and a data lakehouse include data catalogs, data processing engines, SQL query engines, and optimization technologies such as conversion to columnar file formats, compression, compaction, and indexing.

Enterprises have been concerned that a data lake can turn into a data swamp. Can the same thing happen with a data lakehouse?

Three factors determine whether any data store is a “swamp” -- technology, people, and processes. Data lakes are the poster child for the data swamp; they didn’t come with a built-in catalog, so a swamp was the default state. In the Hadoop days, they were often built as science projects (solutions looking for a problem) or for ad hoc projects (e.g., business-unit cloud data lakes).

In either case, a lack of disciplined governance led to a data swamp where data was stored but either not easily discoverable or not trusted to be correct. Of course, a data warehouse or data lakehouse can become “swampy” if there are no procedures for managing data and no people responsible for data hygiene.

What’s the best way to protect against that happening?

Clarity and discipline are the key requirements, so the most important thing to do is to approach data governance as a practice that requires formal management. Usually technology is not the issue. What makes governance hard is that it requires consistent and universal use of the technology, based on rules that have been well thought out and for which there is cross-organizational agreement. Of course, adequate resources for the effort are also required to ensure it doesn’t deteriorate after the initial enthusiasm has waned.

What does data mesh have to do with lakehouses and data pipeline architecture? How not to have it “meshed” up?

Think of the data mesh as a layer of organization implemented on top of the data stores you have in a company, be they databases, data warehouses, or data lakes, and whether they are owned centrally or controlled by distributed business units. The concept of the data mesh is that rather than copying data from various business units into a centralized store for cross-business sharing, you enforce quality and nomenclature standards at the business-unit level (called data domains in the data mesh literature) so that each domain exposes consistent data products to its peers.

A data mesh provides a cross-company catalog of data products. However, this uber-catalog is necessary but not sufficient for avoiding a data swamp. You have to monitor the domains creating and updating these products for compliance with corporate standards, and you must have the authority and power to bring them back in line when they diverge.

Does the data lakehouse have to be centralized or can it be decentralized into a data mesh?

The concepts operate at two different layers. The lakehouse is a tech stack for storing, processing, and querying data. It can be the property of a central authority or of a business domain. The data mesh is a construct for allowing domains to share data products, usually via a federated query engine, so a data mesh might connect data products that were created in domain-specific data lakehouses.

Will a data lakehouse eliminate the use of data warehouses and why?

As envisioned, a truly realized lakehouse will have built data warehouse functionality on top of a low-cost and open cloud object store, which means choices come down to economics. That, in turn, depends on the mix of use cases for a given organization.

The data lakehouse will be attractive in proportion to the extent an organization is working with complex and streaming data that is more economically stored and processed on a data lake. For organizations solely working with structured batch data, a data warehouse should suffice. Of course data warehouse vendors are working to add support for data lake use cases (a data warelake?), so the lines will continue to blur.

How do data lakehouse systems compare in performance and cost to data warehouses?

Price/performance comparisons are dependent on the use case. A data lakehouse takes advantage of an affordable cloud object store and affordable processing instances. In general, a data lakehouse should have better economics for use cases that require continual processing of complex semistructured data or any data at high scale. If the use case is basically processing structured data in smaller volumes with more lenient data freshness requirements, then the data warehouse may preside.

How easy is it for data analysts to use a data lakehouse?

The standard method for accessing a data lakehouse will be a SQL query engine, which is no different from a data warehouse. If a lakehouse is properly cataloged, then there will be no meaningful difference in access.

[Editor’s note: Ori Rafael is the CEO and co-founder of Upsolver, a no-code data lake engineering platform for agile cloud analytics. Before founding Upsolver, he held various technology management roles at IDF’s elite technology intelligence unit, followed by corporate roles. Rafael has a bachelor’s degree in computer science and an MBA.]

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.