How Generative AI and Data Management Can Augment Human Interaction with Data
With access to a broad sample of good-quality data and a thorough, disciplined tuning process, organizations will be better equipped to overcome the limitations of the current breed of generative AI functionality.
- By Alberto Pan
- March 6, 2024
Generative AI, thanks to natural language processing (NLP) capabilities and advances in large language models (LLMs), promises to transform content-creation and application development, as well as the very ways we interact with digital solutions and data. If business users had their way, generative AI would be built into applications across the whole organization, from marketing and operations to finance and beyond, to improve efficiencies, streamline processes, and reduce costs.
However, generative AI is in a nascent phase, and its shortcomings -- including hallucinations, limited training data, and concerns about governance, ethics, and copyright issues -- are well-known. Ultimately, generative AI will only be able to fulfill its promises if it has a strong data foundation. Data management will play a critical role in enabling generative AI to reach its full potential.
Why a Data Foundation Is Critical for Generative AI
Generative AI applications are developed by leveraging foundation LLMs as a starting point, which are public, licensable, and offered by Amazon, Cohere, Stability AI, among others. From there, these foundation models are tuned, enhanced, and otherwise customized to meet the needs of the new application.
These custom-developed, AI-supporting models require data to identify patterns they can ultimately learn from and thus do their work. For example, the initial free version of ChatGPT was built on a model that leveraged gigabytes of data scraped from the internet. In contrast, a model supporting a business application would leverage a mix of internal and external data that might come from a CRM, one or more internal knowledge bases, and potentially many other sources of information. Clearly, these models can never get enough data because the more they get, the more they can learn and the more accurate their answer. However, in observance of the trusted principle of “garbage in, garbage out,” they require data that is clean, accurate, and governed.
Therein lies the challenge for any generative AI development project. Organizations are uncertain about the most efficient way to get the necessary data to the model, especially when some of the data lies in a legacy system. Also, organizations wonder how they can avoid the lengthy, complex ETL processes needed to move the data to where the model can access it. One solution can be found in a logical data management approach.
How Logical Data Management Impacts Data Fabrics
In contrast with ETL processes, logical data management solutions enable real-time connections to disparate data sources without physically replicating any data. This is accomplished with data virtualization, a data integration method that establishes a virtual abstraction layer between data consumers and data sources. With this architecture, logical data management solutions enable organizations to implement flexible data fabrics above their disparate data sources, regardless of whether they are legacy or modern; structured, semistructured, or unstructured; cloud or on-premises; local or overseas; or static or streaming. The result is a data fabric that seamlessly unifies these data sources so data consumers can use the data without knowing the details about where and how it is stored.
In the case of generative AI, where an LLM is the “consumer,” the LLM can simply leverage the available data, regardless of its storage characteristics, so the model can do its job. Another advantage of a data fabric is that because the data is universally accessible, it can also be universally governed and secured. With these capabilities, the data fabric can easily furnish AI models with high-quality sample data in real time.
Data Catalogs Reimagined
Data fabrics also enable a new breed of data catalogs that list all of the authoritative data sources made available through the data fabric and the enhance each listing with rich metadata and clear displays of the lineage of any given data set. However, with generative AI, the role of data catalogs is further expanded because generative AI supports data catalogs and vice versa. Data catalogs will assist generative AI by providing models with a single source of data across the entire enterprise, using data that is governed, consistent, and expressed in accordance with a unified semantic layer couched in standard business terminology.
At the same time, generative AI will strengthen the data catalog by enabling users to simply submit their queries using natural language, either spoken or written. Through AI-powered data catalogs, and data catalog–powered LLMs, organizations can build generative AI applications that democratize data access, bringing us one step closer to having meaningful, real-time conversations with applications.
Within the logical data management interface itself there would be additional AI-powered enhancements after the user submitted a query. For example, in less than a second, it could do three things:
- Display the SQL required to perform the desired query, including all of the necessary commands and parameters
- In a different text box, explain the steps the query performed, including groups and joins and which tables and data sets were involved
- In yet another box, display the end result
By providing this explainability and history, the catalog would enable skilled users to go back and tweak the SQL to fine-tune results. It would also enable novice users to get an immediate, highly confident answer without entering a request to IT, and they will know how the model arrived at the results.
The tools are in place to take generative AI to its next incarnation, where it can be embedded in myriad business processes for a transformative effect. With access to a broad sample of good-quality data and a thorough, disciplined tuning process, organizations will be better equipped to overcome the limitations of the current breed of generative AI functionality. More important, generative AI will then be strong enough to enable organizations to democratize data and do so in a powerfully effective way.
About the Author
Alberto Pan is EVP and chief technical officer at Denodo, a provider of data management software. He has led product development tasks for all versions of the Denodo Platform and has authored more than 25 scientific papers in areas such as data virtualization, data integration, and Web automation.