Adding Corporate Data to LLMs with Nicolas Decavel-Bueff
From the rise of RAG to data privacy considerations, Nicolas Decavel-Bueff explains what your organization needs to pay attention to when it comes to AI.
- By Upside Staff
- June 20, 2024
In the latest Speaking of Data podcast, Nicolas Decavel-Bueff, data science lead at Further, spoke about adding corporate data to large language models. Decavel-Bueff will be teaching a course at TDWI San Diego (August 4 - 9) on August 5, Hands-On Introduction to Customizing Large Language Models. [Editor’s note: Speaker quotations have been edited for length and clarity.]
In the past year and a half, Decavel-Bueff has focused on generative AI. He’s been working with ChatGPT and large language models (LLMs) and finding ways to give these systems formulated tasks so users can start getting value quickly. Given his hands-on experience with LLMs, the conversation started by exploring why LLMs have grown so much recently.
Decavel-Bueff’s explanation was not highly technical (he saves that for his workshop, he says). “LLMs have been here for a while. Just because we've recently heard about ChatGPT and everybody's talking about generative AI doesn't mean that this has been a net new thing. Natural language processing started when language translation was a huge issue that people were trying to tackle; since then, we've had small language models for a while that needed less data for training. These small models use traditional deep learning architectures such as recurrent neural networks, and they were really good at solving particular tasks. If you wanted to identify the semantic meaning behind a tweet, whether or not somebody's tweet was positive or negative, you could do that with a small language model. You would need a lot of data that's labeled, so the data curation effort was high, but you could do it.
“Back in 2017, Google wrote a paper that introduced the transformer architecture that large language models that we talk about today are based on. That was the big leap into the large language model world, as well as training on so much more data.
“As they used this architecture, and as they trained on more data, they found they were able to reach a point where the model could generalize to a lot of different tasks with something they call few-shot learning. You don't need a thousand tweets with their sentiment. You could just give a few examples, maybe three or four -- up to 20 for more complex tasks -- to help guide the model on how to answer. That was a really exciting change.
“Then, obviously there’s ChatGPT. It's actually been super fun as a data scientist because I've been talking about this for a while, but it wasn't until last year when I could talk to my parents about it. They would ask me questions, they would show me on their phone, ‘Look what ChatGPT said; why did it say that?’ It led to some great conversations, so it's nice that my parents know what I do.”
RAG Tackles LLMs’ Deficiencies
LLMs are not perfect. “Let's talk about where large language models fail and why retrieval-augmented generation (RAG) is a solution. Whenever you interact with something like ChatGPT, it will say, ‘Knowledge cut off July of 2023;’ even the recently released Omni model has a cut-off of October 2023.” Thus, ChatGPT isn’t using the most current information, which is an issue for some use cases.
The other problem Decavel-Bueff points to is hallucinations. He discussed the case using ChatGPT as an LLM example because it’s so accessible that people can understand it, but he acknowledges there are many other LLMs available.
“What ChatGPT and other LLMs are really good at is natural language generation -- being able to create a response that sounds right. The issue is you'll tend to believe something if it sounds correct. It'll give you an essay (or a few essays) about why that answer is correct, but the issue with hallucinations is sometimes that information isn't correct. If you just trust what you see, you get into an issue of accountability in terms of using that information for part of a decision-making process.
“What's great about retrieval-augmented generation is that you're able to connect these systems to data sources that are more up to date. You can connect it with your corporate data, whether that's Confluence pages or a database that you have a lot of product information or customer reviews; there's an endless possibility of what you can connect it to.
“With RAG you solve the problem of data recency. In terms of solving the issue of hallucinations, you now can reference directly material that your organization vetted as accurate and information you'd want to reference when generating these responses. You’re going to hear more and more about it because it's where organizations are finding practical use cases of large language models versus just the fun.”
Data Privacy
How can you take a generative AI chat program and use it within an organization without having your data released to the public?
“Look at the new GPT for Omni, where you can now talk with ChatGPT and it's collecting a ton of data from users. If you're in the free tier, or even the paid tier, it will still collect your data. It will use that data to further train the model and further create their competitive advantage.”
To avoid this, Decavel-Bueff says “If you're using something like ChatGPT, you can get their enterprise plan. It's a bit more expensive but it makes sure that all the information you’re using remains private and that they don't use it for training. Another option is to use an open source model. For example, Meta’s Llama 2 has made great strides in the open source community. For some use cases, you don't need the fanciest or the best model available; you can solve your problem with something that's just slightly below what the closed source models deliver, but you can host that all internally or on your own cloud tenant to make sure it's all secure. You know exactly where all the data is going, and it's a closed system.”
Choosing the Right Model
What should you consider when you choose a model? According to Decavel-Bueff, “When you decide to use something like Llama 2 or GPT-4, you want to understand what data you will use to train it. You want to consider all its limitations. What are the privacy considerations? If you don't, and the only way you choose a model is based on the perceived output of how you use it, then you're going to face these issues down the line.”
What are some best practices that organizations should consider when implementing their own custom RAG solution?
“With any AI use case, if you don't have a problem for it, maybe you shouldn't be using it. You really don’t want to define the solution first (that's always a bad, bad way to start), but define the problem, define how you're evaluating that problem, and define how the problem relates directly to business impact. Otherwise, it’s hard to have a successful AI implementation.
“Assuming you have that, you get into the next biggest issue, which is data governance and quality. Suppose your company has created an HR bot and it allows people to talk with a chatbot that answers questions such as ‘How does my PTO work? What are my benefits? How does promotion work?’ It's able to properly source its responses. The issue arises when it's sourcing information that the individual asking shouldn't have access to. If I were to ask, ‘How much is my boss paid?’ and it starts returning that information, that's a good indication that you aren't at a data maturity level ready to implement RAG at that scale.
“With any AI, bad quality data leads to a bad quality model. Make sure your data is of high quality, that the permissions are appropriate, and there is someone accountable for that process. You can’t say, ‘We have everything in the drive, everybody's permissions should be valid,’ because what RAG does is opens this can of worms.” He cautions not to give your program access to outdated information, either, because it’s likely that that information might be used. “Making sure you understand what information is being accessed is critical.”
Finally, many implementations fail or struggle to get past the proof-of-concept phase because of metric-driven experimentation. “It's really easy to get stuck in the loop when you're testing these models. You're testing different prompts as well as the process, but you don't track your experiments as well as you should. Three months into the project and you don't have a clear leaderboard of what your best experiments were, and you have to make sure you have good machine learning practices in place. I always highly recommend use of great metrics, specifically focused for retrieval-augmented generation. Also gaining popularity is the idea of using a different large language model to help score your responses.”
Don’t forget the ethical considerations when using these LLMs. “That’s critical, especially when you think about responsible AI and how you deliver these things -- not just confidently but knowing you're not creating a security risk or putting your organization in a spot for really bad press in the future.”