Harnessing Retrieval Augmented Generation (RAG) for LLMs
Written on
Understanding Retrieval Augmented Generation
The rise of generative AI has sparked significant interest in AI assistants capable of performing diverse tasks, from product shopping to information retrieval. These applications leverage advancements in large language models (LLMs), which are trained on extensive textual datasets to create a robust knowledge repository. Yet, LLMs often struggle with effectively retrieving and applying this knowledge, resulting in challenges such as hallucination (generating incorrect information), knowledge cutoffs, and misunderstandings of niche topics. So, how can we enhance an LLM's capacity to access and utilize reliable information?
"If AI assistants are to play a more useful role in everyday life, they need to be able not just to access vast quantities of information but, more importantly, to access the correct information." — source
The answer is a resounding "yes." This overview delves into one of the most widely adopted methods for enriching LLMs with knowledge: retrieval augmented generation (RAG). Remarkably, RAG is both straightforward to implement and highly effective in integrating LLMs with external data sources. It can enhance the factuality of an LLM, update the model's knowledge with recent data, or even create a specialized model based on proprietary information without the need for extensive finetuning.
What is Retrieval Augmented Generation?
To fully grasp the technical aspects of this overview, it's crucial to understand retrieval augmented generation (RAG), its operation, and its advantages. LLMs inherently possess substantial knowledge within their pretrained weights (parametric knowledge) that can be accessed through prompting. However, these models are prone to hallucination, which indicates that the parametric knowledge may not always be trustworthy. Fortunately, LLMs can utilize in-context learning, defined as the ability to employ information present in the prompt to yield superior output. RAG enhances the LLM's knowledge base by integrating pertinent context into the prompt, leveraging in-context learning to improve response quality through this additional context.
The Framework of a RAG Pipeline
"A RAG process takes a query and assesses if it relates to subjects defined in the paired knowledge base. If yes, it searches its knowledge base to extract information related to the user's question." — source
When presented with an input query, an LLM typically processes this query (often as part of a prompt template) to generate a response. RAG alters this method by linking the LLM with a searchable knowledge base. Essentially, the input query is first employed to seek relevant information from an external dataset. The retrieved information is then incorporated into the model's prompt during output generation, allowing the LLM to utilize this context (through its in-context learning capabilities) to create a more accurate and factual response. By merging the LLM with a non-parametric data source, we can supply the model with correct, specific, and up-to-date information.
The video titled "Sam Partee on Retrieval Augmented Generation (RAG)" explores the practical applications of RAG in enhancing LLMs.
Cleaning and Chunking Data
RAG necessitates access to a dataset containing accurate and relevant information to enrich the LLM's knowledge base. To facilitate this, we need to construct a pipeline for searching relevant data within the knowledge base. External data sources may present information in various formats (e.g., PDF, markdown), requiring us to clean and extract raw textual content. Once this is achieved, the next step is to "chunk" the data, or divide it into shorter sequences typically containing 100-500 tokens.
The objective of chunking is to create units of retrieval (text pieces that can be fetched as search results). A complete document may be too large for effective retrieval, thus necessitating its division into smaller chunks. The fixed-size chunking strategy is the most common, segmenting longer texts into shorter sequences based on a predetermined number of tokens. However, other methods can be employed, especially if the data is naturally segmented (e.g., social media posts or product descriptions).
Searching Through Chunks
After cleaning and segmenting our data into retrievable chunks, we must develop a search engine to match input queries with these chunks. Previous discussions on AI-powered search have covered essential concepts that can be adapted for building an accurate search engine to link input queries with textual chunks in RAG.
We aim to establish a dense retrieval system by utilizing an embedding model to generate vector representations for each chunk and indexing these vectors in a vector database. Subsequently, we can embed the input query using the same embedding model and conduct an efficient vector search to retrieve semantically related chunks.
Many RAG applications utilize pure vector search for identifying relevant textual chunks; however, we can enhance the retrieval process by integrating existing methods from AI-powered search. Specifically, we can supplement dense retrieval with a lexical (or keyword-based) retrieval component, forming a hybrid search algorithm. Additionally, we can implement a fine-grained re-ranking step to prioritize candidate chunks based on relevance.
The video "Introduction to Gen AI & Retrieval Augmented Generation (RAG)" provides insights into the foundational concepts and methodologies of RAG.
Generating Outputs with RAG
Once we have retrieved the relevant textual chunks, the final stage of RAG is to incorporate these chunks into the language model's prompt and generate an output. RAG encompasses the comprehensive process of receiving an input query, identifying relevant textual chunks, combining this context with the input query, and employing an LLM to produce an output based on the merged input. This method presents numerous advantages.
The Advantages of RAG
"RAG systems consist of a retrieval and an LLM-based generation module, supplying LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, minimizing the risk of hallucinations." — from [8]
Implementing RAG allows us to tailor an LLM over a knowledge base of our choice. Compared to other knowledge injection methods, such as finetuning (or continued pretraining), RAG is simpler to deploy and more cost-effective. The results produced by RAG are generally superior to those obtained through continued pretraining. However, while RAG requires additional effort compared to merely prompting a pretrained LLM, the benefits make it worthwhile.
- Minimizing Hallucinations: The primary appeal of RAG lies in its capacity to reduce hallucinations (i.e., the generation of incorrect information by the LLM). While LLMs often produce erroneous information based solely on their parametric knowledge, RAG's incorporation can significantly lower the occurrence of hallucinations, thereby enhancing the overall quality of any LLM application and fostering greater trust among users.
- Accessing Up-to-Date Information: LLMs relying on parametric knowledge typically have a knowledge cutoff date. Updating this knowledge can be costly and ineffective through continual training. RAG allows us to easily enhance the LLM's output and knowledge base with accurate and current information.
- Data Security: Including data in an LLM's training set carries the risk of leaking this information in its outputs. Researchers have indicated that LLMs can be susceptible to data extraction attacks that may reveal the contents of an LLM's pretraining dataset through prompting techniques. RAG mitigates this security risk by not requiring the model to be trained on proprietary data.
- Ease of Implementation: Lastly, one of the most compelling reasons to utilize RAG is the straightforward implementation process compared to alternatives like finetuning. The core principles from the original RAG paper can be executed in just five lines of code, eliminating the need to train the LLM itself. Instead, we can concentrate our finetuning efforts on enhancing the quality of the smaller, specialized models used for retrieval within RAG, which is significantly less resource-intensive.
From the Origins of RAG to Modern Applications
Numerous concepts underlying RAG stem from previous research in question answering. Interestingly, the initial proposal of RAG was largely influenced by a singular paper that integrated a similar retrieval mechanism into the language model pretraining process. RAG was conceived to assist with knowledge-intensive tasks that require access to external knowledge sources. Although pretrained language models contain a wealth of information, their proficiency in accessing and manipulating this knowledge is often lacking.
"The retriever provides latent documents conditioned on the input, and the seq2seq model then conditions on these latent documents together with the input to generate the output." — from [1]
The essence of RAG is to enhance a pretrained language model's capacity to access and utilize knowledge by linking it with a non-parametric memory store, typically a collection of documents or textual data that can be retrieved. This approach not only provides additional factual context but also allows users to examine the retrieval process, offering greater insight into the LLM's problem-solving capabilities.
With a clear understanding of RAG's framework and its advantages, practitioners can effectively harness its potential to build robust applications that leverage LLMs for improved performance.