Retrieval-Augmented Generation (RAG) enhances LLM text generation by incorporating external knowledge sources, making responses more accurate, relevant, and up-to-date. RAG combines an information retrieval component with a text generation model, allowing the LLM to access and process information from external databases before generating text. This approach addresses challenges like domain knowledge gaps, factuality issues, and hallucinations often associated with LLMs.

Key Aspects of RAG:

Augmentation:

RAG augments LLMs with external knowledge, bridging the gap between the LLM’s inherent knowledge and the vast, dynamic repositories of external databases.
Retrieval:

RAG uses retrieval mechanisms to find relevant information from external databases based on user queries.
Generation:

The LLM uses the retrieved information as context to generate more accurate and contextually appropriate responses.

Benefits of RAG:

Reduced Hallucination:

By grounding the LLM in external knowledge, RAG reduces the likelihood of the LLM generating incorrect or nonsensical information (hallucinations).
Improved Accuracy:

RAG ensures that LLMs generate responses that are factually accurate and aligned with the latest information.
Continuous Knowledge Updates:

RAG allows for easy integration of new or updated information without retraining the underlying LLM.
Domain-Specific Knowledge:

RAG enables LLMs to be specialized for specific domains or organizations, providing them with domain-specific knowledge.

How RAG Works:

User Query:

A user submits a query to the LLM.
Retrieval:

The LLM sends the query to a retrieval model (often an embedding model) that converts it into a numerical representation.
Knowledge Base Search:

The numerical representation is used to search a vector database containing embeddings of external knowledge sources.
Information Retrieval:

The retrieval model identifies the most relevant information from the knowledge base.
Text Generation:

The LLM uses the retrieved information as context to generate a response to the user’s query.

Examples of RAG Applications:

FAQ Bots:

RAG can be used to create FAQ bots that can answer questions using an organization’s internal knowledge base.
Research Assistants:

RAG can be used to build research assistants that can access and synthesize information from various sources.
Customer Support Tools:

RAG can be used to create customer support tools that can provide accurate and relevant answers to customer queries.

Welcome to this beginner-friendly guide! In this project, you’ll learn how to:

Use LLaMA, a powerful large language model (LLM), for text generation.
Retrieve relevant text snippets using Sentence Transformers, a popular tool for embedding text.
Combine these techniques to answer questions based on context provided from retrieved snippets.

Run this command to install everything you need:

!pip install -Uq sentence-transformers

️ Setup: Installing the Necessary Libraries

First, we need to install some Python libraries that will help us:

transformers: Provides access to LLaMA and other pre-trained models for text generation.
sentence-transformers: Helps generate embeddings, which are essential for comparing text snippets.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

from sentence_transformers import SentenceTransformer

import torch

Step 1: Model Setup and Tokenizer

chechpoint = "meta-llama/Llama-3.2-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(chechpoint)

llama_model = AutoModelForCausalLM.from_pretrained(chechpoint, torch_dtype=torch.bfloat16)

generator = pipeline("text-generation", model=llama_model, tokenizer=tokenizer, device="cuda") # Create a simple text generation pipeline

What’s happening:

We specify the LLaMA 3.2B Instruct model from Meta
The tokenizer converts text to tokens (numerical representations) the model can understand
We load the model weights with bfloat16 precision (memory-efficient floating point format)
Create a text generation pipeline that handles:
- Tokenization
- Model inference
- Decoding back to text
- Runs on GPU (“cuda”) for faster processing

Step 2: Text Snippet Retrieval Setup

text_snippets = [

"Fiona thanked Ethan for his unwavering support and promised to cherish their friendship.",

"As they ventured deeper into the forest, they encountered a wide array of obstacles.",

"Ethan and Fiona crossed treacherous ravines using rickety bridges, relying on each other's strength.",

"Overwhelmed with joy, Fiona thanked Ethan and disappeared into the embrace of her family.",

"Ethan returned to his cottage, heart full of memories and a smile brighter than ever before.",

]

Purpose:

We create a small knowledge base of text snippets that will serve as our retrieval corpus
In a real application, this would typically be a much larger database or document collection

Step 3: Convert text snippets to embeddings for later comparison.

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings_text_snippets = model.encode(text_snippets) # Generate embeddings for the text snippets

Key concepts:

We use the “all-MiniLM-L6-v2” model which is optimized for semantic similarity tasks
encode() converts each text snippet into a 384-dimensional vector (embedding)
These embeddings capture semantic meaning in a way that allows mathematical comparison

Step 4: Create a function to retrieve the closest matching snippet using cosine similarity.

def retrieve_snippet(query):

query_embedded = model.encode([query]) # Encode the query to obtain its embedding

similarities = model.similarity(embeddings_text_snippets, query_embedded) # Calculate cosine similarities between the query embedding and the snippet embeddings

retrieved_texts = text_snippets[similarities.argmax().item()] # Retrieve the text snippet with the highest similarity

return retrieved_texts

How retrieval works:

The user’s query is converted to an embedding
We compute cosine similarity between query embedding and all snippet embeddings
Cosine similarity measures angle between vectors (1 = identical, -1 = opposite)
The snippet with highest similarity score is returned

Step 5: Create a function to generate the answer based on the retrieved snippet and query.

# In this step, we utilize the retrieved context snippets to generate a relevant answer using LLaMA, exemplifying the power of RAG in enhancing the quality of responses.

def ask_query(query):
    retrieved_texts = retrieve_snippet(query)

# Prepare the messages for the text generation pipeline
    messages = [
        {"role": "system", "content": "You are a helpful AI assistant."
                "Provide one Answer ONLY the following query based on the context provided below. "
                "Do not generate or answer any other questions. "
                "Do not make up or infer any information that is not directly stated in the context. "
                "Provide a concise answer."
                f"{retrieved_texts}"},
        {"role": "user", "content": query}
    ]

# Generate a response using the text generation pipeline
    response = generator(messages, max_new_tokens=128)[-1]["generated_text"][-1]["content"]
    print(f"Query: \n\t{query}")
    print(f"Context: \n\t{retrieved_texts}")
    print(f"Answer: \n\t{response}")

# In this step, we utilize the retrieved context snippets to generate a relevant answer using LLaMA, exemplifying the power of RAG in enhancing the quality of responses.

def ask_query(query):

retrieved_texts = retrieve_snippet(query)

# Prepare the messages for the text generation pipeline

messages = [

{"role": "system", "content": "You are a helpful AI assistant."

"Provide one Answer ONLY the following query based on the context provided below. "

"Do not generate or answer any other questions. "

"Do not make up or infer any information that is not directly stated in the context. "

"Provide a concise answer."

f"{retrieved_texts}"},

{"role": "user", "content": query}

]

# Generate a response using the text generation pipeline

response = generator(messages, max_new_tokens=128)[-1]["generated_text"][-1]["content"]

print(f"Query: \n\t{query}")

print(f"Context: \n\t{retrieved_texts}")

print(f"Answer: \n\t{response}")

RAG Process:

Retrieve the most relevant context for the query
Construct a prompt that:
- Sets system instructions
- Includes retrieved context
- Presents the user query
Key instructions to the model:
- Answer only based on provided context
- Be concise
- Don’t hallucinate information
Generate response limited to 128 new tokens

Step 6: Ask a Question

query = "Why did Fiona thank Ethan?"

ask_query(query)

OUTPUT:

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

Query:

Why did Fiona thank Ethan?

Context:

Fiona thanked Ethan for his unwavering support and promised to cherish their friendship.

Answer:

Fiona thanked Ethan for his unwavering support.

Why this works:

The retrieval system found the most relevant snippet about Fiona thanking Ethan
The LLM was constrained to only use this context
The response is accurate and directly supported by the context

Key Benefits of This RAG Approach

Factual Accuracy: By grounding responses in retrieved documents, we reduce hallucinations
Up-to-date Information: Can update the knowledge base without retraining the model
Transparency: Users can see the source context for answers
Efficiency: Don’t need to fine-tune the large language model

This implementation shows a basic RAG pipeline that can be extended with: