This article continues the journey we embarked on a few weeks back with our last practical AI blog post: “Ollama: from zero to running an LLM in less than 2 minutes!” where we leveraged Ollama to procure and serve an LLM in a virtual machine equipped with a GPU, Scaleway's H100 PCIe GPU Instance. After going through that article you may have been inspired to integrate AI capabilities into your own applications (Did you? Let me know via the Scaleway Community!) and you may have realized that even though thousands of possibilities opened up for you, there may still be some scenarios missing in the picture, such as the ability to make an LLM interact with your data. This is where RAG, the focus of this article, comes in.
The term RAG stands for Retrieval-augmented generation, which is a technique that augments the usefulness of an LLM by enabling it to generate responses based on an extended set of information you provide. This “extended set of information” may come in the form of basically any type of structured (your typical database or a spreadsheet) or unstructured data (text documents, or even media files) and needs to be further processed and stored in a specific way such that the model can easily find patterns within it, in order to retrieve the right information. If such information cannot be found, instead of confidently providing a hallucinated answer, the LLM can be instructed to simply say “Hey, good question! I don't know ¯\(ツ)/¯” or another response you consider appropriate for your use case.
The work we did when using Ollama to run an LLM laid the foundations we need for this new blog post where we will use that same hands-on approach to harness the power of AI, and thus we will focus only on the really important concepts and leave the more complex ones for later. This also means we will continue to use Python, and I'll assume you have an Instance running your preferred LLM with Ollama.
Hands-on with RAG
The importance of RAG lies in its ability to improve an LLM's accuracy and reliability. LLMs by themselves rely entirely on the knowledge gained through their training phase to generate output, which can sometimes result in inaccurate or outdated responses. RAG addresses this issue by incorporating external sources of information into the response generation pipeline with the added benefit of not needing to update or “fine-tune” the original model — a process that might require large amounts of compute power —, making it a simpler and more efficient approach.
We will build a simple app that will use an LLM (Llama2:70b) to go through Scaleway's public documentation repository and try to find the answer to an input question provided by the user. The base example has 50 lines of code, and we will see how we can improve its functionality by adding a few more here and there.
We will use LlamaIndex “a simple, flexible data framework for connecting custom data sources to large language models” — as they describe it — as our main tool to achieve our goal. We will also make use of an 'embedding model' that will transform documents — or chunks of data — into a numerical representation (vectors would be the fancy/proper term) based on their attributes. And finally, a 'Vector Database' that will store the numerical representations of our documents, for easier consumption by the whole pipeline.
Architectural Overview
The system looks something like this:
Setup
All the commands and code are meant to be run inside your GPU Instance. Feel free to check the documentation if you need a refresher.
You can use your preferred text editor, in my case I still like Visual Studio Code and its Remote Development feature lets me connect to my instance by logging in via SSH. It automatically installs a server on my Instance that allows me to edit and run code that lives in the remote Instance just the same way as I'd do it on my local environment. But if you know how to exit Vim, by all means, feel free to use it.
The environment
It's always a good idea to set up a virtual environment for your project, and I like to go simple, so I default to virtualenv:
There are many “Vector Databases'' to choose from nowadays. Qdrant is an open source one that's written in Rust, has many official client libraries, and can be easily run via docker:
docker run -d -p 6333:6333 --name qdrant qdrant/qdrant
And if for some reason you decide to use a different Vector Database, LlamaIndex makes it easy for you to migrate with a few tweaks.
Dependencies
We'll need to install the LlamaIndex package, our open source workhorse:
pip install llama-index
And while we're at it, why not install all the other dependencies?
llama-index-llms-ollama is the LlamaIndex wrapper that allows us to use a model served by Ollama
llama-index-embeddings-huggingface is the LlamaIndex wrapper for HuggingFace embedding models (more on those later on)
llama-index-vector-stores-qdrant is the LlamaIndex 'Vector Store' integration for Qdrant
qdrant-client is the official Python Qdrant library
Getting the “data source”
As mentioned before, this example will use the Scaleway Documentation as its data source. Scaleway docs are maintained by a dedicated team of professional technical writers, but they're also a collaborative effort that the community can contribute to. That's why it is available as an open source repository on GitHub. For this example, we will only clone the main branch with a depth of 1
If you explore the repo, you'll find several directories and files linked to the deployment process, which are not important to us. The content we're after lives inside the files with the mdx extension. These MDX files use the Markdown syntax and have a Frontmatter header including associated metadata (title, description, categories, tags, publishing date, etc).
The code
Imports, constants, and settings
Don't focus too much on the imports, we're simply bringing the packages we installed before along with a couple more included in the standard library.
After the imports we set 3 constants: The local directory where we want to store the IDs and hashes associated with the documents we will feed to our vector database, the location of our documents (the Scaleway documentation), and the name of the collection we want to use for this app in our vector database — think of a database name.
import sysfrom pathlib import Pathfrom llama_index.core import Settings, StorageContext, VectorStoreIndex, SimpleDirectoryReaderfrom llama_index.embeddings.huggingface import HuggingFaceEmbeddingfrom llama_index.llms.ollama import Ollamafrom llama_index.vector_stores.qdrant import QdrantVectorStoreimport qdrant_clientSTORAGE_DIR="./storage"DOCS_DIR ="./docs-content"COLLECTION_NAME ="scw_docs"llm = Ollama(model ="llama2:70b")# Using a local LLM served by Ollamaembed_model = HuggingFaceEmbedding(model_name ="sentence-transformers/multi-qa-distilbert-dot-v1", embed_batch_size =768, device ="cuda")# Assigning an embedding model from HuggingFaceSettings.llm = llmSettings.embed_model = embed_model
The next few lines define the 2 models we will use, the LLM, and the embeddings model. Finally, Lammaindex's Settings.llm and Settings.embed_model will set those values globally within this app's context.
Embeddings Model
We've been mentioning embeddings and vector databases for a while now, and it's time to spend a few lines making sure we have a basic understanding of their relationship. As mentioned before, an 'embedding model' is capable of taking in input data, such as text, a document, or an image, and projecting it into a vector (an array of numbers) that represents the entity's meaning or features. When converted into a numerical representation (a vector), a machine can establish relationships between entities by calculating their positions and proximity within the vectorial space. The way an entity is represented in a vectorial space will depend on the embedding model being used. There are embedding models specifically trained to find text, answer questions, and look for images based on text input (and vice-versa). On top of that, you have to consider the languages these models have been trained on, the amount of data they were fed with, etc. A good place to start learning more is the Sentence Transformers framework documentation.
Here I picked multi-qa-distilbert-dot-v1 because it's been trained on Q&A tasks from various sources and it showed good results when compared with other embedding models.
Setting up the Vector Store
Calling qdrant_client.QdrantClient() without any arguments will use the default connection values which will point to localhost on port 6333. By the way, you can visit <your instance's public domain>:6333/dashboard to check out your Qdrant's Web UI.
Then we have the “vector store”. A vector store is a storage system that holds the embedding vectors of nodes (document chunks), and the nodes themselves. These stores are used in machine learning and AI applications to efficiently store and retrieve high-dimensional vectors, which are often used to represent complex data like text, images, and more.
Once the vector store and storage context are created, we can now move to the next stage: loading the files and converting them into documents. “Wait, files are not documents?” you may be wondering, and no, in this context, “A Document is a generic container around any data source [...] By default, a Document stores text along with some other attributes”. The main attributes are the metadata and relationships dictionaries, which contain additional information for a document (by default the file path, name, size, creation date, and last modified date), and their relationship with other documents and Nodes, respectively. A Node is a chunk of a Document.
The get_documents function receives a path string — in this case, the path to the Scaleway documentation directory —, and defines a list of directories we know we want to exclude from our 'document loading' process, like the .git folder because it's not relevant and the index.mdx because its contents don't actually add any useful information.
The SimpleDirectoryReader class takes in the path to the Scaleway documentation directory, a list of extensions we want it to look for (remember to add the . before the extension. It will save you hours of debugging time :/ ), whether or not we want it to recursively look for subdirectories (we do!), and the list of things we want to exclude. The load_data method will return the documents, which will include the text found in each file, along with some metadata.
In the code below, the if statement checks if this is the first time this script has been executed by checking if the storage dir exists in the filesystem. If this is the first time running, (that's the else branch), then the get_documents function is called and a storage context is created.
LlamaIndex uses StorageContext to, well… store things. In this case to the vector_store, which is our Qdrant vector database.
vector_index creates a new vector store index from the documents previously generated, splits them up in chunks, and loads them into the vector database.
Finally, on the else branch, we persist to disk the document IDs and hashes that point to the vector database elements, and that's what happens on the last line when vector_index.storage_context.persist is called.
On the if branch we load the StorageContext from the file system by passing the path in the persist_dir argument, then create a vector index the same way as previously mentioned, except, instead of creating it from_documents, it is created from_vector_store because the data already exists in the vector database.
At this point, a reference to the LLM was passed to LlamaIndex, the document embeddings were created and stored on the vector database, and all is left to do is to query the vector_index:
if __name__ =="__main__":iflen(sys.argv)>1: questin_string = sys.argv[1] query_engine = vector_index.as_query_engine() response = query_engine.query(str(questin_string))print(response)else:print("You forgot to pass in your question :-) simply put it within quotes after invoking this script: python3 main.py \"what is an instance?\"")
First, we check if the script is being loaded as the main program, then we check the script arguments to make sure there's a query after the script call — we want to be able to call the script and pass a query along directly, such as python3 main.py “what is an Instance?”.
The vector_index.as_query_engine() creates a basic Query Engine instance that is then executed with the query method by passing the query string.
The result
When you run your script for the first time with a query such as “how do I create a serverless job?”
python3 demo.py "how do I create a serverless job?"
You will get an answer similar to this:
You can create a serverless job using the Scaleway console, Terraform, API, or CLI.Using the Scaleway console, you can easily create a job definition and track your job runs. You can also monitor your jobs using Scaleway Cockpit.Alternatively, you can use Terraform to integrate serverless jobs into your infrastructure as code via the Terraform provider and resources.The Scaleway HTTP API allows you to manage your serverless resources via HTTP calls, which can be useful when integrating jobs management into your automated tasks or continuous integration.You can also use the Scaleway CLI, a simple command-line interface that allows you to create, update, delete, and list your serverless jobs. For example, you can use the CLI to deploy a job with the following command: `scw jobs definition create name=testjob cpu-limit=70 memory-limit=128 image-uri=docker.io/alpine:latest command=ls`.Finally, Scaleway SDKs are available for Go, JS, and Python, allowing you to manage your resources directly using your favorite languages.
This is great! The LLM by itself wasn't trained on the latest release of the Scaleway documentation. But it doesn’t have to be! It can go through the document nodes retrieved by the 'Query Engine' from the vector database and use them as the context to not only return a single document's text, but to generate an appropriate response based on the set of available documents and nodes.
As promised, this example can deliver great results with just 50 lines of code, here's the complete code:
import sysfrom pathlib import Pathfrom llama_index.core import Settings, StorageContext, VectorStoreIndex, SimpleDirectoryReaderfrom llama_index.embeddings.huggingface import HuggingFaceEmbeddingfrom llama_index.llms.ollama import Ollamafrom llama_index.vector_stores.qdrant import QdrantVectorStoreimport qdrant_clientSTORAGE_DIR="./storage"DOCS_DIR ="./docs-content"COLLECTION_NAME ="scw_docs"llm = Ollama(model ="llama2:70b")embed_model = HuggingFaceEmbedding(model_name ="sentence-transformers/multi-qa-distilbert-dot-v1", embed_batch_size =768, device ="cuda")# If you're using a system with lower VRAM than the 80GB of the H100 PCIe Instance, such as the L4 GPU Instance, you can use the smaller models you'll find below. They are not as powerful as their larger pairs, but they'll get the job done# llm = Ollama(model = "llama2:7b")# embed_model = HuggingFaceEmbedding(model_name = "sentence-transformers/multi-qa-MiniLM-L6-dot-v1", embed_batch_size = 384, device = "cuda")Settings.llm = llmSettings.embed_model = embed_modelclient = qdrant_client.QdrantClient()vector_store = QdrantVectorStore(client = client, collection_name = COLLECTION_NAME)defget_documents(dir_path): ignore_these =['.git/**','.github/**','.husky/**','assets/**','bin/**','blocks/**','changelog/**','components/**','docs/**','menu/**','styles/**','contribute.mdx','index.mdx']return SimpleDirectoryReader( input_dir = dir_path, required_exts =[".mdx"], recursive =True, exclude = ignore_these).load_data()if Path(STORAGE_DIR).exists(): storage_context = StorageContext.from_defaults(persist_dir = STORAGE_DIR) vector_index = VectorStoreIndex.from_vector_store( vector_store = vector_store, storage_context = storage_context, show_progress =True)else: docs = get_documents(DOCS_DIR) storage_context = StorageContext.from_defaults(vector_store = vector_store) vector_index = VectorStoreIndex.from_documents( documents= docs, storage_context = storage_context, show_progress =True) vector_index.storage_context.persist(STORAGE_DIR)if __name__ =="__main__":iflen(sys.argv)>1: questin_string = sys.argv[1] query_engine = vector_index.as_query_engine() response = query_engine.query(str(questin_string))print(response)else:print("You forgot to pass in your question :-) simply put it within quotes after invoking this script: python3 main.py \"what is an instance?\"")
Next steps
This app can serve as the foundation for bigger things. In this case, we are using a simple approach that uses many of the LlamaIndex default settings, but you could say there are endless possibilities for what you can achieve. You can try out different LLM and embedding models, feed it different kinds of data, try out different vector databases, create different vector stores for different types of data, and then process each using a different model. Let's say you want to create a chatbot (did I mention besides a Query Engine, LlamaIndex also supports a Chat Engine?) that can help onboard new developers to your company. You'd want them to be able to quickly find the answers they need, but as is sometimes the case, information is spread around many sources, like Confluence (who doesn't just love Confluence's search?) or Notion pages for guidelines and "How-to" guides, but also Google Docs for meeting notes, spreadsheets for reports, and your repository's README and CONTRIBUTING files for detailed practical information on specific projects. All of these different sources can be loaded thanks to the many different integrations available on Llama Hub, the go-to place for data loaders and tools that can make it easier for your app to go even further.
Custom Metadata
One such addition that can take our example app one step forward is to make the document-loading process include an additional step: customizing the document's metadata. As mentioned before, by default, the SimpleDirectoryReader will take the following file attributes as metadata: file_path, file_name, file_size, creation_date, and last_modified_date. Some of these are not entirely helpful in our case, but there's something quite useful we can get out of the file path. As it turns out, the Scaleway documentation website's build process keeps the relative file paths as they are, only prepending the base path https://www.scaleway.com/en/docs/ and removing the .mdx extension. Knowing this we can create new metadata that includes the public URL of the document. To do so we need to create a new function that we will pass as the value of SimpleDirectoryReader's file_metadata argument. This function will in turn receive the file path string and needs to return a dictionary of metadata key-value pairs.
What do we get after this? Well, not much. But this is only the first step towards something useful, instructing the LLM to generate a response following our guidelines.
Custom Prompt
Under the hood, LlamaIndex passes many default prompts to the LLM to provide it with the required instructions for different steps of the generation process based on several different factors. However, we have the ability to set our custom prompts. One such prompt we can set is the text_qa_template the Query Engine can receive. This prompt allows us to define several instructions, as you can see below:
#...from llama_index.core import PromptTemplate#... qa_prompt_str =("You're a helpful technical expert who provides answers based on the Scaleway Documentation.\n""Assume every question you receive is related to Scaleway. If you can't find the data to answer a question, or the question is out of the scope of Scaleway, say `I don't know.`, suggest visiting the documentation website and don't provide any further information.\n""Context information is below.\n""---------------------\n""{context_str}\n""---------------------\n""\nInstructions:\n""- Based on the above Context information and no prior knowledge, provide a concise answer to the user Query below.\n""- Prioritize documents with the shallowest 'file_path' depth. If you can't find data to answer a question within the Scaleway Documentation, say I don't know.\n""- Always finish your answer with a separate paragraph linking to the most relevant document using the value of its 'metadata' 'public_url'.\n""Query: {query_str}\n""Answer: ") query_engine = vector_index.as_query_engine(text_qa_template = PromptTemplate(qa_prompt_str)) response = query_engine.query(str(questin_string))print(response)#...
We're using this prompt to instruct the LLM to set the scope of the generated answer to the Scaleway platform using the provided context documents, disregarding any prior knowledge, and asking it to provide the public URL of the document it thinks is the most relevant.
You will notice two variables there, context_str and query_str, both of which are automagically passed by LlamaIndex when qa_prompt_str is passed to PromptTemplate (a new class we need to import). query_str is the query provided by the user, and context_str is the information retrieved from the vector database, which was previously queried using the user's query_str. The LLM takes the instructions provided in the prompt, the context information, and the user query to finally synthesize a response.
When invoking the script with the question “is there a cli?”, the generated response we get back looks like this:
Yes, you can use the Scaleway CLI to manage your resources. The Scaleway CLI allows you to interact with the Scaleway platform using commands instead of the graphical user interface. You can install the Scaleway CLI on your local machine and use it to create, modify, and delete resources.To get started with the Scaleway CLI, you can refer to the Scaleway CLI documentation. This document provides an overview of the available commands, flags, and options, as well as examples of how to use them. Additionally, you can use the command-line help system to get information about a specific command or flag.For more information on using the Scaleway CLI, please refer to the following document:https://developer.scaleway.com/en/compute/instances/api-cli/
And the query we used for the initial version, “how do I create a serverless job?” would look like this:
Sure! Here's how you can create a serverless job:1. Log in to your Scaleway account and navigate to the Serverless section in the side menu.2. Click on "Jobs" and then click on "Create job".3. Select the Scaleway Container Registry as the source of the container image.4. Choose the desired container registry namespace, select a container image and tag, and enter a name for the job.5. Select the region where you want to create the job and choose the resources that you want to allocate to the job.6. Define a cron schedule to specify when you want the job to run, and add any environment variables or startup commands as needed.7. Set a maximum duration for the job and verify the estimated cost.8. Click "Create a job definition" to finish creating the job.You can find more detailed instructions and information on creating serverless jobs in the Scaleway documentation here: <https://www.scaleway.com/en/docs/serverless/jobs/how-to/create-job-from-scaleway-registry>.
Both responses accurately provide a concise answer to the questions and provide a link to the relevant documentation page for the user to learn more about the topic.
Further improvements
As discussed before, this example serves as a base to start building on top of, and many improvements can be made. In our case, the mdx files include frontmatter that contains relevant information that can be leveraged, such as its title, a description of the file contents, categories it applies to, and tags it can be grouped in. Additionally, the fact that mdx files not only use Markdown syntax, but also can include markup elements, or components, such as <Macro id="requirements" /> or <Navigation title="See also">, can confuse the embedding model. In this case, parsing the documents with a proper Reader from Llama hub, or creating your own, can improve the overall performance and accuracy of your app.
Other improvements can include the automation of the process of pulling the latest version of the documentation and the associated vector database update, using improved storage methods, experimenting with other databases, tweaking the model's parameters, definitely trying out different Response Modes, and protecting our Instance so that only people allowed to access these resources can consume them.
Conclusion
In conclusion, RAG is a powerful technique that can improve the accuracy and reliability of generative AI models. By using external sources of information, RAG enables developers to create sophisticated AI systems that are more accurate and extendable. In this article, we went through the very basics on how to get started with RAG by leveraging LlamaIndex, Qdrant, Ollama, and sentence-transformers embedding models. We covered various aspects of RAG, including setting up the environment, loading documents, running a vector database, creating a vector store, and creating a Query Engine.
We then considered the many possibilities that lie beyond this base setup and improved its functionality by prompting the model to generate responses that include the answer's public documentation page URL. By following these steps, you can create your own RAG system that can be used for various applications that leverage your data with the power of open source tools, LLMs, and Scaleway's AI solutions.
Tooling around AI has made it possible for us to use its powers without having to understand what’s happening under the hood, just like we don’t have to know how a car engine works before driving it.
In this practical example, we roll up our sleeves and put Scaleway's H100 Instances to use by leveraging a couple of open source ML models to optimize our internal communication workflows.