This content originally appeared on DEV Community and was authored by Abhishek Gupta
Local large language models (LLMs) provide significant advantages for developers and organizations. Key benefits include enhanced data privacy, as sensitive information remains entirely within your own infrastructure, and offline functionality, enabling uninterrupted work even without internet access. While cloud-based LLM services are convenient, running models locally gives you full control over model behavior, performance tuning, and potential cost savings. This make them ideal for experimentation before running production workloads.
The ecosystem for local LLMs has matured significantly, with several excellent options available, such as Ollama, Foundry Local, Docker Model Runner, and more. Most popular AI/Agent frameworks including LangChain and LangGraph provide integration with these local model runners, making it easier to integrate them into your projects.
What will you learn?
This blog post will illustrate how to use local LLMs with Azure Cosmos DB as a vector database for retrieval-augmented generation (RAG) scenarios. It will guide you through setting up a local LLM solution, configuring Azure Cosmos DB, loading data, performing vector searches, and executing RAG queries. You can either use the Azure Cosmos DB emulator for local development or connecting to an Azure Cosmos DB account in the cloud. You will be using Ollama (open-source solution) to run LLMs locally on your own machine. It lets you download, run, and interact with a variety of LLMs (like Llama 3, Mistral, and others) using simple commands, without needing cloud access or complex setup.
By the end of this blog post, you will have a working local RAG setup that leverages Ollama and Azure Cosmos DB. the sample app uses LangChain integration with Azure Cosmos DB to perform embedding, data loading, and vector search. You can easily adapt it to other frameworks like LlamaIndex.
Alright, lets dive in!
Setup Ollama
To get started with Ollama, follow the official installation guide on GitHub to install it on your system. The installation process is straightforward across different platforms. For example, on Linux systems, you can install Ollama with a single command:
curl -fsSL https://ollama.com/install.sh | sh
Once installed, start the Ollama service by running:
ollama serve
This blog post demonstrates the integration using two specific models from the Ollama library:
- mxbai-embed-large - A high-quality embedding model with 1024 dimensions, ideal for generating vector representations of text
- llama3 - The 8B parameter variant of Meta's Llama 3, which serves as our chat model for the RAG pipeline
Download both models using the following commands. Note that this process may take several minutes depending on your internet connection speed, as these are substantial model files:
ollama pull mxbai-embed-large
ollama pull llama3:8b
Something to keep in mind ...
While tools like Ollama make it straightforward to run local LLMs, hardware requirements depend on the specific model and your performance expectations. Lightweight models (such as Llama 2 7B or Phi-2) can run on modern CPUs with as little as 8 GB RAM, though performance may be limited. Larger models (like Llama 3 70B or Mixtral) typically require a dedicated GPU with at least 16 GB VRAM for efficient inference.
Ollama supports both CPU and GPU execution. On CPU-only systems, you can expect slower response times, especially with larger models or concurrent requests. Using a compatible GPU significantly accelerates inference required for demanding workloads.
Setup Azure Cosmos DB
Since you're working with local models, you'll likely want to use the Azure Cosmos DB emulator for local development. The emulator provides a local environment that mimics the Azure Cosmos DB service, enabling you to develop and test your applications without incurring costs or requiring an internet connection.
Alternatively, you can use the cloud-based Azure Cosmos DB service. Simply create an Azure Cosmos DB for NoSQL account and enable the vector search feature. Make sure to log in using
az cli
with an identity that has RBAC permissions for the account, since the application usesDefaultAzureCredential
for authentication (not key-based authentication).
The emulator is available as a Docker container, which is the recommended way to run it. Here are the steps to pull and start the Cosmos DB emulator. The commands shown are for Linux - refer to the documentation for other platform options.
If you don't have Docker installed, please refer to the Docker installation guide.
docker pull mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:latest
docker run --publish 8081:8081 -e AZURE_COSMOS_EMULATOR_PARTITION_COUNT=1 mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:latest
Next, configure the emulator SSL certificate. For example, on the Linux system I was using, I ran the following commands to download the certificate and regenerate the certificate bundle:
curl --insecure https://localhost:8081/_explorer/emulator.pem > ~/emulatorcert.crt
sudo update-ca-certificates
You should see output similar to this:
Updating certificates in /etc/ssl/certs...
rehash: warning: skipping ca-certificates.crt,it does not contain exactly one certificate or CRL
1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
Load data into Azure Cosmos DB
Now that both Ollama and Azure Cosmos DB are set up, it's time to populate our vector database with some sample data. For this demonstration, we'll use Azure Cosmos DB's own documentation as our data source. The loader will fetch markdown content directly from the Microsoft Docs repository, specifically focusing on articles about Azure Cosmos DB vector search functionality.
Our data loading process will read these documentation articles, generate embeddings using the mxbai-embed-large
model, and store both the content and vector representations in Azure Cosmos DB for retrieval.
Begin by cloning the GitHub repository containing the sample application:
git clone https://github.com/abhirockzz/local-llms-rag-cosmosdb
cd local-llms-rag-cosmosdb
Before running the loader application, ensure you have Python 3 installed on your system. Create a virtual environment and install the required dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
Next, configure the environment variables and execute the loading script. The example below uses the Azure Cosmos DB emulator for local development. If you prefer to use the cloud service instead, simply set the COSMOS_DB_URL
variable to your Azure Cosmos DB account URL and remove the USE_EMULATOR
variable.
# export COSMOS_DB_URL="https://<Cosmos DB account name>.documents.azure.com:443/"
export USE_EMULATOR="true"
export DATABASE_NAME="rag_local_llm_db"
export CONTAINER_NAME="docs"
export EMBEDDINGS_MODEL="mxbai-embed-large"
export DIMENSIONS="1024"
python3 load_data.py
The script will automatically create the database and container if they don't already exist. Once the data loading process completes successfully, you should see output similar to this:
Uploading documents to Azure Cosmos DB ['https://raw.githubusercontent.com/MicrosoftDocs/azure-databases-docs/refs/heads/main/articles/cosmos-db/nosql/vector-search.md', 'https://raw.githubusercontent.com/MicrosoftDocs/azure-databases-docs/refs/heads/main/articles/cosmos-db/nosql/multi-tenancy-vector-search.md']
Using database: rag_local_llm_db, container: docs
Using embedding model: mxbai-embed-large with dimensions: 1024
Created instance of AzureCosmosDBNoSqlVectorSearch
Loading 26 document chunks from 2 documents
Data loaded into Azure Cosmos DB
To confirm that your data has been loaded successfully, you can inspect the results using the Azure Cosmos DB Data Explorer. If you're using the emulator, navigate to https://localhost:8081/_explorer/index.html
in your browser. You should see the same number of documents in your container as the number of chunks reported by the loader application.
Run vector search queries
Now that your data is loaded, let's test the vector search functionality. Set the same environment variables used for data loading and run the vector search script with your desired query:
# export COSMOS_DB_URL="https://<Cosmos DB account name>.documents.azure.com:443/"
export USE_EMULATOR="true"
export DATABASE_NAME="rag_local_llm_db"
export CONTAINER_NAME="docs"
export EMBEDDINGS_MODEL="mxbai-embed-large"
export DIMENSIONS="1024"
python3 vector_search.py "show me an example of a vector embedding policy"
The script will process your query through the embedding model and perform a similarity search against the stored document vectors. You should see output similar to the following:
Searching top 5 results for query: "show me an example of a vector embedding policy"
Using database: rag_local_llm_db, container: docs
Using embedding model: mxbai-embed-large with dimensions: 1024
Created instance of AzureCosmosDBNoSqlVectorSearch
Score: 0.7437641827298191
Content: ```
### A policy with two vector paths
//....
The output shows the top five results ordered by their similarity scores, with higher scores indicating better matches to your query.
To modify the number of results returned, you can add the
top_k
argument. For example, to retrieve the top 10 results, run:python3 vector_search.py "show me an example of a vector embedding policy" 10
Execute Retrieval-Augmented Generation (RAG) queries
Now we will put it all together with an simple chat based interface that leverages the llama3
model to generate responses based on the contextual information retrieved from Azure Cosmos DB.
Configure the environment variables needed for the RAG application and launch the script:
bash
# export COSMOS_DB_URL="https://<Cosmos DB account name>.documents.azure.com:443/"
export USE_EMULATOR="true"
export DATABASE_NAME="rag_local_llm_db"
export CONTAINER_NAME="docs"
export EMBEDDINGS_MODEL="mxbai-embed-large"
export DIMENSIONS="1024"
export CHAT_MODEL="llama3"
python3 rag_chain.py
Once the application initializes, you'll see output confirming the RAG chain setup:
text
Building RAG chain. Using model: llama3
Using database: rag_local_llm_db, container: docs
Using embedding model: mxbai-embed-large with dimensions: 1024
Created instance of AzureCosmosDBNoSqlVectorSearch
Enter your questions below. Type 'exit' to quit, 'clear' to clear chat history, 'history' to view chat history.
[User]:
Ask questions about the Azure Cosmos DB vector search documentation that you've loaded. For instance, try asking show me an example of a vector embedding policy
, and you'll see a response like this (note that these may vary slightly for your case, even across different runs):
text
//...
[User]: show me an example of a vector embedding policy
[Assistant]: Here is an example of a vector embedding policy:
{
"vectorEmbeddings": [
{
"path":"/vector1",
"dataType":"float32",
"distanceFunction":"cosine",
"dimensions":1536
},
{
"path":"/vector2",
"dataType":"int8",
"distanceFunction":"dotproduct",
"dimensions":100
}
]
}
This policy defines two vector embeddings: one with the path `/vector1`, using `float32` data type, cosine distance function, and having 1536 dimensions; and another with the path `/vector2`, using `int8` data type, dot product distance function, and having 100 dimensions.
To further explore the capabilities of your RAG system, try these additional example queries:
- "What is the maximum supported dimension for vector embeddings in Azure Cosmos DB?"
- "Is it suitable for large scale data?"
- "Is there a benefit to using the flat index type?"
You can enter 'exit' to quit the application, 'clear' to clear chat history, or 'history' to view your previous interactions. Feel free to experiment with different data sources and queries. To modify the number of vector search results used as context, you can add the
TOP_K
environment variable (defaults to 5).
Wrap up
In this walkthrough, you followed step-by-step instructions to set up a complete RAG application that runs entirely on your local infrastructure — from installing and configuring Ollama with embedding and chat models, to setting up Azure Cosmos DB for vector storage, loading documentation data, and running using RAG through an interactive chat interface.
Running models locally brings clear advantages in terms of costs, data privacy, and connectivity constraints. However, you need to plan for appropriate hardware, particularly for larger models that perform best with dedicated GPUs and sufficient memory. The trade-off between model size, performance, and resource requirements is crucial when planning your local AI setup.
Have you experimented with local LLMs in your projects? What challenges or benefits have you encountered when moving from cloud-based to local AI solutions? Perhaps you have used both approaches? Share your experience and feedback!
This content originally appeared on DEV Community and was authored by Abhishek Gupta

Abhishek Gupta | Sciencx (2025-08-22T14:12:59+00:00) Build a RAG application with LangChain and Local LLMs powered by Ollama. Retrieved from https://www.scien.cx/2025/08/22/build-a-rag-application-with-langchain-and-local-llms-powered-by-ollama/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.