Build a Fully Offline RAG Agent with LangGraph Ollama and Embedded Qdrant
- Authors

- Name
- Nino
- Occupation
- Senior Tech Editor
Most Retrieval-Augmented Generation (RAG) tutorials follow a predictable script: the first line of code usually involves setting an OPENAI_API_KEY. While this is fine for prototyping, it creates a dependency on external cloud providers, introduces latency, and raises data privacy concerns for enterprise applications. In this guide, we will implement a fully offline RAG agent using LangGraph for orchestration, Ollama for local LLM and embedding execution, and Qdrant in its embedded mode for vector storage.
This architecture follows the 'swappable boundary' design principle. By the end of this tutorial, you will see how to switch between local development and high-performance production environments—like those powered by n1n.ai—by simply changing a configuration flag, not your core logic.
The Local Stack Architecture
To run a RAG agent without any API keys, we need three core local components:
- Ollama: This serves as our local inference engine. We will use it to run two distinct models:
qwen3.5:9b(orDeepSeek-V3if your hardware allows) for reasoning andbge-m3for generating high-quality multilingual embeddings. - Embedded Qdrant: Unlike traditional databases that require a Docker container or a remote server, Qdrant's embedded mode allows the vector store to run as a library within your Python process, persisting data directly to a local directory.
- LangGraph: This provides the stateful orchestration required for the ReAct (Reasoning and Acting) loop, allowing the agent to decide when to search the documents and when to answer the user.
Step 1: Setting Up the Local Models
First, ensure you have Ollama installed. Pull the models required for both chat and embeddings:
ollama pull qwen3.5:9b # reasoning model
ollama pull bge-m3 # 1024-dim multilingual embeddings
By running these locally, you eliminate the per-token cost associated with cloud providers. However, when you're ready to scale your application for thousands of concurrent users, transitioning to a high-speed aggregator like n1n.ai ensures that your production environment remains stable and cost-effective.
Step 2: Implementing the Swappable Embedding Factory
The key to a professional RAG system is abstraction. We use a factory pattern to handle embeddings so the system doesn't care if it's talking to Ollama or a cloud provider.
# app/llm/embeddings.py
from functools import lru_cache
from langchain_core.embeddings import Embeddings
@lru_cache
def get_embeddings() -> Embeddings:
settings = get_settings()
provider = settings.embedding_provider.lower()
if provider == "ollama":
from langchain_ollama import OllamaEmbeddings
return OllamaEmbeddings(
model=settings.embedding_model,
base_url=settings.ollama_url
)
if provider == "openai":
from langchain_openai import OpenAIEmbeddings
# Seamlessly switch to n1n.ai for production scale
return OpenAIEmbeddings(
base_url="https://api.n1n.ai/v1",
api_key=settings.n1n_api_key,
model=settings.embedding_model
)
raise ValueError(f"Unknown embedding_provider: {provider}")
Step 3: The Embedded Vector Store
Using Qdrant in embedded mode allows for a 'zero-infrastructure' setup during development. The vector store writes to a local path instead of a network URL.
# app/rag/store.py
from qdrant_client import QdrantClient
@lru_cache
def get_client() -> QdrantClient:
s = get_settings()
if s.qdrant_url:
# Remote/Cloud mode for Production
return QdrantClient(url=s.qdrant_url, api_key=s.qdrant_api_key)
# Embedded mode for Local Development
return QdrantClient(path=s.qdrant_path)
Pro Tip: Embedded mode locks the directory to a single process. If you try to run an ingestion script while your FastAPI server is running, you will encounter a database lock error. Always ingest first, then start the server.
Step 4: The Intelligent Ingestion Pipeline
Different embedding models produce different vector dimensions (e.g., bge-m3 is 1024, while text-embedding-3-small is 1536). Instead of hard-coding these values, we use a 'probe' technique to detect the dimension at runtime.
# scripts/ingest.py
# ... loading and splitting code ...
# Probe the embedding dimension dynamically
embeddings = get_embeddings()
probe_vector = embeddings.embed_query("probe")
dim = len(probe_vector)
# Ensure the Qdrant collection matches the model
ensure_collection(collection_name="docs", vector_size=dim)
get_vector_store().add_documents(chunks)
This ensures that if you flip your CHAT_PROVIDER from Ollama to n1n.ai, your ingestion script automatically adjusts the database schema to match the new provider's dimensions.
Step 5: Running the ReAct Loop
In LangGraph, the agent follows a cycle: Agent Node → Tool Node (Retrieval) → Agent Node (Synthesis). Here is how a real local run looks when asking about project memory:
- HumanMessage: "How is memory implemented?"
- AIMessage: (Calls
search_docstool) - ToolMessage: (Returns chunks from local Qdrant)
- AIMessage: "Short-term memory uses PostgreSQL, long-term uses Zep. Sources: doc-a.md."
Lessons Learned: The 'Gotchas' of Local RAG
Running everything offline is empowering, but it reveals challenges that cloud-based tutorials often ignore:
- Synthesis Whiffs: Occasionally, a 9B model like Qwen or Llama 3 might retrieve the correct documents but return an empty response (
finish_reason='stop'). This is usually due to the model's smaller parameter count struggling with complex context windows. In production, using larger models via n1n.ai eliminates this flakiness. - The Cold Start Problem: Ollama loads models into VRAM on the first request. Your first query might take 10 seconds, while subsequent queries take 200ms. Do not benchmark your system based on the first run.
- Dimension Mismatch: You cannot query a collection created with OpenAI embeddings using Ollama embeddings. If you switch providers, you must re-ingest your data.
Conclusion
Building a local RAG agent with LangGraph and Ollama proves that you don't need a massive cloud budget to develop sophisticated AI tools. By designing your code around swappable providers, you can build on your laptop and deploy to the cloud without rewriting a single line of logic.
When you are ready to move from local experimentation to a production-ready environment with low latency and 99.9% uptime, n1n.ai provides the API infrastructure you need to scale effortlessly.
Get a free API key at n1n.ai