Build a Local AI Assistant with Persistent Memory Using LM Studio and Big RAG
- Authors

- Name
- Nino
- Occupation
- Senior Tech Editor
The dream of a fully private, high-performance AI assistant that lives entirely on your hardware is no longer a distant reality. With the release of models like Google's Gemma 4 and the evolution of tools like LM Studio, developers can now bypass the privacy concerns and subscription costs associated with cloud providers. However, two major hurdles remain: the 'forgetfulness' of standard LLM sessions and the inability to interact with local documentation.
In this technical guide, we will bridge that gap. We are going to build a system that uses RAG (Retrieval-Augmented Generation) to index your personal files and, more importantly, we will modify the Big RAG plugin to implement persistent memory. This ensures your local AI remembers context across different chat sessions, similar to how advanced cloud models operate, but with 100% data sovereignty. While local setups are ideal for privacy, developers often need to compare local performance against industry benchmarks; for those scenarios, n1n.ai offers the most stable access to high-end models like Claude 3.5 Sonnet and OpenAI o3.
Why Local RAG and Persistent Memory Matter
Standard RAG allows a model to query a vector database of your documents (PDFs, Markdown, etc.) in real-time. Instead of hallucinating, the model retrieves relevant text chunks and uses them as context. However, most RAG implementations are 'stateless'—they forget the conversation history once the window is closed. By adding a persistent memory layer using a local JSON-based database, we create an AI that evolves with your projects.
Step 1: Setting Up the Foundation with LM Studio
First, download and install LM Studio from its official site. It serves as the local inference engine for our models.
Model Selection: Gemma 4 and Nomic Embed
For our primary brain, we will use Google's Gemma 4. It offers an excellent balance of instruction following and efficiency. In the LM Studio 'Discover' tab, search for gemma-4. You must choose a quantization level based on your VRAM/RAM:
| Quantization | RAM Required | Performance Profile |
|---|---|---|
| Q4_K_M | 16 GB | Optimal balance for daily use |
| Q3_K_M | 8 GB | Fast, but slightly lower accuracy |
| Q8_0 | 24 GB+ | Maximum quality for high-end GPUs |
Next, we need an embedding model. Unlike the LLM, the embedding model converts text into mathematical vectors for searching. Search for nomic-ai/nomic-embed-text-v1.5-GGUF. This model is lightweight (~270 MB) and highly effective for local retrieval tasks.
Step 2: Installing and Bootstrapping Big RAG
Big RAG is a powerful plugin for LM Studio that handles the indexing of folders. To install it, you need Node.js (LTS) installed on your machine. We will use the lms CLI to bridge the plugin with the core application.
# Bootstrap the CLI for macOS/Linux
~/.lmstudio/bin/lms bootstrap
# Or for Windows
cmd /c %USERPROFILE%/.lmstudio/bin/lms.exe bootstrap
Clone the repository and build the plugin:
git clone https://github.com/ari99/lm_studio_big_rag_plugin.git
cd lm_studio_big_rag_plugin
npm install
npm run build
Move the build folder into your LM Studio plugins directory. For macOS, this is typically ~/.lmstudio/plugins/. Once copied, restart LM Studio and enable the plugin in the Settings menu.
Step 3: Implementing Persistent Memory
This is where we go beyond the standard setup. We will modify the src/promptPreprocessor.ts file to include both session-based history and a cross-session JSON database. We will use lowdb for a lightweight storage solution.
Install the dependency:
npm install lowdb
Modifying the Preprocessor Logic
Open src/promptPreprocessor.ts and import the necessary modules. We need to define a MemorySchema to store timestamps, user queries, and summaries of the interaction. This allows the model to look back at what you discussed yesterday without needing to re-read the entire chat log.
import { JSONFilePreset } from 'lowdb/node'
import * as path from 'path'
type MemorySchema = {
history: Array<{
timestamp: string
user_text: string
summary: string
}>
}
// Helper to initialize or load the memory file
async function getMemory(vectorStoreDir: string) {
const dbPath = path.join(vectorStoreDir, 'chat_memory.json')
const defaultData: MemorySchema = { history: [] }
return (await JSONFilePreset) & lt
MemorySchema & gt
;(dbPath, defaultData)
}
Inside the preprocess() function, we will pull the current session history using ctl.pullHistory() and combine it with the data from our chat_memory.json. By injecting this into the prompt context, Gemma 4 gains a 'long-term memory'.
Step 4: Refining the Context Injection
When building the final prompt, it is crucial to manage the context window. If you inject too much data, the model's performance will degrade (latency > 5000ms). We recommend limiting the persistent memory to the last 5 relevant interactions and the session history to the last 3 exchanges.
// Example of assembling context
const memoryDb = await getMemory(vectorStoreDir);
const pastMemories = memoryDb.data.history.slice(-5);
const persistentMemory = pastMemories.length > 0
? "\n\nPersistent memory from past sessions:\n" +
pastMemories.map(m => `- [${m.timestamp}] ${m.summary}`).join("\n")
: "";
ragContextFull += historyContext + persistentMemory;
After modifying the source, run npm run build again and replace the plugin files in the LM Studio directory. Toggle the plugin off and on to apply changes.
Pro Tips for Local AI Optimization
- Threshold Tuning: If your RAG system returns irrelevant data, increase the
Affinity Thresholdin the plugin settings to 0.5. If it finds nothing, lower it to 0.2. - Chunking Strategy: For technical documentation, use a chunk size of 700 tokens. For creative writing or short notes, 300 tokens is more efficient.
- Hybrid Workflow: While local models are great for sensitive data, complex reasoning tasks might still require larger models. You can use n1n.ai to access DeepSeek-V3 or GPT-4o for heavy lifting, then feed the results back into your local knowledge base.
- Hardware Monitoring: Keep an eye on your VRAM. If you experience crashes, switch to a lower quantization (like Q3_K_M).
Conclusion
By combining LM Studio, Gemma 4, and a customized Big RAG plugin, you have created a sovereign AI assistant that grows smarter with every interaction. This setup ensures that your proprietary data never leaves your machine while providing the convenience of a persistent memory assistant. For developers looking to scale these capabilities to the cloud or integrate multiple LLMs into a single workflow, n1n.ai provides the robust API infrastructure needed for production-grade AI applications.
Get a free API key at n1n.ai