Eliminating RAG Hallucinations through Architectural Constraints

Authors
  • avatar
    Name
    Nino
    Occupation
    Senior Tech Editor

If you have ever shipped a Retrieval-Augmented Generation (RAG) assistant, you have likely written some variation of this line in your system prompt: "If the answer is not in the provided context, say you do not know. Do not make things up." You have also likely watched the model, whether it is a legacy GPT-4 or a modern powerhouse like Claude 3.5 Sonnet, cheerfully ignore that instruction under pressure. A confident-sounding question comes in, retrieval returns something tangentially related, and the model stitches together an answer that is plausible, fluent, and entirely wrong.

The fundamental problem is that telling a language model not to hallucinate is merely a suggestion. In the hierarchy of model behaviors, suggestions almost always lose to the model's overwhelming prior toward being helpful. When you use high-performance LLM aggregators like n1n.ai to access top-tier models, you are getting the best reasoning capabilities available, but even the smartest models are prone to 'pleasing' the user if given the chance.

The Fallacy of Prompt-Based Refusal

Prompt engineering has reached a point of diminishing returns for hallucination prevention. The reason is simple: LLMs are trained on massive datasets where the goal is to predict the next token. When a user asks a question, the model's internal weights are biased toward providing a coherent response. If the retrieved context is weak or irrelevant, the model uses its internal knowledge to fill the gaps, often hallucinating details that contradict the provided source or invent new facts altogether.

Instead of fighting this with increasingly desperate prompt wording, we should shift the framing. While building the MCP (Model Context Protocol) SDK Docs Assistant, a specialized tool for the TypeScript SDK, I realized that the best way to handle refusal is not to ask the model to refuse—it is to remove its ability to fabricate. The core idea is that the model can only hallucinate if you hand it material to hallucinate from.

The Architectural Gate: Moving Refusal to the Tool Layer

In a traditional RAG pipeline, the retrieval tool fetches data and passes it to the LLM regardless of the quality of the matches. The LLM is then expected to judge the relevance. In a robust architecture, the refusal decision lives in the retrieval tool, before the model ever sees the data. If nothing clears a specific confidence bar, the tool returns an empty result set. The model is then left with no source text to spin into an answer, forcing it to state that the documentation does not cover the topic.

In practice, the implementation of this "Refusal Gate" looks like this:

const candidates = await hybridSearch(query, \{ version, limit: 12 \});

// If the best match doesn't meet our strict similarity threshold,
// we return an empty set to prevent the LLM from hallucinating.
if (!hasConfidentMatch(candidates)) {
  // example: best cosine similarity < 0.45
  return { relevant: false, results: [] };
}

const results = await rerank(query, candidates, 6);
return { relevant: true, results };

By shaping the system this way, the only coherent next move for the model—when results come back empty—is to admit it doesn't have the information. Refusal stops being a personality trait you're hoping for and becomes a property of the architecture. This approach is particularly effective when using models like DeepSeek-V3 or OpenAI o3 via n1n.ai, as these models are highly sensitive to tool outputs and will respect the empty state of a retrieval result.

Advanced Retrieval: Hybrid Search and Reciprocal Rank Fusion

To make this "Refusal Gate" effective, your retrieval must be high-precision. If your retrieval is poor, you will get too many false negatives (refusing to answer when the info exists). To solve this, the MCP Docs Assistant uses a hybrid search pipeline.

  1. Semantic Similarity: Using pgvector on Postgres to perform vector cosine similarity. This captures the intent and meaning behind a query.
  2. Lexical Matching: Using Postgres full-text search to find exact term matches. This is crucial for technical documentation where specific method names (e.g., callTool) must be matched exactly.
  3. Reciprocal Rank Fusion (RRF): We fuse these two result sets. RRF is a powerful algorithm that combines rankings from different search methods without needing to normalize scores. It ensures that if a result appears high in either the vector search or the text search, it is prioritized.

The Challenge of Version-Correctness

Fast-moving libraries like the MCP TypeScript SDK present a unique challenge: breaking changes. When a library moves from v1 to v2, methods are renamed and packages are split. A generic documentation bot often blends v1 and v2 snippets together, resulting in non-functional code.

To solve this, versioning must be a first-class dimension of retrieval. Every chunk in your vector database should be tagged with its version. Retrieval should filter by the version currently in scope. When combined with the refusal gate, this ensures the assistant stays version-correct, cites the exact source line, and refuses when it should.

Validating the System with Golden Sets

How do you know your refusal gate actually works? You cannot rely on vibes. You need a "Golden Set"—a curated list of questions with known correct behaviors.

In my testing, I developed an 18-case golden set that scores three behaviors:

  1. Refusal Accuracy: Does it refuse when the answer is truly missing?
  2. Citation Integrity: Does every claim have a valid source link?
  3. Version Correctness: Does it provide the right code for the requested SDK version?

During early runs, I found that even when retrieval surfaced strong hits (similarity > 0.70), the model would sometimes still claim it didn't know the answer because it was over-analyzing the relevance. The fix was to explicitly tell the model that if the tool returns results, those results ARE relevant. This alignment between the retrieval tool's confidence and the model's reasoning is key. For developers looking to experiment with these configurations, n1n.ai provides the low-latency access needed to run these evaluation loops efficiently.

Conclusion: Constraints Over Suggestions

If you are building RAG systems in 2025, the lesson is clear: behavior you genuinely care about—refusing, citing, and staying within a version—is more reliable when it is enforced in code than when it is requested in a prompt. A prompt is a hope; a gate in the retrieval tool is a constraint. Constraints survive contact with adversarial questions in a way that polite instructions never do.

By moving the logic from the "soft" layer of prompting to the "hard" layer of architectural gates, you create an assistant that is not just helpful, but fundamentally trustworthy.

Get a free API key at n1n.ai