Building a Bitemporal Knowledge Graph for LLM Agent Memory: A 92% LongMemEval Case Study

Authors
  • avatar
    Name
    Nino
    Occupation
    Senior Tech Editor

Modern Large Language Models (LLMs) like Claude 3.5 Sonnet and GPT-4o have achieved remarkable reasoning capabilities. However, when deployed as persistent agents (e.g., Claude Code, Cursor, or custom enterprise bots), they often suffer from a 'Groundhog Day' effect. Every new session feels like onboarding a brilliant but amnesiac employee. Traditional Retrieval-Augmented Generation (RAG) using vector databases often falls short because it relies on cosine similarity—a method that struggles with contradictions, temporal reasoning, and complex entity relationships.

To solve this, I developed Memento, a bitemporal knowledge graph memory system, and benchmarked it against LongMemEval, achieving a 92.4% task-averaged score. This guide explores the architecture, the pitfalls of standard RAG, and how to implement a memory system that actually reasons.

The Failure of Vector-Only Memory

Most AI memory implementations follow a predictable pattern: user input is embedded into a vector, stored in a database (like Pinecone or Milvus), and retrieved via nearest-neighbor search. While effective for simple 'what did we talk about?' queries, this approach breaks down in three critical scenarios:

  1. Entity Confusion: It cannot distinguish between 'John' (a generic name) and 'John Smith' (the VP of Sales) if their embeddings are too similar or if the context is sparse.
  2. Temporal Blindness: A fact from January 2023 is treated with the same weight as a contradictory fact from yesterday. Vector search has no inherent concept of 'now' versus 'then'.
  3. Complex Synthesis: If an answer requires connecting information across three different sessions (e.g., 'What changed in the Alpha project since the last meeting?'), simple document retrieval often fails to provide the necessary relational context.

For developers seeking the highest performance from their agents, integrating a more sophisticated memory layer with high-speed APIs from n1n.ai is the first step toward true autonomy.

Architecture: The Memento Approach

Memento replaces the flat vector store with a Bitemporal Knowledge Graph. The architecture consists of several specialized layers:

  • Ingestion Pipeline: Extracts entities (People, Projects, Organizations) and their properties using LLMs.
  • Entity Resolution: Uses a tiered matching system (Exact > Fuzzy > Phonetic > Embedding > LLM Tiebreaker) to ensure 'the sales VP' and 'John Smith' map to the same node.
  • Bitemporal Logic: Tracks two timelines: valid time (when the fact was true in the world) and system time (when the memory system learned the fact).
  • Verbatim Fallback: Uses SQLite's FTS5 (Full-Text Search) and vector search as a safety net to ensure extraction errors don't lead to total data loss.

Benchmarking with LongMemEval

To validate the system, I used LongMemEval, a rigorous benchmark featuring 500 questions across five categories: Single-session recall, Preference tracking, Multi-session reasoning, Knowledge updates, and Temporal reasoning.

I utilized Claude 3.5 Sonnet via n1n.ai for the extraction and reasoning backbone, leveraging its superior instruction-following capabilities. The evaluation judge was GPT-4o, following the methodology established in the original LongMemEval paper.

The Iterative Optimization Process

Run 1: The Baseline (91.0% Overall) The first run revealed two gaps. First, the system lacked session timestamps, making temporal reasoning impossible. Second, the verbatim search was too coarse. By piping session dates into the ingestion pipeline and indexing individual chat turns in FTS5, the baseline hit a strong 91%.

Run 2: The 'More is Better' Fallacy (89.6% Overall) I attempted to improve multi-session reasoning by doubling the retrieval window (top_k 10 → 20, context 4K → 8K tokens). Surprisingly, accuracy dropped. This is a phenomenon known as Context Dilution. When you flood the prompt with 8K tokens of loosely related data, the LLM's 'needle-in-a-haystack' performance degrades, leading to second-guessing and hallucinations.

Run 3: Adaptive Retrieval (90.8% Overall) Instead of widening the window for everyone, I implemented Adaptive Retrieval. The system classifies the query first: is it 'Wide' (e.g., 'How many times did we mention X?') or 'Narrow' (e.g., 'What is John's phone number?')? The retrieval parameters are then adjusted dynamically.

Implementation Guide

You can integrate Memento into your Python projects or use it as an MCP (Model Context Protocol) server for tools like Claude Desktop or Cursor.

Installation

# Install with your preferred provider
pip install memento-memory[anthropic]
# Or use n1n.ai for multi-model flexibility

Basic Usage in Python

from memento import MemoryStore

# Initialize the store with bitemporal support
store = MemoryStore()

# Ingesting information
store.ingest("John Smith was promoted to SVP of Sales on Oct 1st.")
store.ingest("Alpha Corp's current strategy is aggressive expansion.")

# Recalling with graph traversal
# This doesn't just search vectors; it walks the 'John Smith' -> 'Alpha Corp' relationship
context = store.recall("Give me a briefing on John's current status.")
print(context.text)

Key Lessons for AI Engineers

  1. Retrieval Quality > Quantity: A focused 4K token context consistently beats a cluttered 8K context. Prioritize precision over recall to avoid confusing the model.
  2. Avoid Multi-Pass Chains: Every additional LLM call (Self-verification, Chain-of-thought validation) introduces a 'probability of corruption.' The simplest pipeline that achieves the goal is usually the most robust.
  3. Entity Resolution is the Secret Sauce: The ability to merge disparate mentions of the same entity into a single node is what transforms a 'search engine' into a 'memory system.'

To build these systems at scale, you need reliable access to the world's best models. Using an aggregator like n1n.ai allows you to switch between Claude for extraction and GPT for evaluation without changing your infrastructure.

Conclusion

Moving from vector-based RAG to a Bitemporal Knowledge Graph is the key to unlocking true long-term memory for AI agents. By focusing on entity relationships and temporal logic, we can build agents that don't just search—they remember.

Get a free API key at n1n.ai.