Building Reliable AI Agents with Reflexion and LangGraph

An LLM that cannot reflect on its mistakes is not an agent — it is an autocomplete on steroids. You have probably seen it: you give a large language model (LLM) a complex task — write a research report, debug a multi-file codebase, plan a multi-step strategy — and it confidently produces something that sounds right but is subtly, sometimes catastrophically, wrong. It has no mechanism to pause, question itself, or try again with a better strategy. This is where high-performance API aggregators like n1n.ai come in, providing the underlying stability and speed required for the iterative loops that define modern agentic systems.

This article bridges the gap between raw prompting and reliable systems. We will explore how to build self-improving AI agents using Reflexion (a technique where agents critique and retry their own outputs) and LangGraph (a framework for stateful, graph-based workflows).

The Structural Weakness of Raw LLMs

Large language models are extraordinarily powerful pattern completers. Whether you are using OpenAI o3, Claude 3.5 Sonnet, or DeepSeek-V3, models run a single forward pass. They read input from left to right and generate tokens until they stop. There is no internal loop, no checking, no going back. This leads to four primary failure modes:

Hallucination: Inventing facts that sound authoritative but do not exist.
Premature Convergence: Settling on the first reasonable answer without exploring better alternatives.
Context Blindness: Losing track of constraints as tasks grow in scale.
Silent Failure: Unlike software crashes, a wrong LLM output looks identical to a correct one.

For developers building on n1n.ai, these failures represent an architectural challenge, not just a prompting issue. To solve them, we must move from linear chains to iterative graphs.

The Reflexion Framework: Verbal Reinforcement Learning

Introduced by Shinn et al. (2023), Reflexion is an inference-time technique that turns a static model into a self-improving agent without fine-tuning. It consists of three core components:

The Actor: The LLM that performs the task. It takes the task description and past memory to generate an output.
The Evaluator (Critic): A function (or another LLM) that scores the output. This could be a unit test, a linter, or a factuality checker.
The Reflector: An LLM that analyzes the output and the evaluator's feedback to produce a verbal self-critique. This critique is stored in episodic memory for the next attempt.

By externalizing the critique into natural language, we leverage the LLM's ability to reason about its own errors. A typical benchmark improvement on the HotpotQA dataset sees accuracy jump from 30% to 60% simply by adding reflection cycles.

Building the Backbone with LangGraph

While LangChain is great for simple sequences, LangGraph is designed for cycles. It treats agent workflows as directed graphs where nodes are functions and edges are transitions. This is critical for Reflexion because it allows for:

Explicit State Management: The agent's "working memory" is a typed Python object.
Conditional Branching: Routes like "If score < 0.8 -> reflect_node" are natively supported.
Persistence: Checkpointing allows agents to pause and resume, which is vital for long-running tasks.

Step-by-Step Implementation

1. Define the State

The ReflexionState must capture the history of the agent's attempts and critiques.

from typing import List, TypedDict

class ReflexionState(TypedDict):
    task: str
    attempts: List[str]
    reflections: List[str]
    scores: List[float]
    iteration: int
    max_iterations: int

2. The Actor Node

The Actor must be aware of its previous failures. This is where using a high-context model like Claude 3.5 Sonnet via n1n.ai excels, as it can process long histories of critiques accurately.

def actor_node(state: ReflexionState):
    # Inject history into the prompt
    history = ""
    for i, (att, ref) in enumerate(zip(state['attempts'], state['reflections'])):
        history += f"\nAttempt {i+1}: {att}\nCritique {i+1}: {ref}"

    prompt = f"Task: {state['task']}\n{history}\n\nGenerate a better response:"
    # Call LLM via n1n.ai
    response = llm.invoke(prompt)
    return {
        **state,
        "attempts": state["attempts"] + [response.content],
        "iteration": state["iteration"] + 1
    }

3. Choosing the Evaluator

Your evaluator determines the quality ceiling of the system. Refer to this table for selection logic:

Task Type	Best Evaluator
Code Generation	Unit test runner (Deterministic)
Research/Q&A	Fact-check LLM prompt
API Integration	JSON Schema validation
Math	Python `eval()` or symbolic solver

4. The Reflexion Node

The Reflector must be precise. Avoid vague feedback like "do better." Force the model to identify specific lines of code or logical fallacies.

def reflect_node(state: ReflexionState):
    last_attempt = state["attempts"][-1]
    last_score = state["scores"][-1]
    prompt = f"Your last attempt: {last_attempt} got a score of {last_score}. Why? Be specific."
    reflection = llm.invoke(prompt).content
    return {**state, "reflections": state["reflections"] + [reflection]}

Pro Tip: Optimizing for Cost and Latency

Every reflection cycle is an additional API call. If you are running 5 cycles with a GPT-4o-class model, your costs and latency will increase 5x.

Strategy: Asymmetric Modeling Use a high-reasoning model (like OpenAI o3) for the Actor and a faster, cheaper model (like DeepSeek-V3) for the Evaluator and Reflector. By using n1n.ai, you can switch between these models seamlessly using a single unified API, ensuring you only pay for high-reasoning power when it's actually doing the work.

Real-World Success: Devin and AlphaCode 2

The industry is moving toward this architecture. Devin, the AI software engineer, uses a Reflexion-like loop where it writes code, runs it in a terminal, and iterates on the error output. On the SWE-bench, this approach lifted solve rates significantly because the "Evaluator" was the real-world terminal output.

Summary of Trade-offs

Pros	Cons
Significantly higher accuracy	Increased token cost
No fine-tuning required	Higher latency (not for real-time chat)
Fully debuggable state	Risk of "reflection poisoning"

Conclusion

The move from simple LLMs to reliable AI systems is not about finding a "magic prompt." It is about building a robust architectural scaffolding. Reflexion provides the cognitive loop, while LangGraph provides the execution infrastructure. Together, they transform an autocomplete tool into a production-grade agent.

Get a free API key at n1n.ai

Source: https://dev.to/im-shafiqurehman/from-simple-llms-to-reliable-ai-systems-building-reflexion-based-agents-with-langgraph-1a5n