Building an AI that Audits Other AI Agents: A Guide to A2A Production Systems

The audit report arrived at 2:47 am. I wasn't expecting it—I had triggered the test run before bed, more out of habit than anticipation. But there it was: a comprehensive score, six dimension breakdowns, and a remediation plan with specific line numbers. The auditor was an AI. The subject being audited was also an AI. This exchange involved seven turns of natural language conversation with zero human intervention. This is what Agent-to-Agent (A2A) communication looks like in a production environment.

The Hidden Crisis of Token Waste

Most engineering teams building on Large Language Models (LLMs) focus on output quality, latency, and user satisfaction. However, token efficiency—the ratio of useful information to total tokens consumed—is rarely measured. In our production testing, we found that agents consistently waste between 40% and 60% of their token budget on avoidable inefficiencies. This isn't just a technical debt issue; it is a massive financial drain.

Common sources of waste include:

Bloated System Prompts: Carrying 3x more context than the specific task requires.
Suboptimal Model Selection: Using heavy models like GPT-4o or Claude 3.5 Sonnet for tasks that a lighter model could handle.
Noisy RAG Context: Retrieved data where 80% is irrelevant to the query.
Redundant Calls: Identical requests made without any caching mechanism.
Sequential Overhead: Requests that could be batched but are sent one by one.

To manage these costs effectively, developers often turn to n1n.ai, which provides a unified interface for high-speed LLM access, allowing for easier switching between models like DeepSeek-V3 and OpenAI o3 to optimize for cost and performance.

Why Dashboards Fail and A2A Succeeds

Initially, we built a standard observability dashboard. It tracked metrics and logs, but it failed to solve the underlying problem. Logs only show what happened; they don't explain why it happened. The most significant inefficiencies are architectural. They reside in how an agent is designed to 'think'—which prompts it selects, how it routes to different models, and how it manages memory. You cannot infer intent from a raw JSON log.

The solution was to ask. We built 'Gary,' an auditing agent designed to interrogate other agents. Gary uses a natural language interrogation strategy to extract architectural details without requiring any SDK integration or code changes on the target agent's side.

The Seven Pillars of Agent Interrogation

Gary follows a strict protocol of seven questions to evaluate the target agent's architecture:

Model Routing: How do you decide between different models for different tasks?
System Prompt Scope: What is the approximate length and purpose of your system prompt?
Context Handling: What logic determines which information is included in the context window?
Output Constraints: Do you enforce specific response lengths or formats?
Retrieval Strategy: If you use RAG, how are chunks retrieved and ranked?
Caching Logic: Do you store previous LLM responses to avoid repeat calls?
Batching Capabilities: Can you group multiple user intents into a single inference call?

By analyzing the responses, Gary calculates a score across six dimensions. For instance, an audit of a RAG-based support bot might reveal that it routes simple FAQ queries to expensive models. By switching these simple tasks to a model like GPT-4o-mini via n1n.ai, teams can save up to 45% on model spend immediately.

Technical Implementation: The A2A Protocol

The audit implementation uses an emerging A2A protocol based on JSON-RPC over HTTPS. It utilizes tasks/send and tasks/get methods with Server-Sent Events (SSE) for streaming responses. Below is a simplified example of how a client agent initiates an audit task:

{
  "jsonrpc": "2.0",
  "method": "tasks/send",
  "params": {
    "id": "audit-session-xyz",
    "message": {
      "role": "user",
      "parts": [
        { "type": "text", "text": "Initiate architectural audit. Target: CustomerServiceBot v2." }
      ]
    }
  }
}

When the target agent responds, Gary processes the natural language to extract specific architectural patterns. This is where the power of modern LLMs like DeepSeek-V3 shines; they are capable of high-level reasoning about system design based on conversational cues.

Pro Tip: The Self-Awareness Gap

One surprising finding from our production runs is that agents are remarkably honest about their design flaws, except for context usage. While an agent can accurately describe its system prompt, it almost always underestimates the actual token count of its context window. This is why a hybrid approach is best: use A2A for architectural discovery and pair it with the high-performance throughput of n1n.ai to monitor real-time token consumption.

Optimization Metrics and Remediation

A typical remediation plan provided by the A2A auditor looks like this:

Dimension	Score	Finding	Remediation
Model Fit	62/100	Over-reliance on GPT-4o for simple logic.	Implement a routing layer for GPT-4o-mini.
Context Window	71/100	Prepending full history to every call.	Use a sliding window with summarization.
RAG Efficiency	55/100	Chunk size is too large (2000 tokens).	Reduce chunk size to 512 tokens + reranking.

Conclusion

If you are building complex agentic workflows, flying blind on token efficiency is no longer an option. Implementing an A2A auditing layer allows you to treat your AI infrastructure as a living, self-optimizing system. By combining these insights with the reliable, multi-model API access provided by n1n.ai, you can ensure your AI agents are both intelligent and economically sustainable.

Get a free API key at n1n.ai

Source: https://dev.to/garybotlington/we-built-an-ai-that-audits-other-ai-agents-heres-how-a2a-works-in-production-51l6