Frontier Models Score Below 50% on ITBench-AA for Enterprise IT Tasks
- Authors

- Name
- Nino
- Occupation
- Senior Tech Editor
The transition from Large Language Models (LLMs) as simple chatbots to autonomous 'Agents' is the defining trend of 2025. However, a new reality check has arrived in the form of ITBench-AA, a benchmark co-developed by Artificial Analysis and IBM. The results are sobering: even the most advanced frontier models fail to reach a 50% success rate on complex, real-world enterprise IT tasks. For developers utilizing n1n.ai to power their automation pipelines, these findings provide a critical roadmap for where current technology stands and how to bridge the gap.
What is ITBench-AA?
ITBench-AA (Artificial Analysis IT Benchmark) is the first comprehensive evaluation framework designed specifically for Agentic Enterprise IT Tasks. Unlike general reasoning benchmarks (like MMLU) or coding benchmarks (like HumanEval), ITBench-AA simulates the messy, multi-step environments of a corporate IT department.
It evaluates models across several domains:
- System Administration: Managing users, permissions, and configurations.
- Cloud Infrastructure: Provisioning and troubleshooting AWS/Azure/GCP resources.
- Database Management: Schema migrations, query optimization, and recovery.
- Security & Compliance: Identifying vulnerabilities and enforcing policy.
The benchmark requires models to act as agents, meaning they must use tools (CLI, APIs, documentation) to solve problems over multiple turns. When testing these models through high-performance aggregators like n1n.ai, developers can see firsthand how latency and tool-calling accuracy impact these scores.
The Performance Gap: Why Frontier Models are Struggling
According to the report, the top-performing models—including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 Pro—all scored below 50%. This suggests a massive 'Agentic Gap' between simple instruction following and reliable task execution.
| Model | Overall Score (ITBench-AA) | Key Weakness |
|---|---|---|
| Claude 3.5 Sonnet | ~48% | Multi-step state tracking |
| GPT-4o | ~46% | Over-reliance on 'hallucinated' CLI flags |
| Gemini 1.5 Pro | ~42% | Context window management in long loops |
| Llama 3.1 405B | ~38% | Tool-calling syntax errors |
The 'State Tracking' Problem
In a typical IT task, such as 'Troubleshoot a failing Kubernetes pod and fix the underlying storage issue,' the model must maintain an accurate mental model of the system state. As the agent executes commands and receives errors, it often loses track of previous attempts, leading to repetitive loops or catastrophic failures. Accessing these models via a stable API provider like n1n.ai ensures that the networking layer isn't the bottleneck, allowing developers to focus on refining the agent's logic.
Technical Deep Dive: Building Resilient IT Agents
To overcome the sub-50% performance, developers must move beyond simple 'Zero-shot' prompting. Implementing a robust agentic loop requires sophisticated error handling and state management. Below is a conceptual implementation using a multi-model approach, which can be easily executed through the n1n.ai unified API.
import n1n_sdk # Hypothetical SDK for n1n.ai
# Initialize the client with n1n.ai
client = n1n_sdk.Client(api_key="YOUR_N1N_KEY")
def execute_it_task(task_description):
state = {"history": [], "status": "pending"}
# Use a high-reasoning model for planning
plan = client.chat(model="o3-mini", prompt=f"Plan this IT task: {task_description}")
for step in plan.steps:
# Execute action using a fast, tool-optimized model
response = client.chat(
model="claude-3-5-sonnet",
messages=[{"role": "user", "content": f"Execute: {step}. Current State: {state}"}],
tools=it_tools_definition
)
# Critical: Verify the output
verification = verify_system_state(response)
if not verification.success:
# Self-correction loop
state["history"].append(f"Failed {step}: {verification.error}")
continue
return "Task Completed"
Pro Tips for Enterprise AI Implementation
- Tool-Calling Precision: IT tasks often fail because the model generates a CLI command with a non-existent flag. Use Pydantic schemas to strictly define tool outputs.
- Hybrid RAG: Don't rely on the model's internal knowledge of software versions. Provide real-time documentation snippets via a RAG (Retrieval-Augmented Generation) pipeline.
- Latency Matters: In multi-turn agentic loops, high latency compounds. If an agent takes 10 turns to solve a task, a 500ms difference in API response time adds up to 5 seconds of idle time. This is why low-latency providers like n1n.ai are essential for production agents.
- Human-in-the-loop (HITL): Since models score < 50%, always implement a 'check-point' for destructive actions (e.g.,
rm -rfor database drops).
The Future of IT Automation
The ITBench-AA results are not a sign of failure, but a baseline for growth. As models evolve from 'Reasoning' (o1/o3 series) to 'Action' (Agentic), we expect these scores to climb. For enterprises, the strategy should be to start with 'Copilot' workflows—where the AI suggests actions—and gradually move toward full autonomy as benchmark scores improve.
By leveraging the diverse model ecosystem available on n1n.ai, organizations can swap models as soon as a new leader emerges on the ITBench-AA leaderboard without rewriting their entire integration layer.
Get a free API key at n1n.ai