Model Reviews
Frontier Models Struggle with Enterprise IT Tasks in ITBench-AA Benchmark
The first comprehensive benchmark for agentic enterprise IT tasks, ITBench-AA, reveals that even leading models like Claude 3.5 Sonnet and GPT-4o score below 50%, highlighting a massive gap in AI readiness for technical automation.
Read more →