Benchmark Results: SmolLM3 3B and Phi-4-mini Lead Agent Coding Tests
- Authors

- Name
- Nino
- Occupation
- Senior Tech Editor
The landscape of Large Language Models (LLMs) has long been dominated by the philosophy that 'bigger is better.' However, the second round of the Works With Agents agent coding benchmark has completely upended this narrative. In a comprehensive test involving 32 models—a significant increase from the initial 10—the results have sent shockwaves through the developer community. A 3-billion-parameter model from Hugging Face, SmolLM3 3B, didn't just compete; it dominated, scoring a 93.3 and leaving frontier models like Claude Sonnet 4 and various GPT-5 iterations in the dust.
For developers looking to integrate these high-performance models into their production environments, n1n.ai provides a unified API to access the latest small and large language models with industry-leading stability. This shift toward 'Small Language Models' (SLMs) for specialized tasks like agentic coding suggests that efficiency and architectural focus might be more critical than raw parameter count.
The Rise of the Tiny Giants: Benchmark Overview
The benchmark results highlight a surprising trend: the top of the leaderboard is crowded with models that can comfortably run on a modern laptop. SmolLM3 3B secured the gold medal, followed closely by Microsoft’s Phi-4-mini. Even the Qwen2.5 variants (1.5B and 3B) managed to tie with the much larger Claude Sonnet 4.
| Rank | Model | Score |
|---|---|---|
| 🥇 | SmolLM3 3B | 93.3 |
| 🥈 | Phi-4-mini | 90.0 |
| 🥉 | Claude Sonnet 4 | 85.0 |
| 4 | Qwen2.5 1.5B | 85.0 |
| 5 | Qwen2.5 3B | 85.0 |
| 6 | Granite 3.2 2B | 82.5 |
| 7 | Ministral 3B | 81.7 |
| 8 | Mistral Large 3 | 79.6 |
| 9 | Gemma 4 31B | 78.3 |
| 10 | Gemma 4 26B A4B | 78.3 |
Why Small Models are Winning at Agentic Coding
Agentic coding is fundamentally different from standard code completion. It requires a model to operate within a loop, handling multi-file edits, executing shell commands, and recovering from its own errors. The Works With Agents benchmark evaluates models over 12 rigorous rounds, focusing on:
- Multi-file edits: Modifying Python scripts and shell files across a directory structure.
- Git operations: Performing clones, branching, and commits autonomously.
- Shell command execution: Interacting with the OS to run tests or build scripts.
- Bash scripting: Using complex pipes and redirects (e.g.,
grep | awk | sed). - Error Recovery: The ability to see a traceback and fix the code without human intervention.
Small models like SmolLM3 3B seem to have a 'purity' of instruction following that larger models lose due to over-alignment or excessive 'reasoning' steps that lead to hallucination in tool-calling sequences. When you use n1n.ai to orchestrate these models, you can leverage this high efficiency for a fraction of the cost of frontier models.
The Failure of the 'Pro' Variants
One of the most startling findings was the underperformance of 'Pro' and 'Large' variants compared to their 'Flash' or 'Mini' counterparts.
| Model | Score |
|---|---|
| Claude Sonnet 4 | 85.0 |
| Gemini 2.5 Flash | 76.4 |
| GPT-5.4 | 76.6 |
| Grok 4.20 | 75.0 |
| DeepSeek V4 Flash | 60.0 |
| GPT-5.4 Pro | 51.6 |
| DeepSeek V4 Pro | 38.3 |
DeepSeek V4 Pro, despite its massive parameter count, scored a dismal 38.3, while its Flash variant achieved a 60.0. Similarly, GPT-5.5 Pro and GPT-5.4 Pro both underperformed their base models. This suggests that the 'reasoning' overhead in larger models might actually hinder their ability to execute straightforward tool calls efficiently. They often 'overthink' the solution, leading to unnecessary steps that penalize their efficiency score (which accounts for 30% of the total benchmark weight).
Implementation Guide: Building an Agent with SmolLM3
To implement an agent using a high-scoring model like SmolLM3 3B via n1n.ai, you need to focus on the system prompt and tool-definition structure. Below is a Python example of how to initialize a coding agent loop.
import openai
# Configure n1n.ai API access
client = openai.OpenAI(
base_url="https://api.n1n.ai/v1",
api_key="YOUR_N1N_API_KEY"
)
def run_agent_task(prompt):
messages = [
{"role": "system", "content": "You are a coding agent. Use shell tools to solve problems. Be concise."},
{"role": "user", "content": prompt}
]
# Using SmolLM3 3B for high efficiency
response = client.chat.completions.create(
model="smollm3-3b",
messages=messages,
tools=[
{
"type": "function",
"function": {
"name": "execute_shell",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string"}
}
}
}
}
]
)
return response
Technical Deep Dive: The 1.5B Threshold
While 3B models are thriving, the benchmark revealed a 'intelligence floor' around the 1.5B parameter mark for reasoning models. Models like DeepSeek-R1 1.5B and Qwen3.5 0.8B struggled to complete basic tool sequences, scoring 27.5 and 26.0 respectively.
More concerning was the performance of Google's Lyria suite. Lyria 3 Pro scored a meager 8.3, and Lyria 3 Clip scored a literal zero. These models were unable to produce any working output for the agentic tasks, highlighting a significant gap in their instruction-tuning for real-world environment interaction.
Pro Tips for Enterprise AI Strategy
- Don't Default to the Largest Model: For internal dev-ops agents or automated PR reviewers, a model like SmolLM3 or Phi-4-mini is not only faster but statistically more accurate in this specific benchmark.
- Monitor Efficiency: The benchmark weights efficiency at 30%. In production, a model that takes 50 steps to do what another does in 5 is a massive cost and latency liability.
- Use an Aggregator: Use n1n.ai to swap models dynamically. If a task requires heavy creative writing, use Claude; if it requires a 12-round coding sequence, switch to SmolLM3.
Conclusion
The Works With Agents benchmark proves that we are entering an era of 'Model Specialization.' The era of the general-purpose monolith is being challenged by highly optimized, small-scale models that excel in agentic workflows. SmolLM3 3B and Phi-4-mini are leading the charge, proving that for coding agents, size isn't the bottleneck—execution logic is.
Get a free API key at n1n.ai