How to Reduce LLM API Costs by 80% with a Simple Router

Scaling a generative AI application often leads to a painful realization: high-performance models like Claude 3.5 Sonnet or OpenAI o3 are expensive. When I first launched my RAG-based (Retrieval-Augmented Generation) customer support tool, my monthly bill reached $340 within weeks. After implementing a simple 50-line Python router, that bill dropped to$ 67 the following month—with no perceptible loss in quality.

By using n1n.ai, I was able to access all these models through a single unified interface, making the implementation of this routing logic significantly easier. In this guide, I will walk you through the architecture of a complexity router, how to build it, and how to add a semantic caching layer for even greater savings.

The Problem: The Overkill Trap

Most developers start by pointing every request to the most powerful model available. It is the safest path during development. However, in a production environment, not every user query requires a 'PhD-level' reasoning engine.

Consider these two user queries in a typical RAG system:

Query A: "What are your business hours on Sundays?"
Query B: "Can you compare the technical specifications of the X1 and Y2 models and tell me which one is better for high-altitude photography?"

Using Claude 3.5 Sonnet for Query A is like hiring a rocket scientist to help you find your car keys. It works, but it is a massive waste of resources. Query A can be handled perfectly by a 'small' model like GPT-4o mini or DeepSeek-V3 for a fraction of the cost.

The Solution: Complexity-Based Routing

The goal is to create a middleware layer—a router—that inspects the incoming query and decides which model to invoke. We categorize queries into 'Simple' and 'Complex'.

Step 1: Defining the Router Logic

To build this, you need a way to classify intent. While you could use another LLM for classification (which adds latency and cost), a simple heuristic approach or a very small model (like a distilled Llama-3-8B) works best. For this tutorial, we will use a lightweight classification prompt on a fast model provided by n1n.ai.

import openai
from typing import Literal

# Using n1n.ai unified endpoint configuration
client = openai.OpenAI(
    base_url="https://api.n1n.ai/v1",
    api_key="YOUR_N1N_API_KEY"
)

def classify_complexity(query: str) -> Literal["simple", "complex"]:
    """
    Classifies the query intent to determine model routing.
    """
    # Heuristic check for length and keywords
    if len(query) &lt; 50 and not any(kw in query.lower() for kw in ["compare", "analyze", "explain"]):
        return "simple"

    # Fallback to a cheap LLM classification if needed
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Classify if this query needs deep reasoning (complex) or simple fact retrieval (simple). Reply with one word."},
            {"role": "user", "content": query}
        ],
        max_tokens=5
    )
    return "complex" if "complex" in response.choices[0].message.content.lower() else "simple"

Step 2: Implementing the Route Function

Now we implement the actual routing logic. We will route 'simple' queries to GPT-4o mini and 'complex' queries to Claude 3.5 Sonnet. By using the n1n.ai aggregator, we don't need to manage different SDKs for Anthropic and OpenAI.

def get_llm_response(query: str, context: str = "") -> str:
    complexity = classify_complexity(query)

    # Select model based on complexity
    target_model = "gpt-4o-mini" if complexity == "simple" else "claude-3-5-sonnet"

    print(f"Routing query to: {target_model}")

    response = client.chat.completions.create(
        model=target_model,
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

Step 3: Adding a Semantic Cache Layer

Even with routing, you might still be paying for redundant queries. If 100 users ask the same question about your refund policy, why call the API 100 times? A cache layer can bring your costs down even further.

While a simple dictionary works for exact matches, a Semantic Cache uses vector embeddings to catch queries that mean the same thing but use different words.

import hashlib

# Simple in-memory cache for demonstration
# In production, use Redis or a Vector DB
_cache = {}

def get_cache_key(query: str) -> str:
    return hashlib.md5(query.strip().lower().encode()).hexdigest()

def route_with_cache(query: str, context: str = "") -> str:
    key = get_cache_key(query)

    if key in _cache:
        print("Cache hit! Cost: $0.00")
        return _cache[key]

    result = get_llm_response(query, context)
    _cache[key] = result
    return result

Model Pricing Comparison Table

To understand why this saves so much money, look at the price disparity between models available on n1n.ai (prices per 1M tokens, estimated):

Model Name	Input Price (per 1M)	Output Price (per 1M)	Best Use Case
GPT-4o mini	$0.15	$0.60	Basic Q&A, Summarization
DeepSeek-V3	$0.14	$0.28	Coding, Math, Logic
Claude 3.5 Sonnet	$3.00	$15.00	Creative Writing, Nuance
OpenAI o3-mini	$1.10	$4.40	Complex Reasoning

By routing 80% of traffic to GPT-4o mini, you are effectively reducing your input costs by 20x for those specific requests.

Real-Time Cost Logging

To maintain visibility, you should log the cost of every transaction. Most API responses include a usage object. You can map these to a price table to monitor your savings in real-time.

COST_TABLE = {
    "gpt-4o-mini": {"input": 0.15 / 1_000_000, "output": 0.60 / 1_000_000},
    "claude-3-5-sonnet": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000}
}

def calculate_cost(model, usage):
    in_cost = usage.prompt_tokens * COST_TABLE[model]["input"]
    out_cost = usage.completion_tokens * COST_TABLE[model]["output"]
    return in_cost + out_cost

When Routing Fails

Model routing is not a silver bullet. You should avoid routing in the following scenarios:

Creative Consistency: If the user is in a long conversation, switching models mid-stream can change the 'personality' or 'tone' of the assistant.
Strict Formatting: If you require complex JSON output that only larger models can reliably produce, don't route to smaller models.
Latency-Sensitive Tasks: The classification step adds a small amount of overhead (usually < 200ms). If your UI needs to be instantaneous, consider using a static regex-based router instead of an LLM classifier.

Conclusion

Cost optimization in the LLM space isn't just about finding the cheapest provider; it's about using the right tool for the right job. By implementing a complexity router and leveraging the unified API from n1n.ai, you can ensure that your application remains both powerful and profitable.

Start small: identify your top 5 most common queries and see if a smaller model can handle them. You might be surprised at how much 'intelligence' you've been overpaying for.

Get a free API key at n1n.ai.

Source: https://dev.to/chnby/how-i-cut-my-llm-api-bill-by-80-with-a-simple-router-3246