Scaling Edge LLM Deployment with Distillation and Embeddings

Authors
  • avatar
    Name
    Nino
    Occupation
    Senior Tech Editor

Deploying Large Language Models (LLMs) in production is rarely about just hitting an API endpoint. When building the Wu Wei Planner—an AI companion for a physical deck of facilitation cards—the challenge wasn't just intelligence; it was the brutal reality of economics and latency. Operating with a self-imposed budget of $2 per user for a lifetime complimentary service meant that every token became a liability.

To achieve this, we leveraged the power of n1n.ai, the premier LLM API aggregator, to access high-speed models like GPT-4o-mini and Claude 3.5 Sonnet. This article chronicles the journey from a bloated 17,000-token system prompt to a streamlined, high-performance architecture using vector embeddings and Cloudflare Workers.

The Infrastructure: LLMs at the Edge

The project runs entirely on Cloudflare Workers. For developers looking for zero-cold-start performance, the edge is the only logical choice. By deploying logic in over 200+ global locations, we ensure that the orchestration layer—the code that handles prompt assembly and API calls—is physically close to the user.

Cloudflare Workers provide a unique advantage: native streaming support. When using n1n.ai to fetch responses from models like DeepSeek-V3 or OpenAI o3, the Worker can pipe the stream directly to the client, reducing perceived latency even when the underlying model is processing a complex query.

Phase 1: The Brute Force and the Token Trap

Initially, we attempted to give the AI everything. The knowledge base—50 card descriptions, 15 professional contexts, and trauma-sensitive guidelines—totaled 17,000 tokens. While modern models like GPT-4o handle large contexts easily, the math for a chat interface is punishing.

In a stateful conversation, the history grows. If the base context is 17k tokens, message 1 is 17k, message 10 is 20k, and message 20 is 25k. At standard pricing, a 20-message thread could cost 0.08.Ona0.08. On a 2 total budget, a user would run out of credit in just 25 conversations. We needed a way to scale without breaking the bank.

Phase 2: Prompt Distillation (AI-to-AI Communication)

Humans need prose; LLMs need semantics. We realized that the 300-word descriptions for cards like "Fire" were filled with filler words. We moved toward "Prompt Distillation," a technique where we strip away the fluff.

Original Context (~300 tokens): "Fire represents passion and destruction. It can be a destructive force or a warming community element. Facilitators should hold space for both..."

Distilled Context (~100 tokens): CARD:Fire | kw:passion,intensity,chaos,transformation | !caution:trauma | core:Duality (gen/dest) | prompts:What does this fire bring up?

By distilling the entire knowledge base, we dropped the system prompt from 17k to 12k tokens. While a 30% saving is good, it wasn't enough to meet our $2 target.

Phase 3: The Router Failure

We then tried an LLM Router. The idea was to use a fast model (like Gemini 1.5 Flash via n1n.ai) to analyze the user's query and decide which specific chunks of the knowledge base to inject.

The router was instructed to output JSON. However, we hit two major walls:

  1. JSON Malformation: Even high-tier models occasionally fail to close a brace or escape a quote, leading to 20% failure rates in our orchestration.
  2. Sequential Latency: A router call takes 4-5 seconds, and the main LLM takes another 1-2 seconds. A 6-second delay before the first token is unacceptable for a modern UX.

Phase 4: The Vector Embedding Solution

This led us to the final, most stable architecture: Retrieval-Augmented Generation (RAG) using local vector similarity. Instead of asking an LLM to "choose" the context, we use text-embedding-3-small to convert the user's query into a 1536-dimensional vector.

We pre-computed the vectors for all 50 cards and 15 professions into a 300kb static JSON file. When a user sends a message:

  1. We generate an embedding for the query (< 20ms).
  2. We run a Cosine Similarity check against our local 300kb file (< 5ms).
  3. We inject only the top 3 cards and the top 1 profession into the prompt.

The Math of Cosine Similarity:

function cosineSimilarity(a, b) {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i &lt; a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] ** 2;
    normB += b[i] ** 2;
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

By moving to this architecture, the "token overhead" dropped from 12,000 to just 150 (the query embedding). This allows us to provide hundreds of conversations for that same $2 budget.

Quality Control with LLM-as-a-Judge

Once the infrastructure was fast and cheap, we focused on quality. Using promptfoo and the llm-rubric assertion, we graded the responses. Instead of checking for keywords, we used a "Judge LLM" to evaluate if the response honored the "Wu Wei" philosophy (non-directive, metaphorical, warm).

If the judge found the response too clinical, we adjusted the system prompt. This feedback loop, powered by the diverse model selection at n1n.ai, allowed us to fine-tune the personality of the Wu Wei Planner until it felt human.

Pro-Tips for Edge LLM Scaling

  1. Threshold Calibration: Don't trust the default 0.7 similarity threshold. For metaphorical content, 0.3 or 0.4 is often the sweet spot.
  2. One LLM is Better Than Two: Avoid sequential chains if possible. Use embeddings for retrieval and a single powerful model like GPT-4o-mini for reasoning.
  3. Pre-compute Everything: If your dataset is under 1MB, don't use a vector database. A static JSON file in a Cloudflare Worker is faster and cheaper.

Get a free API key at n1n.ai