Scale LLM Workloads to Zero on Kubernetes with Hearth

In the current gold rush of Generative AI, the most expensive 'waste' isn't just failed experiments—it is idle silicon. For developers self-hosting Large Language Models (LLMs) on Kubernetes, a single NVIDIA A100 or H100 instance can cost thousands of dollars per month. When that GPU is pinned to a model that only receives traffic during business hours, you are effectively burning money for 16 hours a day. While managed services like n1n.ai offer high-speed access to models like Claude 3.5 Sonnet and OpenAI o3 without infrastructure overhead, many enterprises still require private deployments for data sovereignty or custom fine-tuning.

Enter Hearth: a vendor-neutral Kubernetes operator designed to turn 'running a model' into a managed, cost-efficient service. Hearth introduces a critical capability that has been notoriously difficult to implement in the AI stack: Scale-to-Zero.

The Problem with Existing LLM Serving Stacks

If you have tried to implement autoscaling for LLMs using standard tools, you likely encountered several friction points:

The Heavyweight Tax: Frameworks like KServe or Seldon often require a massive sidecar of dependencies, including Knative and Istio. For a team just wanting to run a Qwen-7B or DeepSeek-V3 instance, this is often overkill.
Hardware Lock-in: Most serving stacks assume an NVIDIA-first environment. If you are operating in regions where Ascend NPUs or other domestic chips are the primary available compute, the integration becomes a manual nightmare.
The Cold Start Dilemma: Scaling an LLM from 0 to 1 instance isn't like scaling a Go microservice. Loading a 14GB model into GPU memory can take 30 to 90 seconds. Standard Kubernetes HPA (Horizontal Pod Autoscaler) often times out or drops initial requests during this window.

Hearth Architecture: How Scale-to-Zero Works

Hearth solves the 'idle GPU' problem by decoupling the request gateway from the inference engine. When an LLMService is scaled to zero, Hearth keeps a lightweight gateway active.

When a request arrives:

The Hearth Gateway intercepts the call and holds the connection open.
It sends a signal to the Hearth Controller to spin up a pod.
While the model loads (Cold Start), the gateway sends periodic SSE (Server-Sent Events) heartbeats to the client to prevent timeouts.
Once the backend (e.g., vLLM) is ready, the gateway proxies the request, streams the tokens, and then monitors for inactivity.
After a period of silence, the controller scales the deployment back to zero, freeing up the GPU for other workloads.

Implementation Guide: Deploying Qwen on Hearth

To get started, you define an LLMService. This manifest is significantly cleaner than a standard Kubernetes Deployment because it abstracts the complexity of model sourcing and hardware selection.

apiVersion: serving.hearth.dev/v1alpha1
kind: LLMService
metadata:
  name: deepseek-v3-small
  namespace: ai-prod
spec:
  model:
    source:
      uri: modelscope://deepseek-ai/DeepSeek-V3-Distill-Qwen-7B
  runtime:
    selector: { vendor: [nvidia, ascend] } # Multi-backend support
  resources:
    accelerators: 1
  scaling:
    min: 0 # The magic number
    max: 5
    metric: queueDepth
    target: 5

After applying this, you can check the status:

kubectl get llmservice -n ai-prod

If there is no traffic, the REPLICAS count will be 0. As soon as you hit the endpoint, Hearth triggers the orchestration.

Comparing Infrastructure Strategies

When deciding how to serve your models, consider the following trade-offs between self-hosting with Hearth and using a premier aggregator like n1n.ai.

Feature	Hearth (Self-Hosted)	n1n.ai (API Aggregator)
Cost Model	Pay for GPU Time (Idle = $0 with Hearth)	Pay per Token
Cold Start	30s - 120s	< 200ms
Hardware	Your own (NVIDIA/Ascend)	Managed High-End Clusters
Privacy	Full Data Sovereignty	Encrypted Transit
Maintenance	Kubernetes Management	Zero Ops

Technical Deep Dive: The Multi-Backend Strategy

One of Hearth's strongest features is its InferenceRuntime abstraction. Instead of hard-coding support for every new chip, Hearth uses a declarative approach. Adding support for a new hardware vendor (like Moore Threads or Biren) involves creating a runtime manifest that defines how to pull the image and map the resources.

For instance, the Ascend backend uses the vllm-ascend image. Because Hearth abstracts the model source (supporting HuggingFace and ModelScope), you can move a workload from an NVIDIA-based cloud to an on-premise Ascend cluster by changing just one line in your configuration.

Pro Tips for Production Scaling

Pre-warm Logic: If you know your traffic spikes at 9:00 AM, use a CronJob to set minReplicas: 1 at 8:55 AM and revert it to 0 in the evening.
Queue Depth vs. Concurrency: For LLMs, scaling based on CPU or Memory is useless. Always scale based on queueDepth. Hearth's gateway provides this metric natively, ensuring you scale up before the latency becomes unbearable.
Hybrid Failover: For mission-critical applications, use Hearth for your primary workloads. If your local cluster hits its GPU limit or a node fails, configure your application to failover to n1n.ai. This ensures 100% uptime even when your local infrastructure is under pressure.

Conclusion

Hearth is currently in its alpha phase (v0.1.0), but it already addresses the most painful part of LLM operations: the cost of silence. By implementing scale-to-zero, teams can experiment with a wider variety of models without fear of a massive cloud bill at the end of the month.

Whether you are building RAG pipelines with LangChain or deploying massive DeepSeek clusters, managing your API costs is paramount. For those who want the power of these models without the Kubernetes headache, n1n.ai remains the fastest path to production.

Get a free API key at n1n.ai

Source: https://dev.to/kubegopher/idle-gpus-also-burn-money-a-kubernetes-operator-that-can-scale-large-models-down-to-zero-ofa