vLLM vs TensorRT-LLM vs Ollama vs llama.cpp: Choosing the Best Inference Engine for RTX 5090

Authors
  • avatar
    Name
    Nino
    Occupation
    Senior Tech Editor

The release of the NVIDIA RTX 5090 has fundamentally shifted the landscape for local LLM inference. With 32GB of GDDR7 VRAM and a massive memory bandwidth of 1.79 TB/s, this consumer-grade Blackwell card rivals data-center hardware for specific workloads. However, hardware is only half the battle. Choosing the right inference engine—vLLM, TensorRT-LLM, Ollama, or llama.cpp—determines whether you actually harness that power or leave it idling. While local inference on an RTX 5090 is powerful, many developers prefer the managed stability and unified access of n1n.ai for production workloads that require 99.9% uptime.

The Core Contenders: An Overview

When evaluating inference backends, we must distinguish between 'serving' and 'running.' Some engines are designed to provide a high-throughput API for hundreds of concurrent users, while others focus on getting a model running on a single laptop as quickly as possible.

  1. vLLM: The industry standard for high-throughput serving. It pioneered PagedAttention and is the go-to for Python-based production environments.
  2. TensorRT-LLM: NVIDIA’s own highly optimized library. It offers the absolute maximum throughput by compiling models into specialized engines, but it comes with a steep learning curve.
  3. Ollama: A user-friendly wrapper around llama.cpp. It is designed for developers who want a 'one-click' experience for local testing.
  4. llama.cpp: The foundation of the local LLM movement. Written in C++, it is incredibly portable and supports almost every hardware platform imaginable.

Technical Feature Comparison

FeaturevLLMTensorRT-LLMOllamallama.cpp
Primary FocusProduction ServingMaximum ThroughputDeveloper SimplicityPortability & Efficiency
QuantizationAWQ, GPTQ, FP8FP8, FP4, INT8GGUFGGUF (Q4_K_M, etc.)
Memory ManagementPagedAttentionIn-flight BatchingStatic AllocationStatic/Unified Memory
API CompatibilityOpenAI-compatibleCustom / TritonOpenAI-compatibleOpenAI-compatible
RTX 5090 SupportYes (v0.15.1+)Partial (SM120 Gaps)YesYes
Mamba/SSM SupportNativeLimitedVia GGUFVia GGUF

vLLM: The Pragmatic King of Production

vLLM's core innovation is PagedAttention, which treats Key-Value (KV) cache like virtual memory pages. In traditional engines, KV cache is allocated contiguously, leading to significant fragmentation. vLLM eliminates this waste, allowing for much larger batch sizes. If you find the setup of vLLM or TensorRT-LLM too cumbersome, n1n.ai provides a unified interface to access these high-performance models without the local hardware headache.

On an RTX 5090 running the Nemotron Nano 9B v2 Japanese model in BF16, vLLM shows its strength in batched scenarios:

  • Single Request: ~83 tokens/second
  • 10 Concurrent Requests: ~630 tokens/second (Total Throughput)
  • Time to First Token (TTFT): 45–60 ms

For developers building RAG (Retrieval-Augmented Generation) pipelines with LangChain or LlamaIndex, vLLM is the most logical choice because it integrates seamlessly with the Python ecosystem. Its support for Mamba-hybrid architectures (like Nemotron) is currently the best in class, utilizing specialized kernels that other engines haven't fully optimized for Blackwell's SM120 compute capability.

TensorRT-LLM: Raw Power at the Cost of Complexity

TensorRT-LLM is designed for the data center. On H100 or B200 clusters, it is the undisputed champion, often outperforming vLLM by 30–50% in raw throughput. It excels at FP8 and FP4 quantization, which are critical for maximizing the performance of models like DeepSeek-V3 or Claude-style architectures when deployed at scale.

However, on the RTX 5090, the experience is less polished. The SM120 architecture of consumer Blackwell GPUs often lacks the specific 'fused' kernels found in the enterprise versions. During testing, many users encounter errors like Fall back to unfused MHA for data_type = bf16. Furthermore, the installation process typically requires a specific Docker container from NVIDIA's NGC registry, making it difficult for solo developers to maintain.

Ollama and llama.cpp: The Local Heroes

Ollama is the 'Apt-get' of the AI world. It abstracts away the complexity of CUDA versions and Python environments. If you need to test a model like Llama 3.1 or Mistral in under 60 seconds, Ollama is unbeatable. However, it lacks Continuous Batching. If you send five requests simultaneously, Ollama will process them one by one, whereas vLLM would process them all at once using PagedAttention.

llama.cpp remains the most versatile. Its GGUF format is the gold standard for quantization. On the RTX 5090, the 1.79 TB/s bandwidth allows llama.cpp to achieve incredible speeds for single-user chat. It also allows for 'Hybrid Inference,' where you can offload some layers to the GPU and keep others in system RAM—essential if you are trying to run a 70B or 405B model that exceeds the 32GB VRAM of the 5090.

Implementation Guide: vLLM on RTX 5090

To run a high-performance server on your 5090, you will likely need the latest CUDA 13 nightly builds. Here is a sample implementation for a reasoning-capable model:

# Starting the vLLM server
# vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese \
#    --trust-remote-code \
#    --max-num-seqs 64 \
#    --mamba_ssm_cache_dtype float32

from openai import OpenAI

# vLLM exposes an OpenAI-compatible API
client = OpenAI(base_url="http://localhost:8000/v1", api_key="n1n-dummy-key")

response = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese",
    messages=[
        {"role": "system", "content": "You are a technical assistant."},
        {"role": "user", "content": "Explain the benefits of GDDR7 for LLM inference."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

Why vLLM Wins for the 5090 Developer

For a solo developer or a small team using the RTX 5090, vLLM offers the best balance. It provides production-grade features like continuous batching and prefix caching without the extreme DevOps overhead of TensorRT-LLM. While llama.cpp is great for personal use, vLLM allows you to serve an entire portfolio of projects—from Shogi AI evaluation to legal document summarization—from a single GPU instance.

In the broader context of the AI industry, the gap between local and cloud is narrowing. For those who need production-grade performance without the hardware overhead, n1n.ai offers a seamless alternative, providing access to top-tier models through a single, high-speed API.

Summary Recommendation

  • Use vLLM if: You are building an API, using RAG, or need to serve multiple users on a single RTX 5090.
  • Use TensorRT-LLM if: You are in an enterprise environment with H100s and need every last drop of FP8 performance.
  • Use Ollama if: You want to experiment with new models locally without writing a single line of setup code.
  • Use llama.cpp if: You are running massive models that require system RAM offloading or are working on non-NVIDIA hardware.

As the Blackwell ecosystem matures, we expect TensorRT-LLM to improve its consumer support, but for now, vLLM remains the pragmatic choice for high-performance local inference.

Get a free API key at n1n.ai