CUDA

Explore our entire collection of insights, tutorials, and industry news.

All Posts

Topics

View All Tags→

AI TutorialsJune 20, 2026
GPU-Resident Top-K for Agentic RAG: Optimizing Retrieval Latency with CUDA Kernels
Discover how building a custom GPU-resident Top-K CUDA kernel eliminates PCIe transfer bottlenecks in Agentic RAG pipelines, delivering microsecond-level retrieval for high-performance LLM applications.
Read more →
AI TutorialsJune 9, 2026
PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory
A deep dive into how vLLM uses PagedAttention to eliminate memory fragmentation and increase LLM inference throughput by up to 24x.
Read more →
AI TutorialsMay 22, 2026
TitanCore Core-1 LLM Training Infrastructure with C++ CUDA and ZeRO-3
Explore TitanCore Core-1, a high-performance C++/CUDA infrastructure designed for trillion-parameter LLM training using ZeRO-3 and custom fused kernels for 2.6x speedup.
Read more →
Industry NewsMay 11, 2026
Why CUDA Proves Nvidia Is a Software Company
While the world focuses on Nvidia's H100 and Blackwell GPUs, the real secret to their trillion-dollar dominance lies in CUDA. This deep dive explores how software, not just silicon, created an unassailable moat for AI development.
Read more →
Model ReviewsFebruary 14, 2026
Optimizing GPU Performance with Custom Kernels from Claude and Codex
Explore how modern LLMs like Claude 3.5 Sonnet and OpenAI Codex are revolutionizing GPU programming by generating high-performance Triton and CUDA kernels.
Read more →
Model ReviewsJanuary 29, 2026
Using Claude 3.5 Sonnet to Build CUDA Kernels and Train Open Models
An in-depth technical exploration of how Claude 3.5 Sonnet is revolutionizing low-level GPU programming and serving as a high-fidelity teacher for open-source model distillation.
Read more →
AI TutorialsJanuary 17, 2026
Reducing LLM Memory Usage by 84% with Fused Kernels
Discover how fused Triton kernels can drastically reduce memory overhead in the final LLM layers, preventing OOM errors during training and fine-tuning.
Read more →

CUDA

Categories

Topics

GPU-Resident Top-K for Agentic RAG: Optimizing Retrieval Latency with CUDA Kernels

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory

TitanCore Core-1 LLM Training Infrastructure with C++ CUDA and ZeRO-3

Why CUDA Proves Nvidia Is a Software Company

Optimizing GPU Performance with Custom Kernels from Claude and Codex

Using Claude 3.5 Sonnet to Build CUDA Kernels and Train Open Models

Reducing LLM Memory Usage by 84% with Fused Kernels