GPU-OPTIMIZATION

Explore our entire collection of insights, tutorials, and industry news.

All Posts

Topics

View All Tags→

AI TutorialsFebruary 15, 2026
Mastering Multi-GPU Communication: Point-to-Point and Collective Operations in PyTorch
A deep dive into the mechanics of distributed AI training using PyTorch, covering P2P and collective communication primitives essential for scaling large models like DeepSeek-V3 and Llama 3.
Read more →
Model ReviewsFebruary 14, 2026
Optimizing GPU Performance with Custom Kernels from Claude and Codex
Explore how modern LLMs like Claude 3.5 Sonnet and OpenAI Codex are revolutionizing GPU programming by generating high-performance Triton and CUDA kernels.
Read more →
Model ReviewsJanuary 29, 2026
Using Claude 3.5 Sonnet to Build CUDA Kernels and Train Open Models
An in-depth technical exploration of how Claude 3.5 Sonnet is revolutionizing low-level GPU programming and serving as a high-fidelity teacher for open-source model distillation.
Read more →
AI TutorialsJanuary 27, 2026
vLLM and PagedAttention: Optimizing LLM Inference for Speed and Efficiency
A deep dive into how vLLM uses PagedAttention to solve GPU memory fragmentation and boost LLM serving throughput.
Read more →
AI TutorialsJanuary 12, 2026
Accelerate LLM Inference by 2.4x with Speculative Decoding
Deep dive into Speculative Decoding: the technique that boosts LLM inference speeds by 2-4x without compromising model quality or weights.
Read more →
AI TutorialsJanuary 10, 2026
vLLM Quickstart: High-Performance LLM Serving and Optimization
A comprehensive guide to deploying and optimizing vLLM, the industry-standard inference engine for high-throughput LLM serving using PagedAttention.
Read more →
AI TutorialsJanuary 5, 2026
Mosaic: Sharding Attention Across GPUs for 150,000-Token Sequences
Discover how Mosaic enables 150,000-token sequence processing by sharding attention across multiple GPUs, overcoming the quadratic memory bottleneck.
Read more →

GPU-OPTIMIZATION

Categories

Topics

Mastering Multi-GPU Communication: Point-to-Point and Collective Operations in PyTorch

Optimizing GPU Performance with Custom Kernels from Claude and Codex

Using Claude 3.5 Sonnet to Build CUDA Kernels and Train Open Models

vLLM and PagedAttention: Optimizing LLM Inference for Speed and Efficiency

Accelerate LLM Inference by 2.4x with Speculative Decoding

vLLM Quickstart: High-Performance LLM Serving and Optimization

Mosaic: Sharding Attention Across GPUs for 150,000-Token Sequences