DeepSeek-V4-Flash-DSpark Benchmark on GPUStack

The landscape of Large Language Model (LLM) inference is shifting from simple scaling to sophisticated architectural optimizations. On Day 0 of its release, the community has already seen a breakthrough with DeepSeek-V4-Flash-DSpark. By integrating a specialized Speculative Decoding module, this variant of DeepSeek-V4-Flash significantly pushes the boundaries of throughput and latency. For developers utilizing n1n.ai, staying ahead of these hardware-software co-optimizations is crucial for maintaining competitive AI services.

The Architecture of Speed: DeepSeek-V4-Flash-DSpark

DeepSeek-V4-Flash-DSpark is not just another fine-tune; it is a structural enhancement. While it retains the core weights of the original DeepSeek-V4-Flash, it adds a 'DSpark' speculative decoder. Speculative decoding addresses one of the primary bottlenecks in LLM inference: the memory-bound nature of autoregressive generation. In standard inference, the model generates tokens one by one, requiring a full pass of the model weights for every single token. Speculative decoding uses a smaller, faster 'draft' model (or module) to predict multiple future tokens in parallel, which the larger 'target' model then verifies in a single pass.

When deploying these advanced models, platforms like n1n.ai provide the necessary high-speed API access to ensure that these architectural gains translate into real-world application performance. In the following benchmark, we explore how GPUStack leverages SGLang to maximize this specific model's potential.

Hardware and Environment Setup

The benchmark was conducted on a high-density compute node featuring:

GPU: 8× NVIDIA H20-141G (High-bandwidth memory optimized for the Chinese market)
Software Stack: GPUStack v2, SGLang 0.5.14
Backend: Custom SGLang image with DSpark support

The H20 GPUs, while capped in total FLOPs compared to the H100, offer massive VRAM and memory bandwidth, making them ideal candidates for Mixture-of-Experts (MoE) models like DeepSeek-V4, where memory throughput is often the limiting factor.

Step-by-Step Deployment on GPUStack

GPUStack provides a seamless abstraction layer for managing inference backends. To deploy the DSpark variant, we must utilize a specific container image that includes the patched SGLang environment.

1. Configure the Inference Backend

Navigate to the Inference Backends section in the GPUStack UI. Locate the SGLang card and select Edit. Under the version configuration, add a new entry:

Name: dspark
Image: swr.cn-north-4.myhuaweicloud.com/desaysv/gpustack/sglang-dspark:v1.0
Framework: CUDA
Entrypoint: sglang serve
Command: --model-path {{model_path}} --host {{worker_ip}} --port {{port}}

2. Model Deployment

From the Deployments page, select Deploy Model:

Source: ModelScope
Model ID: deepseek-ai/DeepSeek-V4-Flash-DSpark
Backend: SGLang
Backend Version: dspark-custom

3. Advanced Parameter Tuning

For an 8-GPU H20 setup, the following parameters are critical for maximizing the MoE architecture's efficiency:

--context-length 1000000
--trust-remote-code
--tp-size 8
--ep-size 8
--moe-runner-backend flashinfer_mxfp4
--speculative-moe-runner-backend flashinfer_mxfp4
--speculative-algorithm DSPARK
--speculative-eagle-topk 1
--speculative-num-steps 1
--mem-fraction-static 0.85
--cuda-graph-max-bs 32
--max-running-requests 32
--disable-overlap-schedule

Note on MXFP4: The use of flashinfer_mxfp4 allows for high-precision inference at lower bit-widths, significantly reducing the memory footprint of the MoE experts without sacrificing accuracy.

Benchmark Results: The 2x Breakthrough

We compared the DSpark variant against the original DeepSeek-V4-Flash (DSV4F) with Multi-Token Prediction (MTP) enabled. The results were measured using sglang.bench_serving.

Workload A: Single-Stream Throughput (1K Input / 1K Output)

This represents a standard chat or coding assistant scenario.

Original DSV4F: 96.20 tokens/s | TTFT: 300.45 ms
DSpark (DSV4FD): 195.18 tokens/s | TTFT: 129.34 ms
Improvement: ~2.03× throughput increase and a ~57% reduction in Time to First Token.

Workload B: Long Context Concurrency (64K Input / 3K Output)

This simulates RAG (Retrieval-Augmented Generation) or document analysis with 10 concurrent users.

Original DSV4F: 198.60 tokens/s
DSpark (DSV4FD): 338.17 tokens/s
Improvement: ~1.7× throughput increase.

Metric	Original DSV4F	DSpark (DSV4FD)	Gain
1K/1K Throughput	96.20 t/s	195.18 t/s	2.0×
1K/1K TTFT	300.45 ms	129.34 ms	0.43×
64K/3K Throughput	198.60 t/s	338.17 t/s	1.7×

Technical Analysis: Why DSpark Wins

The performance leap is attributed to the Acceptance Length. In the single-stream test, DSpark achieved an acceptance length of 4.42, meaning that for every expensive pass of the main model, it successfully 'guessed' and verified over 4 tokens. In contrast, the standard MTP approach only managed 2.71 tokens.

Furthermore, the reduction in TTFT (Time to First Token) is vital for user experience. By optimizing the CUDA graph capture and utilizing the DSpark speculative algorithm, the initial response time is cut in half, making the AI feel significantly more responsive. For enterprises building real-time applications, these metrics are the difference between a clunky interface and a seamless one. If you are looking for managed access to these high-performance models without the overhead of maintaining 8× H20 clusters, n1n.ai offers a robust alternative with optimized routing.

Conclusion

The Day 0 benchmark of DeepSeek-V4-Flash-DSpark on GPUStack demonstrates that we have reached a point where software optimization can effectively double the value of existing hardware. By moving from 96 to 195 tokens/s, developers can serve twice the number of users or provide twice the speed for the same infrastructure cost.

Get a free API key at n1n.ai

Source: https://dev.to/gpustack/day-0-benchmark-deploying-deepseek-v4-flash-dspark-on-gpustack-doubles-throughput-1b8h