Speculative decoding batch size. Therefore, we propose an adaptive spe...
Speculative decoding batch size. Therefore, we propose an adaptive speculative decoding strategy, which adjusts the speculation length according to the batch size used. It runs a short period of profiling before deployment and builds a mapping from the batch size to its corresponding optimal speculation length. Sep 19, 2025 · Our contribution establishes the minimal synchronization requirements for correctness at any batch size, providing a principled foundation that clarifies what correct batch speculative decoding requires. 6B target/draft pairs, our approach achieves up to 3× throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95% output equivalence. 6B, and GLM-4-9B/0. Feb 16, 2026 · Multi-token Prediction: MTP-1 reduces per-token latency but degrades text throughput under high concurrency because speculative tokens consume KV cache capacity, reducing effective batch size. Motivated by these works, we consider a batch version of speculative decoding algorithm using a simpler parallel structure (Left of Figure 3). With batch_size=1 enforced by speculative decoding, each of the 10 concurrent users had to wait for all previous requests to complete, requests queued serially. It is often one request, one stream, generating tokens one by one. Mamba-2 Hybrid — SSM state cache (mamba_ssm_cache) is distinct from the KV cache. Unlike external draft models, additional KV cache and latency overhead is minimal as there is only a single layer called per predicted token. With our denition, per token latency increases as the batch size increases because the amount of FLOPS during the per token latency is multiplied by batch size, and this applies to both regular decoding and speculative decoding. Mar 8, 2026 · Sampling and Decoding Optimization Relevant source files Purpose and Scope This page documents optimization techniques that accelerate token generation during the decode phase of LLM inference through speculative execution and multi-token prediction strategies. May 2, 2024 · In a naive speculative decoding implementation, each speculative head would have its own kv-cache, but instead we modify the paged attention kernel developed in the vLLM project to enable efficient kv-cache maintenance. Tree-structured speculation further increases parallelism, but is often brittle when ported across heterogeneous backends and accelerator stacks, where 1 day ago · That matters for speculative decoding because the draft model is usually run in a small-batch latency-sensitive regime. . This layer functions as a tail augmented draft model (similar to Eagle or other MTP heads) for speculative decoding. Batch Size and Numerical Stability: Changes in batch size may cause variations in logprobs and output probabilities, potentially due to non-deterministic behavior in batched operations or numerical instability. Our Algorithm 4 can be viewed as a simplified approximation to those batch algorithms. These techniques address the sequential bottleneck inherent in autoregressive language model generation, where each token must be Abstract Autoregressive decoding remains a primary bottleneck in large language model (LLM) serving, motivating speculative decoding methods that reduce expensive teacher-model invocations by verifying multiple candidate tokens per step. This ensures that throughput does not reduce at larger batch sizes. On the SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0. eokujhlsgodawlegorycanyfyppdsscqknjkpkzrfbfxryxn