Defeating Nondeterminism in LLM Inference
šhttps://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/?utm_source=substack&utm_medium=email
1. Introduction
The paper begins by addressing a problem many AI developers face but rarely fully understand: why do LLMs (like ChatGPT) produce different outputs even when using greedy sampling with temperature 0? While randomness from sampling is expected, true nondeterminismĀ can occur even in supposedly deterministic settingsāespecially during inference. This challenges reproducibility in LLM systems and highlights a need for deeper insight into GPU computation and floating-point arithmetic.
2. The Original Sin: Floating-Point Non-Associativity
Floating-point arithmetic is not associativeāmeaning the order of operations affects results. The loss of precision when adding numbers of very different scales (e.g. 1e-10 vs 1e+3) introduces rounding errors. If the same set of numbers is added in different orders (e.g. by different GPU threads), the final result can varyāsometimes yielding dozens or hundreds of subtly different outputs. This mathematical nuance is a key source of nondeterminism.
3. Why Donāt Kernels Always Add Numbers in the Same Order?
The paper challenges the common "concurrency + floating point" explanation. While concurrency (parallel thread execution) canĀ create nondeterminism, the authors find that LLM inference engines like vLLM do not depend on atomic addsĀ or nondeterministic concurrency. Instead, the real culprit is batch processing, where the ordering of operations changes depending on batch size or structureāeven if the model, inputs, and sampling settings stay the same.
4. When Are Atomic Adds Needed?
Atomic adds (synchronized operations between threads) are sometimes used in GPU computations, but LLM inference rarely requires them. So they are not the root cause of randomness in inference outputs. This shifts the focus from thread concurrency to batch invarianceāa property that, if violated, can make inference results dependent on batch structure rather than input content alone.
The authors redefine what determinism should mean: if the same input produces different outputs depending on what other examples itās batched with, then the system is not batch-invariantāand thus not truly deterministic. Batch size, sequence length, and padding all influence execution order inside kernels, especially when leveraging fused optimizations.
5. How Do We Make Kernels Batch-Invariant?
This core technical section lays out a path to defeating nondeterminism: building batch-invariant kernel implementations.
Batch-Invariant RMSNorm: The authors propose a new implementation of RMSNorm (Root Mean Square Layer Norm) that guarantees the same output regardless of batch size or sequence placement.
Batch-Invariant Matrix Multiplication: They introduce techniques to force consistent memory access and accumulation order during matrix multiplication, even across variable batch structures.
Batch-Invariant Attention: Finally, they rewrite attention mechanisms to behave identically even when queries or keys are padded, batched, or reorderedāaddressing one of the biggest sources of inconsistency.
6. Implementation
The techniques were implemented in PyTorch-compatible kernels. The authors focused on reproducibility without sacrificing performance. They share code examples and recommendations for building truly deterministic LLM inference systems.
7. Experiments
Through extensive testing, the authors validated their techniques:
How Nondeterministic Are Completions?: Even with deterministic sampling, batching inputs differently caused LLMs to return different completions with the same prompt. Their fixes eliminated this behavior.
Performance: The batch-invariant implementations ran with negligible performance loss, demonstrating that you donāt have to choose between determinism and speed.
True On-Policy RL: In reinforcement learning, determinism is essential. The authors applied their methods to train RLHF-style agents on-policy, achieving stable and reproducible training resultsāsomething previously very difficult.
8. Conclusion
The paper ends with a clear message: nondeterminism in LLM inference is solvableābut only by rethinking how GPU kernels are written. Floating-point quirks and batch structure must be accounted for. This work lays the groundwork for a more reproducible, debuggable, and trustworthy AI future.




