AI Publications

Public·11 members

Back

JA Soler

September 27, 2025

Defeating Nondeterminism in LLM Inference

🔗https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/?utm_source=substack&utm_medium=email

1. Introduction

The paper begins by addressing a problem many AI developers face but rarely fully understand: why do LLMs (like ChatGPT) produce different outputs even when using greedy sampling with temperature 0? While randomness from sampling is expected, true nondeterminism can occur even in supposedly deterministic settings—especially during inference. This challenges reproducibility in LLM systems and highlights a need for deeper insight into GPU computation and floating-point arithmetic.

2. The Original Sin: Floating-Point Non-Associativity

Floating-point arithmetic is not associative—meaning the order of operations affects results. The loss of precision when adding numbers of very different scales (e.g. 1e-10 vs 1e+3) introduces rounding errors. If the same set of numbers is added in different orders (e.g. by different GPU threads), the final result can vary—sometimes yielding dozens or hundreds of subtly different outputs. This mathematical nuance is a key source of nondeterminism.

3. Why Don’t Kernels Always Add Numbers in the Same Order?

The paper challenges the common "concurrency + floating point" explanation. While concurrency (parallel thread execution) can create nondeterminism, the authors find that LLM inference engines like vLLM do not depend on atomic adds or nondeterministic concurrency. Instead, the real culprit is batch processing, where the ordering of operations changes depending on batch size or structure—even if the model, inputs, and sampling settings stay the same.

4. When Are Atomic Adds Needed?

Atomic adds (synchronized operations between threads) are sometimes used in GPU computations, but LLM inference rarely requires them. So they are not the root cause of randomness in inference outputs. This shifts the focus from thread concurrency to batch invariance—a property that, if violated, can make inference results dependent on batch structure rather than input content alone.

The authors redefine what determinism should mean: if the same input produces different outputs depending on what other examples it’s batched with, then the system is not batch-invariant—and thus not truly deterministic. Batch size, sequence length, and padding all influence execution order inside kernels, especially when leveraging fused optimizations.

5. How Do We Make Kernels Batch-Invariant?

This core technical section lays out a path to defeating nondeterminism: building batch-invariant kernel implementations.

Batch-Invariant RMSNorm: The authors propose a new implementation of RMSNorm (Root Mean Square Layer Norm) that guarantees the same output regardless of batch size or sequence placement.
Batch-Invariant Matrix Multiplication: They introduce techniques to force consistent memory access and accumulation order during matrix multiplication, even across variable batch structures.
Batch-Invariant Attention: Finally, they rewrite attention mechanisms to behave identically even when queries or keys are padded, batched, or reordered—addressing one of the biggest sources of inconsistency.

6. Implementation

The techniques were implemented in PyTorch-compatible kernels. The authors focused on reproducibility without sacrificing performance. They share code examples and recommendations for building truly deterministic LLM inference systems.

7. Experiments

Through extensive testing, the authors validated their techniques:

How Nondeterministic Are Completions?: Even with deterministic sampling, batching inputs differently caused LLMs to return different completions with the same prompt. Their fixes eliminated this behavior.
Performance: The batch-invariant implementations ran with negligible performance loss, demonstrating that you don’t have to choose between determinism and speed.
True On-Policy RL: In reinforcement learning, determinism is essential. The authors applied their methods to train RLHF-style agents on-policy, achieving stable and reproducible training results—something previously very difficult.

8. Conclusion

The paper ends with a clear message: nondeterminism in LLM inference is solvable—but only by rethinking how GPU kernels are written. Floating-point quirks and batch structure must be accounted for. This work lays the groundwork for a more reproducible, debuggable, and trustworthy AI future.

11 Views

See All Members (11)

AI Publications

Defeating Nondeterminism in LLM Inference

Members

CuriousAI.net

Home AI Glossary AI Publications AI Forum

FOLLOW US

Copyright @2025 CuriousAI.net | All rights reserved | Online Privacy