From Words to Worlds: Spatial Intelligence is AIâs Next Frontier
đhttps://shre.ink/From-Words-to-Worlds
1. Introduction
Fei-Fei Li argues that we are moving beyond an era in which AI focuses just on words, and heading into one where it must master worlds â the spatial, physical, embodied, three-dimensional reality in which we live. She suggests that while large language models (LLMs) have made great strides in text, they lack a true grasp of space: distance, motion, geometry, the physics of objects and how they relate to one another. Without that, AI remains fundamentally limited. She proposes that the next major frontier of AI is what she calls spatial intelligence â the scaffolding of human cognition, built on perception, action and understanding of the physical world.
She emphasizes that this isnât a niche add-on to existing systems, but a paradigm shift. Just as human intelligence evolved from sensing and moving in the world, so too must AI evolve from processing words to interacting with and reasoning about entire worlds.
2. Spatial Intelligence: The Scaffolding of Human Cognition
Fei-Fei Li traces human intelligence back to the earliest perceptual loops: organisms sensed space, moved in it, manipulated objects. These interactions formed the basis of reasoning. She argues that this spatial awareness underlies our ability to build models of the world, to think about geometry, to understand cause and effect in physical space. It wasnât words that first shaped intelligenceâit was motion, space, coordination.
Scientists, engineers and creatives have relied on spatial reasoning to transform the world â from Eratosthenes measuring the Earth, to Watson and Crick modeling DNA structure, to inventors visualizing machines and structures in three dimensions. These breakthroughs werenât purely linguistic. They required a âsenseâ of space. And yet, many AI systems are stuck in the âwordsâ realm: excellent at text but poor at spatial or embodied reasoning. The scaffolding is missing.
She argues that for AI to go beyond the narrow tasks it currently excels at â like summarizing text, translating, generating images â it needs to internalize spatial intelligence. In other words: it needs to understand how objects relate in space, how they move, how physical laws apply, how agents act and react. Without this, it cannot fully understand or shape the world.
3. The Next Decade of AI: Building Truly Spatially Intelligent Machines
Fei-Fei Li outlines a vision for the next decade: AI systems built around world models â models that not only process words or images, but span the perceptual, geometric and physical consistency of environments. These models, she argues, must incorporate three major capabilities:
Generative: World Models Can Generate Worlds with Perceptual, Geometrical and Physical Consistency: world models should be able to create plausible 3D (or 4D) environments where geometry, material, physics and perception are consistent. Just as LLMs generate coherent text, world models must generate coherent spaces: an objectâs shape must match its shadow, its movement must align with gravity, its spatial relations must make sense. Fei-Fei Li argues this is a far higher bar than text generation, but once achieved, unlocks new possibilities: simulation, design, virtual/augmented reality, robotics.
Multimodal: World Models Are Multimodal by Design: integrating text, images, depth, motion, touch, maybe even smell or physics signals. The world we inhabit is not just visual or linguisticâitâs richly multimodal. A model of the world must handle prompts in text, interpret images, understand motion and material, reason about space. Fei-Fei Li emphasizes that spatial intelligence means crossing modalities, linking words and worlds.
Interactive: World Models Can Output the Next States Based on Input Actions: given an action, they can predict the next state of the world. They can model âwhat happens if I move hereâ or âwhat happens if I pick up this objectâ or âhow will this agent navigate this spaceâ. This sense of action, reaction, dynamics is central to embodied intelligence. Without interactivity, you have a static modelâbut intelligence unfolds in time, through doing and reacting. Fei-Fei Li argues that machines must internalize this loop of perception â action â consequence to become spatially intelligent.
4. The Scope of This Challenge Exceeds Anything AI Has Faced Before
Language models have one task: next-token prediction. The world models must handle geometry, physics, modalities, actions, timeâand large operational environments. This is far broader and deeper than any AI challenge to date. Here are some examples of current research topics:
A New, Universal Task Function for Training: Fei-Fei Li argues that we need a new âuniversal task functionâ akin to next-token prediction, but for spatial/world modeling. What is the objective function? What is the loss metric? How do we train a machine to anticipate motion, layout, causality, interaction? We must develop novel task definitions, benchmarks and training objectives that capture the full richness of world modeling.
Large-Scale Training Data: to train world models we need massive datasets of 3D scans, videos, multimodal sequences, spatial layouts, physics simulationsâmany orders of magnitude more complex than text corpora. The data must include depth, geometry, motion, multiple sensor types, embeddings of physical laws. Collecting, curating and processing such data is one of the most daunting tasks ahead.
New Model Architecture and Representational Learning: spatial intelligence will likely require new architectural primitives: 3D/4D aware tokenization, embedding geometry and physics, memory over space and time, interaction loops, perhaps even embodied simulation. Representational learning must change: from flattened sequences to structured spatial representations, graphs, meshes, dynamics.
5. Using World Models to Build a Better World for People
Moving from the technical to the practical, Fei-Fei Li describes how spatially intelligent AI can transform industries and human experience:
Creativity: Superpowering Storytelling and Immersive Experiences: imagine authors, artists, designers generating entire worlds with consistency, immersing users in 3D environments, interactive narratives, virtual architecture. Spatial intelligence enables AI to not just describe a scene, but to build itâand let users inhabit it. This is potentially transformative for entertainment, education, training, simulation.
Robotics: Embodied Intelligence in Action: if machines truly understand space, they can navigate, manipulate, coordinate in real-world environments. A robot with a world model can anticipate how objects will move, how agents will behave, where it can step or reach. Spatial intelligence moves robotics from rule-based automation to intuitive, adaptive, human-like embodied intelligence.
The Longer Horizon: Science, Healthcare and Education: in science, world models can simulate complex physical systems, model climate, materials, healthcare systemsâthings humans canât easily test. In healthcare, immersive, spatially grounded AI can aid surgery, rehabilitation, diagnostics in 3D. In education, it can create virtual labs, real-world simulations, personalized spatial experiences.
6. Conclusion
Fei-Fei Li argues that we stand at a watershed moment: the move from language to space is not optionalâitâs inevitable if AI is to be truly intelligent, embodied and useful. She calls for the research community, industry, funders and educators to commit to spatial intelligence as the next frontier. It will require new objectives, new data, new architecturesâbut the payoff is potentially the next wave of AI breakthroughs.
Her vision is optimistic yet grounded: yes, the challenge is enormous. But inertial constraints are falling: sensors, compute, virtual/augmented reality technologies are improving. What remains is conceptual and engineering workâdefining tasks, collecting data, building systems.
She envisions a world where AI doesnât just answer our questions, but understands the spaces we live in, helps us build new ones, and bridges the digital and the physical in profound ways. Spatial intelligence, she argues, will be the scaffolding of human-machine partnership in the decades ahead.




