Artificial intelligence has seen rapid development across multiple domains, including text, image, audio, and video, with outputs evolving from rudimentary to nearly indistinguishable from human creations in just a few years after foundational breakthroughs. Building on this trajectory, experts now anticipate that AI-generated virtual worlds—three-dimensional environments capable of user exploration and interaction—could emerge as the next significant milestone.
Central to this vision is Fei-Fei Li, a renowned AI researcher often regarded as a key figure in computer vision advancements. In November, Li’s new enterprise, World Labs, unveiled its initial commercial product named Marble. This platform allows users to generate exportable 3D environments simply by providing prompts in the form of text, images, or videos. Such capability could streamline complex creative tasks for professionals engaged in design. Yet, Li envisions a broader ambition: the development of what she terms “spatial intelligence,” described in her recent manifesto as "the frontier beyond language—the capability that links imagination, perception and action." Unlike current AI, which can interpret visual information, spatial intelligence would enable systems to interact meaningfully with their environment.
While virtual worlds are familiar in the form of video games accessed via screens or headsets, their creation has traditionally demanded substantial technical skill and labor. AI promises to simplify this process, achieving personalized and potentially infinite virtual spaces. At present, these models remain nascent but are progressing along a path similar to previous AI modalities like text and video generation. Ben Mildenhall, a co-founder of World Labs alongside Li, anticipates an evolution from early fascination to widespread recognition of AI-crafted virtual worlds, paralleling the trajectory of recent AI-generated media.
AI-driven video generation has notably advanced, as evidenced by popular models developed by OpenAI and Midjourney, and companies such as Captions, Runway, and Synthesia have successfully commercialized AI-produced video content. MIT assistant professor Vincent Sitzmann, an authority on AI world modeling, describes video models as "proto-world models," laying the groundwork for full spatial simulations.
World Labs’ Marble platform supports multiple input methods, including text descriptions, photographs, videos, or existing 3D scenes, to produce immersive worlds navigable from a first-person viewpoint typical of video games. Initially, these worlds are static, though developers can introduce motion and interactivity utilizing specialized tools. However, current limitations are evident—visual distortions and incoherent structures emerge within moments of exploration, signaling the infancy of these environments.
Constructing comprehensive virtual worlds entails greater complexity than generating videos alone. According to Mildenhall, the higher barrier to entry in 3D world creation results in earlier visible value from tools like Marble. Sitzmann praises World Labs for integrating and scaling computer vision advancements accumulated over the previous decade, marking a significant achievement that offers a glimpse into future products potentially enabled by these technologies.
Li emphasizes the potential to create a multitude of virtual worlds that connect, extend, or complement the physical world. Immediate applications in entertainment are clear, supplying novel experiences for users. In professional arenas such as architecture and engineering, the ability to simulate numerous design alternatives at reduced cost presents compelling efficiency gains. Nevertheless, the broader use cases proposed for robotics, science, and education confront substantial challenges.
A central difficulty lies in data availability, particularly for robotics. While copious video and camera footage supports video model training, appropriate data for humanoid robots, especially proprioceptive or "action data" that links specific motor movements to physical effects, remains scarce. By contrast, self-driving car systems can rely on millions of hours of video paired with corresponding human driver actions due to their limited input scope. For humanoid robots, with myriad joints and potential movements, such comprehensive datasets do not yet exist, hindering accurate simulation development.
Li proposes that world models may hold a critical role in addressing data limitations in robotics, though experts like Sitzmann acknowledge that current explanations leave open questions regarding the mechanisms by which these models would overcome existing deficits. Faithful simulators necessitate accurate correlations between movement and action, data presently unavailable.
Additional hurdles arise for scientific and educational uses of world models. Unlike entertainment, where realism suffices, these fields demand simulations faithful to real-world dynamics. Li envisions applications such as immersive exploration of cellular interiors or surgical training within virtual anatomies. However, such simulations are only valuable if truly accurate. World Labs’ leadership is aware of the trade-offs between visual realism and fidelity to underlying facts, expressing optimism that models will eventually achieve both simultaneously.
Presently, AI exhibits significant limitations in spatial reasoning compared to language processing. Li asserts that advancing spatial intelligence is essential for AI to surpass existing thresholds and unlock further growth, potentially impacting industries worth trillions. Yet, whether current multimodal language models like ChatGPT will encounter insurmountable barriers remains uncertain. What is observable is a continuous improvement trend across AI modalities.
Mildenhall envisions a future where AI models enable experiences replicating any real-world event. Users could engage multimodally with virtual environments, transforming them dynamically in accordance with personal impulses. Paired with parallel progress in reasoning models and virtual reality hardware, this scenario might allow individuals access to boundless, interactive generative worlds. Instead of passively viewing videos, users could directly explore and manipulate immersive environments responsive to their will.
Despite this promising outlook, current technological and economic constraints such as high computational costs and extended rendering durations imply such comprehensive virtual worlds remain distant prospects. Christoph Lassner, another World Labs co-founder, acknowledges the gap between current capabilities and this vision. Sitzmann concurs that while the concept is plausible, substantial barriers persist.
Li emphasizes that these technologies aim to augment human capabilities and preserve collaborative relationships between people and AI. She expresses confidence in humanity’s progressive trajectory, rejecting both dystopian and utopian extremes. Advocating responsibility, she highlights the shared duty to guide AI development toward beneficial outcomes, underscoring the importance of aligning hope and action to ensure humanity’s prosperity amidst evolving technological power.