A Culinary Introduction to AI's rumored next frontier

World Models. Their terminology alone makes them sound quite impressive. How on earth can one model our world? Seems like quite the invention, and many big names in AI are spreading that message as well. Yann LeCun, Fei-Fei Li, Yoshua Bengio, Demis Hassabis, and Dutch pride Pim de Witte; all telling us that world models are our ticket, or at least the next station, to creating human-like intelligence. While I might be a bit more critical – likely due to not having a large financial interest in them like the aforementioned – I'm still excited about the technology. Their role as a key component in self-driving cars, video(game) generation, and your future robotic assistant should make you care too. So what are they, how do they work, but maybe even more important; why is this narrative around them being the next major step in AI becoming so popular? This write-up will answer all of those questions, and in doing so provide you with a first-principles, conceptual understanding of them, so world models don't become another buzzword to you in this rat’s nest of AI news. Speaking of rats, they are actually where world models find their foundation.

Wayve's autonomous car. — UK's Wayve is powering all types of cars to drive autonomously using world models.Source: wayve.ai

The Origin Story

World models are a concept hailing from psychology, with Edward Tolman being their pioneer. Tolman believed animals - and let's not forget, we humans are animals too - form internal mental representations of their environment. In other words: they build mental maps of the world around them. Much like the one you rely on to navigate from your bedroom to the bathroom in the dark. This was contrary to the, then-popular, views on behaviourism from B.F. Skinner. His research told us that animals just learn to react to operant conditioning like “turn left = food, turn right = shock," and in doing so rejected that they learn any representations of the world around them.

Given that this status quo was quite dominant, Tolman needed to test his hypothesis to get his ideas any real scientific attention. So, like any good experimentalist, he got some lab rats together. He split them in three groups and put them inside a maze. These groups had but one key difference: when (or if) they received food. The first group always had food at the end of the maze. Another, unfortunately for them, was never allowed to find any. The third, and the one Tolman was most interested in, was the group which at first wasn't allowed to find food, but starting on day 11, was.

The first two groups showed behaviour in accordance with Skinner’s views. The well-fed group 1 quickly learned to traverse the maze after a couple of days, and the food-deprived group 2 never did. However, the third group learned to navigate the maze quickly from the day they got food; quicker than the group that always had food! They'd been learning the maze structure all along; they just had no reason to show it until the food arrived. Tolman’s hypothesis seemed to be confirmed; the rats had made some model of the maze just by traversing it and without any rewards/reinforcement. Behaviourism could explain performance differences, but not the learning that occurred without reinforcement. These findings have turned out to be future-proof. Also due to evidence found in modern neuroscience, it has become widely accepted that animals form internal representations of the world to some degree. These cognitive maps, also referred to as world models, are now an important premise of psychology.

Enjoying this introduction to world models? Subscribe to get my new articles and scientific research delivered to your inbox.

Translating Psychology to AI

Enough of this old-fashioned psychology research: how does this all relate to AI? World models are part of the sub-field of AI called Reinforcement Learning (RL), or more specifically, model-based RL. This space deals with creating agents that interact with their environment to build an internal model of how it works. Given an accurate model of the world, it can be used to predict outcomes and plan ahead from any given state. To make this a bit more concrete, let's put ourselves in modern day Tolman's shoes and make a robo-rat. However, this one will not be made for traversing a maze. This one has actual use: it can cook for you (preferably without having to pull your hair). Just like a human, we will have it learn how to cook, by, well… cooking. While it’s cooking, it jots down what it does, and what it observes as a consequence.

"I have a pan with hot water. I put salt in it. The water in the pan starts boiling.”
“I have a pan with boiling water. I turn the fire down. The water in the pan stops boiling.

Tada, there we have it: our cooking world model. Simple as it is, it does, to some degree, represent the dynamics of our environment, or world, and thus qualifies as one. It is in fact quite cool that this little world model of just two explicit rules already contains some knowledge of how getting water to boil works. Each of the two observations teaches our robo-rat what happens next. This type of prediction is what makes a world model useful. Given 'hot water + salt,' our robo-rat can now predict 'boiling water' without having to actually add the salt and wait. But of course, two rules won't allow our robo-rat to cook anything. So let's say we have some more, enough to capture how making a tomato soup works. Our rat could then start to exploit its world model, because it can make combinations of rules to get into situations it has not encountered, but can now imagine in a combinatorial way. For example, it knows that "chopping onions + heating them = caramelized onions" and "adding broth + simmering = soup base." Even if it's never made pumpkin soup before, it can combine these rules: chop the pumpkin, add broth, simmer, and predict the outcome without having to actually try it first. It could reuse exact rules from making tomato soup to make a different kind of soup.

An image showing Remy from the Pixar movie Ratatouille cooking a soup. — How awesome would it be to have our own little chef...

But let's get real. Cooking is an artform that involves quite a bit more than making a soup. There are an infinite amount of ingredients with different properties and interactions, and all kinds of pots, pans, and tools one has to use. Not to speak of the various techniques like sautéing, braising, reducing. Quite some dynamics to take account of...

From Recipe Books to Real Understanding

To capture all dynamics of cooking with explicit rules? We'd need billions of them. The computational cost would bankrupt Scrooge McDuck, let alone OpenAI. And heaven forbid the kitchen layout changes or we introduce a new menu: a lot of our collected rules become useless. Clearly, our simple rules aren’t the proper way of creating our little chef. We need a more generalizable format, where rules represent abstract concepts. Instead of storing 'salt goes in the blue pot at minute 3' and 'pepper goes in the red pan at minute 5,' we should try to have our robo-rat learn a notion of 'season after building your base flavors’. It could better reuse this type of abstract learned experience across other types of recipes, compared to the more explicit rules about salt and pepper. This is a transition which, as you probably guessed, happened naturally in the field of world models as it modernized. The space moved from memorizing specific procedures to trying to get the agent to 'understand' them and how they transfer to new situations. In my opinion, much more akin to how a human learns.

From a technical standpoint, this means that instead of writing down states and actions in memory, we now encode states as a row of numbers or a code. These might not get across the same meaning as human-readable text, but the idea is that similar situations end up with similar codes and provide us with semantics in a different way. Perhaps more importantly, it gives us a numerical representation which can more easily be adjusted to fit the task at hand. If you want to read more on this, the term to search for is latent variables.

The introduction of latent variables was a major breakthrough in model-based RL. One concrete example of it being put to use is a model called MuZero. It mastered Chess, Go, and Atari without anyone teaching it the rules. It just played. Millions and millions of games. Through all that experience, it learned how these game worlds actually work, better than most humans. Quite impressive, to say the least. It’s important to note that MuZero was limited to deterministic games, which are easy to simulate and getting a ‘training experience’ was thus not troublesome. But to be honest, it provides the conceptual framework we still work from in state-of-the-art world models. Meaning that they still aim to encode the world abstractly, learn its dynamics through interaction, and use that understanding to simulate possible futures.

The Reality of World Models Today

Unlike the world models I'm describing, I can't predict the future. So let me stick to describing the current state of affairs. Unfortunately, there are no robot versions of Remy walking around as of yet. As of now, world models are most widely used in Autonomous Driving (AD). This has proved to be a great fit for a multitude of reasons. Driving is a safety-critical, long-horizon task: before committing to an action, a system must reason about what would happen if it steered, braked, or accelerated differently. World models enable this by predicting the consequences of actions over time, allowing planning without having to physically execute every possibility. Moreover, data collection in AD is expensive, because one has to have cars being driven about and film trajectories combined with the actions taken by the driver. Not only that, given the different types of roads, weather conditions, traffic conditions and all other things one needs to account for, there are just a ton of situations data would have to be collected for in a supervised learning setting. Being able to transfer concepts learned in one situation to another is thus very valuable. World models are used here, with relatively little samples, to predict what will happen with the car, when it performs a certain action in a certain position. Doing this over time, helps allow AD.

Robotics is another space worth mentioning. Although there are still less world models used here concretely, research on them is very prevalent. V-JEPA 2, trained on over a million hours of video, learns a world model of physical dynamics to enable robotic planning without explicit programming. It is used to have a robot learn notions like "if I push this type of object, it moves this way". I relate to this way of learning very much, given the amount of time I spend watching YouTube.

Marble world model simulation showing 3D world generation from 2D inputs. — Marble allows you to generate fully navigable 3D worlds from 2D images or text prompts.Source: https://www.testingcatalog.com/world-labs-launches-marble-3d-world-model-for-creators/

One final approach to implementing world models I'll mention comes from the company World Labs. They, with Marble, have built a way to, from 2D images or textual prompts, get to simulate a 3D world one can navigate. This type of spatial modeling currently is missing interactivity, but if I were to be able to predict the future, and I am not, I would say that this is a clear next step to being able to get to create intelligent agents. If we are able to generate a 3D world, and then also have it be interactive, we will have focused and unlimited data to for instance train a robot in simulation.

Clearly, there is enough to be excited about. To me, world models provide a technology which is more akin to the way humans develop a skill than some of the more popular supervised approaches. Instead of being shown an immense amount of labeled examples of "correct" behavior, they learn by doing and learn about cause and effect in the process. That said, honesty demands acknowledging their current limitations. World models remain fundamentally predictive systems, constrained by their training experiences and prone to the same statistical pattern matching that limits other AI approaches. They're not yet as practically useful as model-free RL or LLMs, and significant challenges remain, among others, around finding exploitation strategies that avoid collapse, and bridging the gap between simulated success and real-world embodied systems. But these limitations don't diminish their potential to achieve more human-like learning, better sample efficiency, and planning capabilities compared to other approaches; if anything, they make the field more interesting to follow. Hopefully I've given you enough of a foundation to follow the developments you'll inevitably come across, and to evaluate them yourself, with excitement and a healthy dose of skepticism.

Want to get notified when I post something new? .