View all blogs

What LLMs have in common with multi-armed bandits, and why it matters

As a machine learning researcher, I’ve studied reinforcement learning and the multi-armed bandit problem over the course of my career. With support from Seek, my team just had a poster accepted at NeurIPs 2023 on the problem of planning over a series of related tasks in a meta-reinforcement learning setting.

Theoretical problems like these may appear to exist in a vacuum (and they are certainly dense to read about!). But they can help us understand challenges and opportunities in the world of applied ML. In this article, we’ll explore how multi-armed bandits can be a useful lens for thinking about how we use LLMs today—and in the future.

What are multi-armed bandits?

Imagine a gambler in a room of slot machines (nicknamed “single-armed bandits”), where some of the machines might give better rewards than others; the gambler has no way of knowing without trying each one. How much should the gambler explore new slot machines vs. sticking with ones that seem to be yielding high rewards?

In a nutshell, this is the multi-armed bandit problem. You need to explore to learn by trial and error, while also focusing on exploiting rewards once you have enough information. You’ll see this explore/exploit tradeoff problem everywhere there are decisions to be made and uncertainty, e.g., in commercial applications like recommendation systems.

Going deeper than simple explore/exploit, the multi-armed bandit problem is also about how you learn to take the path that is going to bring better outcomes in the long-term, not just immediately. This is also known as reinforcement learning. DeepMind famously used reinforcement learning techniques to create a computer that could beat human experts in the complex game of Go. In Go, every single decision you make doesn't bring you reward; rather, it’s your ability to plan a long sequence of decisions in advance that’s important.

Non-stationary bandits and lifelong reinforcement learning

Imagine a roomful of slot machines that can break down, or give different rewards as time passes. We call this problem non-stationary bandits. In this scenario, you need to learn when your strategy isn’t working anymore because the world has shifted, and you need to adapt.

Adaptation is something humans are pretty good at: We notice when there is a discrepancy between our expectations of the world and our present experience, and we can translate our prior knowledge to new problems. Machines, however, are not as good at this. They’re built to solve one task, and usually struggle to extrapolate what they’ve learned to new scenarios.

To try to model this human behavior in a machine context, we study the problem of lifelong reinforcement learning: How do you continuously learn and update your beliefs when the world is changing, and the only source of information is trial and error (in other words, explore/exploit)?

We can look at how humans learn for some clues. For instance, humans are continually learning high-level skills that we can reuse, while not forgetting what we’ve learned in the past. We are also careful not to overfit: We spend time gathering feedback about the current state of the world, but at some point, we put exploration aside and focus on solving the task at hand.

The implications for LLMs

What does reinforcement learning, and lifelong reinforcement learning in particular, have to do with LLMs? Well, if we look at how LLMs work today, what we see is that OpenAI and others have essentially taken everything that exists on the internet (simplifying a bit here!) and trained a language model, and then they repeat this process periodically, e.g., every year. Meanwhile, millions of people are generating content online using LLMs like ChatGPT.  

What this means is that LLMs one year from now will be learning by trying out choices and seeing how the environment changes as a result of their decisions. It’s similar to reinforcement learning, except on a very long timescale! But unlike making a move in the game of Go and waiting for the adversary to make another move, it’s as if LLMs have given us a tool that the world is using to modify the game itself for the next language model. This resembles the problem of non-stationary bandits and lifelong reinforcement learning.

Non-stationarity exists in multiple ways. In addition to natural changes from the external world, one year from now, the internet will contain a vast amount of data that was generated by the LLM itself. Millions of people are using tools like ChatGPT, and we’re only at the beginning of a cycle where these models will be increasingly trained on their own data. It’s important to think about the long-term impact of these aspects of machine training, and consider questions like:

  • Should we include machine generated data in future training? If so, how can we learn to recognize this data?
  • In applications such as code generation, how will the code quality or accuracy be impacted by retraining on machine-generated code?


Even though we are applying LLMs to solve countless practical problems today, there’s much about them that we don’t understand, especially when we extrapolate years into the future. The multi-armed bandits problem explores how we can learn in the long-term through trial and error, and how these choices impact our environment and thus our future learning. Reinforcement learning, and particularly lifelong reinforcement learning, can be a useful lens for thinking about LLMs, even if it’s not a common one (yet).

And with LLMs, it’s particularly interesting to consider the many problems that arise when you start learning from your own data. This is an area where reinforcement learning has just started to scratch the surface, but these questions are only going to become more urgent and important as we use LLMs more.

One of the main reasons we study machine learning theory is to inform machine learning practice. But these theoretical problems are fascinating on their own, too—if you’re interested in a math-heavy read on reinforcement learning, check out our poster at NeurIPS!  

View all blogs