0:47
Today I'm chatting with Richard Sutton, who is one of the founding fathers of
0:52
reinforcement learning and inventor of many of the main techniques used there,
0:55
like TD learning and policy gradient methods. For that, he received this year's Turing Award
1:00
which, if you don’t know, is the Nobel Prize for computer science. Richard, congratulations.
1:05
Thank you, Dwarkesh. Thanks for coming on the podcast.
1:08
It's my pleasure. First question. My audience and I are
1:12
familiar with the LLM way of thinking about AI. Conceptually, what are we missing in terms of
1:18
thinking about AI from the RL perspective? It's really quite a different point of view.
1:26
It can easily get separated and lose the ability to talk to each other.
1:32
Large language models have become such a big thing, generative AI in general a big thing.
1:38
Our field is subject to bandwagons and fashions, so we lose track of the basic things.
1:46
I consider reinforcement learning to be basic AI. What is intelligence?
1:52
The problem is to understand your world. Reinforcement learning is about understanding
1:58
your world, whereas large language models are about mimicking people,
2:02
doing what people say you should do. They're not about figuring out what to do.
2:08
You would think that to emulate the trillions of tokens in the corpus of Internet text,
2:14
you would have to build a world model. In fact, these models do seem to have
2:17
very robust world models. They're the best world models
2:21
we've made to date in AI, right? What do you think is missing?
2:26
I would disagree with most of the things you just said.
2:30
To mimic what people say is not really to build a model of the world at all.
2:36
You're mimicking things that have a model of the world: people.
2:40
I don't want to approach the question in an adversarial way, but I would question the
2:47
idea that they have a world model. A world model would enable you
2:51
to predict what would happen. They have the ability to predict
2:55
what a person would say. They don't have the
2:57
ability to predict what will happen. What we want, to quote Alan Turing, is a machine
3:04
that can learn from experience, where experience is the things that actually happen in your life.
3:09
You do things, you see what happens, and that's what you learn from.
3:16
The large language models learn from something else.
3:18
They learn from "here's a situation, and here's what a person did".
3:22
Implicitly, the suggestion is you should do what the person did.
3:26
I guess maybe the crux, and I'm curious if you disagree with this, is that some people
3:30
will say that imitation learning has given us a good prior, or given these models a good prior,
3:36
of reasonable ways to approach problems. As we move towards the era of experience, as
3:42
you call it, this prior is going to be the basis on which we teach these models from experience,
3:49
because this gives them the opportunity to get answers right some of the time.
3:54
Then on this, you can train them on experience. Do you agree with that perspective?
4:00
No. I agree that it's the large language model perspective.
4:04
I don't think it's a good perspective. To be a prior for something,
4:10
there has to be a real thing. A prior bit of knowledge should be
4:15
the basis for actual knowledge. What is actual knowledge? There's no definition of actual
4:20
knowledge in that large-language framework. What makes an action a good action to take?
4:29
You recognize the need for continual learning. If you need to learn continually,
4:34
continually means learning during the normal interaction with the world.
4:39
There must be some way during the normal interaction to tell what's right.
4:47
Is there any way to tell in the large language model setup what's the right thing to say?
4:54
You will say something and you will not get feedback about what the right thing to say is,
4:58
because there's no definition of what the right thing to say is. There's no
5:02
goal. If there's no goal, then there's one thing to say, another thing to say.
5:07
There's no right thing to say. There's no ground truth. You can't have prior knowledge
5:12
if you don't have ground truth, because the prior knowledge is supposed to be a hint or
5:17
an initial belief about what the truth is. There isn't any truth. There's no right thing to say.
5:24
In reinforcement learning, there is a right thing to say, a right thing to do, because the right
5:29
thing to do is the thing that gets you reward. We have a definition of what's the right thing
5:33
to do, so we can have prior knowledge or knowledge provided by people about
5:39
what the right thing to do is. Then we can check it to see,
5:43
because we have a definition of what the actual right thing to do is.
5:47
An even simpler case is when you're trying to make a model of the world.
5:50
When you predict what will happen, you predict and then you see what happens. There's ground
5:56
truth. There's no ground truth in large language models because you don't have
6:02
a prediction about what will happen next. If you say something in your conversation,
6:09
the large language models have no prediction about what the person will say in response
6:14
to that or what the response will be. I think they do. You can literally ask them,
6:19
"What would you anticipate a user might say in response?" They’ll have a prediction.
6:23
No, they will respond to that question right. But they have no prediction in the substantive
6:29
sense that they won't be surprised by what happens.
6:32
If something happens that isn't what you might say they predicted, they will not
6:36
change because an unexpected thing has happened. To learn that, they'd have to make an adjustment.
6:43
I think a capability like this does exist in context.
6:49
It's interesting to watch a model do chain of thought.
6:53
Suppose it's trying to solve a math problem. It'll say, "Okay, I'm going to approach this
6:56
problem using this approach first." It'll write this out and be like,
7:00
"Oh wait, I just realized this is the wrong conceptual way to approach the problem.
7:03
I'm going to restart with another approach." That flexibility does exist in context, right?
7:10
Do you have something else in mind or do you just think that you need
7:12
to extend this capability across longer horizons? I'm just saying they don't have in any meaningful
7:20
sense a prediction of what will happen next. They will not be surprised by what happens next.
7:25
They'll not make any changes if something happens, based on what happens.
7:30
Isn't that literally what next token prediction is?
7:32
Prediction about what's next and then updating on the surprise?
7:35
The next token is what they should say, what the actions should be.
7:39
It's not what the world will give them in response to what they do.
7:42
Let's go back to their lack of a goal. For me, having a goal is
7:48
the essence of intelligence. Something is intelligent if it can achieve goals.
7:53
I like John McCarthy's definition that intelligence is the computational part
7:57
of the ability to achieve goals. You have to have goals or you're
8:03
just a behaving system. You're not anything special,
8:08
you're not intelligent. You agree that large language
8:11
models don't have goals? No, they have a goal.
8:14
What's the goal? Next token prediction.
8:17
That's not a goal. It doesn't change the world. Tokens come at you,
8:24
and if you predict them, you don't influence them. Oh yeah. It's not a goal about the external world.
8:31
It's not a goal. It's not a substantive goal. You can't look at a system and say it has a goal
8:38
if it's just sitting there predicting and being happy with itself that it's predicting accurately.
8:43
The bigger question I want to understand is why you don't think doing RL on
8:48
top of LLMs is a productive direction. We seem to be able to give these models
8:52
the goal of solving difficult math problems. They are in many ways at the very peaks of
8:58
human-level in the capacity to solve math Olympiad-type problems. They got gold at
9:04
IMO. So it seems like the model which got gold at the International Math Olympiad does
9:09
have the goal of getting math problems right. Why can't we extend this to different domains?
9:15
The math problems are different. Making a model of the physical world and carrying
9:22
out the consequences of mathematical assumptions or operations, those are very different things.
9:29
The empirical world has to be learned. You have to learn the consequences.
9:36
Whereas the math is more computational, it's more like standard planning.
9:44
There they can have a goal to find the proof, and they are in some way
9:54
given that goal to find the proof. It's interesting because you wrote
9:59
this essay in 2019 titled "The Bitter Lesson," and this is the most influential
10:04
essay, perhaps, in the history of AI. But people have used that as a justification for
10:13
scaling up LLMs because, in their view, this is the one scalable way we have found to pour ungodly
10:21
amounts of compute into learning about the world. It's interesting that your perspective is that
10:26
the LLMs are not "bitter lesson"-pilled. It's an interesting question whether large
10:32
language models are a case of the bitter lesson. They are clearly a way of using massive
10:42
computation, things that will scale with computation up to the limits of the Internet.
10:51
But they're also a way of putting in lots of human knowledge. This is an interesting question.
11:01
It's a sociological or industry question. Will they reach the limits of the data and
11:13
be superseded by things that can get more data just from experience rather than from people?
11:24
In some ways it's a classic case of the bitter lesson.
11:29
The more human knowledge we put into the large language models, the better they
11:32
can do. So it feels good. Yet, I expect there to be systems that can learn from experience.
11:44
Which could perform much better and be much more scalable.
11:49
In which case, it will be another instance of the bitter lesson, that the things that used human
11:56
knowledge were eventually superseded by things that just trained from experience and computation.
12:05
I guess that doesn't seem like the crux to me. I think those people would also agree that the
12:11
overwhelming amount of compute in the future will come from learning from experience.
12:17
They just think that the scaffold or the basis of that, the thing you'll start with in order to pour
12:22
in the compute to do this future experiential learning or on-the-job learning, will be LLMs.
12:31
I still don't understand why this is the wrong starting point altogether.
12:36
Why do we need a whole new architecture to begin doing experiential, continual learning?
12:43
Why can't we start with LLMs to do that? In every case of the bitter lesson you
12:48
could start with human knowledge and then do the scalable things. That's always the case. There's
12:56
never any reason why that has to be bad. But in fact, and in practice,
13:02
it has always turned out to be bad. People get locked into the human
13:07
knowledge approach, and they psychologically… Now I'm speculating why it is, but this is
13:13
what has always happened. They get their lunch eaten
13:20
by the methods that are truly scalable. Give me a sense of what the scalable method is.
13:24
The scalable method is you learn from experience. You try things, you see what works.
13:33
No one has to tell you. First of all, you have a goal.
13:37
Without a goal, there's no sense of right or wrong or better or worse.
13:41
Large language models are trying to get by without having a goal or a sense of better or worse.
13:48
That's just exactly starting in the wrong place. Maybe it's interesting to compare this to humans.
13:55
In both the case of learning from imitation versus experience and on the question of goals,
14:02
I think there's some interesting analogies. Kids will initially learn from imitation.
14:10
You don't think so? No, of course not.
14:14
Really? I think kids just watch people. They try to say the same words…
14:19
How old are these kids? What about the first six months?
14:24
I think they're imitating things. They're trying to make their mouth sound the way
14:28
they see their mother's mouth sound. Then they'll say the same words without
14:31
understanding what they mean. As they get older, the complexity
14:33
of the imitation they do increases. You're imitating maybe the skills that
14:41
people in your band are using to hunt down the deer or something.
14:44
Then you go into the learning from experience RL regime.
14:47
But I think there's a lot of imitation learning happening with humans.
14:51
It's surprising you can have such a different point of view.
14:55
When I see kids, I see kids just trying things and waving their
15:00
hands around and moving their eyes around. There's no imitation for how they move their
15:10
eyes around or even the sounds they make. They may want to create the same sounds,
15:14
but the actions, the thing that the infant actually does, there's no targets for that.
15:23
There are no examples for that. I agree. That doesn't explain everything infants
15:26
do, but I think it guides a learning process. Even an LLM, when it's trying to predict the next
15:31
token early in training, it will make a guess. It'll be different from what it actually sees.
15:36
In some sense, it's very short-horizon RL, where it's making this guess,
15:40
"I think this token will be this." It's this other thing, similar to how a kid
15:43
will try to say a word. It comes out wrong. The large language models are learning
15:47
from training data. It's not learning from experience. It's learning from something that
15:52
will never be available during its normal life. There's never any training data that says you
15:59
should do this action in normal life. I think this is more of a semantic
16:05
distinction. What do you call school? Is that not training data?
16:10
School is much later. Okay, I shouldn't have said never.
16:15
I don’t know, I think I would even say that about school.
16:17
But formal schooling is the exception. But there are phases of learning where
16:25
there’s the programming in your biology early on, you're not that useful.
16:29
Then why you exist is to understand the world and learn how to interact with it.
16:34
It seems like a training phase. I agree that then there's a more
16:39
gradual… There's not a sharp cutoff to training to deployment, but there
16:44
seems to be this initial training phase right? There's nothing where you have training of what
16:49
you should do. There's nothing. You see things that happen. You're not told what to do. Don't
16:59
be difficult. I mean this is obvious. You're literally taught what to do.
17:03
This is where the word training comes from, from humans.
17:07
I don't think learning is really about training. I think learning is about learning,
17:13
it's about an active process. The child tries things and sees what happens.
17:22
We don't think about training when we think of an infant growing up.
17:27
These things are actually rather well understood. If you look at how psychologists think about
17:32
learning, there's nothing like imitation. Maybe there are some extreme cases where humans
17:40
might do that or appear to do that, but there's no basic animal learning process called imitation.
17:46
There are basic animal learning processes for prediction and for trial-and-error control.
17:53
It's really interesting how sometimes the hardest things to see are the obvious ones.
17:58
It's obvious—if you look at animals and how they learn, and you look at psychology and our
18:04
theories of them—that supervised learning is not part of the way animals learn.
18:13
We don't have examples of desired behavior. What we have are examples of things that happen,
18:20
one thing that followed another. We have examples of,
18:24
"We did something and there were consequences." But there are no examples of supervised learning.
18:32
Supervised learning is not something that happens in nature.
18:38
Even if that were the case with school, we should forget about it because that's
18:42
some special thing that happens in people. It doesn't happen broadly in nature. Squirrels
18:48
don't go to school. Squirrels can learn all about the world.
18:51
It's absolutely obvious, I would say, that supervised learning doesn't happen in animals.
18:59
I interviewed this psychologist and anthropologist, Joseph Henrich,
19:05
who has done work about cultural evolution, basically what distinguishes humans and
19:12
how humans pick up knowledge. Why are you trying to distinguish
19:15
humans? Humans are animals. What we have in common is more interesting.
19:22
What distinguishes us, we should be paying less attention to.
19:26
We're trying to replicate intelligence. If you want to understand what it is that enables humans
19:31
to go to the moon or to build semiconductors, I think the thing we want to understand is what
19:37
makes that happen. No animal can go
19:38
to the moon or make semiconductors. We want to understand what makes humans special.
19:42
I like the way you consider that obvious, because I consider the opposite obvious.
19:50
We have to understand how we are animals. If we understood a squirrel, I think we'd
19:57
be almost all the way there to understanding human intelligence.
20:01
The language part is just a small veneer on the surface. This is great. We're finding out the
20:08
very different ways that we're thinking. We're not arguing. We're trying to share our different
20:15
ways of thinking with each other. I think argument is useful.
20:21
I do want to complete this thought. Joseph Henrich has this interesting
20:24
theory about a lot of the skills that humans have had to master in order to be successful.
20:33
We're not talking about the last thousand years or the last 10,000 years,
20:35
but hundreds of thousands of years. The world is really complicated. It's not possible to
20:42
reason through how to, let’s say, hunt a seal if you're living in the Arctic.
20:50
There's this many, many-step, long process of how to make the bait and how to find the seal,
20:57
and then how to process the food in a way that makes sure you won't get poisoned.
21:02
It's not possible to reason through all of that. Over time, there's this larger process of whatever
21:09
analogy you want to use—maybe RL, something else—where culture as a whole has figured out
21:14
how to find and kill and eat seals. In his view, what is happening when
21:23
this knowledge is transmitted through generations, is that you have to imitate
21:29
your elders in order to learn that skill. You can't think your way through how to
21:34
hunt and kill and process a seal. You have to watch other people,
21:38
maybe make tweaks and adjustments, and that's how knowledge accumulates.
21:43
The initial step of the cultural gain has to be imitation.
21:46
But maybe you think about it a different way? No, I think about it the same way.
21:50
Still, it's a small thing on top of basic trial-and-error learning, prediction learning.
21:58
It's what distinguishes us, perhaps, from many animals. But we're an animal
22:05
first. We were an animal before we had language and all those other things.
22:13
I do think you make a very interesting point that continual learning is a
22:17
capability that most mammals have. I guess all mammals have it.
22:22
It's quite interesting that we have something that all mammals have, but our AI systems don't have.
22:29
Whereas the ability to understand math and solve difficult math problems—depends on how
22:33
you define math—is a capability that our AIs have, but that almost no animal has.
22:40
It's quite interesting what ends up being difficult and what ends up being easy.
22:45
Moravec's paradox. That’s right, that’s right.
23:58
This alternative paradigm that you're imagining… The experiential paradigm. Let's
24:02
lay it out a little bit. It says that experience, action,
24:08
sensation—well, sensation, action, reward—this happens on and on and on for your life.
24:15
It says that this is the foundation and the focus of intelligence.
24:20
Intelligence is about taking that stream and altering the actions to
24:25
increase the rewards in the stream. Learning then is from the stream,
24:32
and learning is about the stream. That second part is particularly telling.
24:40
What you learn, your knowledge, is about the stream.
24:44
Your knowledge is about if you do some action, what will happen.
24:48
Or it's about which events will follow other events. It's about the stream. The content of
24:55
the knowledge is statements about the stream. Because it's a statement about the stream,
25:01
you can test it by comparing it to the stream, and you can learn it continually.
25:06
When you're imagining this future continual learning agent…
25:10
They're not "future". Of course, they exist all the time.
25:13
This is what the reinforcement learning paradigm is, learning from experience.
25:17
Yeah, I guess what I meant to say is a general human-level,
25:20
general continual learning agent. What is the reward function? Is it just predicting the world?
25:26
Is it then having a specific effect on it? What would the general reward function be?
25:34
The reward function is arbitrary. If you're playing chess, it's to win the game of chess.
25:42
If you're a squirrel, maybe the reward has to do with getting nuts.
25:51
In general, for an animal, you would say the reward is to avoid pain and to acquire pleasure.
26:04
I think there also should be a component having to do with your
26:08
increasing understanding of your environment. That would be sort of an intrinsic motivation.
26:14
I see. With this AI, lots of people would want it to be doing lots of different kinds of things.
26:24
It's performing the task people want, but at the same time, it's learning
26:28
about the world from doing that task. Let’s say we get rid of this paradigm
26:35
where there's training periods and then there's deployment periods.
26:40
Do we also get rid of this paradigm where there's the model and then instances of the model or
26:46
copies of the model that are doing certain things? How do you think about the fact that we'd
26:53
want this thing to be doing different things? We'd want to aggregate the knowledge that it's
26:56
gaining from doing those different things. I don't like the word "model"
27:00
when used the way you just did. I think a better word would be "the network"
27:05
because I think you mean the network. Maybe there are many networks. Anyway, things would
27:11
be learned. You'd have copies and many instances. Sure, you'd want to share knowledge across the
27:20
instances. There would be
27:21
lots of possibilities for doing that. Today, you have one child grow up and
27:28
learn about the world, and then every new child has to repeat that process.
27:33
Whereas with AIs, with a digital intelligence, you could hope to do it once and then copy it
27:38
into the next one as a starting place. This would be a huge savings.
27:44
I think it'd be much more important than trying to learn from people.
27:49
I agree that the kind of thing you're talking about is necessary regardless
27:54
of whether you start from LLMs or not. If you want human or animal-level intelligence,
28:00
you're going to need this capability. Suppose a human is trying to make a startup.
28:05
This is a thing which has a reward on the order of 10 years.
28:08
Once in 10 years you might have an exit where you get paid out a billion dollars.
28:12
But humans have this ability to make intermediate auxiliary rewards or have some way of…Even when
28:18
they have extremely sparse rewards, they can still make intermediate steps having an
28:23
understanding of what the next thing they're doing leads to this grander goal we have.
28:27
How do you imagine such a process might play out with AIs?
28:31
This is something we know very well. The basis of it is temporal difference
28:35
learning where the same thing happens in a less grandiose scale.
28:41
When you learn to play chess, you have the long-term goal of winning the game.
28:46
Yet you want to be able to learn from shorter-term things like taking your opponent's pieces.
28:55
You do that by having a value function which predicts the long-term outcome.
28:59
Then if you take the guy's pieces, your prediction about the long-term outcome is changed.
29:05
It goes up, you think you're going to win. Then that increase in your belief immediately
29:11
reinforces the move that led to taking the piece. We have this long-term 10-year goal of making a
29:20
startup and making a lot of money. When we make progress, we say, "Oh,
29:24
I'm more likely to achieve the long-term goal," and that rewards the steps along the way.
29:34
You also want some ability for information that you're learning.
29:39
One of the things that makes humans quite different from these LLMs is that if you're
29:43
onboarding on a job, you're picking up so much context and information.
29:47
That's what makes you useful at the job. You're learning everything from how your
29:51
client has preferences to how the company works, everything.
29:56
Is the bandwidth of information that you get from a procedure like TD learning high
30:01
enough to have this huge pipe of context and tacit knowledge that
30:06
you need to be picking up in the way humans do when they're just deployed?
30:14
I’m not sure but I think at the crux of this, the big world hypothesis seems very relevant.
30:20
The reason why humans become useful on the job is because they are encountering
30:25
their particular part of the world. It can't have been anticipated and
30:31
can't all have been put in in advance. The world is so huge that you can't.
30:38
The dream of large language models, as I see it, is you can teach the agent everything.
30:45
It will know everything and won't have to learn anything online, during its life.
30:52
Your examples are all, "Well, really you have to" because you can teach it,
30:58
but there's all the little idiosyncrasies of the particular life they're leading and the
31:02
particular people they're working with and what they like, as opposed to what average people like.
31:08
That's just saying the world is really big, and you're going to have to learn it along the way.
31:14
It seems to me you need two things. One is some way of converting this long-run
31:19
goal reward into smaller auxiliary predictive rewards of the future reward, or the future
31:27
reward that leads to the final reward. But initially, it seems to me,
31:35
I need to hold on to all this context that I'm gaining as I'm working in the world.
31:42
I'm learning about my clients, my company, and all this information.
31:50
I would say you're just doing regular learning. Maybe you're using "context"
31:54
because in large language models all that information has to go into the context window.
31:58
But in a continual learning setup, it just goes into the weights.
32:02
Maybe context is the wrong word to use because I mean a more general thing.
32:06
You learn a policy that's specific to the environment that you're finding yourself in.
32:12
The question I'm trying to ask is, you need some way of getting…How many bits per second is a human
32:20
picking up when they're out in the world? If you're just interacting over Slack
32:25
with your clients and everything. Maybe you're trying to ask the question of,
32:28
it seems like the reward is too small of a thing to do all the learning that we need to do.
32:33
But we have the sensations, we have all the other information we can learn from.
32:41
We don't just learn from the reward. We learn from all the data.
32:45
What is the learning process which helps you capture that information?
32:52
Now I want to talk about the base common model of the agent with the four parts. We
32:59
need a policy. The policy says, "In the situation I'm in, what should I do?" We
33:04
need a value function. The value function is the thing that is learned with TD learning,
33:09
and the value function produces a number. The number says how well it's going.
33:13
Then you watch if that's going up and down and use that to adjust your policy.
33:19
So you have those two things. Then there's also the perception
33:24
component, which is construction of your state representation, your sense of where you are now.
33:30
The fourth one is what we're really getting at, most transparently anyway.
33:34
The fourth one is the transition model of the world.
33:38
That's why I am uncomfortable just calling everything "models," because I want to
33:41
talk about the model of the world, the transition model of the world.
33:45
Your belief that if you do this, what will happen? What will be the consequences of what you do?
33:50
Your physics of the world. But it's not just physics, it's also abstract models,
33:55
like your model of how you traveled from California up to Edmonton for this podcast.
34:00
That was a model, and that's a transition model. That would be learned. It's not
34:05
learned from reward. It's learned from, "You did things, you saw what happened,
34:08
you made that model of the world." That will be learned very richly
34:13
from all the sensation that you receive, not just from the reward.
34:17
It has to include the reward as well, but that's a small part of the whole
34:22
model, a small, crucial part of the whole model. One of my friends, Toby Ord, pointed out that if
34:27
you look at the MuZero models that Google DeepMind deployed to learn Atari games, these models were
34:36
initially not a general intelligence itself, but a general framework for training specialized
34:42
intelligences to play specific games. That is to say that you couldn't,
34:46
using that framework, train a policy to play both chess and Go and some other game.
34:53
You had to train each one in a specialized way. He was wondering whether that implies
34:58
that with reinforcement learning generally, because of this information constraint,
35:03
you can only learn one thing at a time? The density of information isn't that high?
35:08
Or whether it was just specific to the way that MuZero was done.
35:11
If it's specific to AlphaZero, what needed to be changed about that approach so that
35:18
it could be a general learning agent? The idea is totally general. I do use
35:24
all the time, as my canonical example, the idea of an AI agent is like a person.
35:32
People, in some sense, have just one world they live in.
35:38
That world may involve chess and it may involve Atari games, but those are
35:43
not a different task or a different world. Those are different states they encounter.
35:47
So the general idea is not limited at all. Maybe it would be useful to explain what was
35:54
missing in that architecture, or that approach, which this continual learning AGI would have.
36:04
They just set it up. It was not their ambition to have one agent across those games.
36:13
If we want to talk about transfer, we should talk about transfer not across games or
36:18
across tasks, but transfer between states. I guess I’m curious if historically, have we
36:26
seen the level of transfer using RL techniques that would be needed to build this kind of…
36:35
Good. Good. We're not seeing transfer anywhere. Critical to good performance is that you can
36:42
generalize well from one state to another state. We don't have any methods that are good at that.
36:47
What we have are people trying different things and they settle on something, a representation
36:56
that transfers well or generalizes well. But we have very few automated techniques
37:05
to promote transfer, and none of them are used in modern deep learning.
37:11
Let me paraphrase to make sure that I understood that correctly.
37:17
It sounds like you're saying that when we do have generalization in these models,
37:22
that is a result of some sculpted… Humans did it. The researchers did it.
37:31
Because there's no other explanation. Gradient descent will not make you generalize well.
37:35
It will make you solve the problem. It will not make you, if you get
37:39
new data, generalize in a good way. Generalization means to train on one thing
37:45
that'll affect what you do on other things. We know deep learning is really bad at this.
37:50
For example, we know that if you train on some new thing, it will often catastrophically interfere
37:56
with all the old things that you knew. This is exactly bad generalization. Generalization,
38:02
as I said, is some kind of influence of training on one state on other states.
38:11
The fact that you generalize is not necessarily good or bad.
38:13
You can generalize poorly, you can generalize well.
38:17
Generalization always will happen, but we need algorithms that will cause the
38:23
generalization to be good rather than bad. I'm not trying to kickstart this initial
38:30
crux again, but I'm just genuinely curious because I think I might be using the term differently.
38:35
One way to think about these LLMs is that they’re increasing the scope of
38:39
generalization from earlier systems, which could not really even do a basic math problem,
38:44
to now where they can do anything in this class of Math Olympiad-type problems.
38:50
You initially start with them being able to generalize among addition problems.
38:54
Then they can generalize among problems which require use of different kinds of mathematical
39:02
techniques and theorems and conceptual categories, which is what the Math Olympiad requires.
39:08
It sounds like you don't think of being able to solve any problem within that
39:12
category as an example of generalization. Let me know if I'm misunderstanding that.
39:18
Large language models are so complex. We don't really know what
39:23
information they have had prior. We have to guess because they've been fed so much.
39:30
This is one reason why they're not a good way to do science.
39:34
It's just so uncontrolled, so unknown. But if you come up with an entirely new…
39:39
They're getting a bunch of things right, perhaps. The question is why. Well maybe that they don't
39:46
need to generalize to get them right, because the only way to get some of them right is to
39:51
form something which gets all of them right. If there's only one answer and you find it,
39:58
that's not called generalization. It's just it's the only way to solve it,
40:02
and so they find the only way to solve it. But generalization is when it could be this way,
40:06
it could be that way, and they do it the good way.
40:08
My understanding is that this is working more and more, better and better, with coding agents.
40:15
With engineers, obviously if you're trying to program a library, there are many
40:21
different ways you could achieve the end spec. An initial frustration with these models has
40:25
been that they'll do it in a way that's sloppy. Over time they're getting better and better at
40:31
coming up with the design architecture and the abstractions that developers find more satisfying.
40:37
It seems like an example of what you're talking about.
40:41
There's nothing in them which will cause it to generalize well.
40:46
Gradient descent will cause them to find a solution to the problems they've seen.
40:52
If there's only one way to solve them, they'll do that.
40:55
But if there are many ways to solve it, some which generalize well, some which generalize poorly,
40:59
there's nothing in the algorithms that will cause them to generalize well.
41:03
But people, of course, are evolved and if it's not working out they fiddle with
41:08
it until they find a way, perhaps until they find a way which generalizes well.
42:17
I want to zoom out and ask about being in the field of AI for longer than almost anybody who
42:25
is commentating on it, or working in it now. I'm curious about what the
42:29
biggest surprises have been. How much new stuff do you feel like is coming out?
42:34
Or does it feel like people are just playing with old ideas?
42:39
Zooming out, you got into this even before deep learning was popular.
42:43
So how do you see the trajectory of this field over time and how new ideas have come about and
42:49
everything? What's been surprising? I thought a little bit about this.
42:57
There are a handful of things. First, the large language models are surprising.
43:03
It's surprising how effective artificial neural networks are at language tasks.
43:12
That was a surprise, it wasn't expected. Language seemed different. So that's impressive. There's a
43:19
long-standing controversy in AI about simple basic principle methods, the general-purpose
43:28
methods like search and learning, compared to human-enabled systems like symbolic methods.
43:41
In the old days, it was interesting because things like search and learning were called
43:44
weak methods because they're just using general principles, they're not using
43:48
the power that comes from imbuing a system with human knowledge. Those were called strong. I think
43:56
the weak methods have just totally won. That's the biggest question from the
44:06
old days of AI, what would happen. Learning and search have just won the day.
44:13
There's a sense in which that was not surprising to me because I was always hoping or rooting
44:18
for the simple basic principles. Even with the large language models,
44:23
it's surprising how well it worked, but it was all good and gratifying.
44:30
AlphaGo was surprising, how well that was able to work, AlphaZero in particular.
44:40
But it's all very gratifying because again, simple basic principles are winning the day.
44:46
Whenever the public conception has been changed because some new application was
44:54
developed— for example, when AlphaZero became this viral sensation—to you as somebody who
44:59
has literally came up with many of the techniques that were used, did it feel
45:03
to you like new breakthroughs were made? Or did it feel like, "Oh, we've had these
45:08
techniques since the '90s and people are simply combining them and applying them now"?
45:14
The whole AlphaGo thing had a precursor, which is TD-Gammon.
45:18
Gerry Tesauro did reinforcement learning, temporal difference learning methods, to play backgammon.
45:28
It beat the world's best players and it worked really well.
45:33
In some sense, AlphaGo was merely a scaling up of that process.
45:38
But it was quite a bit of scaling up and there was also an additional innovation
45:43
in how the search was done. But it made sense. It wasn't surprising in that sense.
45:49
AlphaGo actually didn't use TD learning. It waited to see the final outcomes. But
45:56
AlphaZero used TD. AlphaZero was applied to all the other games and it did extremely well.
46:04
I've always been very impressed by the way AlphaZero plays chess because I'm a
46:09
chess player and it just sacrifices material for positional advantages.
46:15
It's just content and patient to sacrifice that material for a long period of time.
46:22
That was surprising that it worked so well, but also gratifying and it fit into my worldview.
46:31
This has led me where I am. I'm in some sense a contrarian or
46:36
someone thinking differently than the field is. I'm personally just content being out of sync
46:43
with my field for a long period of time, perhaps decades, because
46:47
occasionally I have been proved right in the past. The other thing I do—to help me not feel I'm out
46:56
of sync and thinking in a strange way—is to look not at my local environment or my local field,
47:04
but to look back in time and into history and to see what people have thought classically about
47:12
the mind in many different fields. I don't feel I'm out of sync with
47:15
the larger traditions. I really view myself as
47:18
a classicist rather than as a contrarian. I go to what the larger community of thinkers
47:26
about the mind have always thought. Some sort of left-field questions
47:30
for you if you'll tolerate them. The way I read the bitter lesson is
47:35
that it's not necessarily saying that human artisanal researcher tuning doesn't work,
47:42
but that it obviously scales much worse than compute, which is growing exponentially.
47:49
So you want techniques which leverage the latter. Yep.
47:52
Once we have AGI, we'll have researchers which scale linearly with compute.
47:59
We'll have this avalanche of millions of AI researchers.
48:02
Their stock will be growing as fast as compute. So maybe this will mean that it is rational
48:09
or it will make sense to have them doing good old-fashioned
48:13
AI and doing these artisanal solutions. As a vision of what happens after AGI in
48:21
terms of how AI research will evolve, I wonder if that's still compatible with a bitter lesson.
48:25
How did we get to this AGI? You want to presume that it's been done.
48:30
Suppose it started with general methods, but now we've got the AGI.
48:34
And now we want to go… Then we're done.
48:38
Interesting. You don't think that there's anything above AGI?
48:44
But you're using it to get AGI again. Well, I'm using it to get superhuman levels
48:48
of intelligence or competence at different tasks. These AGIs, if they're not superhuman already,
48:54
then the knowledge that they might impart would be not superhuman.
49:00
I guess there are different gradations. I'm not sure your idea makes sense because
49:05
it seems to presume the existence of AGI and that we've already worked that out.
49:12
Maybe one way to motivate this is, AlphaGo was superhuman. It beat any Go player. AlphaZero
49:18
would beat AlphaGo every single time. So there are ways to get more
49:22
superhuman than even superhuman. It was also a different architecture.
49:27
So it seems possible to me that the agent that's able to generally learn across all domains,
49:33
there would be ways to give it better architecture for learning, just the same way that AlphaZero was
49:38
an improvement upon AlphaGo and MuZero was an improvement upon AlphaZero.
49:41
And the way AlphaZero was an improvement was that it did not use human knowledge but just went from
49:48
experience. Right.
49:49
So why do you say, "Bring in other agents' expertise to teach it",
49:57
when it's worked so well from experience and not by help from another agent?
50:04
I agree that in that particular case that it was moving to more general methods.
50:10
I meant to use that particular example to illustrate that it's possible to go
50:12
superhuman to superhuman++, to superhuman+++. I'm curious if you think those gradations will
50:19
continue to happen by just making the method simpler.
50:22
Or, because we'll have the capability of these millions of minds who can then add complexity
50:27
as needed, will that continue to be a false path, even when you have billions of AI researchers or
50:34
trillions of AI researchers? It’s more interesting
50:37
just to think about that case. When you have many AIs, will they help each
50:44
other the way cultural evolution works in people? Maybe we should talk about that.
50:50
The bitter lesson, who cares about that? That's an empirical observation about a particular
50:55
period in history. 70 years in history, it doesn't necessarily have to apply to the next 70 years.
51:01
An interesting question is, you're an AI, you get some more computer power.
51:04
Should you use it to make yourself more computationally capable?
51:08
Or should you use it to spawn off a copy of yourself to go learn something interesting
51:13
on the other side of the planet or on some other topic and then report back to you?
51:18
I think that's a really interesting question that will only arise in
51:24
the age of digital intelligences. I'm not sure what the answer is.
51:29
More questions, will it be possible to really spawn it off, send it out, learn something new,
51:35
something perhaps very new, and then will it be able to be reincorporated into the original?
51:40
Or will it have changed so much that it can't really be done?
51:47
Is that possible or is that not? You could carry this to its limit as I saw
51:53
one of your videos the other night. It suggests that it could. You spawn off many, many copies,
51:58
do different things, highly decentralized, but report back to the central master.
52:05
This will be such a powerful thing. This is my attempt to add something to this view.
52:14
A big issue will become corruption. If you really could just get information
52:21
from anywhere and bring it into your central mind, you could become more and more powerful.
52:27
It's all digital and they all speak some internal digital language.
52:31
Maybe it'll be easy and possible. But it will not be as easy as you're
52:37
imagining because you can lose your mind this way. If you pull in something from the outside
52:43
and build it into your inner thinking, it could take over you, it could change you,
52:48
it could be your destruction rather than your increment in knowledge.
52:55
I think this will become a big concern, particularly when you're like, "Oh,
53:00
he's figured out all about how to play some new game or he's studied Indonesia,
53:04
and you want to incorporate that into your mind." You could think, "Oh, just read it all in,
53:12
and that'll be fine." But no, you've just read a whole
53:14
bunch of bits into your mind, and they could have viruses in them, they could have hidden goals,
53:23
they can warp you and change you. This will become a big thing.
53:27
How do you have cybersecurity in the age of digital spawning and re-reforming again?
54:35
I guess this brings us to the topic of AI succession.
54:39
You have a perspective that's quite different from a lot of people that
54:42
I've interviewed and a lot of people generally. I also think it's a very interesting perspective.
54:47
I want to hear about it. I do think succession to digital
54:55
intelligence or augmented humans is inevitable. I have a four-part argument. Step one is,
55:05
there's no government or organization that gives humanity a unified point of
55:12
view that dominates and that can arrange... There's no consensus about how the world
55:18
should be run. Number two,
55:21
we will figure out how intelligence works. The researchers will figure it out eventually.
55:26
Number three, we won't stop just with human-level intelligence. We
55:29
will reach superintelligence. Number four, it's inevitable over time that the most intelligent
55:39
things around would gain resources and power. Put all that together and it's sort of inevitable.
55:50
You're going to have succession to AI or to AI-enabled, augmented humans.
55:59
Those four things seem clear and sure to happen. But within that set of possibilities,
56:07
there could be good outcomes as well as less good outcomes, bad outcomes.
56:14
I'm just trying to be realistic about where we are and ask how we should feel about it.
56:21
I agree with all four of those arguments and the implication.
56:25
I also agree that succession contains a wide variety of possible futures.
56:34
Curious to get more thoughts on that. I do encourage people to
56:37
think positively about it. First of all, it's something we humans have
56:42
always tried to do for thousands of years, try to understand ourselves, trying to make ourselves
56:47
think better, just understanding ourselves. This is a great success for science, humanities.
56:58
We're finding out what this essential part of humanness is, what it means to be intelligent.
57:06
Then what I usually say is that this is all human-centric.
57:10
But if we step aside from being a human and just take the point of view of the universe,
57:17
this is I think a major stage in the universe, a major transition, a transition from replicators.
57:24
We humans and animals, plants, we're all replicators.
57:28
That gives us some strengths and some limitations. We're entering the age of design
57:34
because our AIs are designed. Our physical objects are designed, our buildings
57:39
are designed, our technology is designed. We're designing AIs now, things that can
57:46
be intelligent themselves and that are themselves capable of design.
57:51
This is a key step in the world and in the universe.
57:57
It's the transition from the world in which most of the
57:59
interesting things that are, are replicated. Replicated means you can make copies of them,
58:07
but you don't really understand them. Right now we can make more intelligent beings,
58:11
more children, but we don't really understand how intelligence works.
58:15
Whereas we're reaching now to having designed intelligence,
58:20
intelligence that we do understand how it works. Therefore we can change it in different
58:25
ways and at different speeds than otherwise. In our future, they may not be replicated at all.
58:32
We may just design AIs, and those AIs will design other AIs, and
58:38
everything will be done by design and construction rather than by replication.
58:43
I mark this as one of the four great stages of the universe.
58:48
First there's dust, it ends with stars. Stars make planets. The planets can give rise to life.
58:55
Now we're giving rise to designed entities. I think we should be proud that we are giving
59:07
rise to this great transition in the universe. It's an interesting thing. Should we consider them
59:15
part of humanity or different from humanity? It's our choice. It's our choice whether we should say,
59:20
"Oh, they are our offspring and we should be proud of them and we should celebrate
59:24
their achievements."Or we could say, "Oh no, they're not us and we should be horrified."
59:29
It's interesting that it feels to me like a choice.
59:33
Yet it's such a strongly held thing that, how could it be a choice?
59:38
I like these sort of contradictory implications of thought.
59:42
It is interesting to consider if we are just designing another generation of humans.
59:48
Maybe design is the wrong word. But we know a future generation of humans is going
59:51
to come up. Forget about AI. We just know in the long run, humanity will be more capable and more
59:58
numerous, maybe more intelligent. How do we feel about that?
1:00:02
I do think there are potential worlds with future humans that we would be quite concerned about.
1:00:08
Are you thinking like, maybe we are like the Neanderthals that give rise to Homo sapiens.
1:00:13
Maybe Homo sapiens will give rise to a new group of people.
1:00:17
Something like that. I'm basically taking the example you're giving.
1:00:20
Even if we consider them part of humanity, I don't think that necessarily means that we should feel
1:00:26
super comfortable. Kinship.
1:00:28
Like Nazis were humans, right? If we thought, "Oh, the future generation will be Nazis,
1:00:33
I think we'd be quite concerned about just handing off power to them."
1:00:37
So I agree that this is not super dissimilar to worrying about more capable future humans,
1:00:44
but I don't think that addresses a lot of the concerns people might have about this
1:00:49
level of power being attained this fast with entities we don't fully understand.
1:00:54
I think it's relevant to point out that for most of humanity,
1:01:00
they don't have much influence on what happens. Most of humanity doesn't influence who can control
1:01:11
the atom bombs or who controls the nation states. Even as a citizen, I often feel that we don't
1:01:21
control the nation states very much. They're out of control. A lot of it
1:01:25
has to do with just how you feel about change. If you think the current situation is really good,
1:01:32
then you're more likely to be suspicious of change and averse to change than if you think
1:01:40
it's imperfect. I think it's imperfect. In fact, I think it's pretty bad. So I’m
1:01:47
open to change. I think humanity has not had a super good track record.
1:01:54
Maybe it's the best thing that there has been, but it's far from perfect.
1:01:59
I guess there are different varieties of change. The Industrial Revolution was change,
1:02:06
the Bolshevik Revolution was also change. If you were around in Russia in the 1900s and
1:02:11
you were like, "Look, things aren't going well, the tsar is kind of messing things up, we need
1:02:16
change", I'd want to know what kind of change you wanted before signing on the dotted line.
1:02:23
Similarly with AI, where I'd want to understand, and, to the extent that it's
1:02:27
possible, change the trajectory of AI such that the change is positive for humans.
1:02:35
We should be concerned about our future, the future.
1:02:39
We should try to make it good. We should also though recognize
1:02:45
the limit, our limits. I think we want to avoid
1:02:51
the feeling of entitlement, avoid the feeling of, "Oh, we are here first,
1:02:55
we should always have it in a good way." How should we think about the future?
1:03:01
How much control should a particular species on a particular planet have over it?
1:03:08
How much control do we have? A counterbalance to our limited control
1:03:12
over the long-term future of humanity should be how much control do we have over our own lives.
1:03:21
We have our own goals. We have our families. Those things are much more controllable than
1:03:28
trying to control the whole universe. I think it's appropriate for us to
1:03:39
really work towards our own local goals. It's kind of aggressive for us to say, "Oh, the
1:03:47
future has to evolve this way that I want it to." Because then we'll have arguments where different
1:03:52
people think the global future should evolve in different ways, and then they
1:03:56
have conflict. We want to avoid that. Maybe a good analogy here would be this.
1:04:03
Suppose you are raising your own children. It might not be appropriate to have extremely
1:04:09
tight goals for their own life, or also have some sense of like, "I want my children to go out
1:04:14
there in the world and have this specific impact. My son's going to become president and my daughter
1:04:19
is going to become CEO of Intel. Together they're going to have
1:04:21
this effect on the world." But people do have the sense—and
1:04:26
I think this is appropriate—of saying, "I'm going to give them good robust values such
1:04:32
that if and when they do end up in positions of power, they do reasonable, prosocial things."
1:04:39
Maybe a similar attitude towards AI makes sense, not in the sense of we can predict everything that
1:04:44
they will do, or we have this plan about what the world should look like in a hundred years.
1:04:50
But it's quite important to give them robust and steerable and prosocial values.
1:04:58
Prosocial values? Maybe that's the wrong word.
1:05:02
Are there universal values that we can all agree on?
1:05:06
I don't think so, but that doesn't prevent us from giving our kids a good education, right?
1:05:12
Like we have some sense of wanting our children to be a certain way.
1:05:15
Maybe prosocial is the wrong word. High integrity is maybe a better word.
1:05:18
If there's a request or if there's a goal that seems harmful, they will refuse to engage in it.
1:05:25
Or they'll be honest, things like that. We have some sense that we can teach our
1:05:32
children things like this, even if we don't have some sense of what true morality is,
1:05:36
where everybody doesn't agree on that. Maybe that's a reasonable target for AI as well.
1:05:41
So we're trying to design the future and the principles by
1:05:47
which it will evolve and come into being. The first thing you're saying is, "Well,
1:05:51
we try to teach our children general principles which will promote more likely evolutions."
1:06:01
Maybe we should also seek for things to be voluntary.
1:06:04
If there is change, we want it to be voluntary rather than imposed on people.
1:06:09
I think that's a very important point. That's all good. I think this is the big or one of
1:06:19
the really big human enterprises to design society that's been ongoing for thousands of years again.
1:06:28
The more things change, the more things they stay the same.
1:06:31
We still have to figure out how to be. The children will still come up with different
1:06:36
values that seem strange to their parents and their grandparents. Things will evolve.
1:06:43
"The more things change, the more they stay the same" also seems like
1:06:46
a good capsule into the AI discussion. The AI discussion we were having was
1:06:49
about how techniques, which were invented even before their application to deep
1:06:56
learning and backpropagation was evident, are central to the progression of AI today.
1:07:01
Maybe that's a good place to wrap up the conversation.
1:07:05
Okay. Thank you very much. Awesome. Thank you for coming on.
1:07:07
My pleasure.