1
00:00:13,280 --> 00:00:19,730
My name is Mike Wooldridge. I'm a professor of computer science and currently head of the Department of Computer Science at the University of Oxford.

2
00:00:19,940 --> 00:00:24,440
And I would like to welcome you all to this term Straight G Lecture.

3
00:00:24,710 --> 00:00:31,310
The Straight G lectures are the distinguished lectures in computer science that the Department of Computer Science offers.

4
00:00:31,880 --> 00:00:36,140
We do not usually host our strategy lectures in the show only in theatre.

5
00:00:36,410 --> 00:00:42,530
The fact that we are able to do this today is because of the generous support of Oxford Asset Management.

6
00:00:43,010 --> 00:00:43,610
We are very,

7
00:00:43,610 --> 00:00:51,560
very grateful to Oxford Asset Management who literally have made this event possible on a scale and type that would not have been possible otherwise.

8
00:00:51,950 --> 00:00:56,180
So we thank them very much for that and for their continuing, ongoing support.

9
00:00:58,910 --> 00:01:02,690
Let me introduce let me say a few words about today's speaker.

10
00:01:02,790 --> 00:01:10,940
It's an enormous pleasure to be able to welcome Demis Hassabis of Google DeepMind to be this term's straight lecturer.

11
00:01:11,840 --> 00:01:15,860
Demis gained his undergraduate degree from Cambridge in computer science.

12
00:01:16,460 --> 00:01:23,660
He then went on and was for some time a successful computer games programmer designed

13
00:01:23,660 --> 00:01:26,990
a number of games which went on to achieve a degree of success in the marketplace.

14
00:01:27,650 --> 00:01:37,520
He did a PhD at UCL in Cognitive Neuroscience, and then in 2009 he was a co-founder of a company called DeepMind.

15
00:01:37,910 --> 00:01:43,790
Well, all through that time I think it's probably fair to say that he didn't hit the front pages of any newspapers.

16
00:01:44,420 --> 00:01:54,680
All of that changed in 2014 when DeepMind were acquired by Google for the very un-British sum of, I believe, 400 million,

17
00:01:54,680 --> 00:02:01,700
which is not a figure that happens that much in any part of of of British industry and got him and DeepMind on the front

18
00:02:01,700 --> 00:02:09,229
pages of the international press and computer science professionals like myself were all agog to see what this company,

19
00:02:09,230 --> 00:02:14,150
which I think is probably fair to say, had been operating in stealth mode or something like it for a number of years,

20
00:02:14,840 --> 00:02:18,440
had to offer, well, we didn't have to wait very long.

21
00:02:18,440 --> 00:02:25,010
Very soon after that, the first results became public from DeepMind, and it became clear what Google were interested in.

22
00:02:25,250 --> 00:02:28,880
I'm not going to spoil them. It's a show by telling you about those results now,

23
00:02:29,120 --> 00:02:34,939
but some very impressive results to do with learning to play video games and being on

24
00:02:34,940 --> 00:02:38,780
the front pages of the national press just once or twice was not enough for them.

25
00:02:39,020 --> 00:02:42,649
They got on the front pages of the international press last year with some

26
00:02:42,650 --> 00:02:47,450
incredible achievements in the area of computer programs playing the game of go.

27
00:02:48,170 --> 00:02:56,270
And we all now have the opportunity to see Demis speak at a really special time for DeepMind because it's Demis is

28
00:02:56,270 --> 00:03:02,629
about to tell us they are heading up towards a competition to play go with some of the world's leading players.

29
00:03:02,630 --> 00:03:10,520
So we're going to get an insight into a company that's doing remarkable things, one of the most remarkable points in that trajectory.

30
00:03:10,700 --> 00:03:17,150
So it's a very great with very great pleasure that I introduce you and welcome you to give this term strategy lecture over to you.

31
00:03:17,420 --> 00:03:23,030
Thank you. Good evening, ladies and gentlemen.

32
00:03:24,710 --> 00:03:28,080
Good evening, ladies and gentlemen. Welcome to the Sheldon Theatre.

33
00:03:28,130 --> 00:03:32,300
Before the lecture begins, would you ensure your mobile phones are switched off?

34
00:03:32,750 --> 00:03:38,570
I would also remind you that an authorised photography and recording are prohibited in the interest of safety.

35
00:03:38,600 --> 00:03:44,000
Would you ensure emergency exits, walkways and window lights to capture personal belongings?

36
00:03:44,600 --> 00:03:49,489
Guests in the upper gallery asked to leave using the stairs. The sides are not to use.

37
00:03:49,490 --> 00:03:53,719
It is used in the steps. Thank you. Okay.

38
00:03:53,720 --> 00:03:59,180
Well, thanks. From The Voice from the sky. So thanks, Mike, for that very generous introduction.

39
00:03:59,180 --> 00:04:03,950
It's a huge honour and real pleasure to be giving this lecture.

40
00:04:03,950 --> 00:04:07,100
We invited you to give this lecture in these auspicious surroundings.

41
00:04:07,550 --> 00:04:13,100
So what I'm going to try and do today in my talk is to give you a whirlwind tour of what's happening at the cutting

42
00:04:13,100 --> 00:04:19,160
edge of artificial intelligence and then end with some of the latest breakthroughs that we've been doing at DeepMind.

43
00:04:19,940 --> 00:04:27,409
And then I'll probably talk a little bit about the bigger picture of artificial intelligence and where I think it's heading to in the future.

44
00:04:27,410 --> 00:04:32,420
And then we can go into the Q&A. So artificial intelligence.

45
00:04:32,600 --> 00:04:36,230
AI is basically the science of making machines smart.

46
00:04:37,840 --> 00:04:45,250
Now what DeepMind is we founded in 2010 and as Mike mentioned, we joined Google in 2014 to accelerate our mission.

47
00:04:45,760 --> 00:04:52,030
And the way we think about DeepMind is a sort of Apollo program or Apollo program effort for A.I.

48
00:04:53,180 --> 00:04:56,510
We have about 200 research scientists and engineers now.

49
00:04:56,840 --> 00:05:03,590
So I think it's probably one of the biggest collections anywhere in the world of talent focusing around this topic.

50
00:05:05,460 --> 00:05:11,670
And not only is this a very ambitious sort of research program, but we also try and think about a new,

51
00:05:11,670 --> 00:05:15,870
more efficient and productive way of organising science and scientific research.

52
00:05:16,530 --> 00:05:19,170
And in terms of like the environment we've created,

53
00:05:19,350 --> 00:05:28,410
we've tried to sort of build a unique environment that's a blend between the best of academia and how academia should function in an ideal world.

54
00:05:28,950 --> 00:05:32,520
And the best from the top sort of Silicon Valley start-ups.

55
00:05:32,910 --> 00:05:37,770
So kind of blue sky thinking from academia and collaborative interdisciplinary research.

56
00:05:37,980 --> 00:05:46,500
And then the focus and the energy and buzz and resources that really, really successful start-ups have.

57
00:05:47,650 --> 00:05:52,010
We try to fuse this together into a unique environment that's uniquely suited to research.

58
00:05:54,440 --> 00:05:58,550
So our mission at DeepMind, we basically articulated, or at least I do in this way.

59
00:05:58,910 --> 00:06:06,140
So step one, we try and so fundamentally solve intelligence and then step to use that to solve everything else.

60
00:06:06,530 --> 00:06:13,159
So, you know, this may seem quite fantastical to you, this step, too, but actually I hope that by the end of this talk,

61
00:06:13,160 --> 00:06:17,120
you'll be convinced that it actually naturally follows on from solving step one.

62
00:06:19,610 --> 00:06:24,379
So more prosaically, how are we going to attempt to do this? Well, I don't mind.

63
00:06:24,380 --> 00:06:28,490
What we're interested in doing is building what we call general purpose learning algorithms.

64
00:06:29,640 --> 00:06:36,480
So the key things of everything that we do is that our algorithms learn how to master certain tasks.

65
00:06:36,660 --> 00:06:40,680
They learn automatically from raw inputs or raw data.

66
00:06:41,230 --> 00:06:44,370
They're not pre-programmed or handcrafted in any way.

67
00:06:46,250 --> 00:06:50,360
The second important notion that we have is this idea of generality.

68
00:06:50,900 --> 00:06:55,970
So this is the sort of the idea that the same system or same set of algorithms

69
00:06:56,180 --> 00:07:00,530
can operate out of the box across a wide range of environments and tasks.

70
00:07:01,820 --> 00:07:07,220
So we call this kind of AI internally DeepMind Artificial General Intelligence, AGI,

71
00:07:07,970 --> 00:07:14,210
and the hallmark of AGI is that from the ground up, it's built to be flexible, adaptive and inventive.

72
00:07:14,540 --> 00:07:17,030
It can deal gracefully with the unexpected.

73
00:07:18,420 --> 00:07:26,850
Now, if we compare that with most A.I. that's out there today, which we term now to distinguish it from AGI,

74
00:07:27,300 --> 00:07:34,200
most of the A.I. that you interact with every day is handcrafted and special cased to particular single task.

75
00:07:35,370 --> 00:07:38,310
And what that often means is, is that these systems are quite brittle.

76
00:07:38,850 --> 00:07:45,420
If you do something unexpected or something unexpected happens that the programmers of that system didn't cater for the full time,

77
00:07:46,140 --> 00:07:47,550
it will catastrophically fail.

78
00:07:48,650 --> 00:07:56,270
And you can see that with things like Siri on your phone, you know, it works fine if you stick to the templates that have been pre-programmed.

79
00:07:56,540 --> 00:08:02,660
But as soon as you start going off pace with your conversation, the holes in the algorithms quickly become apparent.

80
00:08:04,430 --> 00:08:10,759
So still today, probably the greatest achievement, one of the greatest achievements in I was deep blue,

81
00:08:10,760 --> 00:08:13,310
beating Garry Kasparov for chess in the late nineties.

82
00:08:13,880 --> 00:08:19,250
Of course, this is a huge technical achievement and an absolute watershed moment for A.I. research.

83
00:08:20,550 --> 00:08:24,780
But having said that, you know, the question is is was deeply, truly intelligent.

84
00:08:25,170 --> 00:08:29,970
And I think even the design is a deep blue. And certainly we would argue that it isn't really.

85
00:08:30,330 --> 00:08:36,899
And one easy way to see that intuitively is the fact that Deep Blue couldn't even play a strictly much simpler game,

86
00:08:36,900 --> 00:08:40,620
like noughts and crosses without being totally reprogrammed from scratch.

87
00:08:41,250 --> 00:08:43,110
There was no knowledge in the end,

88
00:08:43,120 --> 00:08:49,320
or in the algorithms that deeply was running that would help it play any any other game, let alone do anything else.

89
00:08:50,220 --> 00:08:52,890
So I actually came away. I remember this match very distinctly.

90
00:08:52,890 --> 00:08:59,640
I was studying at Cambridge and I actually came away more impressed by Garry Kasparov minds than the computer,

91
00:08:59,910 --> 00:09:05,460
because here was Garry Kasparov able to compete on more or less level terms with this brute of a machine.

92
00:09:05,730 --> 00:09:11,700
And yet, of course, Garry can do many other things, speak several languages, drive cars, tie shoelaces.

93
00:09:12,450 --> 00:09:17,670
So, you know, in a way, it's quite amazing that the human mind, what the human mind is capable of.

94
00:09:19,320 --> 00:09:23,160
So instead of this kind of regime, how do we think about artificial intelligence?

95
00:09:23,550 --> 00:09:29,250
Well, I would say the core of what we're doing, a DeepMind focus is around what's called reinforcement learning.

96
00:09:29,880 --> 00:09:32,370
And that's how we think about intelligence at DeepMind.

97
00:09:33,300 --> 00:09:38,280
So just quickly going to explain that with the help of a little simple diagram here, what reinforcement learning is.

98
00:09:39,340 --> 00:09:48,640
So if we start off with the agent system, the AOC and the agent system finds itself in some kind of environment trying to achieve a goal.

99
00:09:49,210 --> 00:09:53,080
Now, that environment could be a real world environment, in which case the agent would be a robot,

100
00:09:53,440 --> 00:09:56,500
or it can be a virtual environment, in which case the agent would be an avatar.

101
00:09:56,890 --> 00:10:00,130
And in fact, for most of our research, as you'll see, we use virtual environments.

102
00:10:01,840 --> 00:10:05,290
Now the agent interacts with the environment in just two ways.

103
00:10:05,830 --> 00:10:09,190
Firstly, it gets observations through its sensory apparatus.

104
00:10:09,790 --> 00:10:14,440
We must use vision currently, but we're starting to think about other modalities.

105
00:10:15,130 --> 00:10:21,820
And one of the jobs of the agent system is to build the best possible model of the world out there, the environment out there,

106
00:10:22,900 --> 00:10:29,799
just based on these incomplete and noisy observations that it's receiving in real time and in real time,

107
00:10:29,800 --> 00:10:32,820
it's got to keep updating that model in the face of new evidence.

108
00:10:34,130 --> 00:10:40,820
The second job of the agent is once it's built, this model of the world is to use that model to make predictions about what's going to happen next.

109
00:10:41,360 --> 00:10:45,590
And if you can make predictions about the world, then you can start planning about what to do.

110
00:10:46,040 --> 00:10:51,500
So if you're trying to achieve a goal, the agent will have a set of actions available to it at that moment.

111
00:10:51,800 --> 00:10:59,810
And the decision making problem is to pick which action will be the best action to take right now to get you towards your goal.

112
00:11:01,650 --> 00:11:05,760
And once the agent has decided that based on its model and its planning trajectories,

113
00:11:05,760 --> 00:11:11,610
the output executes the action, and that action may or may not make some change to the environment.

114
00:11:11,820 --> 00:11:15,240
And then that drives a new observation. And that's really it.

115
00:11:15,270 --> 00:11:19,770
That's the heart of reinforcement learning. But although this diagram is very simple,

116
00:11:19,890 --> 00:11:26,580
those of you who know about reinforcement learning will understand there's huge complexity hidden behind this simple diagram.

117
00:11:27,180 --> 00:11:32,760
But we do know that if we could solve all the issues behind reinforcement learning and make this work perfectly,

118
00:11:32,910 --> 00:11:37,950
then that would be sufficient for general level, general human level intelligence.

119
00:11:38,280 --> 00:11:44,700
And the reason we know that is because biological systems learn using reinforcement learning, including the human brain.

120
00:11:45,420 --> 00:11:50,760
In fact, there are some seminal studies done in the late nineties on monkeys that showed that the

121
00:11:50,760 --> 00:11:55,290
dopamine neurones in the brain implement a form of reinforcement learning called learning.

122
00:11:57,660 --> 00:12:01,590
So the set reinforces that it is an end for at the core of what we do at the moment.

123
00:12:02,010 --> 00:12:10,050
The second big philosophical thing we committed to at the start of at the founding of DeepMind was this idea of grounded cognition.

124
00:12:12,170 --> 00:12:19,370
So this is the idea that a true thinking machine has to be grounded in a rich sensory motor reality or data stream.

125
00:12:21,440 --> 00:12:29,120
Now when people commit to this sort of sentiment, often they then start working on real robots because after all,

126
00:12:29,120 --> 00:12:36,560
real robots are actually situated in the real world. And of course, through their sensory apparatus, they're getting data, real world data.

127
00:12:37,670 --> 00:12:44,540
But we actually made a different decision on this. We decided to use virtual environments in games, and we think they're the perfect platform,

128
00:12:44,540 --> 00:12:47,870
if used correctly, for developing and testing A.I. algorithms.

129
00:12:49,090 --> 00:12:55,600
One of the important things you have to do and avoid is that when you use virtual environments, of course, if you want to,

130
00:12:55,780 --> 00:13:00,129
you could allow your agent to have access to all kinds of the internal states of the

131
00:13:00,130 --> 00:13:04,900
game that it couldn't actually directly sense through its normal sensory apparatus.

132
00:13:05,680 --> 00:13:07,350
And of course, that's something you have to avoid.

133
00:13:07,360 --> 00:13:14,110
Otherwise you'll think that you're making progress with your algorithms, but actually it would be cheating in some way.

134
00:13:14,710 --> 00:13:19,810
So we have to be very disciplined about how you allow the interface between the virtual

135
00:13:19,810 --> 00:13:24,940
environment and the agent and really treat the agent as if it was a virtual robot,

136
00:13:25,120 --> 00:13:29,680
only getting the information that it's that it could be available to it through its sensors.

137
00:13:31,760 --> 00:13:38,180
Now if you use games like that, then there are many advantages. Of course you can create as much training data as you like.

138
00:13:39,320 --> 00:13:44,690
This is very important. When we were a small, independent company and we didn't have access to a lot of data, but it's still vital.

139
00:13:44,690 --> 00:13:48,770
Now, even though we're at Google, there's no testing bias.

140
00:13:48,890 --> 00:13:58,400
One of the biggest things I think, that held back the A.I. research field was that often you'll find the researchers are also the ones that are

141
00:13:58,400 --> 00:14:04,490
creating the tests and that can lead to unconscious sort of biases about the sorts of tests that you design.

142
00:14:05,030 --> 00:14:11,000
We end up designing tests that subconsciously, at least our algorithms are well suited to.

143
00:14:12,710 --> 00:14:13,460
Of course you can.

144
00:14:14,450 --> 00:14:22,940
If we're talking about virtual agents in virtual environments, we can test thousands, perhaps even millions of these agents systems in parallel.

145
00:14:23,660 --> 00:14:29,960
And games are very convenient in that a lot of them have schools or quite easily identifiable goals.

146
00:14:30,170 --> 00:14:37,700
So it's very easy to measure incremental progress and how your algorithms are doing when you incrementally improve them.

147
00:14:38,180 --> 00:14:43,430
And that's very key for us. Actually, benchmarking is a hugely important thing that we have.

148
00:14:43,430 --> 00:14:48,770
We have a whole team who works on that because when you've got a very ambitious long term goal,

149
00:14:48,920 --> 00:14:58,640
it's even more important to have short term directional sort of waypoints that tell you if you're heading in the right direction towards this big,

150
00:14:58,730 --> 00:15:08,160
ambitious, long term goal. So putting all this together, then this brings to sort of the nub the notion of what we call end to end learning agents.

151
00:15:08,640 --> 00:15:16,620
So this idea of going all the way from the pixels or the raw data input and then ending up with making a decision about what action to take.

152
00:15:17,680 --> 00:15:26,380
And I my women should in that entire stack of problems. So everything from perceptual processing to decision making and all the things in between.

153
00:15:28,770 --> 00:15:35,850
So our first attempt at doing this, which really scaled to something challenging we call deep reinforcement learning.

154
00:15:37,370 --> 00:15:43,610
And the essence here was combining deep neural networks, which is called deep learning these days with reinforcement learning.

155
00:15:44,650 --> 00:15:49,240
And what this allows reinforced learning to do is to actually scale up to work

156
00:15:49,240 --> 00:15:54,160
on very challenging problems until we sort of came up with this paradigm.

157
00:15:54,310 --> 00:16:00,640
Reinforcement learning has been around for many decades, but is usually only been used for relatively toy grade well problems.

158
00:16:01,180 --> 00:16:06,880
It's been hard to scale it up to something challenging with high dimensional sensory inputs.

159
00:16:08,940 --> 00:16:12,050
So I'm going to show you a few videos of this agent working.

160
00:16:12,060 --> 00:16:14,970
But before I do, I just want to clearly explain what it is you're going to see.

161
00:16:15,630 --> 00:16:23,670
So we started off with really the first iconic console, the Atari 2600 from the eighties.

162
00:16:24,360 --> 00:16:29,880
This has the benefit of there are hundreds of different classic games, many of which are iconic and everyone will recognise.

163
00:16:30,600 --> 00:16:34,380
But it's still quite a challenging sensory data stream.

164
00:16:36,210 --> 00:16:39,720
The agents here, they only get the raw pixels from the screen as inputs.

165
00:16:40,020 --> 00:16:47,490
So that's around 30,000 numbers per frame because the screen is about 200 by 150 pixels in size.

166
00:16:48,720 --> 00:16:57,180
And the goal here is to simply maximise the score. The agent system has to learn everything else from scratch, from first principles.

167
00:16:57,570 --> 00:17:01,450
It doesn't know what it's controlling. It doesn't know what the object of the game is.

168
00:17:01,470 --> 00:17:07,590
It doesn't know what it gets. Hate points. It doesn't even know that pixels next to each other are correlated in time.

169
00:17:07,770 --> 00:17:10,080
Has to find all this structure for itself.

170
00:17:11,970 --> 00:17:16,950
And then there's an additional constraint or requirement we put on the system, which is this idea of generality, again,

171
00:17:17,310 --> 00:17:25,350
that a single system has to play all the different games without any changes, with the same hyper parameter settings and other settings.

172
00:17:28,150 --> 00:17:34,960
Then I show you a couple of videos. So the first one to show you is Space Invaders, which were two parts to it.

173
00:17:35,200 --> 00:17:38,319
One where the agent has had no training.

174
00:17:38,320 --> 00:17:44,110
So literally the first time it's seen the data stream. And then after a day or two worth of training.

175
00:17:44,530 --> 00:17:48,190
So initially, you see, is controlling the green rocket at the bottom of the screen.

176
00:17:49,360 --> 00:17:55,209
It's it's losing its three lives immediately because obviously it has no idea at the moment what it's supposed

177
00:17:55,210 --> 00:17:58,870
to be doing or even the fact that it's controlling that collection of pixels at the bottom of the screen.

178
00:17:59,750 --> 00:18:09,980
Now after training by playing the game overnight for 24 hours, you come back and now the system is superhuman level.

179
00:18:10,310 --> 00:18:15,290
So I can play Space Invaders better than any human can. So you see here, every single bullet hit something.

180
00:18:15,980 --> 00:18:19,469
It's learned that the pink mothership at the top of the screen coming across there now,

181
00:18:19,470 --> 00:18:23,510
which she hits for this amazing shot, is is worth the most number of points.

182
00:18:23,930 --> 00:18:30,960
And as you'll see later, those of you who remember Space Invaders, if you're old enough to remember the less of them there are, the faster they go.

183
00:18:30,980 --> 00:18:36,350
So just watch the last sort of predictive shot that that it makes to to get the last one.

184
00:18:36,680 --> 00:18:43,610
So, you know, so it's built up these very accurate models, implicit models of what's happening in this game with this data showing.

185
00:18:44,680 --> 00:18:47,120
So we show you another video. Now, this breakout, my favourite video.

186
00:18:47,120 --> 00:18:51,260
So here you control the bat and ball and you're trying to break through this rainbow coloured wall.

187
00:18:51,560 --> 00:18:55,700
Now, the beginning of the 100 games, you can see the agent is not very good.

188
00:18:55,820 --> 00:19:00,920
It's missing the ball most of the time. But it's time to get the hang of the idea that the bat should go towards the ball.

189
00:19:02,000 --> 00:19:05,690
Now, after 300 games, it's about as good as any human can play this.

190
00:19:06,070 --> 00:19:12,410
And and it gets pretty much gets the ball back every time, even when it's coming back at very fast vertical angles.

191
00:19:13,520 --> 00:19:19,610
But then we left. We thought, Well, that's pretty cool, but we left the system playing for another 200 games and it did this amazing thing.

192
00:19:19,610 --> 00:19:24,830
It found the optimal strategy was to dig a tunnel around the site and put the ball around the back of the wall.

193
00:19:25,640 --> 00:19:29,120
And you see how incredibly accurate, of course, it can send the ball around the back.

194
00:19:29,900 --> 00:19:38,330
So so the funny thing about that is that obviously the research is working on this amazing AI developers and programmers,

195
00:19:38,570 --> 00:19:42,709
but they're not so good at breakouts. And actually they didn't know about that strategy.

196
00:19:42,710 --> 00:19:49,910
So they learned something from their own system, which is, you know, pretty funny and quite instructive, I think, about the potential for general A.I.

197
00:19:51,400 --> 00:19:57,180
You show final video here of Italy, which is really showing a medley now of many different games.

198
00:19:57,360 --> 00:20:06,420
Just to give you a feeling that this system, which we called TQM, really is a general A.I. within the within the constraints of Atari games.

199
00:20:08,870 --> 00:20:13,850
So here is the same system you just saw playing those other games, playing an early racing game called Enduro.

200
00:20:15,620 --> 00:20:17,179
Here is playing a game called River Ride,

201
00:20:17,180 --> 00:20:24,829
which is a fighter pilot game that one of the very early 3D games called Battle Zone is a set of classic ponies controlling the green back here.

202
00:20:24,830 --> 00:20:28,190
And it wins 21 nil every time. Can't get points off of it.

203
00:20:28,490 --> 00:20:34,940
Sea Quest Submarine Game. So you can see the absolute diversity of the graphics and also the objectives.

204
00:20:35,180 --> 00:20:37,379
So here's a boxing is controlling the red box.

205
00:20:37,380 --> 00:20:45,410
So on the on the left does a bit of sparring then once it gets the boat computer on the side just racks up an infinite number of points.

206
00:20:46,640 --> 00:20:53,630
It's just happy to just carry on doing that forever. So, you know, so a very, very diverse range of of games.

207
00:20:53,810 --> 00:21:00,730
Same system out of the box, mastering these things. Now, if you want to read more about that, that was featured in our Nature article.

208
00:21:01,760 --> 00:21:04,300
Beginning of last year we also released the code.

209
00:21:04,540 --> 00:21:09,190
So if you want to play around with this system yourself, you can as freely available on the Internet.

210
00:21:10,060 --> 00:21:13,030
And so we then sort of took that. That was around a year ago.

211
00:21:13,030 --> 00:21:20,590
And now we've taken that further and we started looking at 3D games, at using simulators like robot simulators.

212
00:21:20,740 --> 00:21:25,440
And eventually we would like to get to start thinking about real robotics.

213
00:21:27,010 --> 00:21:34,720
Now just going to quickly show you a couple of 3D videos now with effectively the same deep reinforcement learning system with a few tweaks.

214
00:21:35,230 --> 00:21:38,020
Now coping with this 3D data stream.

215
00:21:38,470 --> 00:21:48,700
So here here is a deacon like algorithm driving a racing car very fast around the track, again, just from raw pixel inputs.

216
00:21:48,970 --> 00:21:53,380
So that's how it's learnt and it's driving around about 200 kilometres an hour.

217
00:21:53,620 --> 00:21:57,340
And it figures out overtaking manoeuvres.

218
00:21:58,030 --> 00:22:00,340
It can also recover from spins, all sorts of things.

219
00:22:00,730 --> 00:22:06,610
So, so it has really amazing performance now in these kinds of driving games just from the vision.

220
00:22:08,420 --> 00:22:16,069
We started looking at how the problems in 3D mazes, collecting objects, finding your way out,

221
00:22:16,070 --> 00:22:22,490
remembering where you've got to go, the sort of things that maybe a rodent like a rat would be able to do.

222
00:22:23,240 --> 00:22:29,630
And so you can sort of think about where we're going next is trying to build a kind of rat level intelligence.

223
00:22:30,260 --> 00:22:32,690
And rats are actually pretty smart. They can do quite a lot of things.

224
00:22:32,990 --> 00:22:38,690
And so here it is again, just from the visuals on the screen, just from the raw pixels,

225
00:22:38,810 --> 00:22:43,040
finding these green apples which are rewarding and then trying to find the exit,

226
00:22:43,640 --> 00:22:48,620
which is this little red floating object and efficiently navigating around.

227
00:22:50,710 --> 00:22:55,690
So that's where we are on sort of 3D and there'll be big announcements about that later this year.

228
00:22:56,680 --> 00:22:59,880
Another. So I've talked about sort of reinforcement learning.

229
00:22:59,890 --> 00:23:07,360
I've talked about grounded cognition. Another thing that I think is sort of pretty unique to Deepmind's approach to AI is taking

230
00:23:07,360 --> 00:23:12,910
systems neuroscience seriously as a source of inspiration for new algorithmic ideas,

231
00:23:13,210 --> 00:23:16,510
but also as a kind of validation testing, if you like.

232
00:23:16,750 --> 00:23:23,739
So if you have your own pet favourite algorithm or algorithmic technique and the question is,

233
00:23:23,740 --> 00:23:28,000
is that you're not sure if this can scale up to become a component of general AI.

234
00:23:28,390 --> 00:23:32,110
How much effort should you put into that? You know, should you spend five years doing that?

235
00:23:32,110 --> 00:23:35,290
Ten years? How many people should be on that working on that?

236
00:23:35,470 --> 00:23:41,320
You know, these are very difficult decisions if you're running a lab or a department or a company working on this kind of thing.

237
00:23:42,010 --> 00:23:48,280
Now, if you can point to something in the brain and show that like with reinforcement learning and as I said earlier,

238
00:23:48,280 --> 00:23:52,930
we know that the brain implements TDW learning through the dopamine system.

239
00:23:53,200 --> 00:23:57,160
That gives you confidence then that in the limit this has to be sufficient.

240
00:23:57,370 --> 00:24:04,810
For example, reinforcement learning. It's not crazy to think about that as a component, a vital component for the general A.I. solution.

241
00:24:05,200 --> 00:24:10,420
And that can be very important directionally when you're thinking about four or five year research programs.

242
00:24:11,650 --> 00:24:16,090
So systems. But when I say neuroscience, I should be very clear. We we're thinking about systems neuroscience.

243
00:24:16,390 --> 00:24:22,870
So we mean the algorithms, the representations and the architectures the brain uses rather than something like the Human Brain Project,

244
00:24:22,870 --> 00:24:26,290
which is more interested in the low level synaptic implementation.

245
00:24:26,290 --> 00:24:30,510
Details of how the brain achieves things with spiking neural networks.

246
00:24:30,730 --> 00:24:34,690
That's too low level for us. We're more interested in the computational level of the brain.

247
00:24:36,350 --> 00:24:42,080
So we're using many ideas and this new as I haven't got time to go into it today, but here are some of the things we're looking at.

248
00:24:42,110 --> 00:24:44,959
Memory, attention, concepts, planning, navigation,

249
00:24:44,960 --> 00:24:51,200
imagination and all of these areas we're actively researching on right now and have very interesting prototypes on.

250
00:24:51,620 --> 00:24:53,390
And in fact, I'll just mention one thing.

251
00:24:54,050 --> 00:25:01,460
For my Ph.D., I studied an area of the brain called the hippocampus, and I studied memory and imagination in the human brain with stimuli.

252
00:25:02,240 --> 00:25:06,410
And it turns out the hippocampus, which is shown here in pink, which is in the centre of your brain,

253
00:25:06,740 --> 00:25:14,210
is actually critical for many of these capabilities, especially things like memory, memory, navigation and imagination.

254
00:25:14,570 --> 00:25:18,440
So and the hippocampus has very different structure to cortex.

255
00:25:18,740 --> 00:25:24,590
So it's quite interesting when people talk about intelligence in the brain, they usually talk about the cerebral cortex.

256
00:25:24,860 --> 00:25:31,370
But actually there are other structures in the brain that are equally critical to this whole question of intelligence.

257
00:25:34,640 --> 00:25:37,550
So now I'm going to talk a little bit about our newest work, AlphaGo.

258
00:25:38,780 --> 00:25:42,950
And the reason we took on this project and I'll explain a lot more about what this is in a second,

259
00:25:43,190 --> 00:25:48,320
is that AlphaGo really combines pattern recognition with planning.

260
00:25:48,950 --> 00:25:55,280
So what you've seen so far with the Atari games is really a kind of stimulus response system.

261
00:25:55,490 --> 00:25:59,209
So, you know, it's very smart, but it's it's stimulus response.

262
00:25:59,210 --> 00:26:09,680
So it learns about how to process Atari screens and generally speaking, what to do in that moment in terms of an action that will maximise its score.

263
00:26:10,430 --> 00:26:16,720
But there isn't a lot of long term planning. Now contrast that with a game like go.

264
00:26:17,320 --> 00:26:22,630
So go. For those of you that don't know how to play it or don't know what it is, this is a picture of a go board.

265
00:26:23,830 --> 00:26:32,020
So go is them is the sort of pinnacle of board games is the most complex game pretty much ever devised by man that's played professionally.

266
00:26:32,980 --> 00:26:43,300
And the way he's played is that you play on a 19 by 19 grid and there's two sides, black and white, and you put down these pieces called stones,

267
00:26:43,570 --> 00:26:51,910
and the stones are placed on the vertices of the board, and black goes first and they take turns placing one stone at a time.

268
00:26:54,180 --> 00:27:00,250
Once the stones are placed, they don't move. Now they're actually the rules of go incredibly simple.

269
00:27:00,270 --> 00:27:06,270
I'm going to teach you how to play go in two slides. But it leads to incredible, profound complexity.

270
00:27:06,480 --> 00:27:11,970
That's why it's considered to be one of the most in fact, the most elegant games ever invented.

271
00:27:14,110 --> 00:27:20,290
Now a quick history of go for those of you who don't know about it. It originated in China over 3000 years ago.

272
00:27:20,620 --> 00:27:24,200
And it has an incredibly rich tradition in Asia.

273
00:27:24,220 --> 00:27:29,830
So in China, Japan and Korea and other Asian countries, this is what they play instead of chess.

274
00:27:30,880 --> 00:27:37,720
But in Asian countries, this is regarded as more than just as a game go sort of elevated to the status of poetry or art.

275
00:27:38,470 --> 00:27:45,970
In fact, Confucius wrote about gold and he considered to be one of the four essential arts to be mastered by any true scholar.

276
00:27:47,580 --> 00:27:52,110
Japan also has a rich history around gold.

277
00:27:52,500 --> 00:28:00,630
And in fact, during the Edo period, sort of 250 years, 1618 hundred annual games are played.

278
00:28:00,810 --> 00:28:03,630
They were called Castle Games in front of the Shogun.

279
00:28:04,230 --> 00:28:12,810
And what would happen is that each tribal clan would send that top go player to play in the castle game for the honour of the whole plan.

280
00:28:13,440 --> 00:28:17,760
And and some real legendary players came out of this.

281
00:28:18,330 --> 00:28:25,320
There's one guy, Kosciuszko, who won 19 years in a row and has gone down, and legend has the nickname The Invincible.

282
00:28:25,710 --> 00:28:30,390
So they were really absolute heroes to in those in that period.

283
00:28:31,140 --> 00:28:36,180
So it has this incredibly rich history intertwined with the culture of Japan.

284
00:28:37,470 --> 00:28:45,470
But it's not just an ancient game. Today, there are over 40 million active players in many of these countries, like Korea, for example.

285
00:28:45,480 --> 00:28:50,130
It's taught as part of the school curriculum, and there are specialist go go schools.

286
00:28:50,460 --> 00:28:59,970
So if you show talent at go at a young age, then you will go to these guys schools from about the age of ten instead of going to normal school.

287
00:29:01,300 --> 00:29:08,350
So it's taken very, very seriously. Now, as I was going to say and I'm going to show you in a second, there are just two rules, actually, for go.

288
00:29:09,220 --> 00:29:15,100
But the complexities huge that arises out of these very basic rules.

289
00:29:15,460 --> 00:29:22,270
One quick, easy way of measuring a sort of illustrating to the complexity is the fact that there are ten to the power,

290
00:29:22,270 --> 00:29:28,090
170 possible board configurations. And in fact, ten to the 700 different possible games.

291
00:29:29,020 --> 00:29:31,300
And that's more than the number of atoms in the universe.

292
00:29:32,110 --> 00:29:39,690
So there's no way that you can solve go through exhaustive search or even play go well through exhaustive search.

293
00:29:39,870 --> 00:29:44,190
Brute force search is just too large. So how do you pay go?

294
00:29:44,490 --> 00:29:52,590
Well, rule one is called the catch rule. So here is a position from a game of go.

295
00:29:53,250 --> 00:29:57,480
And we're just going to zoom in to the bottom right of this board so I can show you how the capture works.

296
00:29:58,440 --> 00:30:05,489
So let's look at this little part of the board. Now, if you see that white stone there that's surrounded by the three black stones, the vertices,

297
00:30:05,490 --> 00:30:12,030
the empty vertices coming out from from each stone adjacent to your where your stone is called liberties.

298
00:30:12,660 --> 00:30:17,160
Now, when you run out of liberties, then those stones are removed from the board.

299
00:30:18,090 --> 00:30:24,330
So here that white stone that surrounded by the three black stones only has one liberty left that that empty vertex above it.

300
00:30:24,810 --> 00:30:32,850
So if it's black, smooth and black with to play into that final empty vertex, taking away the lost liberty, then the white stone would be captured.

301
00:30:33,060 --> 00:30:37,830
So now that white that white stone has no liberties and would be taken as a prisoner off the board.

302
00:30:38,220 --> 00:30:43,440
So that's the capture rule. And you can capture multiple large groups of stones like there's not just one at a time.

303
00:30:44,910 --> 00:30:49,230
The second rule is that a repeated board position is not allowed.

304
00:30:49,470 --> 00:30:54,510
So this is called the rule. So let's imagine. Here's another little zoomed in part of the board.

305
00:30:54,930 --> 00:30:58,630
Now, let's imagine it's whites turn so whites could hear.

306
00:30:59,760 --> 00:31:04,710
I'll just replicate that board to the right so you can see how this is going to become a repeated board position.

307
00:31:05,130 --> 00:31:10,410
Let's imagine it's White's turn. They play here and capture that black stone like I just showed you.

308
00:31:10,980 --> 00:31:13,200
Now it's black. Turn now black.

309
00:31:13,320 --> 00:31:20,430
You might think, Well, they could just play back where that stone was captured and recapture the white stone that was just put down.

310
00:31:21,000 --> 00:31:23,100
So you might think, why can't black go there?

311
00:31:23,100 --> 00:31:28,170
And this is actually not allowed, because if Black was to go back, that would recapture that white stone.

312
00:31:28,320 --> 00:31:32,280
And you'd see now this new position we're in is the same as that original position.

313
00:31:32,820 --> 00:31:38,400
So, in fact, that recapture would not be allowed. Black would have to play somewhere else before recapturing.

314
00:31:39,180 --> 00:31:40,950
And that's it. That's how you play. Go.

315
00:31:41,700 --> 00:31:51,750
So the objective of the game is to not only capture your opponent's stones, but also to wall off and surround empty territory, empty vertices.

316
00:31:52,080 --> 00:31:55,230
And you can see here, this is a picture of a go board at the end of the game.

317
00:31:55,590 --> 00:31:59,910
And you you've just total up the number of the spaces that you've captured.

318
00:32:00,030 --> 00:32:03,540
And you add that to the number of stones that you've taken off the board.

319
00:32:03,690 --> 00:32:07,680
And the player with the most the highest total is the winner.

320
00:32:08,670 --> 00:32:14,130
So here's the white territory in the black territory. And in fact, this is a very close game and white wins this game by one point.

321
00:32:16,430 --> 00:32:19,430
So that's how you play go. So why is it hard for computers to play?

322
00:32:19,820 --> 00:32:23,540
Well, I just is playing. The complexity makes brute force exhaustive, search intractable.

323
00:32:23,780 --> 00:32:27,680
And there are two main challenges. The branching factor is huge.

324
00:32:27,920 --> 00:32:34,100
So and writing an evaluation function to determine who is winning is thought to be impossible.

325
00:32:34,400 --> 00:32:39,320
So an evaluation function is a function to tell you whether the black or the white side is winning.

326
00:32:40,130 --> 00:32:46,490
And for go, this is very difficult. Let me just unpack that for you by comparing it to the next most complex game chess.

327
00:32:47,360 --> 00:32:50,750
So in chess, on an average position, there are 20 possible moves.

328
00:32:52,720 --> 00:32:56,290
And that's referred to as the branching factor in go.

329
00:32:56,560 --> 00:32:59,860
By contrast, there's around an hour and an average position.

330
00:33:00,010 --> 00:33:06,940
There's around 200 moves. So the branching factor in go is one order of magnitude bigger than it is for chess.

331
00:33:08,240 --> 00:33:16,520
The second issue, which is related to the evaluation function, is that goes really primarily came about intuition rather than brute calculation.

332
00:33:17,510 --> 00:33:25,280
If you if you ask a great, good player why they played a certain move, often they'll just tell you it felt right and they'll use those words.

333
00:33:25,880 --> 00:33:28,340
Whereas if you ask a great chess player that, they'll never say that.

334
00:33:28,340 --> 00:33:33,350
They'll tell you exactly the reasons, what the how they calculated that that move was the right move to do.

335
00:33:34,690 --> 00:33:39,370
And of course, what we know about computers is, is that when we start using words like intuition.

336
00:33:39,760 --> 00:33:44,380
Computers are generally traditionally not good at what we think of as intuition.

337
00:33:44,800 --> 00:33:48,280
But of course, they're very good at things that we think of as calculation.

338
00:33:49,710 --> 00:33:56,190
So that's one of the challenges of go making computers good at GO is to replicate this kind of intuition that humans use to play.

339
00:33:58,080 --> 00:34:04,620
So this is the issue of writing an evaluation function and why it was thought to be impossible for Deep Blue or any chess program.

340
00:34:04,800 --> 00:34:09,540
What you can do is write a set of handcrafted, pre-programmed heuristics or rules.

341
00:34:10,580 --> 00:34:15,320
In fact, a first approximation for chess. If you just count up the value of the pieces on each side,

342
00:34:15,560 --> 00:34:21,470
that gives you a very rough and ready but reasonable estimate of which side, black or white, is winning.

343
00:34:22,460 --> 00:34:26,700
That's of course, impossible for go because each of the pieces are worth the same.

344
00:34:26,720 --> 00:34:30,440
They're just stones. So there isn't any idea of sort of materiality.

345
00:34:32,920 --> 00:34:40,810
So go then and y, which is why we've taken this on as a challenge combines intuitive pattern recognition with logical planning and such.

346
00:34:42,410 --> 00:34:45,320
So I'm just going to take you through the technicalities of how we did this.

347
00:34:46,010 --> 00:34:52,190
So what we did is we trained to do deep neural networks to deal with some of these intuitive part of go.

348
00:34:52,940 --> 00:35:00,110
So the first thing we did is we downloaded 100,000 games played by relatively expert humans,

349
00:35:00,110 --> 00:35:05,300
though still amateurs playing on Internet go servers, but they're a pretty strong club players.

350
00:35:06,020 --> 00:35:11,510
So we took those hundred thousand games and we trained our first neural network, which we called a policy network.

351
00:35:12,410 --> 00:35:19,100
And this was done through supervised learning. And what we did is we got this network to try and mimic the moves,

352
00:35:19,220 --> 00:35:25,580
copy and predict what move in a particular position that human amateur expert would play.

353
00:35:26,450 --> 00:35:30,050
So this network was trying to copy those expert players.

354
00:35:31,210 --> 00:35:35,030
So that was the first step. Once we had the first version of that,

355
00:35:35,420 --> 00:35:45,410
we then allowed it to play against itself many millions of times and improve its prediction capability through the use of reinforcement learning.

356
00:35:45,920 --> 00:35:49,220
So it learned through trial and error, and it's from its own mistakes.

357
00:35:49,490 --> 00:35:54,800
And then that would then modify the neural network to make it better, incrementally better, over time.

358
00:35:55,860 --> 00:36:03,300
And once we finish this self-pay process, the new policy network could be the original policy network 80% of the time.

359
00:36:04,600 --> 00:36:08,440
Now we freeze this sort of final reinforcement learning policy network,

360
00:36:08,890 --> 00:36:16,690
and we allow that to play a final 30 million times on our Google Cloud service, and that generates our new dataset.

361
00:36:17,260 --> 00:36:21,090
And we take one position from each of those 30 million games.

362
00:36:21,100 --> 00:36:30,130
So we have 30 million positions. So now we finally have a dataset that maybe is big enough to try and learn an evaluation function.

363
00:36:30,820 --> 00:36:35,680
So what we do is we have these 30 million positions and we have the end result of the game.

364
00:36:36,370 --> 00:36:42,610
So we can try and learn sort of correlation between that position and who ends up winning.

365
00:36:44,420 --> 00:36:48,680
So we then train this final network, which we call the Valley Network.

366
00:36:49,370 --> 00:36:57,410
And this Valley Network learns to predict who is winning the game from a particular position and by estimate, by how much.

367
00:36:58,720 --> 00:37:06,160
So this is really the core of the breakthrough with AlphaGo was the value network is this fabled evaluation function.

368
00:37:06,490 --> 00:37:09,820
But instead of writing it out by hand, like something like deep blue,

369
00:37:10,060 --> 00:37:18,550
where we as expert go players or chess players wrote out all the rules that could evaluate position by hand a big database of rules.

370
00:37:18,880 --> 00:37:23,200
We instead have a neural network that learns for itself directly from the data.

371
00:37:25,130 --> 00:37:30,680
So we take these new networks forward and we have two new networks, Policy Network.

372
00:37:30,680 --> 00:37:35,569
This network in green that I was showing you earlier that takes the board position in in blue

373
00:37:35,570 --> 00:37:42,620
here as the input and the output is a probability distribution over the possible moves.

374
00:37:43,460 --> 00:37:51,590
And you can see here, the height of the green bars is the probability mass assigned to that particular move by the policy network.

375
00:37:52,660 --> 00:38:01,060
So what this means is, is that our AlphaGo system doesn't have to consider all these 200 possible moves every time it's looking at a decision point.

376
00:38:01,420 --> 00:38:05,950
It can maybe just look at the top three or four most sensible or most likely moves.

377
00:38:07,940 --> 00:38:14,240
The second network, the network in Pink, which is the Valley network that also takes the board position in as an input.

378
00:38:14,660 --> 00:38:23,870
But this time, the output is a single number, a real number between zero and one, where zero is meaning Y is winning by huge margin one.

379
00:38:23,990 --> 00:38:31,610
One is blacks winning and 0.5 is the games. Even so, estimates are who is winning the game and by how much?

380
00:38:34,520 --> 00:38:39,470
So we now take this forward into and combine it with search, and I'm going to show you that in a second.

381
00:38:39,740 --> 00:38:49,100
But I just want to give you pictorially an idea, illustrate to you why using these two new networks helps with make a plane go tractable.

382
00:38:49,910 --> 00:38:55,969
So imagine here we're searching through the game of go and each of these little nodes

383
00:38:55,970 --> 00:39:01,610
here represents by these mini boards a position in a particular game we're playing now.

384
00:39:02,060 --> 00:39:10,220
The Tree of Possibilities branches out almost to infinity, these huge number of possibilities which are completely not tractable to search.

385
00:39:11,310 --> 00:39:14,820
So what we do is we firstly take the policy network, this network in green.

386
00:39:15,330 --> 00:39:25,800
And what that does is reduce the breadth of the search so we can hone in to only look at the moves that are plausible and sensible.

387
00:39:26,670 --> 00:39:32,550
So that reduces the breadth of the search. And then the value network you can think of is reducing the depth of the search.

388
00:39:32,850 --> 00:39:40,530
So instead of having to search through the entire game tree until the end of the game to tell you whether which side is winning,

389
00:39:40,980 --> 00:39:50,670
we can call the value network at any time and and estimate which side is winning so we can truncate that search at any depth level that we want.

390
00:39:51,810 --> 00:39:59,040
So you can see by using these two networks in tandem, we've cut down that enormous search base to something much more tractable.

391
00:40:01,180 --> 00:40:08,860
So I'm going to show you how we do our search now. So we use Monte Carlo tree search and we also use another thing called roll out policies.

392
00:40:09,280 --> 00:40:12,730
And we combine that together with the two new networks I've just shown you.

393
00:40:13,780 --> 00:40:16,989
So let's imagine that we are making a decision.

394
00:40:16,990 --> 00:40:20,380
AlphaGo is in the middle of thinking about what move it should make next,

395
00:40:20,980 --> 00:40:26,040
and it's done a bit of searching from the current position, which is that node at the top of the tree.

396
00:40:26,080 --> 00:40:29,440
That's our current position and it's found a couple of promising moves.

397
00:40:29,890 --> 00:40:35,140
Now the value of each move is represented by the letter Q here, the action value of each move.

398
00:40:36,220 --> 00:40:40,570
And what we're trying to do is find the move, in essence, that has the maximum Q.

399
00:40:41,530 --> 00:40:46,110
And what we might do is we might follow a trajectory that has quite high Q value.

400
00:40:46,540 --> 00:40:56,020
And you can see this in the bold black arrows, and we follow that trajectory until we hit a node that has not been explored yet in the game tree.

401
00:40:56,560 --> 00:40:59,800
So here on the left hand side of this tree that we're unfolding.

402
00:41:00,550 --> 00:41:07,030
Now, once we hit that new note, what we do, the first thing we do is we call our policy network, the Green Network,

403
00:41:07,930 --> 00:41:16,210
and we ask the policy network to expand the tree at that point, but only expand it with a few moves that it thinks are most probable.

404
00:41:16,900 --> 00:41:19,900
So with the highest p the prior probability of that move.

405
00:41:21,830 --> 00:41:31,400
So once that's expanded, then we call the second neural network the value net to evaluate that position and give an estimate of who is winning.

406
00:41:32,700 --> 00:41:39,570
We also do a second thing. We call if we have time, we do rollouts to the end of the game.

407
00:41:40,050 --> 00:41:47,430
Maybe a few thousand of them to collect statistics. Also true statistics about who ends up winning the game from that position.

408
00:41:48,970 --> 00:41:54,970
And then we combine both these two estimates, the estimate from the value network and the estimate from the roll out policy.

409
00:41:55,180 --> 00:42:02,680
And and we combine them together to give a final evaluation of the promising ness of that branch of the tree.

410
00:42:03,980 --> 00:42:12,650
And once we get this new Q value, we then back that up the tree and update the the connections and the choice, the decision points.

411
00:42:13,400 --> 00:42:18,950
And then finally, once we run out of time for searching and thinking and we have to make a decision,

412
00:42:19,100 --> 00:42:24,410
we then, in essence, pick the action, pick the move that has the most promising.

413
00:42:24,680 --> 00:42:30,540
Q value associated with it. So once we built AlphaGo, how did we evaluate it?

414
00:42:30,990 --> 00:42:40,050
Well, the first thing we did back in April last year was play against the other strongest go programs available at that time.

415
00:42:41,070 --> 00:42:48,510
So we tried against two Crazy Stone and Z, which are the strongest programs out there other than AlphaGo.

416
00:42:49,710 --> 00:42:53,370
Now I'll just explain about the scale that we're going to show here on these bar charts.

417
00:42:53,700 --> 00:43:02,669
On the right hand side, Dan and Q levels, which are the ratings that you get when you play go and they go when you are big enough in.

418
00:43:02,670 --> 00:43:10,350
Q From like 25? Q Down to one. Q And then as an amateur, you go from one down to about six or seven down,

419
00:43:10,620 --> 00:43:17,550
and then you can become a professional if you pass certain productions and you start again from one down to nine down the professional level.

420
00:43:17,730 --> 00:43:21,750
So that's what the three landings are, the yellow beginner, orange, amateur,

421
00:43:21,930 --> 00:43:28,380
red professional on the left hand side, or numerical equivalents of those down ratings.

422
00:43:28,530 --> 00:43:34,920
So we call them ELO ratings, and this is our rating scale from zero to about 3500.

423
00:43:35,340 --> 00:43:38,700
And you can think of the way of thinking about this is that if you have an E,

424
00:43:38,700 --> 00:43:48,330
no rating difference between two players of 200 to 250 points, that translates to about an 80% win rate for the higher rated player.

425
00:43:49,050 --> 00:43:54,920
Right. So it's a kind of Bayesian sort of comparison between the strengths of these different players.

426
00:43:55,960 --> 00:44:02,320
And what we found is AlphaGo, when we played against these other programs, could beat them more than 99% of the time.

427
00:44:02,390 --> 00:44:11,620
In fact, nearly 100% of the time. And there was a huge margin between AlphaGo and the next best program, Crazy Stone of around 1200 ELO points.

428
00:44:12,680 --> 00:44:19,010
And some of you who've been following this may know that Facebook have also have their own program that they're working on called Dark Forest.

429
00:44:19,670 --> 00:44:23,239
But that's not even as strong as Zen or Crazy Stone.

430
00:44:23,240 --> 00:44:29,210
In fact, it lost Tarzan on a landmine tournament last month. So it's estimated to be around the same level as them.

431
00:44:29,780 --> 00:44:33,590
So there's around 1200 low point difference between AlphaGo and these other programs.

432
00:44:33,830 --> 00:44:41,990
So we needed a greater challenge. So we thought, well, we're ready to play a top human professional.

433
00:44:42,440 --> 00:44:46,819
And so we contacted Fan Wei, who is the reigning three times European champion.

434
00:44:46,820 --> 00:44:53,360
He's a two time professional and he started playing Go at the age of seven back in China where he grew up.

435
00:44:53,600 --> 00:45:00,410
And he turned professional in China at the age of 16. And in China it's one of the most competitive places to try and become a professional.

436
00:45:01,190 --> 00:45:05,980
So this was very exciting for us back in October and we didn't really know how well we were going to do.

437
00:45:05,990 --> 00:45:11,300
We knew that we were much stronger than other commercially available programs, but we didn't know.

438
00:45:11,810 --> 00:45:17,420
We obviously it was a lot better than any of us on the team, so we didn't know how strong it would be against a human opponent.

439
00:45:17,660 --> 00:45:25,490
So this is what happened. I think after what they did, maybe a little like fight in like play slowly.

440
00:45:25,500 --> 00:45:30,480
So it's why became the second game. I fight with things on.

441
00:45:33,790 --> 00:45:38,170
I see. Maybe I'm right. It's why it was another game.

442
00:45:38,380 --> 00:45:47,670
I fight all the time. It's. No, it's not nice, but I lose all my.

443
00:45:51,400 --> 00:45:54,730
He's a really great guy, actually. He's a really good sport.

444
00:45:54,970 --> 00:45:58,780
So AlphaGo won five nil, which was very surprising to us.

445
00:45:58,810 --> 00:46:03,880
We were hoping to win at least one game, but five no was was was pretty amazing.

446
00:46:04,270 --> 00:46:07,660
And this story ends well. They don't worry. He's he looks anguished here.

447
00:46:07,900 --> 00:46:13,120
But we actually then hired him as a consultant for our team ready for the next world match.

448
00:46:13,120 --> 00:46:15,550
So in the end, he's on the side of the computers now.

449
00:46:17,320 --> 00:46:22,570
But one interesting thing, actually, is that he's since played a few more games against AlphaGo informally,

450
00:46:22,750 --> 00:46:25,600
and he feels that it's actually improved his own play.

451
00:46:26,230 --> 00:46:32,230
And very recently he won the European Professional Championship again and he beat with a full score.

452
00:46:32,260 --> 00:46:41,260
He beat every single other professional in Europe. So he feels he's got stronger by training against AlphaGo, in fact, which is quite interesting.

453
00:46:41,830 --> 00:46:50,410
So anyway, he's around here on this measure. He's around 2900 ELO and AlphaGo at that time was around 3100.

454
00:46:51,890 --> 00:47:00,830
Again, this is covered in a nature paper that came out a couple of weeks ago on the front cover, and it's caused a huge stir in the air community.

455
00:47:01,460 --> 00:47:07,580
And I encourage you to read that if you want to hear much more of the technical details which are outlined in that in that paper.

456
00:47:09,410 --> 00:47:14,990
So I just want to just explain take a minute to explain the critical difference here between AlphaGo and Deep Blue.

457
00:47:16,340 --> 00:47:25,460
So although this is a big achievement, go beating a professional player at GO is a long standing grand challenge of AI research.

458
00:47:26,030 --> 00:47:30,530
And many people have been working, many smart people working on this for over a decade.

459
00:47:31,520 --> 00:47:37,310
And in fact, this happened about a decade earlier than many experts in the field.

460
00:47:37,460 --> 00:47:43,520
The top programmers of the other guy programs, for example, thought it was going to happen even even from like last year.

461
00:47:44,570 --> 00:47:48,620
But the key thing for us is no ICD treatment, but how we did it.

462
00:47:49,190 --> 00:47:55,160
So we've used general purpose algorithms, deep learning reinforcement learning, tree search.

463
00:47:55,370 --> 00:48:01,700
These are general purpose algorithms, and we've put them together in a way that learns how to play go.

464
00:48:02,030 --> 00:48:06,290
It's not a handcrafted set of rules and heuristics like Deep Blue or chess programs,

465
00:48:07,520 --> 00:48:13,100
and it's also a modular system that combines pattern recognition with planning algorithms.

466
00:48:13,370 --> 00:48:18,560
So that's another thing, is that deep learning is hugely popular right now, very fashionable.

467
00:48:18,770 --> 00:48:25,400
But we think and we think it's critical. And, of course, we have a huge, deep learning team, many amazing deep learners at DeepMind.

468
00:48:26,480 --> 00:48:32,690
But we don't think that's the whole story on its own. We think that it needs all the other things are going to be required,

469
00:48:32,690 --> 00:48:40,250
like reinforcement learning and memory and other advances combined with deep learning to reach full intelligence.

470
00:48:41,330 --> 00:48:45,409
And because of the way we train AlphaGo, many people have commented,

471
00:48:45,410 --> 00:48:52,010
many professional players commented about how humanlike that it plays in this playing style is and how it thinks.

472
00:48:52,310 --> 00:48:57,140
And if you think about it, AlphaGo has been trained in a way like a human expert player,

473
00:48:57,260 --> 00:49:03,900
starts off by studying professional games and learning from that, and then improves by.

474
00:49:03,920 --> 00:49:09,930
Through practice, by playing games that go. So for us, what's the next step?

475
00:49:09,930 --> 00:49:17,880
Now, as Mike alluded to, actually only about a week and a half away from this is the next step is to take on the world's best player.

476
00:49:18,540 --> 00:49:21,540
Lisa, Don, and he's from South Korea.

477
00:49:21,870 --> 00:49:27,240
He's a legend there. He's sort of like the David Beckham of South Korea, believe it or not.

478
00:49:27,480 --> 00:49:33,270
And I describe him as the Roger Federer of GO, because he's been at the top of the game for a decade,

479
00:49:33,270 --> 00:49:42,270
but he's still one of the top three players in the world. And he's won 18 international titles, kind of like Grand Slams over the last decade.

480
00:49:42,810 --> 00:49:49,020
And we're challenging him to $1,000,000 match, five game match in Seoul in March 8th to 15th.

481
00:49:49,020 --> 00:49:54,660
And you can follow that on on YouTube live stream. And, you know, he's taking this pretty seriously.

482
00:49:54,780 --> 00:50:01,470
Obviously, there's the money on the line, his reputation. But when he was asked by the South Korean press how he felt about the game,

483
00:50:01,800 --> 00:50:06,180
he said, I'm not sure if I represent the whole of humanity, but I think I am.

484
00:50:07,110 --> 00:50:10,530
So it's good that he's confident that he's going to win the match.

485
00:50:11,010 --> 00:50:15,360
So he's actually a lovely guy as well. And I'm really looking forward to going out there.

486
00:50:15,360 --> 00:50:17,430
And it's it's it's crazy out.

487
00:50:17,550 --> 00:50:26,470
We did a press conference yesterday and via video call and there were over 300 journalists, including live TV cameras for a video call.

488
00:50:26,490 --> 00:50:31,140
So it's pretty crazy. So we're going to we're very excited to see how it's going to be like when we go there.

489
00:50:32,190 --> 00:50:36,959
But Lisa, though on our ELO measures, he significantly better than fans,

490
00:50:36,960 --> 00:50:41,820
where he's a couple of notches better fan, where he's kind of like a grandmaster level.

491
00:50:42,030 --> 00:50:45,150
But there's another level to reach sort of the world elite.

492
00:50:45,420 --> 00:50:53,070
So he's at least 600 ELO stronger. So we have to go some if we want to beat him from where AlphaGo was back in October.

493
00:50:54,590 --> 00:50:58,730
So my final slide and go is talking about how do we do this testing?

494
00:50:59,060 --> 00:51:06,740
Well, we have our own internal testing where we have running 24, seven different versions of our program playing against itself.

495
00:51:06,980 --> 00:51:10,940
And we can make accurate estimates of how strong our product we think our program

496
00:51:10,940 --> 00:51:14,900
is from this continual live tournament that that's going on in the cloud.

497
00:51:15,470 --> 00:51:22,010
But every now and again, we have to calibrate those internal tests with external testing.

498
00:51:22,310 --> 00:51:27,020
So we need to test against these external benchmarks.

499
00:51:27,320 --> 00:51:31,130
So in April, we tested again, Zen and Crazy Stone. We won over 99%.

500
00:51:31,790 --> 00:51:37,400
Then in October, our new version, our October version could be our April version and 100% of the time.

501
00:51:37,790 --> 00:51:46,490
And obviously, we were playing fans who we also knew could beat these these other top commercial programs 100% of the time if he was to play it.

502
00:51:47,420 --> 00:51:52,280
And so we knew we were at least roughly matched. But in the end, as you saw, we won five now.

503
00:51:53,390 --> 00:51:57,890
So now we're at March, coming up to March, and we're playing Lisa Dole.

504
00:51:58,100 --> 00:52:04,220
And Lisa Dole would on the low ratings, you would expect him to win around 97% of the time against Fan.

505
00:52:04,520 --> 00:52:08,120
So it's a huge step up. So it's obviously confidential.

506
00:52:08,120 --> 00:52:11,030
Our number till the match that we we've got on the left hand side.

507
00:52:11,270 --> 00:52:15,620
And obviously the million dollar question is what's going to happen when we play him?

508
00:52:16,230 --> 00:52:23,990
So it's going to be very exciting to see. So I just want to give a big shout out to the amazing team that's worked on AlphaGo,

509
00:52:23,990 --> 00:52:30,470
led by David Silva and ajoke Wang as the team leads by some incredible work has gone on on this.

510
00:52:32,380 --> 00:52:38,230
Now, of course, playing games is great fun and it's very efficient for advancing our A.I. research.

511
00:52:38,410 --> 00:52:41,170
But we also want to apply these technologies to the real world,

512
00:52:41,530 --> 00:52:48,700
and we plan to make some announcements about this over the next year in health care, in robotics, and in smart assistants.

513
00:52:49,450 --> 00:52:50,499
All these different areas,

514
00:52:50,500 --> 00:52:58,240
we feel that extensions of and components of what we're building for things like AlphaGo can be used very powerfully in these areas.

515
00:53:00,100 --> 00:53:04,479
So I just want to end the talk with a couple of high level thoughts and why I've

516
00:53:04,480 --> 00:53:08,320
been so obsessed with it for my entire career and why I think it's so important.

517
00:53:08,800 --> 00:53:19,540
I see two big challenges facing society today information overload, which is deluged as users and scientists with data everywhere.

518
00:53:19,660 --> 00:53:24,610
Big data from genomics, entertainment, every field sphere of of human life.

519
00:53:25,060 --> 00:53:28,390
Now, personalisation might be one technology to try and combat that,

520
00:53:28,600 --> 00:53:36,430
but unfortunately doesn't work very well because it mostly is based today on the averaging of crowds rather than actually adapting to you as a person,

521
00:53:36,610 --> 00:53:37,930
as a thing, as an individual.

522
00:53:38,950 --> 00:53:47,410
Then secondly, the systems that we would like to master are so complex today, from climate to disease to energy macroeconomics, high energy physics.

523
00:53:47,740 --> 00:53:52,210
So, you know, you have to think that maybe the complexity of systems is so great.

524
00:53:52,420 --> 00:53:56,050
It's difficult to imagine how even an Einstein, someone at that level,

525
00:53:56,230 --> 00:54:00,910
can master these systems in their own lifetime and still leave enough time for innovation.

526
00:54:01,960 --> 00:54:06,220
So we think a d mind that solving AI in a fundamental way like we're trying

527
00:54:06,220 --> 00:54:09,580
to do is potentially a kind of better solution to all these other problems.

528
00:54:10,030 --> 00:54:14,740
If we can solve A.I. in this way, we can bring it to bear on all the other issues that we would like to solve.

529
00:54:15,100 --> 00:54:22,480
So the dream is really to make for me anyway, is to use this kind of AI to create A.I. scientists or A.I. assisted science.

530
00:54:24,380 --> 00:54:26,090
And finally, I should mention a word about ethics.

531
00:54:26,510 --> 00:54:32,000
As with all powerful new technologies, they have to be used ethically, responsibly, and AOA is no different.

532
00:54:32,240 --> 00:54:36,740
And even though human level Jan is decades away, we should start the debate now.

533
00:54:37,880 --> 00:54:44,540
And as a neuroscientist, I think trying to distil intelligence into an algorithmic construct and then comparing it to the human mind,

534
00:54:44,870 --> 00:54:50,750
what actually this journey we're on is will be one of the best ways to better understand the mysteries of our own minds.

535
00:54:51,080 --> 00:54:58,850
And things shed light on, on, on, on sort of things like dreaming, creativity, and perhaps even the ultimate question of consciousness.

536
00:54:59,480 --> 00:55:00,110
Thanks for listening.