1 00:00:13,280 --> 00:00:19,730 My name is Mike Wooldridge. I'm a professor of computer science and currently head of the Department of Computer Science at the University of Oxford. 2 00:00:19,940 --> 00:00:24,440 And I would like to welcome you all to this term Straight G Lecture. 3 00:00:24,710 --> 00:00:31,310 The Straight G lectures are the distinguished lectures in computer science that the Department of Computer Science offers. 4 00:00:31,880 --> 00:00:36,140 We do not usually host our strategy lectures in the show only in theatre. 5 00:00:36,410 --> 00:00:42,530 The fact that we are able to do this today is because of the generous support of Oxford Asset Management. 6 00:00:43,010 --> 00:00:43,610 We are very, 7 00:00:43,610 --> 00:00:51,560 very grateful to Oxford Asset Management who literally have made this event possible on a scale and type that would not have been possible otherwise. 8 00:00:51,950 --> 00:00:56,180 So we thank them very much for that and for their continuing, ongoing support. 9 00:00:58,910 --> 00:01:02,690 Let me introduce let me say a few words about today's speaker. 10 00:01:02,790 --> 00:01:10,940 It's an enormous pleasure to be able to welcome Demis Hassabis of Google DeepMind to be this term's straight lecturer. 11 00:01:11,840 --> 00:01:15,860 Demis gained his undergraduate degree from Cambridge in computer science. 12 00:01:16,460 --> 00:01:23,660 He then went on and was for some time a successful computer games programmer designed 13 00:01:23,660 --> 00:01:26,990 a number of games which went on to achieve a degree of success in the marketplace. 14 00:01:27,650 --> 00:01:37,520 He did a PhD at UCL in Cognitive Neuroscience, and then in 2009 he was a co-founder of a company called DeepMind. 15 00:01:37,910 --> 00:01:43,790 Well, all through that time I think it's probably fair to say that he didn't hit the front pages of any newspapers. 16 00:01:44,420 --> 00:01:54,680 All of that changed in 2014 when DeepMind were acquired by Google for the very un-British sum of, I believe, 400 million, 17 00:01:54,680 --> 00:02:01,700 which is not a figure that happens that much in any part of of of British industry and got him and DeepMind on the front 18 00:02:01,700 --> 00:02:09,229 pages of the international press and computer science professionals like myself were all agog to see what this company, 19 00:02:09,230 --> 00:02:14,150 which I think is probably fair to say, had been operating in stealth mode or something like it for a number of years, 20 00:02:14,840 --> 00:02:18,440 had to offer, well, we didn't have to wait very long. 21 00:02:18,440 --> 00:02:25,010 Very soon after that, the first results became public from DeepMind, and it became clear what Google were interested in. 22 00:02:25,250 --> 00:02:28,880 I'm not going to spoil them. It's a show by telling you about those results now, 23 00:02:29,120 --> 00:02:34,939 but some very impressive results to do with learning to play video games and being on 24 00:02:34,940 --> 00:02:38,780 the front pages of the national press just once or twice was not enough for them. 25 00:02:39,020 --> 00:02:42,649 They got on the front pages of the international press last year with some 26 00:02:42,650 --> 00:02:47,450 incredible achievements in the area of computer programs playing the game of go. 27 00:02:48,170 --> 00:02:56,270 And we all now have the opportunity to see Demis speak at a really special time for DeepMind because it's Demis is 28 00:02:56,270 --> 00:03:02,629 about to tell us they are heading up towards a competition to play go with some of the world's leading players. 29 00:03:02,630 --> 00:03:10,520 So we're going to get an insight into a company that's doing remarkable things, one of the most remarkable points in that trajectory. 30 00:03:10,700 --> 00:03:17,150 So it's a very great with very great pleasure that I introduce you and welcome you to give this term strategy lecture over to you. 31 00:03:17,420 --> 00:03:23,030 Thank you. Good evening, ladies and gentlemen. 32 00:03:24,710 --> 00:03:28,080 Good evening, ladies and gentlemen. Welcome to the Sheldon Theatre. 33 00:03:28,130 --> 00:03:32,300 Before the lecture begins, would you ensure your mobile phones are switched off? 34 00:03:32,750 --> 00:03:38,570 I would also remind you that an authorised photography and recording are prohibited in the interest of safety. 35 00:03:38,600 --> 00:03:44,000 Would you ensure emergency exits, walkways and window lights to capture personal belongings? 36 00:03:44,600 --> 00:03:49,489 Guests in the upper gallery asked to leave using the stairs. The sides are not to use. 37 00:03:49,490 --> 00:03:53,719 It is used in the steps. Thank you. Okay. 38 00:03:53,720 --> 00:03:59,180 Well, thanks. From The Voice from the sky. So thanks, Mike, for that very generous introduction. 39 00:03:59,180 --> 00:04:03,950 It's a huge honour and real pleasure to be giving this lecture. 40 00:04:03,950 --> 00:04:07,100 We invited you to give this lecture in these auspicious surroundings. 41 00:04:07,550 --> 00:04:13,100 So what I'm going to try and do today in my talk is to give you a whirlwind tour of what's happening at the cutting 42 00:04:13,100 --> 00:04:19,160 edge of artificial intelligence and then end with some of the latest breakthroughs that we've been doing at DeepMind. 43 00:04:19,940 --> 00:04:27,409 And then I'll probably talk a little bit about the bigger picture of artificial intelligence and where I think it's heading to in the future. 44 00:04:27,410 --> 00:04:32,420 And then we can go into the Q&A. So artificial intelligence. 45 00:04:32,600 --> 00:04:36,230 AI is basically the science of making machines smart. 46 00:04:37,840 --> 00:04:45,250 Now what DeepMind is we founded in 2010 and as Mike mentioned, we joined Google in 2014 to accelerate our mission. 47 00:04:45,760 --> 00:04:52,030 And the way we think about DeepMind is a sort of Apollo program or Apollo program effort for A.I. 48 00:04:53,180 --> 00:04:56,510 We have about 200 research scientists and engineers now. 49 00:04:56,840 --> 00:05:03,590 So I think it's probably one of the biggest collections anywhere in the world of talent focusing around this topic. 50 00:05:05,460 --> 00:05:11,670 And not only is this a very ambitious sort of research program, but we also try and think about a new, 51 00:05:11,670 --> 00:05:15,870 more efficient and productive way of organising science and scientific research. 52 00:05:16,530 --> 00:05:19,170 And in terms of like the environment we've created, 53 00:05:19,350 --> 00:05:28,410 we've tried to sort of build a unique environment that's a blend between the best of academia and how academia should function in an ideal world. 54 00:05:28,950 --> 00:05:32,520 And the best from the top sort of Silicon Valley start-ups. 55 00:05:32,910 --> 00:05:37,770 So kind of blue sky thinking from academia and collaborative interdisciplinary research. 56 00:05:37,980 --> 00:05:46,500 And then the focus and the energy and buzz and resources that really, really successful start-ups have. 57 00:05:47,650 --> 00:05:52,010 We try to fuse this together into a unique environment that's uniquely suited to research. 58 00:05:54,440 --> 00:05:58,550 So our mission at DeepMind, we basically articulated, or at least I do in this way. 59 00:05:58,910 --> 00:06:06,140 So step one, we try and so fundamentally solve intelligence and then step to use that to solve everything else. 60 00:06:06,530 --> 00:06:13,159 So, you know, this may seem quite fantastical to you, this step, too, but actually I hope that by the end of this talk, 61 00:06:13,160 --> 00:06:17,120 you'll be convinced that it actually naturally follows on from solving step one. 62 00:06:19,610 --> 00:06:24,379 So more prosaically, how are we going to attempt to do this? Well, I don't mind. 63 00:06:24,380 --> 00:06:28,490 What we're interested in doing is building what we call general purpose learning algorithms. 64 00:06:29,640 --> 00:06:36,480 So the key things of everything that we do is that our algorithms learn how to master certain tasks. 65 00:06:36,660 --> 00:06:40,680 They learn automatically from raw inputs or raw data. 66 00:06:41,230 --> 00:06:44,370 They're not pre-programmed or handcrafted in any way. 67 00:06:46,250 --> 00:06:50,360 The second important notion that we have is this idea of generality. 68 00:06:50,900 --> 00:06:55,970 So this is the sort of the idea that the same system or same set of algorithms 69 00:06:56,180 --> 00:07:00,530 can operate out of the box across a wide range of environments and tasks. 70 00:07:01,820 --> 00:07:07,220 So we call this kind of AI internally DeepMind Artificial General Intelligence, AGI, 71 00:07:07,970 --> 00:07:14,210 and the hallmark of AGI is that from the ground up, it's built to be flexible, adaptive and inventive. 72 00:07:14,540 --> 00:07:17,030 It can deal gracefully with the unexpected. 73 00:07:18,420 --> 00:07:26,850 Now, if we compare that with most A.I. that's out there today, which we term now to distinguish it from AGI, 74 00:07:27,300 --> 00:07:34,200 most of the A.I. that you interact with every day is handcrafted and special cased to particular single task. 75 00:07:35,370 --> 00:07:38,310 And what that often means is, is that these systems are quite brittle. 76 00:07:38,850 --> 00:07:45,420 If you do something unexpected or something unexpected happens that the programmers of that system didn't cater for the full time, 77 00:07:46,140 --> 00:07:47,550 it will catastrophically fail. 78 00:07:48,650 --> 00:07:56,270 And you can see that with things like Siri on your phone, you know, it works fine if you stick to the templates that have been pre-programmed. 79 00:07:56,540 --> 00:08:02,660 But as soon as you start going off pace with your conversation, the holes in the algorithms quickly become apparent. 80 00:08:04,430 --> 00:08:10,759 So still today, probably the greatest achievement, one of the greatest achievements in I was deep blue, 81 00:08:10,760 --> 00:08:13,310 beating Garry Kasparov for chess in the late nineties. 82 00:08:13,880 --> 00:08:19,250 Of course, this is a huge technical achievement and an absolute watershed moment for A.I. research. 83 00:08:20,550 --> 00:08:24,780 But having said that, you know, the question is is was deeply, truly intelligent. 84 00:08:25,170 --> 00:08:29,970 And I think even the design is a deep blue. And certainly we would argue that it isn't really. 85 00:08:30,330 --> 00:08:36,899 And one easy way to see that intuitively is the fact that Deep Blue couldn't even play a strictly much simpler game, 86 00:08:36,900 --> 00:08:40,620 like noughts and crosses without being totally reprogrammed from scratch. 87 00:08:41,250 --> 00:08:43,110 There was no knowledge in the end, 88 00:08:43,120 --> 00:08:49,320 or in the algorithms that deeply was running that would help it play any any other game, let alone do anything else. 89 00:08:50,220 --> 00:08:52,890 So I actually came away. I remember this match very distinctly. 90 00:08:52,890 --> 00:08:59,640 I was studying at Cambridge and I actually came away more impressed by Garry Kasparov minds than the computer, 91 00:08:59,910 --> 00:09:05,460 because here was Garry Kasparov able to compete on more or less level terms with this brute of a machine. 92 00:09:05,730 --> 00:09:11,700 And yet, of course, Garry can do many other things, speak several languages, drive cars, tie shoelaces. 93 00:09:12,450 --> 00:09:17,670 So, you know, in a way, it's quite amazing that the human mind, what the human mind is capable of. 94 00:09:19,320 --> 00:09:23,160 So instead of this kind of regime, how do we think about artificial intelligence? 95 00:09:23,550 --> 00:09:29,250 Well, I would say the core of what we're doing, a DeepMind focus is around what's called reinforcement learning. 96 00:09:29,880 --> 00:09:32,370 And that's how we think about intelligence at DeepMind. 97 00:09:33,300 --> 00:09:38,280 So just quickly going to explain that with the help of a little simple diagram here, what reinforcement learning is. 98 00:09:39,340 --> 00:09:48,640 So if we start off with the agent system, the AOC and the agent system finds itself in some kind of environment trying to achieve a goal. 99 00:09:49,210 --> 00:09:53,080 Now, that environment could be a real world environment, in which case the agent would be a robot, 100 00:09:53,440 --> 00:09:56,500 or it can be a virtual environment, in which case the agent would be an avatar. 101 00:09:56,890 --> 00:10:00,130 And in fact, for most of our research, as you'll see, we use virtual environments. 102 00:10:01,840 --> 00:10:05,290 Now the agent interacts with the environment in just two ways. 103 00:10:05,830 --> 00:10:09,190 Firstly, it gets observations through its sensory apparatus. 104 00:10:09,790 --> 00:10:14,440 We must use vision currently, but we're starting to think about other modalities. 105 00:10:15,130 --> 00:10:21,820 And one of the jobs of the agent system is to build the best possible model of the world out there, the environment out there, 106 00:10:22,900 --> 00:10:29,799 just based on these incomplete and noisy observations that it's receiving in real time and in real time, 107 00:10:29,800 --> 00:10:32,820 it's got to keep updating that model in the face of new evidence. 108 00:10:34,130 --> 00:10:40,820 The second job of the agent is once it's built, this model of the world is to use that model to make predictions about what's going to happen next. 109 00:10:41,360 --> 00:10:45,590 And if you can make predictions about the world, then you can start planning about what to do. 110 00:10:46,040 --> 00:10:51,500 So if you're trying to achieve a goal, the agent will have a set of actions available to it at that moment. 111 00:10:51,800 --> 00:10:59,810 And the decision making problem is to pick which action will be the best action to take right now to get you towards your goal. 112 00:11:01,650 --> 00:11:05,760 And once the agent has decided that based on its model and its planning trajectories, 113 00:11:05,760 --> 00:11:11,610 the output executes the action, and that action may or may not make some change to the environment. 114 00:11:11,820 --> 00:11:15,240 And then that drives a new observation. And that's really it. 115 00:11:15,270 --> 00:11:19,770 That's the heart of reinforcement learning. But although this diagram is very simple, 116 00:11:19,890 --> 00:11:26,580 those of you who know about reinforcement learning will understand there's huge complexity hidden behind this simple diagram. 117 00:11:27,180 --> 00:11:32,760 But we do know that if we could solve all the issues behind reinforcement learning and make this work perfectly, 118 00:11:32,910 --> 00:11:37,950 then that would be sufficient for general level, general human level intelligence. 119 00:11:38,280 --> 00:11:44,700 And the reason we know that is because biological systems learn using reinforcement learning, including the human brain. 120 00:11:45,420 --> 00:11:50,760 In fact, there are some seminal studies done in the late nineties on monkeys that showed that the 121 00:11:50,760 --> 00:11:55,290 dopamine neurones in the brain implement a form of reinforcement learning called learning. 122 00:11:57,660 --> 00:12:01,590 So the set reinforces that it is an end for at the core of what we do at the moment. 123 00:12:02,010 --> 00:12:10,050 The second big philosophical thing we committed to at the start of at the founding of DeepMind was this idea of grounded cognition. 124 00:12:12,170 --> 00:12:19,370 So this is the idea that a true thinking machine has to be grounded in a rich sensory motor reality or data stream. 125 00:12:21,440 --> 00:12:29,120 Now when people commit to this sort of sentiment, often they then start working on real robots because after all, 126 00:12:29,120 --> 00:12:36,560 real robots are actually situated in the real world. And of course, through their sensory apparatus, they're getting data, real world data. 127 00:12:37,670 --> 00:12:44,540 But we actually made a different decision on this. We decided to use virtual environments in games, and we think they're the perfect platform, 128 00:12:44,540 --> 00:12:47,870 if used correctly, for developing and testing A.I. algorithms. 129 00:12:49,090 --> 00:12:55,600 One of the important things you have to do and avoid is that when you use virtual environments, of course, if you want to, 130 00:12:55,780 --> 00:13:00,129 you could allow your agent to have access to all kinds of the internal states of the 131 00:13:00,130 --> 00:13:04,900 game that it couldn't actually directly sense through its normal sensory apparatus. 132 00:13:05,680 --> 00:13:07,350 And of course, that's something you have to avoid. 133 00:13:07,360 --> 00:13:14,110 Otherwise you'll think that you're making progress with your algorithms, but actually it would be cheating in some way. 134 00:13:14,710 --> 00:13:19,810 So we have to be very disciplined about how you allow the interface between the virtual 135 00:13:19,810 --> 00:13:24,940 environment and the agent and really treat the agent as if it was a virtual robot, 136 00:13:25,120 --> 00:13:29,680 only getting the information that it's that it could be available to it through its sensors. 137 00:13:31,760 --> 00:13:38,180 Now if you use games like that, then there are many advantages. Of course you can create as much training data as you like. 138 00:13:39,320 --> 00:13:44,690 This is very important. When we were a small, independent company and we didn't have access to a lot of data, but it's still vital. 139 00:13:44,690 --> 00:13:48,770 Now, even though we're at Google, there's no testing bias. 140 00:13:48,890 --> 00:13:58,400 One of the biggest things I think, that held back the A.I. research field was that often you'll find the researchers are also the ones that are 141 00:13:58,400 --> 00:14:04,490 creating the tests and that can lead to unconscious sort of biases about the sorts of tests that you design. 142 00:14:05,030 --> 00:14:11,000 We end up designing tests that subconsciously, at least our algorithms are well suited to. 143 00:14:12,710 --> 00:14:13,460 Of course you can. 144 00:14:14,450 --> 00:14:22,940 If we're talking about virtual agents in virtual environments, we can test thousands, perhaps even millions of these agents systems in parallel. 145 00:14:23,660 --> 00:14:29,960 And games are very convenient in that a lot of them have schools or quite easily identifiable goals. 146 00:14:30,170 --> 00:14:37,700 So it's very easy to measure incremental progress and how your algorithms are doing when you incrementally improve them. 147 00:14:38,180 --> 00:14:43,430 And that's very key for us. Actually, benchmarking is a hugely important thing that we have. 148 00:14:43,430 --> 00:14:48,770 We have a whole team who works on that because when you've got a very ambitious long term goal, 149 00:14:48,920 --> 00:14:58,640 it's even more important to have short term directional sort of waypoints that tell you if you're heading in the right direction towards this big, 150 00:14:58,730 --> 00:15:08,160 ambitious, long term goal. So putting all this together, then this brings to sort of the nub the notion of what we call end to end learning agents. 151 00:15:08,640 --> 00:15:16,620 So this idea of going all the way from the pixels or the raw data input and then ending up with making a decision about what action to take. 152 00:15:17,680 --> 00:15:26,380 And I my women should in that entire stack of problems. So everything from perceptual processing to decision making and all the things in between. 153 00:15:28,770 --> 00:15:35,850 So our first attempt at doing this, which really scaled to something challenging we call deep reinforcement learning. 154 00:15:37,370 --> 00:15:43,610 And the essence here was combining deep neural networks, which is called deep learning these days with reinforcement learning. 155 00:15:44,650 --> 00:15:49,240 And what this allows reinforced learning to do is to actually scale up to work 156 00:15:49,240 --> 00:15:54,160 on very challenging problems until we sort of came up with this paradigm. 157 00:15:54,310 --> 00:16:00,640 Reinforcement learning has been around for many decades, but is usually only been used for relatively toy grade well problems. 158 00:16:01,180 --> 00:16:06,880 It's been hard to scale it up to something challenging with high dimensional sensory inputs. 159 00:16:08,940 --> 00:16:12,050 So I'm going to show you a few videos of this agent working. 160 00:16:12,060 --> 00:16:14,970 But before I do, I just want to clearly explain what it is you're going to see. 161 00:16:15,630 --> 00:16:23,670 So we started off with really the first iconic console, the Atari 2600 from the eighties. 162 00:16:24,360 --> 00:16:29,880 This has the benefit of there are hundreds of different classic games, many of which are iconic and everyone will recognise. 163 00:16:30,600 --> 00:16:34,380 But it's still quite a challenging sensory data stream. 164 00:16:36,210 --> 00:16:39,720 The agents here, they only get the raw pixels from the screen as inputs. 165 00:16:40,020 --> 00:16:47,490 So that's around 30,000 numbers per frame because the screen is about 200 by 150 pixels in size. 166 00:16:48,720 --> 00:16:57,180 And the goal here is to simply maximise the score. The agent system has to learn everything else from scratch, from first principles. 167 00:16:57,570 --> 00:17:01,450 It doesn't know what it's controlling. It doesn't know what the object of the game is. 168 00:17:01,470 --> 00:17:07,590 It doesn't know what it gets. Hate points. It doesn't even know that pixels next to each other are correlated in time. 169 00:17:07,770 --> 00:17:10,080 Has to find all this structure for itself. 170 00:17:11,970 --> 00:17:16,950 And then there's an additional constraint or requirement we put on the system, which is this idea of generality, again, 171 00:17:17,310 --> 00:17:25,350 that a single system has to play all the different games without any changes, with the same hyper parameter settings and other settings. 172 00:17:28,150 --> 00:17:34,960 Then I show you a couple of videos. So the first one to show you is Space Invaders, which were two parts to it. 173 00:17:35,200 --> 00:17:38,319 One where the agent has had no training. 174 00:17:38,320 --> 00:17:44,110 So literally the first time it's seen the data stream. And then after a day or two worth of training. 175 00:17:44,530 --> 00:17:48,190 So initially, you see, is controlling the green rocket at the bottom of the screen. 176 00:17:49,360 --> 00:17:55,209 It's it's losing its three lives immediately because obviously it has no idea at the moment what it's supposed 177 00:17:55,210 --> 00:17:58,870 to be doing or even the fact that it's controlling that collection of pixels at the bottom of the screen. 178 00:17:59,750 --> 00:18:09,980 Now after training by playing the game overnight for 24 hours, you come back and now the system is superhuman level. 179 00:18:10,310 --> 00:18:15,290 So I can play Space Invaders better than any human can. So you see here, every single bullet hit something. 180 00:18:15,980 --> 00:18:19,469 It's learned that the pink mothership at the top of the screen coming across there now, 181 00:18:19,470 --> 00:18:23,510 which she hits for this amazing shot, is is worth the most number of points. 182 00:18:23,930 --> 00:18:30,960 And as you'll see later, those of you who remember Space Invaders, if you're old enough to remember the less of them there are, the faster they go. 183 00:18:30,980 --> 00:18:36,350 So just watch the last sort of predictive shot that that it makes to to get the last one. 184 00:18:36,680 --> 00:18:43,610 So, you know, so it's built up these very accurate models, implicit models of what's happening in this game with this data showing. 185 00:18:44,680 --> 00:18:47,120 So we show you another video. Now, this breakout, my favourite video. 186 00:18:47,120 --> 00:18:51,260 So here you control the bat and ball and you're trying to break through this rainbow coloured wall. 187 00:18:51,560 --> 00:18:55,700 Now, the beginning of the 100 games, you can see the agent is not very good. 188 00:18:55,820 --> 00:19:00,920 It's missing the ball most of the time. But it's time to get the hang of the idea that the bat should go towards the ball. 189 00:19:02,000 --> 00:19:05,690 Now, after 300 games, it's about as good as any human can play this. 190 00:19:06,070 --> 00:19:12,410 And and it gets pretty much gets the ball back every time, even when it's coming back at very fast vertical angles. 191 00:19:13,520 --> 00:19:19,610 But then we left. We thought, Well, that's pretty cool, but we left the system playing for another 200 games and it did this amazing thing. 192 00:19:19,610 --> 00:19:24,830 It found the optimal strategy was to dig a tunnel around the site and put the ball around the back of the wall. 193 00:19:25,640 --> 00:19:29,120 And you see how incredibly accurate, of course, it can send the ball around the back. 194 00:19:29,900 --> 00:19:38,330 So so the funny thing about that is that obviously the research is working on this amazing AI developers and programmers, 195 00:19:38,570 --> 00:19:42,709 but they're not so good at breakouts. And actually they didn't know about that strategy. 196 00:19:42,710 --> 00:19:49,910 So they learned something from their own system, which is, you know, pretty funny and quite instructive, I think, about the potential for general A.I. 197 00:19:51,400 --> 00:19:57,180 You show final video here of Italy, which is really showing a medley now of many different games. 198 00:19:57,360 --> 00:20:06,420 Just to give you a feeling that this system, which we called TQM, really is a general A.I. within the within the constraints of Atari games. 199 00:20:08,870 --> 00:20:13,850 So here is the same system you just saw playing those other games, playing an early racing game called Enduro. 200 00:20:15,620 --> 00:20:17,179 Here is playing a game called River Ride, 201 00:20:17,180 --> 00:20:24,829 which is a fighter pilot game that one of the very early 3D games called Battle Zone is a set of classic ponies controlling the green back here. 202 00:20:24,830 --> 00:20:28,190 And it wins 21 nil every time. Can't get points off of it. 203 00:20:28,490 --> 00:20:34,940 Sea Quest Submarine Game. So you can see the absolute diversity of the graphics and also the objectives. 204 00:20:35,180 --> 00:20:37,379 So here's a boxing is controlling the red box. 205 00:20:37,380 --> 00:20:45,410 So on the on the left does a bit of sparring then once it gets the boat computer on the side just racks up an infinite number of points. 206 00:20:46,640 --> 00:20:53,630 It's just happy to just carry on doing that forever. So, you know, so a very, very diverse range of of games. 207 00:20:53,810 --> 00:21:00,730 Same system out of the box, mastering these things. Now, if you want to read more about that, that was featured in our Nature article. 208 00:21:01,760 --> 00:21:04,300 Beginning of last year we also released the code. 209 00:21:04,540 --> 00:21:09,190 So if you want to play around with this system yourself, you can as freely available on the Internet. 210 00:21:10,060 --> 00:21:13,030 And so we then sort of took that. That was around a year ago. 211 00:21:13,030 --> 00:21:20,590 And now we've taken that further and we started looking at 3D games, at using simulators like robot simulators. 212 00:21:20,740 --> 00:21:25,440 And eventually we would like to get to start thinking about real robotics. 213 00:21:27,010 --> 00:21:34,720 Now just going to quickly show you a couple of 3D videos now with effectively the same deep reinforcement learning system with a few tweaks. 214 00:21:35,230 --> 00:21:38,020 Now coping with this 3D data stream. 215 00:21:38,470 --> 00:21:48,700 So here here is a deacon like algorithm driving a racing car very fast around the track, again, just from raw pixel inputs. 216 00:21:48,970 --> 00:21:53,380 So that's how it's learnt and it's driving around about 200 kilometres an hour. 217 00:21:53,620 --> 00:21:57,340 And it figures out overtaking manoeuvres. 218 00:21:58,030 --> 00:22:00,340 It can also recover from spins, all sorts of things. 219 00:22:00,730 --> 00:22:06,610 So, so it has really amazing performance now in these kinds of driving games just from the vision. 220 00:22:08,420 --> 00:22:16,069 We started looking at how the problems in 3D mazes, collecting objects, finding your way out, 221 00:22:16,070 --> 00:22:22,490 remembering where you've got to go, the sort of things that maybe a rodent like a rat would be able to do. 222 00:22:23,240 --> 00:22:29,630 And so you can sort of think about where we're going next is trying to build a kind of rat level intelligence. 223 00:22:30,260 --> 00:22:32,690 And rats are actually pretty smart. They can do quite a lot of things. 224 00:22:32,990 --> 00:22:38,690 And so here it is again, just from the visuals on the screen, just from the raw pixels, 225 00:22:38,810 --> 00:22:43,040 finding these green apples which are rewarding and then trying to find the exit, 226 00:22:43,640 --> 00:22:48,620 which is this little red floating object and efficiently navigating around. 227 00:22:50,710 --> 00:22:55,690 So that's where we are on sort of 3D and there'll be big announcements about that later this year. 228 00:22:56,680 --> 00:22:59,880 Another. So I've talked about sort of reinforcement learning. 229 00:22:59,890 --> 00:23:07,360 I've talked about grounded cognition. Another thing that I think is sort of pretty unique to Deepmind's approach to AI is taking 230 00:23:07,360 --> 00:23:12,910 systems neuroscience seriously as a source of inspiration for new algorithmic ideas, 231 00:23:13,210 --> 00:23:16,510 but also as a kind of validation testing, if you like. 232 00:23:16,750 --> 00:23:23,739 So if you have your own pet favourite algorithm or algorithmic technique and the question is, 233 00:23:23,740 --> 00:23:28,000 is that you're not sure if this can scale up to become a component of general AI. 234 00:23:28,390 --> 00:23:32,110 How much effort should you put into that? You know, should you spend five years doing that? 235 00:23:32,110 --> 00:23:35,290 Ten years? How many people should be on that working on that? 236 00:23:35,470 --> 00:23:41,320 You know, these are very difficult decisions if you're running a lab or a department or a company working on this kind of thing. 237 00:23:42,010 --> 00:23:48,280 Now, if you can point to something in the brain and show that like with reinforcement learning and as I said earlier, 238 00:23:48,280 --> 00:23:52,930 we know that the brain implements TDW learning through the dopamine system. 239 00:23:53,200 --> 00:23:57,160 That gives you confidence then that in the limit this has to be sufficient. 240 00:23:57,370 --> 00:24:04,810 For example, reinforcement learning. It's not crazy to think about that as a component, a vital component for the general A.I. solution. 241 00:24:05,200 --> 00:24:10,420 And that can be very important directionally when you're thinking about four or five year research programs. 242 00:24:11,650 --> 00:24:16,090 So systems. But when I say neuroscience, I should be very clear. We we're thinking about systems neuroscience. 243 00:24:16,390 --> 00:24:22,870 So we mean the algorithms, the representations and the architectures the brain uses rather than something like the Human Brain Project, 244 00:24:22,870 --> 00:24:26,290 which is more interested in the low level synaptic implementation. 245 00:24:26,290 --> 00:24:30,510 Details of how the brain achieves things with spiking neural networks. 246 00:24:30,730 --> 00:24:34,690 That's too low level for us. We're more interested in the computational level of the brain. 247 00:24:36,350 --> 00:24:42,080 So we're using many ideas and this new as I haven't got time to go into it today, but here are some of the things we're looking at. 248 00:24:42,110 --> 00:24:44,959 Memory, attention, concepts, planning, navigation, 249 00:24:44,960 --> 00:24:51,200 imagination and all of these areas we're actively researching on right now and have very interesting prototypes on. 250 00:24:51,620 --> 00:24:53,390 And in fact, I'll just mention one thing. 251 00:24:54,050 --> 00:25:01,460 For my Ph.D., I studied an area of the brain called the hippocampus, and I studied memory and imagination in the human brain with stimuli. 252 00:25:02,240 --> 00:25:06,410 And it turns out the hippocampus, which is shown here in pink, which is in the centre of your brain, 253 00:25:06,740 --> 00:25:14,210 is actually critical for many of these capabilities, especially things like memory, memory, navigation and imagination. 254 00:25:14,570 --> 00:25:18,440 So and the hippocampus has very different structure to cortex. 255 00:25:18,740 --> 00:25:24,590 So it's quite interesting when people talk about intelligence in the brain, they usually talk about the cerebral cortex. 256 00:25:24,860 --> 00:25:31,370 But actually there are other structures in the brain that are equally critical to this whole question of intelligence. 257 00:25:34,640 --> 00:25:37,550 So now I'm going to talk a little bit about our newest work, AlphaGo. 258 00:25:38,780 --> 00:25:42,950 And the reason we took on this project and I'll explain a lot more about what this is in a second, 259 00:25:43,190 --> 00:25:48,320 is that AlphaGo really combines pattern recognition with planning. 260 00:25:48,950 --> 00:25:55,280 So what you've seen so far with the Atari games is really a kind of stimulus response system. 261 00:25:55,490 --> 00:25:59,209 So, you know, it's very smart, but it's it's stimulus response. 262 00:25:59,210 --> 00:26:09,680 So it learns about how to process Atari screens and generally speaking, what to do in that moment in terms of an action that will maximise its score. 263 00:26:10,430 --> 00:26:16,720 But there isn't a lot of long term planning. Now contrast that with a game like go. 264 00:26:17,320 --> 00:26:22,630 So go. For those of you that don't know how to play it or don't know what it is, this is a picture of a go board. 265 00:26:23,830 --> 00:26:32,020 So go is them is the sort of pinnacle of board games is the most complex game pretty much ever devised by man that's played professionally. 266 00:26:32,980 --> 00:26:43,300 And the way he's played is that you play on a 19 by 19 grid and there's two sides, black and white, and you put down these pieces called stones, 267 00:26:43,570 --> 00:26:51,910 and the stones are placed on the vertices of the board, and black goes first and they take turns placing one stone at a time. 268 00:26:54,180 --> 00:27:00,250 Once the stones are placed, they don't move. Now they're actually the rules of go incredibly simple. 269 00:27:00,270 --> 00:27:06,270 I'm going to teach you how to play go in two slides. But it leads to incredible, profound complexity. 270 00:27:06,480 --> 00:27:11,970 That's why it's considered to be one of the most in fact, the most elegant games ever invented. 271 00:27:14,110 --> 00:27:20,290 Now a quick history of go for those of you who don't know about it. It originated in China over 3000 years ago. 272 00:27:20,620 --> 00:27:24,200 And it has an incredibly rich tradition in Asia. 273 00:27:24,220 --> 00:27:29,830 So in China, Japan and Korea and other Asian countries, this is what they play instead of chess. 274 00:27:30,880 --> 00:27:37,720 But in Asian countries, this is regarded as more than just as a game go sort of elevated to the status of poetry or art. 275 00:27:38,470 --> 00:27:45,970 In fact, Confucius wrote about gold and he considered to be one of the four essential arts to be mastered by any true scholar. 276 00:27:47,580 --> 00:27:52,110 Japan also has a rich history around gold. 277 00:27:52,500 --> 00:28:00,630 And in fact, during the Edo period, sort of 250 years, 1618 hundred annual games are played. 278 00:28:00,810 --> 00:28:03,630 They were called Castle Games in front of the Shogun. 279 00:28:04,230 --> 00:28:12,810 And what would happen is that each tribal clan would send that top go player to play in the castle game for the honour of the whole plan. 280 00:28:13,440 --> 00:28:17,760 And and some real legendary players came out of this. 281 00:28:18,330 --> 00:28:25,320 There's one guy, Kosciuszko, who won 19 years in a row and has gone down, and legend has the nickname The Invincible. 282 00:28:25,710 --> 00:28:30,390 So they were really absolute heroes to in those in that period. 283 00:28:31,140 --> 00:28:36,180 So it has this incredibly rich history intertwined with the culture of Japan. 284 00:28:37,470 --> 00:28:45,470 But it's not just an ancient game. Today, there are over 40 million active players in many of these countries, like Korea, for example. 285 00:28:45,480 --> 00:28:50,130 It's taught as part of the school curriculum, and there are specialist go go schools. 286 00:28:50,460 --> 00:28:59,970 So if you show talent at go at a young age, then you will go to these guys schools from about the age of ten instead of going to normal school. 287 00:29:01,300 --> 00:29:08,350 So it's taken very, very seriously. Now, as I was going to say and I'm going to show you in a second, there are just two rules, actually, for go. 288 00:29:09,220 --> 00:29:15,100 But the complexities huge that arises out of these very basic rules. 289 00:29:15,460 --> 00:29:22,270 One quick, easy way of measuring a sort of illustrating to the complexity is the fact that there are ten to the power, 290 00:29:22,270 --> 00:29:28,090 170 possible board configurations. And in fact, ten to the 700 different possible games. 291 00:29:29,020 --> 00:29:31,300 And that's more than the number of atoms in the universe. 292 00:29:32,110 --> 00:29:39,690 So there's no way that you can solve go through exhaustive search or even play go well through exhaustive search. 293 00:29:39,870 --> 00:29:44,190 Brute force search is just too large. So how do you pay go? 294 00:29:44,490 --> 00:29:52,590 Well, rule one is called the catch rule. So here is a position from a game of go. 295 00:29:53,250 --> 00:29:57,480 And we're just going to zoom in to the bottom right of this board so I can show you how the capture works. 296 00:29:58,440 --> 00:30:05,489 So let's look at this little part of the board. Now, if you see that white stone there that's surrounded by the three black stones, the vertices, 297 00:30:05,490 --> 00:30:12,030 the empty vertices coming out from from each stone adjacent to your where your stone is called liberties. 298 00:30:12,660 --> 00:30:17,160 Now, when you run out of liberties, then those stones are removed from the board. 299 00:30:18,090 --> 00:30:24,330 So here that white stone that surrounded by the three black stones only has one liberty left that that empty vertex above it. 300 00:30:24,810 --> 00:30:32,850 So if it's black, smooth and black with to play into that final empty vertex, taking away the lost liberty, then the white stone would be captured. 301 00:30:33,060 --> 00:30:37,830 So now that white that white stone has no liberties and would be taken as a prisoner off the board. 302 00:30:38,220 --> 00:30:43,440 So that's the capture rule. And you can capture multiple large groups of stones like there's not just one at a time. 303 00:30:44,910 --> 00:30:49,230 The second rule is that a repeated board position is not allowed. 304 00:30:49,470 --> 00:30:54,510 So this is called the rule. So let's imagine. Here's another little zoomed in part of the board. 305 00:30:54,930 --> 00:30:58,630 Now, let's imagine it's whites turn so whites could hear. 306 00:30:59,760 --> 00:31:04,710 I'll just replicate that board to the right so you can see how this is going to become a repeated board position. 307 00:31:05,130 --> 00:31:10,410 Let's imagine it's White's turn. They play here and capture that black stone like I just showed you. 308 00:31:10,980 --> 00:31:13,200 Now it's black. Turn now black. 309 00:31:13,320 --> 00:31:20,430 You might think, Well, they could just play back where that stone was captured and recapture the white stone that was just put down. 310 00:31:21,000 --> 00:31:23,100 So you might think, why can't black go there? 311 00:31:23,100 --> 00:31:28,170 And this is actually not allowed, because if Black was to go back, that would recapture that white stone. 312 00:31:28,320 --> 00:31:32,280 And you'd see now this new position we're in is the same as that original position. 313 00:31:32,820 --> 00:31:38,400 So, in fact, that recapture would not be allowed. Black would have to play somewhere else before recapturing. 314 00:31:39,180 --> 00:31:40,950 And that's it. That's how you play. Go. 315 00:31:41,700 --> 00:31:51,750 So the objective of the game is to not only capture your opponent's stones, but also to wall off and surround empty territory, empty vertices. 316 00:31:52,080 --> 00:31:55,230 And you can see here, this is a picture of a go board at the end of the game. 317 00:31:55,590 --> 00:31:59,910 And you you've just total up the number of the spaces that you've captured. 318 00:32:00,030 --> 00:32:03,540 And you add that to the number of stones that you've taken off the board. 319 00:32:03,690 --> 00:32:07,680 And the player with the most the highest total is the winner. 320 00:32:08,670 --> 00:32:14,130 So here's the white territory in the black territory. And in fact, this is a very close game and white wins this game by one point. 321 00:32:16,430 --> 00:32:19,430 So that's how you play go. So why is it hard for computers to play? 322 00:32:19,820 --> 00:32:23,540 Well, I just is playing. The complexity makes brute force exhaustive, search intractable. 323 00:32:23,780 --> 00:32:27,680 And there are two main challenges. The branching factor is huge. 324 00:32:27,920 --> 00:32:34,100 So and writing an evaluation function to determine who is winning is thought to be impossible. 325 00:32:34,400 --> 00:32:39,320 So an evaluation function is a function to tell you whether the black or the white side is winning. 326 00:32:40,130 --> 00:32:46,490 And for go, this is very difficult. Let me just unpack that for you by comparing it to the next most complex game chess. 327 00:32:47,360 --> 00:32:50,750 So in chess, on an average position, there are 20 possible moves. 328 00:32:52,720 --> 00:32:56,290 And that's referred to as the branching factor in go. 329 00:32:56,560 --> 00:32:59,860 By contrast, there's around an hour and an average position. 330 00:33:00,010 --> 00:33:06,940 There's around 200 moves. So the branching factor in go is one order of magnitude bigger than it is for chess. 331 00:33:08,240 --> 00:33:16,520 The second issue, which is related to the evaluation function, is that goes really primarily came about intuition rather than brute calculation. 332 00:33:17,510 --> 00:33:25,280 If you if you ask a great, good player why they played a certain move, often they'll just tell you it felt right and they'll use those words. 333 00:33:25,880 --> 00:33:28,340 Whereas if you ask a great chess player that, they'll never say that. 334 00:33:28,340 --> 00:33:33,350 They'll tell you exactly the reasons, what the how they calculated that that move was the right move to do. 335 00:33:34,690 --> 00:33:39,370 And of course, what we know about computers is, is that when we start using words like intuition. 336 00:33:39,760 --> 00:33:44,380 Computers are generally traditionally not good at what we think of as intuition. 337 00:33:44,800 --> 00:33:48,280 But of course, they're very good at things that we think of as calculation. 338 00:33:49,710 --> 00:33:56,190 So that's one of the challenges of go making computers good at GO is to replicate this kind of intuition that humans use to play. 339 00:33:58,080 --> 00:34:04,620 So this is the issue of writing an evaluation function and why it was thought to be impossible for Deep Blue or any chess program. 340 00:34:04,800 --> 00:34:09,540 What you can do is write a set of handcrafted, pre-programmed heuristics or rules. 341 00:34:10,580 --> 00:34:15,320 In fact, a first approximation for chess. If you just count up the value of the pieces on each side, 342 00:34:15,560 --> 00:34:21,470 that gives you a very rough and ready but reasonable estimate of which side, black or white, is winning. 343 00:34:22,460 --> 00:34:26,700 That's of course, impossible for go because each of the pieces are worth the same. 344 00:34:26,720 --> 00:34:30,440 They're just stones. So there isn't any idea of sort of materiality. 345 00:34:32,920 --> 00:34:40,810 So go then and y, which is why we've taken this on as a challenge combines intuitive pattern recognition with logical planning and such. 346 00:34:42,410 --> 00:34:45,320 So I'm just going to take you through the technicalities of how we did this. 347 00:34:46,010 --> 00:34:52,190 So what we did is we trained to do deep neural networks to deal with some of these intuitive part of go. 348 00:34:52,940 --> 00:35:00,110 So the first thing we did is we downloaded 100,000 games played by relatively expert humans, 349 00:35:00,110 --> 00:35:05,300 though still amateurs playing on Internet go servers, but they're a pretty strong club players. 350 00:35:06,020 --> 00:35:11,510 So we took those hundred thousand games and we trained our first neural network, which we called a policy network. 351 00:35:12,410 --> 00:35:19,100 And this was done through supervised learning. And what we did is we got this network to try and mimic the moves, 352 00:35:19,220 --> 00:35:25,580 copy and predict what move in a particular position that human amateur expert would play. 353 00:35:26,450 --> 00:35:30,050 So this network was trying to copy those expert players. 354 00:35:31,210 --> 00:35:35,030 So that was the first step. Once we had the first version of that, 355 00:35:35,420 --> 00:35:45,410 we then allowed it to play against itself many millions of times and improve its prediction capability through the use of reinforcement learning. 356 00:35:45,920 --> 00:35:49,220 So it learned through trial and error, and it's from its own mistakes. 357 00:35:49,490 --> 00:35:54,800 And then that would then modify the neural network to make it better, incrementally better, over time. 358 00:35:55,860 --> 00:36:03,300 And once we finish this self-pay process, the new policy network could be the original policy network 80% of the time. 359 00:36:04,600 --> 00:36:08,440 Now we freeze this sort of final reinforcement learning policy network, 360 00:36:08,890 --> 00:36:16,690 and we allow that to play a final 30 million times on our Google Cloud service, and that generates our new dataset. 361 00:36:17,260 --> 00:36:21,090 And we take one position from each of those 30 million games. 362 00:36:21,100 --> 00:36:30,130 So we have 30 million positions. So now we finally have a dataset that maybe is big enough to try and learn an evaluation function. 363 00:36:30,820 --> 00:36:35,680 So what we do is we have these 30 million positions and we have the end result of the game. 364 00:36:36,370 --> 00:36:42,610 So we can try and learn sort of correlation between that position and who ends up winning. 365 00:36:44,420 --> 00:36:48,680 So we then train this final network, which we call the Valley Network. 366 00:36:49,370 --> 00:36:57,410 And this Valley Network learns to predict who is winning the game from a particular position and by estimate, by how much. 367 00:36:58,720 --> 00:37:06,160 So this is really the core of the breakthrough with AlphaGo was the value network is this fabled evaluation function. 368 00:37:06,490 --> 00:37:09,820 But instead of writing it out by hand, like something like deep blue, 369 00:37:10,060 --> 00:37:18,550 where we as expert go players or chess players wrote out all the rules that could evaluate position by hand a big database of rules. 370 00:37:18,880 --> 00:37:23,200 We instead have a neural network that learns for itself directly from the data. 371 00:37:25,130 --> 00:37:30,680 So we take these new networks forward and we have two new networks, Policy Network. 372 00:37:30,680 --> 00:37:35,569 This network in green that I was showing you earlier that takes the board position in in blue 373 00:37:35,570 --> 00:37:42,620 here as the input and the output is a probability distribution over the possible moves. 374 00:37:43,460 --> 00:37:51,590 And you can see here, the height of the green bars is the probability mass assigned to that particular move by the policy network. 375 00:37:52,660 --> 00:38:01,060 So what this means is, is that our AlphaGo system doesn't have to consider all these 200 possible moves every time it's looking at a decision point. 376 00:38:01,420 --> 00:38:05,950 It can maybe just look at the top three or four most sensible or most likely moves. 377 00:38:07,940 --> 00:38:14,240 The second network, the network in Pink, which is the Valley network that also takes the board position in as an input. 378 00:38:14,660 --> 00:38:23,870 But this time, the output is a single number, a real number between zero and one, where zero is meaning Y is winning by huge margin one. 379 00:38:23,990 --> 00:38:31,610 One is blacks winning and 0.5 is the games. Even so, estimates are who is winning the game and by how much? 380 00:38:34,520 --> 00:38:39,470 So we now take this forward into and combine it with search, and I'm going to show you that in a second. 381 00:38:39,740 --> 00:38:49,100 But I just want to give you pictorially an idea, illustrate to you why using these two new networks helps with make a plane go tractable. 382 00:38:49,910 --> 00:38:55,969 So imagine here we're searching through the game of go and each of these little nodes 383 00:38:55,970 --> 00:39:01,610 here represents by these mini boards a position in a particular game we're playing now. 384 00:39:02,060 --> 00:39:10,220 The Tree of Possibilities branches out almost to infinity, these huge number of possibilities which are completely not tractable to search. 385 00:39:11,310 --> 00:39:14,820 So what we do is we firstly take the policy network, this network in green. 386 00:39:15,330 --> 00:39:25,800 And what that does is reduce the breadth of the search so we can hone in to only look at the moves that are plausible and sensible. 387 00:39:26,670 --> 00:39:32,550 So that reduces the breadth of the search. And then the value network you can think of is reducing the depth of the search. 388 00:39:32,850 --> 00:39:40,530 So instead of having to search through the entire game tree until the end of the game to tell you whether which side is winning, 389 00:39:40,980 --> 00:39:50,670 we can call the value network at any time and and estimate which side is winning so we can truncate that search at any depth level that we want. 390 00:39:51,810 --> 00:39:59,040 So you can see by using these two networks in tandem, we've cut down that enormous search base to something much more tractable. 391 00:40:01,180 --> 00:40:08,860 So I'm going to show you how we do our search now. So we use Monte Carlo tree search and we also use another thing called roll out policies. 392 00:40:09,280 --> 00:40:12,730 And we combine that together with the two new networks I've just shown you. 393 00:40:13,780 --> 00:40:16,989 So let's imagine that we are making a decision. 394 00:40:16,990 --> 00:40:20,380 AlphaGo is in the middle of thinking about what move it should make next, 395 00:40:20,980 --> 00:40:26,040 and it's done a bit of searching from the current position, which is that node at the top of the tree. 396 00:40:26,080 --> 00:40:29,440 That's our current position and it's found a couple of promising moves. 397 00:40:29,890 --> 00:40:35,140 Now the value of each move is represented by the letter Q here, the action value of each move. 398 00:40:36,220 --> 00:40:40,570 And what we're trying to do is find the move, in essence, that has the maximum Q. 399 00:40:41,530 --> 00:40:46,110 And what we might do is we might follow a trajectory that has quite high Q value. 400 00:40:46,540 --> 00:40:56,020 And you can see this in the bold black arrows, and we follow that trajectory until we hit a node that has not been explored yet in the game tree. 401 00:40:56,560 --> 00:40:59,800 So here on the left hand side of this tree that we're unfolding. 402 00:41:00,550 --> 00:41:07,030 Now, once we hit that new note, what we do, the first thing we do is we call our policy network, the Green Network, 403 00:41:07,930 --> 00:41:16,210 and we ask the policy network to expand the tree at that point, but only expand it with a few moves that it thinks are most probable. 404 00:41:16,900 --> 00:41:19,900 So with the highest p the prior probability of that move. 405 00:41:21,830 --> 00:41:31,400 So once that's expanded, then we call the second neural network the value net to evaluate that position and give an estimate of who is winning. 406 00:41:32,700 --> 00:41:39,570 We also do a second thing. We call if we have time, we do rollouts to the end of the game. 407 00:41:40,050 --> 00:41:47,430 Maybe a few thousand of them to collect statistics. Also true statistics about who ends up winning the game from that position. 408 00:41:48,970 --> 00:41:54,970 And then we combine both these two estimates, the estimate from the value network and the estimate from the roll out policy. 409 00:41:55,180 --> 00:42:02,680 And and we combine them together to give a final evaluation of the promising ness of that branch of the tree. 410 00:42:03,980 --> 00:42:12,650 And once we get this new Q value, we then back that up the tree and update the the connections and the choice, the decision points. 411 00:42:13,400 --> 00:42:18,950 And then finally, once we run out of time for searching and thinking and we have to make a decision, 412 00:42:19,100 --> 00:42:24,410 we then, in essence, pick the action, pick the move that has the most promising. 413 00:42:24,680 --> 00:42:30,540 Q value associated with it. So once we built AlphaGo, how did we evaluate it? 414 00:42:30,990 --> 00:42:40,050 Well, the first thing we did back in April last year was play against the other strongest go programs available at that time. 415 00:42:41,070 --> 00:42:48,510 So we tried against two Crazy Stone and Z, which are the strongest programs out there other than AlphaGo. 416 00:42:49,710 --> 00:42:53,370 Now I'll just explain about the scale that we're going to show here on these bar charts. 417 00:42:53,700 --> 00:43:02,669 On the right hand side, Dan and Q levels, which are the ratings that you get when you play go and they go when you are big enough in. 418 00:43:02,670 --> 00:43:10,350 Q From like 25? Q Down to one. Q And then as an amateur, you go from one down to about six or seven down, 419 00:43:10,620 --> 00:43:17,550 and then you can become a professional if you pass certain productions and you start again from one down to nine down the professional level. 420 00:43:17,730 --> 00:43:21,750 So that's what the three landings are, the yellow beginner, orange, amateur, 421 00:43:21,930 --> 00:43:28,380 red professional on the left hand side, or numerical equivalents of those down ratings. 422 00:43:28,530 --> 00:43:34,920 So we call them ELO ratings, and this is our rating scale from zero to about 3500. 423 00:43:35,340 --> 00:43:38,700 And you can think of the way of thinking about this is that if you have an E, 424 00:43:38,700 --> 00:43:48,330 no rating difference between two players of 200 to 250 points, that translates to about an 80% win rate for the higher rated player. 425 00:43:49,050 --> 00:43:54,920 Right. So it's a kind of Bayesian sort of comparison between the strengths of these different players. 426 00:43:55,960 --> 00:44:02,320 And what we found is AlphaGo, when we played against these other programs, could beat them more than 99% of the time. 427 00:44:02,390 --> 00:44:11,620 In fact, nearly 100% of the time. And there was a huge margin between AlphaGo and the next best program, Crazy Stone of around 1200 ELO points. 428 00:44:12,680 --> 00:44:19,010 And some of you who've been following this may know that Facebook have also have their own program that they're working on called Dark Forest. 429 00:44:19,670 --> 00:44:23,239 But that's not even as strong as Zen or Crazy Stone. 430 00:44:23,240 --> 00:44:29,210 In fact, it lost Tarzan on a landmine tournament last month. So it's estimated to be around the same level as them. 431 00:44:29,780 --> 00:44:33,590 So there's around 1200 low point difference between AlphaGo and these other programs. 432 00:44:33,830 --> 00:44:41,990 So we needed a greater challenge. So we thought, well, we're ready to play a top human professional. 433 00:44:42,440 --> 00:44:46,819 And so we contacted Fan Wei, who is the reigning three times European champion. 434 00:44:46,820 --> 00:44:53,360 He's a two time professional and he started playing Go at the age of seven back in China where he grew up. 435 00:44:53,600 --> 00:45:00,410 And he turned professional in China at the age of 16. And in China it's one of the most competitive places to try and become a professional. 436 00:45:01,190 --> 00:45:05,980 So this was very exciting for us back in October and we didn't really know how well we were going to do. 437 00:45:05,990 --> 00:45:11,300 We knew that we were much stronger than other commercially available programs, but we didn't know. 438 00:45:11,810 --> 00:45:17,420 We obviously it was a lot better than any of us on the team, so we didn't know how strong it would be against a human opponent. 439 00:45:17,660 --> 00:45:25,490 So this is what happened. I think after what they did, maybe a little like fight in like play slowly. 440 00:45:25,500 --> 00:45:30,480 So it's why became the second game. I fight with things on. 441 00:45:33,790 --> 00:45:38,170 I see. Maybe I'm right. It's why it was another game. 442 00:45:38,380 --> 00:45:47,670 I fight all the time. It's. No, it's not nice, but I lose all my. 443 00:45:51,400 --> 00:45:54,730 He's a really great guy, actually. He's a really good sport. 444 00:45:54,970 --> 00:45:58,780 So AlphaGo won five nil, which was very surprising to us. 445 00:45:58,810 --> 00:46:03,880 We were hoping to win at least one game, but five no was was was pretty amazing. 446 00:46:04,270 --> 00:46:07,660 And this story ends well. They don't worry. He's he looks anguished here. 447 00:46:07,900 --> 00:46:13,120 But we actually then hired him as a consultant for our team ready for the next world match. 448 00:46:13,120 --> 00:46:15,550 So in the end, he's on the side of the computers now. 449 00:46:17,320 --> 00:46:22,570 But one interesting thing, actually, is that he's since played a few more games against AlphaGo informally, 450 00:46:22,750 --> 00:46:25,600 and he feels that it's actually improved his own play. 451 00:46:26,230 --> 00:46:32,230 And very recently he won the European Professional Championship again and he beat with a full score. 452 00:46:32,260 --> 00:46:41,260 He beat every single other professional in Europe. So he feels he's got stronger by training against AlphaGo, in fact, which is quite interesting. 453 00:46:41,830 --> 00:46:50,410 So anyway, he's around here on this measure. He's around 2900 ELO and AlphaGo at that time was around 3100. 454 00:46:51,890 --> 00:47:00,830 Again, this is covered in a nature paper that came out a couple of weeks ago on the front cover, and it's caused a huge stir in the air community. 455 00:47:01,460 --> 00:47:07,580 And I encourage you to read that if you want to hear much more of the technical details which are outlined in that in that paper. 456 00:47:09,410 --> 00:47:14,990 So I just want to just explain take a minute to explain the critical difference here between AlphaGo and Deep Blue. 457 00:47:16,340 --> 00:47:25,460 So although this is a big achievement, go beating a professional player at GO is a long standing grand challenge of AI research. 458 00:47:26,030 --> 00:47:30,530 And many people have been working, many smart people working on this for over a decade. 459 00:47:31,520 --> 00:47:37,310 And in fact, this happened about a decade earlier than many experts in the field. 460 00:47:37,460 --> 00:47:43,520 The top programmers of the other guy programs, for example, thought it was going to happen even even from like last year. 461 00:47:44,570 --> 00:47:48,620 But the key thing for us is no ICD treatment, but how we did it. 462 00:47:49,190 --> 00:47:55,160 So we've used general purpose algorithms, deep learning reinforcement learning, tree search. 463 00:47:55,370 --> 00:48:01,700 These are general purpose algorithms, and we've put them together in a way that learns how to play go. 464 00:48:02,030 --> 00:48:06,290 It's not a handcrafted set of rules and heuristics like Deep Blue or chess programs, 465 00:48:07,520 --> 00:48:13,100 and it's also a modular system that combines pattern recognition with planning algorithms. 466 00:48:13,370 --> 00:48:18,560 So that's another thing, is that deep learning is hugely popular right now, very fashionable. 467 00:48:18,770 --> 00:48:25,400 But we think and we think it's critical. And, of course, we have a huge, deep learning team, many amazing deep learners at DeepMind. 468 00:48:26,480 --> 00:48:32,690 But we don't think that's the whole story on its own. We think that it needs all the other things are going to be required, 469 00:48:32,690 --> 00:48:40,250 like reinforcement learning and memory and other advances combined with deep learning to reach full intelligence. 470 00:48:41,330 --> 00:48:45,409 And because of the way we train AlphaGo, many people have commented, 471 00:48:45,410 --> 00:48:52,010 many professional players commented about how humanlike that it plays in this playing style is and how it thinks. 472 00:48:52,310 --> 00:48:57,140 And if you think about it, AlphaGo has been trained in a way like a human expert player, 473 00:48:57,260 --> 00:49:03,900 starts off by studying professional games and learning from that, and then improves by. 474 00:49:03,920 --> 00:49:09,930 Through practice, by playing games that go. So for us, what's the next step? 475 00:49:09,930 --> 00:49:17,880 Now, as Mike alluded to, actually only about a week and a half away from this is the next step is to take on the world's best player. 476 00:49:18,540 --> 00:49:21,540 Lisa, Don, and he's from South Korea. 477 00:49:21,870 --> 00:49:27,240 He's a legend there. He's sort of like the David Beckham of South Korea, believe it or not. 478 00:49:27,480 --> 00:49:33,270 And I describe him as the Roger Federer of GO, because he's been at the top of the game for a decade, 479 00:49:33,270 --> 00:49:42,270 but he's still one of the top three players in the world. And he's won 18 international titles, kind of like Grand Slams over the last decade. 480 00:49:42,810 --> 00:49:49,020 And we're challenging him to $1,000,000 match, five game match in Seoul in March 8th to 15th. 481 00:49:49,020 --> 00:49:54,660 And you can follow that on on YouTube live stream. And, you know, he's taking this pretty seriously. 482 00:49:54,780 --> 00:50:01,470 Obviously, there's the money on the line, his reputation. But when he was asked by the South Korean press how he felt about the game, 483 00:50:01,800 --> 00:50:06,180 he said, I'm not sure if I represent the whole of humanity, but I think I am. 484 00:50:07,110 --> 00:50:10,530 So it's good that he's confident that he's going to win the match. 485 00:50:11,010 --> 00:50:15,360 So he's actually a lovely guy as well. And I'm really looking forward to going out there. 486 00:50:15,360 --> 00:50:17,430 And it's it's it's crazy out. 487 00:50:17,550 --> 00:50:26,470 We did a press conference yesterday and via video call and there were over 300 journalists, including live TV cameras for a video call. 488 00:50:26,490 --> 00:50:31,140 So it's pretty crazy. So we're going to we're very excited to see how it's going to be like when we go there. 489 00:50:32,190 --> 00:50:36,959 But Lisa, though on our ELO measures, he significantly better than fans, 490 00:50:36,960 --> 00:50:41,820 where he's a couple of notches better fan, where he's kind of like a grandmaster level. 491 00:50:42,030 --> 00:50:45,150 But there's another level to reach sort of the world elite. 492 00:50:45,420 --> 00:50:53,070 So he's at least 600 ELO stronger. So we have to go some if we want to beat him from where AlphaGo was back in October. 493 00:50:54,590 --> 00:50:58,730 So my final slide and go is talking about how do we do this testing? 494 00:50:59,060 --> 00:51:06,740 Well, we have our own internal testing where we have running 24, seven different versions of our program playing against itself. 495 00:51:06,980 --> 00:51:10,940 And we can make accurate estimates of how strong our product we think our program 496 00:51:10,940 --> 00:51:14,900 is from this continual live tournament that that's going on in the cloud. 497 00:51:15,470 --> 00:51:22,010 But every now and again, we have to calibrate those internal tests with external testing. 498 00:51:22,310 --> 00:51:27,020 So we need to test against these external benchmarks. 499 00:51:27,320 --> 00:51:31,130 So in April, we tested again, Zen and Crazy Stone. We won over 99%. 500 00:51:31,790 --> 00:51:37,400 Then in October, our new version, our October version could be our April version and 100% of the time. 501 00:51:37,790 --> 00:51:46,490 And obviously, we were playing fans who we also knew could beat these these other top commercial programs 100% of the time if he was to play it. 502 00:51:47,420 --> 00:51:52,280 And so we knew we were at least roughly matched. But in the end, as you saw, we won five now. 503 00:51:53,390 --> 00:51:57,890 So now we're at March, coming up to March, and we're playing Lisa Dole. 504 00:51:58,100 --> 00:52:04,220 And Lisa Dole would on the low ratings, you would expect him to win around 97% of the time against Fan. 505 00:52:04,520 --> 00:52:08,120 So it's a huge step up. So it's obviously confidential. 506 00:52:08,120 --> 00:52:11,030 Our number till the match that we we've got on the left hand side. 507 00:52:11,270 --> 00:52:15,620 And obviously the million dollar question is what's going to happen when we play him? 508 00:52:16,230 --> 00:52:23,990 So it's going to be very exciting to see. So I just want to give a big shout out to the amazing team that's worked on AlphaGo, 509 00:52:23,990 --> 00:52:30,470 led by David Silva and ajoke Wang as the team leads by some incredible work has gone on on this. 510 00:52:32,380 --> 00:52:38,230 Now, of course, playing games is great fun and it's very efficient for advancing our A.I. research. 511 00:52:38,410 --> 00:52:41,170 But we also want to apply these technologies to the real world, 512 00:52:41,530 --> 00:52:48,700 and we plan to make some announcements about this over the next year in health care, in robotics, and in smart assistants. 513 00:52:49,450 --> 00:52:50,499 All these different areas, 514 00:52:50,500 --> 00:52:58,240 we feel that extensions of and components of what we're building for things like AlphaGo can be used very powerfully in these areas. 515 00:53:00,100 --> 00:53:04,479 So I just want to end the talk with a couple of high level thoughts and why I've 516 00:53:04,480 --> 00:53:08,320 been so obsessed with it for my entire career and why I think it's so important. 517 00:53:08,800 --> 00:53:19,540 I see two big challenges facing society today information overload, which is deluged as users and scientists with data everywhere. 518 00:53:19,660 --> 00:53:24,610 Big data from genomics, entertainment, every field sphere of of human life. 519 00:53:25,060 --> 00:53:28,390 Now, personalisation might be one technology to try and combat that, 520 00:53:28,600 --> 00:53:36,430 but unfortunately doesn't work very well because it mostly is based today on the averaging of crowds rather than actually adapting to you as a person, 521 00:53:36,610 --> 00:53:37,930 as a thing, as an individual. 522 00:53:38,950 --> 00:53:47,410 Then secondly, the systems that we would like to master are so complex today, from climate to disease to energy macroeconomics, high energy physics. 523 00:53:47,740 --> 00:53:52,210 So, you know, you have to think that maybe the complexity of systems is so great. 524 00:53:52,420 --> 00:53:56,050 It's difficult to imagine how even an Einstein, someone at that level, 525 00:53:56,230 --> 00:54:00,910 can master these systems in their own lifetime and still leave enough time for innovation. 526 00:54:01,960 --> 00:54:06,220 So we think a d mind that solving AI in a fundamental way like we're trying 527 00:54:06,220 --> 00:54:09,580 to do is potentially a kind of better solution to all these other problems. 528 00:54:10,030 --> 00:54:14,740 If we can solve A.I. in this way, we can bring it to bear on all the other issues that we would like to solve. 529 00:54:15,100 --> 00:54:22,480 So the dream is really to make for me anyway, is to use this kind of AI to create A.I. scientists or A.I. assisted science. 530 00:54:24,380 --> 00:54:26,090 And finally, I should mention a word about ethics. 531 00:54:26,510 --> 00:54:32,000 As with all powerful new technologies, they have to be used ethically, responsibly, and AOA is no different. 532 00:54:32,240 --> 00:54:36,740 And even though human level Jan is decades away, we should start the debate now. 533 00:54:37,880 --> 00:54:44,540 And as a neuroscientist, I think trying to distil intelligence into an algorithmic construct and then comparing it to the human mind, 534 00:54:44,870 --> 00:54:50,750 what actually this journey we're on is will be one of the best ways to better understand the mysteries of our own minds. 535 00:54:51,080 --> 00:54:58,850 And things shed light on, on, on, on sort of things like dreaming, creativity, and perhaps even the ultimate question of consciousness. 536 00:54:59,480 --> 00:55:00,110 Thanks for listening.