1 00:00:15,450 --> 00:00:20,010 So one of the interesting things is the way the deep learning has changed, 2 00:00:20,010 --> 00:00:27,210 machine learning has changed even in this physics departments is on Mondays, we've just started an informal lecture series. 3 00:00:27,210 --> 00:00:36,850 One of our former postdocs, Typekit Ruiner, give me lectures on basic, basic introductions to machine learning and packed out. 4 00:00:36,850 --> 00:00:42,600 We advertised only to the physics faculty and graduate students in our final year students and we packed. 5 00:00:42,600 --> 00:00:47,070 We've packed out this lecture, lecture theatre and that wouldn't have happened two or three years ago. 6 00:00:47,070 --> 00:00:54,750 So there's been an enormous change in the interest inside physics, as well as a huge change in interest worldwide. 7 00:00:54,750 --> 00:00:57,600 So I'm going to do a very basic introduction. 8 00:00:57,600 --> 00:01:02,610 Very pedestrian, if you know a lot about machine learning, I'm not going to say anything new until the end. 9 00:01:02,610 --> 00:01:06,420 I apologise for that, but I'm setting up the next two speakers. 10 00:01:06,420 --> 00:01:14,640 Andrea Lucas and Elliot Benton, who'll be giving you some exciting cutting edge applications of machine learning in physics. 11 00:01:14,640 --> 00:01:24,780 I want to start by giving a kind of potted and partial history of the fields and on the go. 12 00:01:24,780 --> 00:01:31,920 And it starts with this great man, Alan Turing, who in this unbelievably important paper computing machinery and intelligence, 13 00:01:31,920 --> 00:01:37,050 which is kind of the founding paper of the field of artificial intelligence. He. 14 00:01:37,050 --> 00:01:40,670 They made many important contributions there, including the Turing test. 15 00:01:40,670 --> 00:01:45,860 But also said, what if instead of trying to produce a programme to simulate the adult minds, 16 00:01:45,860 --> 00:01:52,910 why not rather try to produce one which simulates the child's in this world, then subjected to an appropriate course of education? 17 00:01:52,910 --> 00:02:00,720 One would obtain the adult brain. We've thus divided up into two parts the child programme and the education process. 18 00:02:00,720 --> 00:02:05,740 I'm going to tell you today about the child programme and the education process. 19 00:02:05,740 --> 00:02:12,240 So this paper estimated enormous amounts of work, and it's important when we talk about machine learning that we look at the history 20 00:02:12,240 --> 00:02:18,350 of really machine learning and artificial intelligence on this ambiguity. Those words a little bit later. 21 00:02:18,350 --> 00:02:22,440 You have a little timeline from 1950 down to here arrows wrong way around. 22 00:02:22,440 --> 00:02:25,140 Let's see the Turing Test 1950. 23 00:02:25,140 --> 00:02:32,550 Not that long afterwards, the first automated translators came out on and they kind of worked, but they didn't work very well. 24 00:02:32,550 --> 00:02:39,030 There was enormous interest, huge amounts of money went into it, and then it kind of died out. 25 00:02:39,030 --> 00:02:45,120 In 1957, the man called Rosenblat the myth of the perceptron, 26 00:02:45,120 --> 00:02:50,130 which is a very basic model of a neurone, which will come across a bit further in the talk. 27 00:02:50,130 --> 00:02:57,120 So in the New York Times in 1957. So that's only seven years after Turing's paper and the journalist wrote the navy 28 00:02:57,120 --> 00:03:00,570 revealed the embryo of an electronic computer that expects will be able to walk, 29 00:03:00,570 --> 00:03:05,250 talk, see, write, reproduce itself and be conscious of its existence. 30 00:03:05,250 --> 00:03:11,940 So there's nothing new about hype or used to hype cycles in lots of fields. 31 00:03:11,940 --> 00:03:19,230 But in 1957, this was the the feeling and huge amounts of money poured into this area. 32 00:03:19,230 --> 00:03:25,680 Over 60, there's lots of the end of the sixties, many started to dry up because the promise was not really being kept, 33 00:03:25,680 --> 00:03:31,200 and the very famous book in 1969 by Minsky and Puppets on the perceptron showed that in fact, 34 00:03:31,200 --> 00:03:34,650 these perceptions were not nearly as powerful as people had initially thought. 35 00:03:34,650 --> 00:03:40,950 They couldn't do some basic logical operations like Exor and so on. 36 00:03:40,950 --> 00:03:43,440 Interest in these things start to die down. 37 00:03:43,440 --> 00:03:53,360 And in 1973, drought and light hail reports of the British government commissioned a big report on the AI and a very famous scientist mathematician. 38 00:03:53,360 --> 00:04:00,560 Wrote that there's really going to be no progress in this, and his argument was, well, these problems are typically so complex. 39 00:04:00,560 --> 00:04:07,220 It's a commentary explosion of possibilities that they're already going to work on toy problems and never anything else. 40 00:04:07,220 --> 00:04:16,910 And there came the first winter. I mean, the funding completely dropped people's careers ground to a halt, and nobody wanted to fund this anymore. 41 00:04:16,910 --> 00:04:24,320 Then in the 1980s, we had expert systems come back, and a few of you may remember this on excellent lists. 42 00:04:24,320 --> 00:04:30,290 All kinds of companies jump on the bandwagon. There's hundreds of millions of dollars and pounds poured into research, 43 00:04:30,290 --> 00:04:36,780 and great promises were made of what this would all mean, how this would transform our lives. 44 00:04:36,780 --> 00:04:40,830 But then by the end of the 80s, these systems turn out to not work, 45 00:04:40,830 --> 00:04:48,540 as well as people thought they were very expensive to maintain they were very brittle. And so all the funding again stopped. 46 00:04:48,540 --> 00:04:57,120 And as recently as 2007 in The Economist, I found the following quote investors were put off by the term voice-recognition, 47 00:04:57,120 --> 00:05:03,140 which, like artificial intelligence, is associated with systems that have all too often failed to live up to their promises. 48 00:05:03,140 --> 00:05:12,200 And so this is something close to the end of the second age or when people when there was very little funding in this area. 49 00:05:12,200 --> 00:05:17,600 Now, everything changed, obviously the last decade, we wouldn't be here if that were not the case. 50 00:05:17,600 --> 00:05:25,700 And I wanted to locate the big change in 2012. It's somewhat arbitrary, but it's an I'll explain to you what changed this, 51 00:05:25,700 --> 00:05:35,300 this what changed in 2012 and what made CEO Google CEO Sundar Pichai say something like A.I. is one of the most profound things we're working on, 52 00:05:35,300 --> 00:05:39,170 as humanity is more profound than fire and electricity. 53 00:05:39,170 --> 00:05:48,440 Now I can safely say that is probably hype of this, but he's excited, OK, people are excited and there's enormous amounts of money. 54 00:05:48,440 --> 00:05:55,700 Billions and billions are being poured into this area. I have a friend who recently listed a company, the stock exchange, 55 00:05:55,700 --> 00:06:00,680 and he told me by putting artificial intelligence the title, he doubled the valuation of the company. 56 00:06:00,680 --> 00:06:05,390 And so there's a lot of interest in this, and I think for good reason. There's also a lot of hype. 57 00:06:05,390 --> 00:06:11,060 If you go in the newspaper, you find all kinds of wild claims. You know, I will take away your jobs. 58 00:06:11,060 --> 00:06:19,340 It's going to be doctors anymore. Some new, as you would like, and a lot of that is hype cycle and probably not true. 59 00:06:19,340 --> 00:06:26,420 But they're also extremely exciting things that have happened. So one very thing is, don't we maybe familiar with in 2016, 60 00:06:26,420 --> 00:06:36,380 a computer programme from Google DeepMind called AlphaGo beat Lisa Dole, who was the 18 time world champion and in Go. 61 00:06:36,380 --> 00:06:42,980 And that was a big step forward because as opposed to something like chess, which is a relatively constrained game, 62 00:06:42,980 --> 00:06:48,110 goes unconstrained, has an enormous number of possibilities and was thought to be an unsolvable game. 63 00:06:48,110 --> 00:06:54,770 And so this deep learning new age, I had done something really amazing and perhaps even more interesting. 64 00:06:54,770 --> 00:07:01,820 In the summer of 2017, DeepMind released a programme called AlphaGo Zero, which didn't even know anything about Go. 65 00:07:01,820 --> 00:07:09,830 It just learnt the rules OK, except the rules played against itself, so it didn't look at any games that were played by experts. 66 00:07:09,830 --> 00:07:19,190 And after 40 days, it was able to outperform AlphaGo, which had beaten this at all, which had been trained on all the expert games. 67 00:07:19,190 --> 00:07:25,160 And it had taught itself how to play go much better than other computers, uncertain much better than humans. 68 00:07:25,160 --> 00:07:29,540 And the same thing for chess. They did a chess programme that taught itself how to play chess. 69 00:07:29,540 --> 00:07:35,090 Interesting where it started out, it was about chess player. Then it went through all the kind of standard opening moves that you have it learnt them 70 00:07:35,090 --> 00:07:39,920 and discarded them one by one by one until it became better than the world's best. 71 00:07:39,920 --> 00:07:44,900 The best alternative chess programmes, which are built with a lot of expert information. 72 00:07:44,900 --> 00:07:52,990 So this is a programme that teaches itself how to play the game so that without a doubt is an extraordinary achievement and a very exciting thing. 73 00:07:52,990 --> 00:07:58,450 And so the question is. That's why people are excited, but how do these programmes work? 74 00:07:58,450 --> 00:08:04,510 Well, I'm going to give you a very potted introduction or a simple induction to the basic technology behind this, 75 00:08:04,510 --> 00:08:09,970 and I'm going to tell you why I think 2012 is the start of what we call the deep learning era. 76 00:08:09,970 --> 00:08:17,860 So on this very important part of what's changed things is the accessibility of huge amounts of new data. 77 00:08:17,860 --> 00:08:22,120 This is the fifth year computer scientists at Stanford. 78 00:08:22,120 --> 00:08:24,740 I put her in because she was a physicist originally. And actually, 79 00:08:24,740 --> 00:08:27,880 there's a lot of people that were physicists that moved into this field over the last few 80 00:08:27,880 --> 00:08:32,470 decades that have transformed things that she introduced a competition called Image Net, 81 00:08:32,470 --> 00:08:40,000 where there's now 14 million images. The 20000 categories of images are like a cat, a dog, no bicycle. 82 00:08:40,000 --> 00:08:45,370 And there was a competition which was Take your best computer programme, learn a bunch of images, 83 00:08:45,370 --> 00:08:52,510 and then we're going to give you new ones that you haven't seen before and predict where this is or this is a cat or a dog or a bicycle. 84 00:08:52,510 --> 00:09:00,700 And in 2012, a team from the rivers of Toronto, Alex took the main person. 85 00:09:00,700 --> 00:09:08,860 Was Geoffrey Hinton, actually the most famous person in that field of this network, which is called Alex Nets at 60 million parameters. 86 00:09:08,860 --> 00:09:16,510 And it beats all the other comers by of factory 40 percent better accurate 40 percent lower error. 87 00:09:16,510 --> 00:09:21,520 That was a huge step forward. And now we're down to about a two percent error. 88 00:09:21,520 --> 00:09:29,200 And it's all based on these are machine learning networks. Now here I get a little schematic picture of Alex that still looks pretty complicated, OK? 89 00:09:29,200 --> 00:09:35,980 It's a slightly messy system, but. It got people really interested. 90 00:09:35,980 --> 00:09:40,810 Immediately, Google sort of pouring huge amounts of money into its Facebook into Microsoft, 91 00:09:40,810 --> 00:09:47,940 these companies are now rebranding themselves as A.I. companies. And it's all based on these basic technologies called deep learning. 92 00:09:47,940 --> 00:09:55,020 And although this is not complicated, I'll explain to you in a minute more how that more or less works in the academic world. 93 00:09:55,020 --> 00:10:00,150 This grew incredibly so. Geoffrey Hinton here I will mention again a few more times. 94 00:10:00,150 --> 00:10:07,110 If you look on Google Scholar, this is his you see his citations. So this is 2011, 2012, and it's just exploded. 95 00:10:07,110 --> 00:10:13,440 Last year, he had seventy three thousand citations. So these are citations is the number of citations of a world leading. 96 00:10:13,440 --> 00:10:17,820 Scientists are minimal prise winners who have less citations than that in their lifetime. 97 00:10:17,820 --> 00:10:24,600 And last year, this is how many citations you have. This gives you a sense of how many people are working on this field. 98 00:10:24,600 --> 00:10:30,360 Three of the five most cited papers in nature from 2019 were on deep learning, 99 00:10:30,360 --> 00:10:37,980 so there's an enormous excitement in this and the the three great kind of experts in 100 00:10:37,980 --> 00:10:42,360 this field of three great founders of these fields and the current Joshua Bengio. 101 00:10:42,360 --> 00:10:44,100 So I got the order wrong. It's Bengio. 102 00:10:44,100 --> 00:10:50,820 Hinton and LeCun won the Turing prise, which is the equivalent of the Nobel prise in computer science last year. 103 00:10:50,820 --> 00:11:00,630 What's interesting is that these three pioneers worked more or less anonymously, but at the fringe for many years and just lost. 104 00:11:00,630 --> 00:11:09,930 About two months ago, I saw a quote from Hinton where he talked about given try to submit the paper to a conference in AI and he was before, he said. 105 00:11:09,930 --> 00:11:16,080 Hinton's been working on this idea for seven years, and no one's interested. It's time to move on. 106 00:11:16,080 --> 00:11:20,550 So for many years, I mean, Hinton is in his 70s now, so for many years he worked on these ideas. 107 00:11:20,550 --> 00:11:21,630 Everybody told it was crazy. 108 00:11:21,630 --> 00:11:27,990 They worked through the big A.I. winters and in spite of the fact that everyone told them this is wrong and this was they couldn't get funding. 109 00:11:27,990 --> 00:11:31,590 They couldn't get published. They kept at it. And now the revolution has come. 110 00:11:31,590 --> 00:11:36,450 This is a good lesson for us that very often great innovations come from outside. 111 00:11:36,450 --> 00:11:42,540 They may be. The few things is wrong at one time, they actually turn out to be revolutionary at a different time. 112 00:11:42,540 --> 00:11:48,900 And what changed was the availability of large amounts of data like image nets that I was a trained things and of course, 113 00:11:48,900 --> 00:11:52,980 lots of proprietary data and commercial data. And of course, large computers. 114 00:11:52,980 --> 00:12:03,380 And those two things, plus some algorithmic. Innovations allowed an idea which actually comes back from the nineteen fifties, nineteen sixties, 115 00:12:03,380 --> 00:12:11,120 these neural networks to to revolutionise the way that we're doing artificial intelligence today and what's new about them? 116 00:12:11,120 --> 00:12:16,280 I'll explain a few things new about them. What's new about them is they're not simple to use in traditional A.I. systems. 117 00:12:16,280 --> 00:12:20,270 They can be used by students. Secondary school students can use them now and train stuff. 118 00:12:20,270 --> 00:12:23,690 And so it's a huge revolution in physics. 119 00:12:23,690 --> 00:12:30,320 There's a physics world as an article just last year called a machine learning revolution where machine learning revolutionised physics. 120 00:12:30,320 --> 00:12:36,890 Well, we will. See you. Also, be careful to call it revolutions on. 121 00:12:36,890 --> 00:12:42,380 There's definitely a bandwagon, there's hype. I was talking to a colleague this week who said, you know, 122 00:12:42,380 --> 00:12:50,230 one of the tricks in science is to jump on bandwagons and ride them a job, offer them before they crash. 123 00:12:50,230 --> 00:12:55,340 I don't think this was going to crash because we're seeing enormous and exciting applications. 124 00:12:55,340 --> 00:12:59,360 So it's been a long history actually of applications and data analysis. 125 00:12:59,360 --> 00:13:04,760 Particle physics have actually been using neural networks for a very long time to analyse their data. 126 00:13:04,760 --> 00:13:10,850 And it's one of the reasons why I think there's a natural link between physics and this kind of machine learning based on their own networks. 127 00:13:10,850 --> 00:13:15,720 There's a huge amount of stuff in image analysis. We've got people here working on biological physics. 128 00:13:15,720 --> 00:13:22,340 They're using these machines to analyse images or pictures of cells dynamically. 129 00:13:22,340 --> 00:13:27,530 And these these networks are by looking at images, recognising things that are human. 130 00:13:27,530 --> 00:13:34,250 I seemed unable to recognise patterns in the data are being exploited in that way. 131 00:13:34,250 --> 00:13:45,140 It's even used textbooks make quantum many body weight functions and calculates the energy levels of of molecules or because of the body's systems, 132 00:13:45,140 --> 00:13:48,320 the much higher accuracy than was thought before. 133 00:13:48,320 --> 00:13:54,950 There's a beautiful experiments by some colleagues here in Inferiors Davis and others in physics where they looked at 134 00:13:54,950 --> 00:14:01,010 some quantum magnet experiments and they looked at using a machine learning algorithm looking at their data again. 135 00:14:01,010 --> 00:14:03,320 They had discovered a bunch of new patterns in there. 136 00:14:03,320 --> 00:14:11,130 So there's really exciting ways of looking at this and an image analysis of these ways of controlling experiments and the events in the last talk, 137 00:14:11,130 --> 00:14:17,360 we'll talk about that and much, much more. An enormous number of cool examples. 138 00:14:17,360 --> 00:14:22,700 And the. Just to give you my last little kind of. 139 00:14:22,700 --> 00:14:27,170 Big picture argument, so a big picture story, not argue, a big picture explanation, 140 00:14:27,170 --> 00:14:33,770 so what is artificial intelligence we booked this day is artificial intelligence in physics. 141 00:14:33,770 --> 00:14:38,820 What we're really talking about something simple called machine learning, and I'm talking about a subset of that called deep learning. 142 00:14:38,820 --> 00:14:42,620 Artificial intelligence is a catch all phrase for all kinds of computational 143 00:14:42,620 --> 00:14:46,640 methods that you may have a computer intelligence in one way or the other. 144 00:14:46,640 --> 00:14:49,880 There's lots of methods that are based on symbolic manipulation, 145 00:14:49,880 --> 00:14:55,670 or it's a really wide range of techniques inside that is a smaller subset of techniques called machine learning, 146 00:14:55,670 --> 00:14:58,430 which when you saw the quote from Turing, 147 00:14:58,430 --> 00:15:07,100 he was speaking about training a child of computer like a child is a machine learning user use data to train the parameters of a machine. 148 00:15:07,100 --> 00:15:10,700 And there are many different machine learning techniques, a wide variety of different ones. 149 00:15:10,700 --> 00:15:15,200 And the one that really caught everyone's attention is this method called deep learning. 150 00:15:15,200 --> 00:15:21,080 And that big Alex that you saw that solved that basically did the best image that 151 00:15:21,080 --> 00:15:24,860 competition 2012 and is now dominated the images that competitions ever since. 152 00:15:24,860 --> 00:15:32,040 Is a method called deep learning. So what is deep learning? All right. 153 00:15:32,040 --> 00:15:36,150 This is a pretty basic, so this is the quote from touring again with the second part of the quotes. 154 00:15:36,150 --> 00:15:40,710 We've got two parts to the problem a child programme and education process. 155 00:15:40,710 --> 00:15:45,930 So deep learning is based on the following idea I have some input nodes. 156 00:15:45,930 --> 00:15:49,530 I've got some weight, some kind of. 157 00:15:49,530 --> 00:15:54,870 Relationships, which are which I'll describe in two minutes to another set of nodes, those that have nodes and have an output layer, 158 00:15:54,870 --> 00:16:01,500 so the input might be pictures of cats and pictures of dogs, and the output might be this is cat resources a dog? 159 00:16:01,500 --> 00:16:09,540 And very, very loosely. This is inspired by neurones in the brain's neurones in the brain can fire or not. 160 00:16:09,540 --> 00:16:16,590 And so firing would be within the when this notice has a large value and not fire would be as a small value. 161 00:16:16,590 --> 00:16:22,480 And how it connects to the other nodes is the same way a neurone in your brain is connected to many other neurones. 162 00:16:22,480 --> 00:16:29,910 And so these weights are taking that into account. So I want to just draw one or two things here on the board. 163 00:16:29,910 --> 00:16:35,310 I realise you can only see a small amount of the board, but that's OK. 164 00:16:35,310 --> 00:16:37,980 So I'm going to give you a really, really simple example. 165 00:16:37,980 --> 00:16:50,210 So if I have three input nodes like this, OK, those are my input nodes, and I might have a problem that I'm interested in something like. 166 00:16:50,210 --> 00:16:59,150 I ask you three questions. I'm a doctor to figure out if you're ill or see if you need to be quarantined, for example. 167 00:16:59,150 --> 00:17:06,680 So the first question might be, you know, do you have a cough? And so this could be one or zero, right? 168 00:17:06,680 --> 00:17:16,550 The first the first one question one question two would be, do you have a little bit and the degree one or zero? 169 00:17:16,550 --> 00:17:24,020 So yes or no? And the Question three could be, did you just come back from Italy and be one or zero? 170 00:17:24,020 --> 00:17:30,000 OK. And then I might think of some, there's some kind of logic that I would apply to this. 171 00:17:30,000 --> 00:17:38,480 So, for example, if none of these are true. Then the answer would be you don't need to be quarantined, OK? 172 00:17:38,480 --> 00:17:45,110 And maybe if you've just been to Italy, I would point to you no matter what. 173 00:17:45,110 --> 00:17:52,250 And you can kind of even think you can see how this would go on, right? So I could do one zero one, maybe if you sniffle, but you have met Italy. 174 00:17:52,250 --> 00:17:58,460 I will always watch about you. But if you sniffle and you've been to Italy, I will definitely quarantine you, etc. 175 00:17:58,460 --> 00:18:03,710 And I can go one. I'll just I'll just go through it really quickly. 176 00:18:03,710 --> 00:18:13,850 This is not necessarily at some point my logic thing breaks down, but I will not pretend that this is how I do this. 177 00:18:13,850 --> 00:18:22,460 What's what you see? Something interesting, by the way, is I've got one to one two three four one two one one one. 178 00:18:22,460 --> 00:18:26,810 That's zero one. So for each of these inputs? 179 00:18:26,810 --> 00:18:35,910 OK. I have an output, and those are my inputs and outputs. 180 00:18:35,910 --> 00:18:40,110 So what I've actually done in this particular problem is I've got a little function, right? 181 00:18:40,110 --> 00:18:44,490 I function takes this set of inputs and it gives, in this case, a set of outputs. 182 00:18:44,490 --> 00:18:47,940 If I was a well trained doctor to see whether squinting or not, 183 00:18:47,940 --> 00:18:52,050 then this would be my inputs, outputs or perhaps the government would give me this advice. 184 00:18:52,050 --> 00:19:01,480 And that's a function that I then have. And so I met with the train a computer to be able to learn this function. 185 00:19:01,480 --> 00:19:05,330 And this is a relatively simple function. So how would I how would I do this? 186 00:19:05,330 --> 00:19:11,570 Well, I would say I've got three input nodes one to three, which can be zero or one, which. 187 00:19:11,570 --> 00:19:17,980 And then I'm going to have a layer here of further notes. 188 00:19:17,980 --> 00:19:28,680 Maybe lots of notes, and I'm going to call this layer one layer to. 189 00:19:28,680 --> 00:19:41,120 I have an output node, and that I'm going to do is I'm going to make lines here that tell me this node connects to all the nodes in the first layer. 190 00:19:41,120 --> 00:19:47,040 This one does the same. This one does the same. 191 00:19:47,040 --> 00:19:53,160 OK. Simple, incredibly simple. And these nodes do the same to the next layer. 192 00:19:53,160 --> 00:20:04,800 OK. Etc. And I can keep drawing forever, and at the end, the notes all converge on this final decision nodes. 193 00:20:04,800 --> 00:20:08,580 It's all going to draw on, can draw the lines because they keep going on, there's tons and tons of them. 194 00:20:08,580 --> 00:20:16,560 I can have many layers and I want to do this. And then what I do is I'm going to put on each of these layers away, so I'll call this a weights. 195 00:20:16,560 --> 00:20:21,330 This is number one. And this is layer one, right? 196 00:20:21,330 --> 00:20:28,800 This is the first one to three, four or five I call this weight one one. 197 00:20:28,800 --> 00:20:36,300 I call this weights from here. Maybe I'll colour it so you can see it. So these are the weights that are going into number one. 198 00:20:36,300 --> 00:20:47,820 There's three of them weights one one four eight one two and weights one three. 199 00:20:47,820 --> 00:20:52,470 I couldn't quite figure out how to make this go up, so I'm going to jump to the side, 200 00:20:52,470 --> 00:20:58,050 I'm going to hide this little truth table here for a minute or two over here. 201 00:20:58,050 --> 00:21:05,910 So sorry. All right. So what is going to happen to this node? 202 00:21:05,910 --> 00:21:11,730 OK, what's going to devalue this nodes? Well, the value of that nodes? OK. 203 00:21:11,730 --> 00:21:20,680 So the node one in there one. 204 00:21:20,680 --> 00:21:26,110 Is going to have a value, which is the function, some function of the inputs. 205 00:21:26,110 --> 00:21:31,720 What are the inputs? Well, it's x one weights one one plus x two. 206 00:21:31,720 --> 00:21:42,340 Wait. Two to two. Sorry weights two one plus x three weights three one plus maybe an offset. 207 00:21:42,340 --> 00:21:57,380 OK. And this function will typically be something like this would be a a classic one. 208 00:21:57,380 --> 00:22:04,700 And this simple function looks like this. So this is for large negatives ads. 209 00:22:04,700 --> 00:22:13,090 This is zero. At zero, it's a half, and for large positive z, it's equal to one. 210 00:22:13,090 --> 00:22:19,520 All right, so if this inputs is large negative, my nose goes to zero. 211 00:22:19,520 --> 00:22:27,230 Is this if this inputs are large and positive, it goes to +1 otherwise as a value in between? 212 00:22:27,230 --> 00:22:32,870 So it's extremely this is this is one of the many on activation function you could have. 213 00:22:32,870 --> 00:22:42,440 So this is really this thing here. Is this really it's X in W plus b that is the function of an Inter product. 214 00:22:42,440 --> 00:22:47,870 Now I do that for this node, for that node, that note that they're not nodes. 215 00:22:47,870 --> 00:22:54,710 And then I think those nodes and use them for the next layer and I do the same game and I play the same game at the end. 216 00:22:54,710 --> 00:23:01,190 I look at this final node and then I might say if the value is larger one half, then I should quarantine you. 217 00:23:01,190 --> 00:23:10,940 If that's enough. I will quarantine you. OK. So that's basically the the way this works is incredibly simple. 218 00:23:10,940 --> 00:23:19,280 It's almost you wonder, how could something like this if I put a lot, if I put enough of these together, how could that? 219 00:23:19,280 --> 00:23:25,310 Learn how to play go, for example, or how could that learn how to recognise images is quite a striking thing, 220 00:23:25,310 --> 00:23:28,250 but it's incredibly simple, mathematically, unbelievably simple. 221 00:23:28,250 --> 00:23:34,340 You're just adding up numbers, multiplying them and then having a little known linearity at the end. 222 00:23:34,340 --> 00:23:41,040 The threshold? But it is not in the news, it is important because without it, this model just is a linear model. 223 00:23:41,040 --> 00:23:48,120 You can write down as a series, a matrix multiplications and a product of matrices as just a matrix that becomes a linear model of this do very much. 224 00:23:48,120 --> 00:23:53,210 So there's no linearity turns out to be really important. All right. 225 00:23:53,210 --> 00:23:59,510 So that's the first part of the basics. The second part of the basics is that I need to to have some education process. 226 00:23:59,510 --> 00:24:03,890 And there are basically three main classes of education processes. 227 00:24:03,890 --> 00:24:09,530 The first one is called supervised learning and it's supervised learning. 228 00:24:09,530 --> 00:24:19,640 What I do is I have. I have two sets of data, so here's a very famous example called end this. 229 00:24:19,640 --> 00:24:25,820 These are on hand right hand handwritten numbers. And so these were used to train. 230 00:24:25,820 --> 00:24:32,930 In fact, some of the earlier neural networks will be for 2012 that we're using automatic scanning of of of checks, for example. 231 00:24:32,930 --> 00:24:41,750 And so this is zeros. Those are the ones that are too. So I might take this first subset here and then I will take them as inputs to my machine. 232 00:24:41,750 --> 00:24:48,650 And then I will try to play with these waits until for each of these images. 233 00:24:48,650 --> 00:24:55,700 If it's this, I get a zero. If it's this, I get a one, two, three, four or five, so I train it until I get zero error, hopefully. 234 00:24:55,700 --> 00:24:59,640 Or typically I get zero error on my training side. Or I can even make it simple. 235 00:24:59,640 --> 00:25:02,150 If I want to do this example, I would say, Well, 236 00:25:02,150 --> 00:25:10,070 let me pick this as my training set and this so I'm going to give you these four inputs with the correct outputs. 237 00:25:10,070 --> 00:25:18,680 I'm going to ask the automated Optimiser to find weights that will then if handles inputs, give me those outputs. 238 00:25:18,680 --> 00:25:24,230 OK. Then I've trained my system and then I'll test it. 239 00:25:24,230 --> 00:25:32,050 OK, so test it, so I might do it on something more complicated this image. 240 00:25:32,050 --> 00:25:35,830 I would take these ones and see how well I do. 241 00:25:35,830 --> 00:25:44,230 Or I might do it on if I was trying to train extremely simple DR algorithm, I would train it first and then eventually I would test it. 242 00:25:44,230 --> 00:25:46,240 And then once I'm happy that it always gives me the right result, 243 00:25:46,240 --> 00:25:52,300 it gives me the right result with high enough accuracy than I would unleash on the public. 244 00:25:52,300 --> 00:25:59,020 So that's the first of ethical, supervised learning, and most of what I'll be speaking about today later is supervised learning. 245 00:25:59,020 --> 00:26:06,100 Does that make sense? It's a super, super simple thing. One way of thinking about this is that what these things are are a kind of computer programme. 246 00:26:06,100 --> 00:26:12,220 It's fitting some kind of function. And you should have quite written the programme yet because you have to figure out what these weights are. 247 00:26:12,220 --> 00:26:16,810 So this training process is a way of having the machine write its own programme. 248 00:26:16,810 --> 00:26:22,100 OK. So we're thinking about what it's doing. It's writing some programme, seemingly some. 249 00:26:22,100 --> 00:26:27,890 Another kind of learning is called reinforcement learning, which is somewhat similar to this. 250 00:26:27,890 --> 00:26:34,880 You're trying to train these weights, but rather than using label data where you know what the answers are in advance, 251 00:26:34,880 --> 00:26:40,220 you've got a bunch of curated stuff. You you have some kind of process that an agent is going through and some environments. 252 00:26:40,220 --> 00:26:45,350 And every time the agent does well, you say this is a good set of weights of keep something like that. 253 00:26:45,350 --> 00:26:53,210 And if there's value, then you, you penalise it and you keep doing this rewards personalisation technique until eventually the system learns. 254 00:26:53,210 --> 00:27:00,470 Well, that's what AlphaGo zero Alpha Zero does. It trains on games, and every time it wins, it plays against itself. 255 00:27:00,470 --> 00:27:07,400 Without wins, it says that's good times. It loses isn't bad. It's very roughly akin to what we do with our children. 256 00:27:07,400 --> 00:27:14,420 They don't always work quite as well as we fortunate immune system and the lost methods, which which is also super important. 257 00:27:14,420 --> 00:27:17,570 But I won't talk about which today is unsupervised learning. So it's supervised learning. 258 00:27:17,570 --> 00:27:23,780 I might have, you know, two vectors that tell me some property of objects and then all the red ones are here, the blue ones are there. 259 00:27:23,780 --> 00:27:30,320 So I'm trying to learn some decision boundaries. Some line that tells me if my parameters are here, I'm right here as there are blue. 260 00:27:30,320 --> 00:27:34,220 It also provides learning. I have no idea what the the axes are. 261 00:27:34,220 --> 00:27:39,560 I'm looking for patterns of data and I might see that these these are clustered together and those are clustered together. 262 00:27:39,560 --> 00:27:43,940 And so I'll start picking out features of the data, and that's extremely important as well. 263 00:27:43,940 --> 00:27:47,300 If you don't if you've got too much data, you're not quite sure what the right way of thinking about it is. 264 00:27:47,300 --> 00:27:53,500 And unsupervised learning techniques allow you to find patterns there that you wouldn't otherwise see. 265 00:27:53,500 --> 00:27:56,470 Just the basics of how these things work, 266 00:27:56,470 --> 00:28:04,240 and so I think it's absolutely remarkable that a system this simple like what I just showed you maybe souped up with more layers or some tricks, 267 00:28:04,240 --> 00:28:10,350 but effectively this is what you're doing that this can achieve these amazing feats. 268 00:28:10,350 --> 00:28:21,800 And so the big question, I think the very super interesting question for us today is why do deep neural networks work so well? 269 00:28:21,800 --> 00:28:31,660 It's completely remarkable. Why do they give us such incredibly good predictions for all kinds of amazing things 270 00:28:31,660 --> 00:28:38,500 you're going to see on methods of this type being used to find patches in string theory? 271 00:28:38,500 --> 00:28:46,720 They've been used to find, obviously to play games. They've been used in all kinds of image recognition of about six, seven years ago, 272 00:28:46,720 --> 00:28:52,210 Eric Schmidt from Google said We should stop training radiologists because they're all going to become irrelevant 273 00:28:52,210 --> 00:28:57,820 because this image recognition is going to look at your scans and tell you whether you are this cancerous or not. 274 00:28:57,820 --> 00:29:04,940 And already there are on. There are some studies that show that these things work roughly that well, 275 00:29:04,940 --> 00:29:09,350 they can work better than human and best trained radiologist at recognising cancer. 276 00:29:09,350 --> 00:29:15,560 The reason why this hasn't why we have not stopped treating radiologists is because there's a lot more to radiology than just looking at images. 277 00:29:15,560 --> 00:29:18,450 And also, it turns out that these systems are more brittle than you might expect. 278 00:29:18,450 --> 00:29:23,390 So a classic example would be you do this, you train your system on hospital a, 279 00:29:23,390 --> 00:29:32,600 you get a very high accuracy and you break the hospital B and then the accuracy drops by fairly large amounts and nobody quite knows why. 280 00:29:32,600 --> 00:29:38,300 But those are the kinds of problems that we have. And so until we solve these fundamental questions about why they work well in the first place, 281 00:29:38,300 --> 00:29:44,720 we won't be understands why they they have these funny ways of not working well in other ways and other times. 282 00:29:44,720 --> 00:29:56,960 The classic example that's not really well understood is we have a adversarial example, so I can show a computer a picture of a. 283 00:29:56,960 --> 00:30:07,220 Gibbon, sorry about Panda, and by tweaking a few of the pixels cleverly, it will completely be confused and categorised as a gibbon, for example. 284 00:30:07,220 --> 00:30:15,070 There were more frightening examples. There's a recent exploit by a Chinese group called Tencent, where they use. 285 00:30:15,070 --> 00:30:23,870 They did this on Tesla cars. So Tesla cars, like many, many systems in an industry, namely self-driving systems, 286 00:30:23,870 --> 00:30:29,830 use some kind of machine learning to recognise, for example, the lane that you're in. 287 00:30:29,830 --> 00:30:38,200 And they were able to put a few small stickers on the lane and confuse the network so that it thought that it was in the wrong in the wrong lane, 288 00:30:38,200 --> 00:30:45,310 it swerved very quickly. Now before you, before you sell your Tesla very quickly, 289 00:30:45,310 --> 00:30:51,070 the way this works is that you get paid a lot of money by Tesla if you find these kinds of errors. 290 00:30:51,070 --> 00:30:55,360 So you there's there's a whole industry of people that try to find these errors in Tesla. 291 00:30:55,360 --> 00:31:02,920 Tesla will then pay them and then tell them, give them a timeline after which the ramp would be public about it or not. 292 00:31:02,920 --> 00:31:07,750 And so this group did that, but it was quite striking because a human would never do that. 293 00:31:07,750 --> 00:31:10,360 So one of the questions is why do these machines? 294 00:31:10,360 --> 00:31:16,120 Why this is sceptical to things like like adversarial examples, and there's many other interesting questions. 295 00:31:16,120 --> 00:31:24,070 And so although although. We know that these things work very well. 296 00:31:24,070 --> 00:31:29,380 There's really a question of why they work so well, so I'm going to unpack that in a very particular way. 297 00:31:29,380 --> 00:31:33,850 So one really fascinating thing that we've known for quite a while about deep neural networks, 298 00:31:33,850 --> 00:31:38,350 neural networks is that there are, in fact, universal function approximations. 299 00:31:38,350 --> 00:31:43,210 So here's the most recent, probably the most recent theorem by Boris huntin, 300 00:31:43,210 --> 00:31:52,060 but there's a long history of theories before that's really exist for any kind of function from some high dimensional space to the real numbers. 301 00:31:52,060 --> 00:31:57,360 So, for example, this would be from a hydrogen space to a real output, which is either one or zero. 302 00:31:57,360 --> 00:32:05,290 It's there exists a fully connected rather network revenue is just a fancy word for one of these activation functions. 303 00:32:05,290 --> 00:32:09,190 And it's important technically, but not that important for you to understand. 304 00:32:09,190 --> 00:32:17,470 It's basically something like that. And with if you have if you have a width of the of the network, 305 00:32:17,470 --> 00:32:27,430 so if the number of nodes in this layer has to simply be less than and plus four inches, four is the size of dimension of your space. 306 00:32:27,430 --> 00:32:28,990 So you were a three dimensional space. 307 00:32:28,990 --> 00:32:36,940 So if I have a seven dimensional one, I should be able to to produce completely every function in of this of this type. 308 00:32:36,940 --> 00:32:44,120 It's a very powerful kind of theorem. And so that tells us that the whole networks are extremely highly expressive. 309 00:32:44,120 --> 00:32:48,830 That means they can they can fit almost anything you throw at them. 310 00:32:48,830 --> 00:32:57,800 Now why is that kind of interesting? Because it gives us a conundrum, OK, if they're so highly expressive that music can fit any function to the data? 311 00:32:57,800 --> 00:33:02,840 Why do they pick the right function? So let me give you the example here, right? 312 00:33:02,840 --> 00:33:07,050 So I've got. Three bits. All right. 313 00:33:07,050 --> 00:33:13,560 That gives me. That gives me two to the end is a number of different bits I have. 314 00:33:13,560 --> 00:33:20,340 So that gives me eight to eight different possible inputs, how many different functions I have, I can. 315 00:33:20,340 --> 00:33:30,340 The function length is eight. There's two to the eight cities, two to the two to the two to the end functions. 316 00:33:30,340 --> 00:33:38,730 That this can produce, so that's two hundred fifty six. 317 00:33:38,730 --> 00:33:45,240 Have you seen the 56 different functions, so I can the tune of fifty six different ways I can map these bits to these outputs? 318 00:33:45,240 --> 00:33:52,470 Okay, so if I do it and then if I train, so if I train my neural network to give me zero targets, all of these correct. 319 00:33:52,470 --> 00:33:56,550 OK, then there's four left here, right? 320 00:33:56,550 --> 00:34:01,980 So there's 16 possible functions that are all consistent with being correct here, 321 00:34:01,980 --> 00:34:05,500 the 16 different functions here and the system to know what these functions are. 322 00:34:05,500 --> 00:34:09,120 As if I train, it's how does it know? Why is it? 323 00:34:09,120 --> 00:34:13,650 It can do all of them. So why is it pick a good one, which is what typically does? 324 00:34:13,650 --> 00:34:15,730 Does that make sense? It's extremely simple question. 325 00:34:15,730 --> 00:34:20,560 And just to give you a sense of how these numbers grow, the number of functions grows really quickly to the two to the end. 326 00:34:20,560 --> 00:34:29,350 So if I have seven functions, it's going to be two to the hundred and twenty eight, which is ten to the thirty eight functions. 327 00:34:29,350 --> 00:34:38,480 If I have nine, it's due to the thirty eight squared. OK. And if I have this, if I have 10, if it keeps going about eight. 328 00:34:38,480 --> 00:34:44,140 So it's two to thirty eight. Squares have nine two to thirty eight squared squares. 329 00:34:44,140 --> 00:34:45,760 It goes incredibly fast. 330 00:34:45,760 --> 00:34:57,760 So I have roughly if I worked out that if I have nine inputs, OK, and I look at all possible functions, there's tons of that 156 function functions. 331 00:34:57,760 --> 00:35:05,300 That's a lot more than what we currently think. The universe has bits of storage, and if you use every thing you could in the universe to store. 332 00:35:05,300 --> 00:35:07,270 OK, so that's a relatively small system. 333 00:35:07,270 --> 00:35:12,400 Actually, nine questions and I asked you, what are all the possible combinations of ulcers that you could have? 334 00:35:12,400 --> 00:35:17,020 And there are more than that can store in the whole universe. This is what Light Hill was trying to say. 335 00:35:17,020 --> 00:35:24,120 The Coventry explosion, of course, is this even a relatively small problem becomes unbelievably big and. 336 00:35:24,120 --> 00:35:30,210 Intractable, and so he said you're never going to get beyond toy problems, which seems reasonable in this kind of arguments. 337 00:35:30,210 --> 00:35:32,160 Now, fascinatingly and then on that, 338 00:35:32,160 --> 00:35:37,470 we'll be able to reproduce all those very large number of functions because of the universe approximation theorem. 339 00:35:37,470 --> 00:35:40,800 So why does it pick a good function? This is a really interesting question. 340 00:35:40,800 --> 00:35:46,050 And in the field, this was kind of sharpened by a very famous paper by saying it's all being called. 341 00:35:46,050 --> 00:35:52,230 It's sort of been cited, I think, almost 50 times in the last four years, just extraordinary. 342 00:35:52,230 --> 00:35:57,630 And they did define the experiment. They took this image datasets called See Far from Canadian datasets. 343 00:35:57,630 --> 00:36:01,860 So this is aeroplanes and automobiles and birds, cats and deer, et cetera. 344 00:36:01,860 --> 00:36:09,600 And they simply that if you're permitted the labels, so they basically label this rather than label aeroplane, they just picked a random label. 345 00:36:09,600 --> 00:36:17,310 OK. And what they showed was a neural network could learn that new corrupted data relatively quickly. 346 00:36:17,310 --> 00:36:24,570 So not that much more work, and it could reproduce the training that was 100 percent accuracy, zero error. 347 00:36:24,570 --> 00:36:31,600 That's very striking. How could it? How? So it can do that, right, so it can memorise the data. 348 00:36:31,600 --> 00:36:39,130 Obviously, if you then give it new correct images, it gives you zero accuracy, 100 percent error because it hasn't learnt anything. 349 00:36:39,130 --> 00:36:45,340 It's just memorised the links, the labels and images so it can do it, which we we knew theoretically, you could. 350 00:36:45,340 --> 00:36:50,150 But they showed that we could find that solution really quickly, which in itself is exciting and interesting. 351 00:36:50,150 --> 00:36:55,630 And the question is whether why does it generalise? Well, why does it give us good solutions, right? 352 00:36:55,630 --> 00:37:01,420 Given that if I gave it the correct labels, it could just memorise the labels but not have any predictive power. 353 00:37:01,420 --> 00:37:07,390 So why does it not do that when it's given the correct labels, whereas it can do that easily when given incorrect labels? 354 00:37:07,390 --> 00:37:10,840 This is a very big question in the field and a very interesting one. 355 00:37:10,840 --> 00:37:14,720 And as physicists, of course, we're very nervous about these high dimensional systems. 356 00:37:14,720 --> 00:37:20,680 I told you there are 60 million parameters in Hinton's in the Alex Nets from 2012. 357 00:37:20,680 --> 00:37:26,200 There are no models with billions of parameters in them, and they work extremely well. 358 00:37:26,200 --> 00:37:31,450 So why are we worried? Well, as physicists, we've told that you should never have too many parameters, right? 359 00:37:31,450 --> 00:37:35,830 There's a very famous story by Freeman Dyson. He's talked about. 360 00:37:35,830 --> 00:37:43,270 He went to Fermi and he had a more high energy physics model with, I think, five parameters in its. 361 00:37:43,270 --> 00:37:47,410 And he went to Fermi and Fermi. Also, how much data do you have? And he didn't have that much data. 362 00:37:47,410 --> 00:37:52,600 The Fermi said that claim that von Neumann had said to him with four parameters I could fit an elephant with five. 363 00:37:52,600 --> 00:37:59,260 I can make it wiggle its trunk. And so what you're see, we may see the data well, but it is probably not true. 364 00:37:59,260 --> 00:38:05,170 And so Dyson went home not to Cornell with his tail between his legs, but that's an intuition that we teach our students all the time. 365 00:38:05,170 --> 00:38:14,380 Never used too many parameters. If you think if you're these things as a group from 10 years ago that worked out indeed for Norman was right. 366 00:38:14,380 --> 00:38:19,060 You can fit elephant and make a bigger stroke with five parameters with four is not enough. 367 00:38:19,060 --> 00:38:28,030 And so it's just great genius. So I'll give you a very simple toy problem that visualises this in a really simple way. 368 00:38:28,030 --> 00:38:42,790 So here I have a 10 data points. I can fit those 10 data points by, you know, Apollo mule. 369 00:38:42,790 --> 00:38:49,150 Right. And what we tell our students is if you've got 10 data points, don't fit a higher polynomial to it. 370 00:38:49,150 --> 00:38:56,650 So here you see this dash line is a five or polynomial. If I fit a 20 order polynomial to it, I get zero error because I can fit it extremely well. 371 00:38:56,650 --> 00:39:04,420 The data extremely well, but it starts to show very odd behaviour, which intuitively I think is probably not going to generalise well. 372 00:39:04,420 --> 00:39:08,320 Other words, if I give it new data points, it's going to give very bad predictions, right? 373 00:39:08,320 --> 00:39:14,710 And that's what that's what one woman and Fermi were trying to tell Dyson. 374 00:39:14,710 --> 00:39:18,370 Now, here's the fascinating thing we've trained here a bunch of neural networks with one, two, 375 00:39:18,370 --> 00:39:23,980 three, one, two and five hidden layers there, but with them, it doesn't seem to matter very much. 376 00:39:23,980 --> 00:39:30,070 All they feed data like this, that's the green curve. They actually on top of each other in this on this scale. 377 00:39:30,070 --> 00:39:37,630 So the question is why do these networks with thousands of parameters fit the data so smoothly that looks to our eye? 378 00:39:37,630 --> 00:39:41,470 This seems to be much better. So this is a kind of central conundrum in the field. 379 00:39:41,470 --> 00:39:44,230 Why do they work so well? 380 00:39:44,230 --> 00:39:54,850 In fact, there was a famous kind of spat between two younger and another al-Hashimi and two important theorists in this field, 381 00:39:54,850 --> 00:39:58,900 where the first one said, You know, deep learning, machine learning is alchemy, OK? 382 00:39:58,900 --> 00:40:02,830 It works. We have no idea why, and this is the kind of argument they're having. 383 00:40:02,830 --> 00:40:11,230 We have no idea why it's generalised so well. No, I wouldn't be setting them up in this way if I didn't think I had something to say about it. 384 00:40:11,230 --> 00:40:18,130 And so last time when you hear I explains a new idea from our group, 385 00:40:18,130 --> 00:40:24,970 which is actually generalising something that's been around for 50 years in the field of algorithm information theory. 386 00:40:24,970 --> 00:40:31,120 So and very often it is a study of the complexity of single objects. 387 00:40:31,120 --> 00:40:35,920 It's an information theory, but as opposed to sharing information, which is about distributions, it's about single objects. 388 00:40:35,920 --> 00:40:40,180 And so the central thing in this theory is a goal of complexity, 389 00:40:40,180 --> 00:40:47,290 which is formally defined as the length of the shortest programme that will generate a particular string on a universal Turing machine. 390 00:40:47,290 --> 00:40:53,750 So I'll illustrate this to you with a very simple example. Imagine a monkey typing on a typewriter using a typewriter, 391 00:40:53,750 --> 00:40:58,060 or I should say a word processor now for the younger people in the audience who have never seen typewriters. 392 00:40:58,060 --> 00:41:03,190 The point is the result. Let's make it simple as a binary one is only zeros and ones and rookie types on this. 393 00:41:03,190 --> 00:41:08,440 How likely will this monkey type the following one hundreds digit long sequence? 394 00:41:08,440 --> 00:41:14,290 Well, the monkey will type with one half to the power 100 because the monkey what monkeys are not truly random, 395 00:41:14,290 --> 00:41:19,510 but let's assume that we have a random monkey. That's what the monkey is going to do. OK? 396 00:41:19,510 --> 00:41:25,180 There was actually experiment in Portsmouth Zoo where they they gave a typewriter to a monkey. 397 00:41:25,180 --> 00:41:33,970 A bunch of monkeys. And what they found is the monkeys kept tapping the same thing many, many times to preference for K, and then they defaecated on. 398 00:41:33,970 --> 00:41:39,500 And that was the end of the experiment. OK. This is this is a hypothetical monkey. 399 00:41:39,500 --> 00:41:46,280 OK, well, on on a binary keyboard, this is the input output, the probability we will have the power and hundreds. 400 00:41:46,280 --> 00:41:50,080 But what if the monkey was Typekit to some kind of computer programme? 401 00:41:50,080 --> 00:41:56,350 Well, then we might expect that you would get something they might actually type the sequence of prints, 402 00:41:56,350 --> 00:41:59,920 so that's not binaries, assume it's a proper computer. 403 00:41:59,920 --> 00:42:06,340 So the print zero 150 times, let's ignore for a moment the the it's what we want over the size of the keyboard, 404 00:42:06,340 --> 00:42:10,210 but it would be something that scales is to the power. 19. 405 00:42:10,210 --> 00:42:16,300 Are making it doesn't quite work in binary arithmetic, but I'm just doing it because I'll need that binary to stick later. 406 00:42:16,300 --> 00:42:22,720 The point being the if you're typing into some kind of programme that you might accidentally type print zero one many times, 407 00:42:22,720 --> 00:42:31,360 it'll give you a short description and call. The graph shows that mathematically is the right way of describing the complexity of a single object. 408 00:42:31,360 --> 00:42:39,280 There's also the reason why this hasn't been applied is because this common graph complexity links to the halting problem in Turing's machines, 409 00:42:39,280 --> 00:42:46,650 which links to girls on desirability theory. And so there's lots of uniformity can never know whether you have the short programme. 410 00:42:46,650 --> 00:42:54,930 But it's you can approximate, and so what we did is this we said, let's assume that we can approximate anyway. 411 00:42:54,930 --> 00:42:59,460 OK. And then we said to the probability that you get a certain outcome should be something 412 00:42:59,460 --> 00:43:03,930 like one half to the power of the shortest programme that gives you that outputs. 413 00:43:03,930 --> 00:43:07,980 And some countries in there which we could ignore. But the first order, qualitatively, that's what we're saying. 414 00:43:07,980 --> 00:43:15,810 So we're saying this might be a much more general property of maps of Input-Output maps, a map of inputs with an output. 415 00:43:15,810 --> 00:43:20,430 This is inputs outputs that should give me something of that nature. 416 00:43:20,430 --> 00:43:25,080 So the question then is how can we think of the neural network as an input output map? 417 00:43:25,080 --> 00:43:28,980 And we have to be careful here because the input outputs. 418 00:43:28,980 --> 00:43:35,640 But you can also think about it in a different kind of input output map, which is a parameter function that I've got parameters here. 419 00:43:35,640 --> 00:43:42,750 OK. So arbitrarily change. OK. And I've got outputs which are the function that these parameters pick, right? 420 00:43:42,750 --> 00:43:46,750 And so the inputs are the parameters and the outputs are the functions. 421 00:43:46,750 --> 00:43:50,460 Inputs are not the inputs. They're all network. The the parameters that I choose. 422 00:43:50,460 --> 00:43:57,750 And if this coding theorem is correct, then we should see that a poor, randomly picking parameters. 423 00:43:57,750 --> 00:44:01,790 I should get functions that have shortcomings of complexity. 424 00:44:01,790 --> 00:44:05,810 Now, we've one of the great things about being in Oxford is that you have unbelievably talented undergraduates, 425 00:44:05,810 --> 00:44:11,000 so an undergraduate that worked for us over a summer programme and he was thinking about something different. 426 00:44:11,000 --> 00:44:16,880 He's actually working on reinforcement learning. And then one day he came to me said, I've been thinking, I think I've proved something. 427 00:44:16,880 --> 00:44:23,000 And so he's and it turns out that he hadn't quite proved something, but it almost proved something which we have now proven, 428 00:44:23,000 --> 00:44:26,600 which is that for a very simple neural network, it's called perceptron. 429 00:44:26,600 --> 00:44:31,540 We only have inputs then weights and outputs, which it's like a zero layer neural network. 430 00:44:31,540 --> 00:44:36,880 We can prove that for this kind of. Binary problems, OK? 431 00:44:36,880 --> 00:44:42,550 The probability upon random sampling these weights that you get a particular function. 432 00:44:42,550 --> 00:44:46,390 OK. We can't prove that. We can prove that you get a certain entropy. 433 00:44:46,390 --> 00:44:49,720 The functions or a certain number of ones is all equal. 434 00:44:49,720 --> 00:44:56,110 So the probability I get zero ones, which is all zeros, is equal to the probability of all functions that have four ones. 435 00:44:56,110 --> 00:44:59,350 OK. But as you can see, if I have four ones, I can permit them in lots of ways. 436 00:44:59,350 --> 00:45:03,340 There are many functions that give me four ones, but there's only one that gets me all zeros. 437 00:45:03,340 --> 00:45:07,600 So I'm much more likely to get the one with all zeros than I'm going to get this particular one. 438 00:45:07,600 --> 00:45:13,390 OK, because it's one of a big class of functions, so we prove that exactly. And we also do. 439 00:45:13,390 --> 00:45:18,280 We then did some. This is for a very simple system. We then took a neural network with multiple layers. 440 00:45:18,280 --> 00:45:21,940 We randomly picked the functions we took and the seven. 441 00:45:21,940 --> 00:45:30,430 So seven of these seven of these inputs to this one handed to an eight possible inputs is 442 00:45:30,430 --> 00:45:34,820 to turn them into an eight just two to the tenth of the thirty eight possible functions. 443 00:45:34,820 --> 00:45:40,150 OK. And then we just repeat as many times. Also, what one? How often you see the same function again and again? 444 00:45:40,150 --> 00:45:45,910 Well, what we find is the the this plot right here. I just want to point out the scale of this graph. 445 00:45:45,910 --> 00:45:52,810 We know the tens of thirty eight functions. So if the functions were all equally likely to appear, which you might assume is of no model, 446 00:45:52,810 --> 00:45:56,350 then you would get a function with probability of one in tens of thirty eight. 447 00:45:56,350 --> 00:46:03,160 You'd never get the same function twice in the age of our universe, even with the fastest computers by random sampling. 448 00:46:03,160 --> 00:46:10,930 What we find here is that some functions are appearing almost 10 percent and one in one percent of time one in one thousand. 449 00:46:10,930 --> 00:46:17,260 That's minutes. That's thirty five orders of magnitude thirty six automated, more likely than you expect on no model. 450 00:46:17,260 --> 00:46:21,770 And so there are these functions that are appearing with high probability when I randomly pick these these parameters. 451 00:46:21,770 --> 00:46:25,780 And the question is, do I know which ones they are? Well, I just have this theory here. 452 00:46:25,780 --> 00:46:32,500 This theorem that says the probability of getting a function should scale with two to the minus the complexities. 453 00:46:32,500 --> 00:46:38,110 If I put the log with the probability analogue scale here versus the complexity of the function. 454 00:46:38,110 --> 00:46:42,040 In fact, the the function, I basically say. So this would be a slightly more complex function. 455 00:46:42,040 --> 00:46:47,410 A function which is zero one zero one zero one zero one would be simpler function or all zeros would be simpler. 456 00:46:47,410 --> 00:46:49,240 Here I brought the complexity of the function, 457 00:46:49,240 --> 00:46:57,040 and this theory is actually the upper bounds is the upper bounds that we calculate from theory, and it predicts exactly what you expect. 458 00:46:57,040 --> 00:47:03,850 So other words, you're very likely to get these simple functions and very unlikely to get complicated functions. 459 00:47:03,850 --> 00:47:07,240 We've shown this holds much more generally for neural networks. 460 00:47:07,240 --> 00:47:12,940 And so this is exciting because we show that they have this fundamental bias where simplicity. 461 00:47:12,940 --> 00:47:17,170 And what's interesting is you can do other games so I can do a training, OK, 462 00:47:17,170 --> 00:47:22,310 I can train manual network on a complicated function, on a simple function and then ask myself, how well does it do? 463 00:47:22,310 --> 00:47:26,800 So I'll pick half of my inputs to train and the other half the tests. 464 00:47:26,800 --> 00:47:31,270 And what you see is as the target functions as a function of trying to learn, it's more complicated. 465 00:47:31,270 --> 00:47:38,170 Actually, my error goes up. So one thing we see is that these networks do well on simple tasks, but poorly on bad tasks. 466 00:47:38,170 --> 00:47:44,410 And this is one of the top. Here's a random learner where you just pick a random function that fits the test. 467 00:47:44,410 --> 00:47:49,180 And the vast majority of functions that you pick randomly are terrible. Not expectedly. 468 00:47:49,180 --> 00:47:55,160 And so the random learner basically acts like might as well just pick might not learn at all. 469 00:47:55,160 --> 00:48:02,390 And what's interesting is if the network, if you train it on a simple function, it finds functions with this complexity and this error, 470 00:48:02,390 --> 00:48:06,770 so it tends to bunch around the one that you're trying to find and this with the random nerve does. 471 00:48:06,770 --> 00:48:10,880 But if I give it a complicated function, it behaves more or less the same as a random murder. 472 00:48:10,880 --> 00:48:15,620 So that's what you expect. Networks are not magic, right? They just have a bias towards simple functions. 473 00:48:15,620 --> 00:48:19,850 So if I give them simple functions to learn, they should do well. I give you a complicated function. 474 00:48:19,850 --> 00:48:23,450 They're not going to do any better than any other kind of random guessing. 475 00:48:23,450 --> 00:48:29,210 So you might think that this would be very exciting if people believe this is the reason why these things generalise so well, 476 00:48:29,210 --> 00:48:35,460 they generalise well, because just by a simple function, the simple functions give you a good generalisation. 477 00:48:35,460 --> 00:48:39,150 But what we find, what what we find, and I think for good reasons, 478 00:48:39,150 --> 00:48:42,720 people are sceptical and the reason they're sceptical is because you train a neural 479 00:48:42,720 --> 00:48:46,530 network by randomly picking parameters that would be really silly and very slow. 480 00:48:46,530 --> 00:48:48,990 Instead, what you do is you your lost functions, 481 00:48:48,990 --> 00:48:56,640 you calculate how far your function is from the function that you want and then you you use a gradient descent mythical. 482 00:48:56,640 --> 00:49:02,940 So castigating the sense which drives you down down your lost landscape to a minimum. 483 00:49:02,940 --> 00:49:09,400 And that's an optimisation algorithm which is extremely effective and efficient. 484 00:49:09,400 --> 00:49:14,000 And the kind of consensus in the field on the most common argument in the few 485 00:49:14,000 --> 00:49:18,760 would be that there's something magic about study that gives you good solutions. 486 00:49:18,760 --> 00:49:25,060 So people push back on this just because the opportunity probability of a function is high. 487 00:49:25,060 --> 00:49:31,760 Does it mean that you're going to find it that way on on going down gradient descent, which I think is good arguments. 488 00:49:31,760 --> 00:49:36,820 However, we have a kind of qualitative argument says, you know, on this space, you've got lots of different possible functions. 489 00:49:36,820 --> 00:49:42,790 You could find the all fit the data. If that function has a large base in so many different ways, you can find it. 490 00:49:42,790 --> 00:49:48,770 You expect that that it's that high loss, you're still going to be likely to fall into that basement and go down into it. 491 00:49:48,770 --> 00:49:52,120 Right. So this is here are some functions with large based in size. 492 00:49:52,120 --> 00:49:58,120 You're going to probably go down into here where this one has a very small and so it's based on is going to be small. 493 00:49:58,120 --> 00:50:06,190 So, Chris, this undergraduate has done the second piece of work is together with a bunch of other undergraduates, actually. 494 00:50:06,190 --> 00:50:11,260 So I've got three undergraduates working on this project and myself and Guillermo and a few others. 495 00:50:11,260 --> 00:50:18,850 He trained on this image that called mists and that he worked out how likely was do to find a particular function. 496 00:50:18,850 --> 00:50:25,660 So these are the functions with one error out of 100, so a one percent error, there's a function of two percent error. 497 00:50:25,660 --> 00:50:27,610 And this is a function with zero error. 498 00:50:27,610 --> 00:50:35,440 And what we found is that the probability that the function is found a priori is very close to the probability that is found by stochastic. 499 00:50:35,440 --> 00:50:41,290 Great instance. And in fact, so that means that occasionally finds poor functions with a probability that we can predict more or less. 500 00:50:41,290 --> 00:50:47,500 And so this is exciting because we can we think we can find more or less the probability with which these functions are found. 501 00:50:47,500 --> 00:50:51,430 So this is the first time I've presented this data, this data that came from last week. 502 00:50:51,430 --> 00:50:56,230 OK, so it's very hot off the press, but we're excited by we have actually a bunch of other examples of this, 503 00:50:56,230 --> 00:50:59,290 and we think that this may therefore be the explanation. 504 00:50:59,290 --> 00:51:05,770 So this a priori explanation seems to be qualitatively at least what study is doing upon training. 505 00:51:05,770 --> 00:51:09,940 If that's true, then this explains why neural networks work so well. 506 00:51:09,940 --> 00:51:12,670 What it doesn't explain what explains part of expertise is they are biased 507 00:51:12,670 --> 00:51:16,870 towards simple functions in the same way that Occam's Razor is biased towards. 508 00:51:16,870 --> 00:51:22,580 You know, you will learn about very registry that you should you should be biased toward simple solutions. 509 00:51:22,580 --> 00:51:27,980 And that's kind of the deep question that is, why are the things that we look at in nature? 510 00:51:27,980 --> 00:51:33,140 Simple physics we're used to looking at things are simple and so we intuitively think that must be true. 511 00:51:33,140 --> 00:51:34,670 But why would that be true in general? 512 00:51:34,670 --> 00:51:44,030 Well, for this and this dataset, actually, there's it's it's a grid of twenty eight by twenty eight pixels, so 700 eighty four dimensional space. 513 00:51:44,030 --> 00:51:51,620 But you can work out the dimensions of the space by looking at how things scale with size and the dimensions seem to be about 14 or 15 dimensional. 514 00:51:51,620 --> 00:51:55,430 So these images are in a very low dimensional manifold in a very high dimensional space. 515 00:51:55,430 --> 00:51:59,930 They're therefore simple. So a bias for simplicity is good for this kind of image recognition. 516 00:51:59,930 --> 00:52:06,650 But more generally, I think it's a wide open question about the nature of the universe, the nature of of the kind of things we're studying. 517 00:52:06,650 --> 00:52:13,730 Why are they so complicated and why are they simple? And we're claiming that there is a bias for simplicity, which is good. 518 00:52:13,730 --> 00:52:18,290 And if it's good, it should explain exactly why you see what you see. 519 00:52:18,290 --> 00:52:23,570 So conclusions, I think what she's learning is transforming physics, it's not just hype. 520 00:52:23,570 --> 00:52:27,710 I can make that point, but I will be made later today, but I think is very important. 521 00:52:27,710 --> 00:52:33,620 The main reason that hopefully of today and then what I tried to argue to you is a deep learning may work because differing 522 00:52:33,620 --> 00:52:39,170 techniques are work because they have a natural bias towards simple functions and the kind of inbuilt Occam's razor without. 523 00:52:39,170 --> 00:52:45,376 I thank you very much for your attention.