1 00:00:01,280 --> 00:00:18,069 George. Okay. 2 00:00:18,070 --> 00:00:21,790 So I'm full blown. So I'm going to introduce our speaker for the stretch lecture this term. 3 00:00:22,510 --> 00:00:28,360 Firstly, I'd like to say a big thank you to Oxford Asset Management, our sponsor who makes this series of lectures possible. 4 00:00:28,690 --> 00:00:34,300 I've also been asked to draw your attention to our hashtag that's prominently placed in case you would like to tweet. 5 00:00:35,770 --> 00:00:40,150 And also anyone interested in our software engineering program can find brochures outside and people to talk to. 6 00:00:41,110 --> 00:00:49,120 So, okay. But onto the mean business. So it's my pleasure to introduce Stephen Carbone, who will give our Hillary term strategy lecture. 7 00:00:50,890 --> 00:00:54,970 Zubin is Professor of Information Engineering at Cambridge. 8 00:00:55,510 --> 00:00:56,980 He's a fellow of the Royal Society. 9 00:00:57,760 --> 00:01:05,770 I think it's fair to say that his machine learning group in Cambridge is would be one of the most influential over the last decade. 10 00:01:06,640 --> 00:01:14,440 It's hard to it's hard to go to any major machine learning academic group or industry lab without finding Rubin's ex-students or postdocs. 11 00:01:16,690 --> 00:01:21,520 So and more recently, Zubin founded Geometric Intelligence A Start-Up. 12 00:01:22,150 --> 00:01:27,070 And after its acquisition, he's now the co-director of Uber, Eli Labs. 13 00:01:27,400 --> 00:01:31,420 And maybe if you're off nicely, he might tell you a bit about that. We'll see. 14 00:01:32,020 --> 00:01:41,140 Okay. So you've made contributions across machine learning, particularly probabilistic inference, even deep learning. 15 00:01:41,840 --> 00:01:47,140 It's not surprising. He was Mike Jordan student and did his postdoctoral work with Geoff Hinton. 16 00:01:47,440 --> 00:01:52,020 So a pretty amazing pedigree there. So really, Zubin, 17 00:01:52,030 --> 00:01:58,540 seminal work is really invasion non parametric where he's really been leading this idea that 18 00:01:58,930 --> 00:02:03,880 it's not just enough for now machine learning research to aim for for accurate predictions. 19 00:02:04,090 --> 00:02:08,950 We also need to be able to quantify uncertainty. We need to be able to talk about causation. 20 00:02:09,460 --> 00:02:13,960 And if we really want machine learning and AI to have an impact in industry, we need to be able to tackle those things. 21 00:02:14,530 --> 00:02:18,910 I'm sure he will tell you all about these things. So please welcome Zubin for the talk. 22 00:02:27,140 --> 00:02:35,360 Thanks, Phil, for that great introduction and a great thank you to the Department of Computer Science for inviting me. 23 00:02:36,200 --> 00:02:38,870 Okay. Can you all hear me? Yeah. Good. 24 00:02:39,230 --> 00:02:44,630 So I'm going to talk about probabilistic machine learning, which is my passion, is the thing I'm really excited about. 25 00:02:45,170 --> 00:02:57,230 And I'll start from basics and as the talk goes on, will get into more and more current research, more of what we're actually really doing these days. 26 00:02:57,440 --> 00:03:05,840 So that's why the subtitle is Foundations and Frontiers. Foundations is meant to be, you know, motivation, background material. 27 00:03:06,080 --> 00:03:12,140 But if you're bored by that, don't worry, it'll get more technical later on. 28 00:03:13,160 --> 00:03:17,569 Okay, so let's start from the basics. Machine learning. 29 00:03:17,570 --> 00:03:20,600 Well, what is machine learning? It's just a term. 30 00:03:20,750 --> 00:03:22,940 There are many other related terms. 31 00:03:24,050 --> 00:03:30,620 You know, depending on the community that you come from, you might think about data mining or artificial intelligence or statistical modelling, 32 00:03:30,890 --> 00:03:35,180 neural networks, pattern recognition, sort of a bit of a more old fashioned term. 33 00:03:35,510 --> 00:03:45,890 All these terms are related. I'll focus on the term machine learning, but keep the context in mind in terms of academic disciplines. 34 00:03:46,280 --> 00:03:56,780 This is also a very interdisciplinary area in that we draw from ideas in computer science, engineering statistics, 35 00:03:56,780 --> 00:04:05,720 applied mathematics, and we get a lot of inspiration from cognitive science, economics, even tools from physics and neuroscience. 36 00:04:06,960 --> 00:04:11,730 And then why are people interested in machine learning these days? 37 00:04:11,940 --> 00:04:19,020 Well, it used to be kind of an interesting academic field where you sort of played around and you kind of tried to get computers to learn from data. 38 00:04:19,410 --> 00:04:23,760 Most people didn't care much about it, but now suddenly lots of people care. 39 00:04:24,000 --> 00:04:30,240 And the reason lots of people care is because there are many, many applications of machine learning. 40 00:04:30,600 --> 00:04:35,700 It's sort of I like to think of it as the invisible thing that's behind a lot 41 00:04:35,700 --> 00:04:41,280 of the more visible applications that involve computers learning from data. 42 00:04:43,470 --> 00:04:46,800 So let's just go through some of those applications just to motivate. 43 00:04:48,750 --> 00:04:54,300 Speech and language technologies is an area that has been transformed by the use of machine learning. 44 00:04:54,540 --> 00:05:01,169 So automatic speech recognition, machine translation, question answering dialogue systems. 45 00:05:01,170 --> 00:05:06,990 Then every year we seem to get more and more advances in these sorts of tools. 46 00:05:07,620 --> 00:05:12,120 Computer vision, again, a field that has been around for a very long time. 47 00:05:12,450 --> 00:05:25,409 But with the advent of large amounts of data and more powerful computational tools, we're able to now do interesting things like not just object, 48 00:05:25,410 --> 00:05:33,540 face and handwriting recognition, but image captioning going from an image to a bit of text that's meant to describe the image. 49 00:05:33,540 --> 00:05:38,610 And these are this is from a very famous paper. 50 00:05:39,900 --> 00:05:46,470 And, you know, you can actually pick it apart in the sense that you could say, well, these are hand chosen to make the algorithm look good. 51 00:05:46,680 --> 00:05:48,600 You know, man in black shirt is playing guitar. 52 00:05:48,600 --> 00:05:55,140 That seems pretty amazing that a computer could take an image like this and produce this description of the image. 53 00:05:55,920 --> 00:05:59,069 It doesn't always work that brilliantly, 54 00:05:59,070 --> 00:06:06,840 but I would say that most of us in the field were stunned when we saw this happen for the first time 55 00:06:06,840 --> 00:06:12,840 that we could actually get a system that would produce some reasonable descriptions from images. 56 00:06:14,290 --> 00:06:19,180 Of course, we all have cameras in our pockets that put boxes around people's faces. 57 00:06:19,390 --> 00:06:25,660 If you ever ask yourself, well, how does that work? Well, that's a bit of machine learning that runs on all of your camera devices. 58 00:06:28,120 --> 00:06:33,700 Moving into the sciences, a lot of the sciences have become very data heavy. 59 00:06:34,900 --> 00:06:39,549 Fields like bioinformatics and genomics in the medical sciences, 60 00:06:39,550 --> 00:06:49,629 but also astronomy areas where we're now able to collect much more data than any human being could sit down and analyse manually. 61 00:06:49,630 --> 00:06:57,520 And so machine learning and AI tools have been very important in scientific data analysis, 62 00:06:57,520 --> 00:07:00,340 and that's something I'll talk about maybe a little bit later on as well. 63 00:07:01,330 --> 00:07:06,430 Recommender systems, we all know, but these are, you know, customers who bought this item, 64 00:07:06,430 --> 00:07:15,040 also bought this kind of thing that's driven by machine learning, self-driving cars, something that I'm now much more involved in. 65 00:07:15,880 --> 00:07:18,370 This is not a totally new thing. 66 00:07:18,370 --> 00:07:28,059 I mean, this self-driving car, Alvin was around about 30 years ago, and he used neural networks to drive around at 70 miles per hour on highways. 67 00:07:28,060 --> 00:07:32,320 That's what it says on this slide that I took from about 30 years ago. 68 00:07:32,530 --> 00:07:33,579 That's very scary. 69 00:07:33,580 --> 00:07:42,040 I would not want to be anywhere close to that truck driving at 70 miles an hour on a highway driven by a neural network that's about this big. 70 00:07:43,990 --> 00:07:52,720 But things have moved on and we now have pretty good self-driving systems that are just getting better every year. 71 00:07:54,260 --> 00:07:57,530 Robotics. I just love dogs playing football. 72 00:07:58,430 --> 00:08:04,309 So robotics is this this particular sort of RoboCop isn't necessarily driven by machine learning, 73 00:08:04,310 --> 00:08:12,670 but there are a lot of excellent uses of machine learning in robotics, automated trading, financial prediction, 74 00:08:12,920 --> 00:08:22,640 computer games you're all familiar with, you know, the the DeepMind landmark results first with learning Atari games, 75 00:08:23,570 --> 00:08:34,640 playing Atari games at a human or superhuman level, then more recently beating the world master at goal. 76 00:08:36,110 --> 00:08:50,830 And who knows what this is? This is Lee Brutus, a system that recently won a poker championship. 77 00:08:52,570 --> 00:08:59,110 And this was a very against a whole bunch of humans. This is all the numbers in parentheses are how much money the humans lost to the computer. 78 00:09:00,940 --> 00:09:08,979 And the very interesting thing about this is that this is quite a complicated game in that if you think about poker, what does it involve? 79 00:09:08,980 --> 00:09:19,000 Well, it involves things like trying to understand the state of mind of the other player and bluffing and things like that. 80 00:09:19,000 --> 00:09:24,550 So to be a good poker player, you have to be able to do those things. And so now we have good machine poker players as well. 81 00:09:25,660 --> 00:09:31,330 So what is it? Well, machine learning, if if I had to define it, I would use a sentence like this. 82 00:09:31,330 --> 00:09:38,470 It's an interdisciplinary field that develops both the mathematical foundations and practical applications of systems that learn from data. 83 00:09:39,130 --> 00:09:42,730 Here are some of the main conferences and so on associated with that field. 84 00:09:44,440 --> 00:09:48,669 So that's all in terms of motivation from applications. 85 00:09:48,670 --> 00:09:51,010 But actually when you look at machine learning systems, 86 00:09:51,490 --> 00:09:58,480 most of the time machine learning systems are are trying to solve one of a few canonical problems. 87 00:09:58,750 --> 00:10:04,570 So I'll just go through those canonical problems in my kind of introductory part of the of the lecture. 88 00:10:05,320 --> 00:10:08,889 So this is probably the most canonical problem, the classification problem. 89 00:10:08,890 --> 00:10:14,410 You have some data, you want to classify it into two or more classes. 90 00:10:15,370 --> 00:10:21,160 So the task is to predict some discrete class labels from input data that has lots and lots of applications. 91 00:10:21,400 --> 00:10:25,870 And there are a lot of buzzwords for different methods that can be used for classification. 92 00:10:25,870 --> 00:10:29,170 These are just different ways of trying to do classification from data. 93 00:10:30,620 --> 00:10:37,040 Regression, trying to predict some continuous quantity Y from some inputs X. 94 00:10:37,910 --> 00:10:40,730 Obviously, this has lots of applications as well. 95 00:10:41,000 --> 00:10:49,030 And, you know, there are lots of methods, some of which you might say, well, that's not a machine learning method, that's linear regression. 96 00:10:49,040 --> 00:10:50,630 It's been around for over a hundred years. 97 00:10:50,990 --> 00:10:57,920 But, you know, again, remember, this is all in the context of everything that's been going on in all of these neighbouring fields. 98 00:10:58,160 --> 00:11:03,580 And there's nothing that says, oh, this is a machine learning method and that's not a machine learning method. 99 00:11:03,590 --> 00:11:12,229 If it's just if it's making predictions and decisions from data, it is a machine learning method at some level clustering. 100 00:11:12,230 --> 00:11:17,720 The task here is the group data together so that similar points are put in the same group. 101 00:11:18,740 --> 00:11:22,700 Many applications, again, many different methods. Dimensionality reduction. 102 00:11:22,700 --> 00:11:31,310 When you have very high dimensional data, you might want to find a low dimensional representation of that data that preserves important information. 103 00:11:31,910 --> 00:11:41,899 Another canonical machine learning problem semi supervised learning where you might have some labelled data. 104 00:11:41,900 --> 00:11:48,980 Here you might have a few label points like these two label points that are minuses and these three that are pluses. 105 00:11:49,310 --> 00:11:55,190 And you might want to basically be able to leverage the fact that you have a lot of unlabelled data as well. 106 00:11:55,190 --> 00:12:02,629 And so semi supervised learning combines labelled and unlabelled data to get better predictions and reinforcement learning, 107 00:12:02,630 --> 00:12:07,880 which is related to sequential decision making and adaptive control. 108 00:12:08,180 --> 00:12:16,070 The task there is to learn to interact with an environment, making sequential decisions to maximise future rewards. 109 00:12:16,340 --> 00:12:22,460 So it's an interactive setting where you have an agent producing some actions or decisions in an environment. 110 00:12:22,700 --> 00:12:26,060 There may be some hidden state to both the agent and the environment, 111 00:12:26,360 --> 00:12:36,200 and then you get some observed sensory inputs and the agent has to be has to act in the environment to maximise its rewards. 112 00:12:38,010 --> 00:12:40,080 Okay. So these are the canonical problems. 113 00:12:40,350 --> 00:12:47,399 It is actually quite bewildering if you start reading the machine learning literature and you're not an expert because there are many, 114 00:12:47,400 --> 00:12:53,520 many different methods and you know, every paper seems to present a new method. 115 00:12:53,790 --> 00:12:59,819 And so here is this sort of a very crude way of organising a bunch of machine learning methods. 116 00:12:59,820 --> 00:13:03,540 But don't give this too much input, too much weight on this. 117 00:13:04,410 --> 00:13:12,150 Okay. But I'm going to focus on for the first few minutes, I'm going to focus on one bubble here, which is this neural networks and deep learning one. 118 00:13:12,390 --> 00:13:17,640 And the reason I'm focusing on that should be for any of you who is familiar with the field, 119 00:13:17,640 --> 00:13:22,860 it should be pretty obvious because these methods have been really revolutionary. 120 00:13:22,860 --> 00:13:29,700 They've really been involved in some of the most spectacular breakthroughs in the last few years. 121 00:13:32,170 --> 00:13:36,550 So what are they? Well, a neural network. 122 00:13:39,180 --> 00:13:42,210 And I'm going to focus here on a feedforward neural network. 123 00:13:42,450 --> 00:13:45,719 Just for simplicity, there are other kinds, but a feedforward neural network. 124 00:13:45,720 --> 00:13:49,650 The most standard one is essentially just a function approximate. 125 00:13:50,160 --> 00:13:57,210 So it takes some inputs, called them X and it produces some outputs. 126 00:13:57,540 --> 00:14:08,000 Call them Y. And the way it produces them is through a sequence of transformations organised in layers. 127 00:14:08,510 --> 00:14:11,450 But all of that is in a sense a bit of a detail. 128 00:14:11,690 --> 00:14:20,239 It's just a way of representing a function that maps from X to Y via tuneable parameters called weights. 129 00:14:20,240 --> 00:14:24,710 Or I'm using theta t to note note the denote the parameters of the network. 130 00:14:27,500 --> 00:14:34,580 So neural nets are I mean, one of the important aspects of neural nets is that they're nonlinear functions 131 00:14:35,360 --> 00:14:41,090 and they're often both nonlinear in the input and nonlinear in the parameters. 132 00:14:41,390 --> 00:14:47,720 So optimising them to minimise some objective function tends to be slightly complicated. 133 00:14:48,080 --> 00:14:56,000 The other defining characteristic of neural networks is that they represent the function from X or Y in layers, 134 00:14:56,210 --> 00:15:00,740 which is essentially simply just as a composition of functions. 135 00:15:01,740 --> 00:15:11,040 Okay. So here is a multiplayer neural network with one hidden layer represented as a function that maps from Xs to Y through some parameters. 136 00:15:11,280 --> 00:15:16,560 And these super scripts here one and two you the the two layers of parameters that you have. 137 00:15:18,580 --> 00:15:22,120 These neural networks are usually trained to maximise some likelihood, 138 00:15:22,120 --> 00:15:32,110 so they fall very squarely within the world of statistical models using some variant of stochastic gradient descent optimisation. 139 00:15:32,120 --> 00:15:35,140 So this is where we start using tools from optimisation theory. 140 00:15:35,770 --> 00:15:42,340 Okay. So that's one slide on neural networks. And these things have been around for many decades. 141 00:15:43,540 --> 00:15:43,869 In fact, 142 00:15:43,870 --> 00:15:52,540 these things are what got me excited about AI back in the eighties when I was sort of an undergraduate and thinking about what to do with my life. 143 00:15:55,990 --> 00:16:04,780 But what's happened is that something dramatic has happened between the 1980s and now. 144 00:16:05,200 --> 00:16:09,309 And one of the things that's dramatic is that the terminology has changed. 145 00:16:09,310 --> 00:16:13,810 So people now call these deep learning systems because they have many more layers. 146 00:16:13,990 --> 00:16:16,720 But there are other, more interesting, dramatic things that have happened. 147 00:16:16,960 --> 00:16:22,420 So these deep learning systems that are involved in a lot of these very impressive 148 00:16:22,420 --> 00:16:29,740 benchmarks are very similar to the neural net architectures from the eighties and nineties, 149 00:16:30,010 --> 00:16:37,809 with some important architectural and algorithmic innovations like being able to use many layers and particular nonlinearity, 150 00:16:37,810 --> 00:16:47,140 such as the real you particular ways of regularising them like dropout and very useful tricks for dealing with time series like the. 151 00:16:49,360 --> 00:16:56,260 They are also based on vastly they're trained using vastly larger data sets, really web scale data sets. 152 00:16:57,640 --> 00:17:04,240 To do that, you need vastly larger compute resources, so GPUs, GPUs on clouds, etc. 153 00:17:04,600 --> 00:17:14,350 Importantly, there's been a major effort to democratise the software tools so that it's quite easy to actually train a neural network. 154 00:17:14,350 --> 00:17:20,020 So we have much better software tools, things like torch and TensorFlow. 155 00:17:20,320 --> 00:17:25,180 And of course, there has been vastly increased industry investment and media hype. 156 00:17:25,390 --> 00:17:35,590 And what that what that has meant is that there is a huge influx of people trying out different variations of neural networks on different problems. 157 00:17:36,010 --> 00:17:47,410 And stepping back, I kind of think of this a little bit of as the community of machine learning researchers is running a bit of a genetic algorithm, 158 00:17:47,590 --> 00:17:55,660 trying out lots of different ideas and variations and ideas to be able to improve on the performance of existing benchmarks. 159 00:17:57,490 --> 00:18:05,310 Okay. So that's that's deep learning in a in a nutshell, there's huge amounts more to say about that. 160 00:18:05,620 --> 00:18:08,680 And there are many better people than me to talk about that. 161 00:18:09,700 --> 00:18:12,940 But one thing I do want to talk about is limitations of deep learning. 162 00:18:13,270 --> 00:18:20,560 So let's step back from the excitement. Let's acknowledge the excitement and let's say, well, where do we go next? 163 00:18:20,860 --> 00:18:28,300 What do we need to focus on? And I would argue that there are a few limitations we really need to think about. 164 00:18:28,660 --> 00:18:33,370 So one of them is that neural nets are very data hungry. 165 00:18:33,640 --> 00:18:37,600 You often need millions of examples to train these large models. And that should not be surprising. 166 00:18:37,600 --> 00:18:41,320 If you if you know a bit of statistics, 167 00:18:42,340 --> 00:18:48,850 perhaps the surprising thing is that you don't need that many millions to train models with millions of parameters. 168 00:18:48,850 --> 00:18:57,280 People would have thought that would. That was crazy. And it is surprising that you can get away with, you know, relatively small amounts of data, 169 00:18:57,280 --> 00:19:00,670 even though it's large by the standards of the eighties and nineties. 170 00:19:01,720 --> 00:19:05,050 They're also very compute intensive to train and deploy. 171 00:19:07,060 --> 00:19:11,710 They're poor at representing uncertainty. And this is something that I'm particularly interested in. 172 00:19:12,850 --> 00:19:19,540 There are some great studies that show that neural nets and deep learning systems can be easily fooled by adversarial example. 173 00:19:19,540 --> 00:19:30,429 So you can construct examples that will make the neural network very confidently give the wrong answer, and that should be worrying. 174 00:19:30,430 --> 00:19:35,500 That relates to the uncertainty thing. It's okay for a system to make mistakes, 175 00:19:35,830 --> 00:19:41,680 but it's not okay for it to be really confidently making mistakes because then you don't 176 00:19:41,680 --> 00:19:46,089 know when to trust the answers and you can't really build mission critical systems, 177 00:19:46,090 --> 00:19:51,459 things like in, let's say in the health care domain or in self-driving cars and so on. 178 00:19:51,460 --> 00:19:58,450 If you really can't trust the confidences of your model, they're finicky to optimise. 179 00:19:59,320 --> 00:20:06,400 You know, optimisation is non convex and there are many different parametric architectural choices that need to be made. 180 00:20:06,730 --> 00:20:12,490 And they're generally on interpretable black boxes, lacking in transparency and difficult to trust. 181 00:20:13,030 --> 00:20:16,809 Okay, of course people are working on all of these things, 182 00:20:16,810 --> 00:20:22,960 but I wanted to put them on a slide to sort of motivate us to move towards the interesting challenges that we have. 183 00:20:24,360 --> 00:20:33,000 A particular area that that I'm really interested in, which Phil mentioned in the introduction, 184 00:20:33,300 --> 00:20:37,620 is thinking about machine learning as probabilistic modelling. 185 00:20:37,830 --> 00:20:43,710 So let's go beyond deep learning. I'll come back to neural nets and deep learning in a minute in the context of probabilistic modelling. 186 00:20:44,010 --> 00:20:50,400 Let's go beyond deep learning and let's talk about a particular view of machine learning 187 00:20:50,610 --> 00:20:56,610 that's grounded in the idea that we want systems that will build models from data, 188 00:20:57,510 --> 00:21:00,880 probabilistic models from data. So what do I mean by a model? 189 00:21:00,900 --> 00:21:12,540 The term model gets used by many people in different contexts. What I mean is a model describes data that one could observe from a system. 190 00:21:13,140 --> 00:21:19,250 Okay. So it should models should be able to make predictions. 191 00:21:19,260 --> 00:21:22,530 It should it should say make statements about observable data. 192 00:21:22,890 --> 00:21:30,120 If it doesn't do that, then it's very difficult to know if you have a good model or not, whether you have a falsifiable model, for example, or not. 193 00:21:31,290 --> 00:21:38,220 Now, if a model is making statements about possible data that could be observed, 194 00:21:38,730 --> 00:21:43,230 then what we're going to do is we're going to use the mathematics of probability 195 00:21:43,230 --> 00:21:47,610 theory to express all forms of uncertainty and noise associated with our model. 196 00:21:48,300 --> 00:21:58,320 So think about a simple model. Let's take a let's say a model that does forecasting of the weather tomorrow. 197 00:21:58,320 --> 00:22:02,370 Okay. That's not a necessarily simple model. One could certainly build a simple version of that. 198 00:22:02,430 --> 00:22:10,620 Okay. Now, you don't want models that make forecasts that don't tell you how uncertain they are. 199 00:22:10,620 --> 00:22:17,760 And now you have to consider where are all the different sources of uncertainty that you could have in predicting the weather tomorrow? 200 00:22:18,540 --> 00:22:26,100 You might have uncertainty that's coming from the noise in the sensor data that you collected. 201 00:22:26,400 --> 00:22:32,880 You might have uncertainty that's coming from the fact that there are unpredictable effects that your model did not consider. 202 00:22:33,270 --> 00:22:37,860 Your model might have parameters and you might be uncertain about what the right parameters are. 203 00:22:38,190 --> 00:22:41,910 All of those sources of uncertainty we need to deal with somehow. 204 00:22:42,120 --> 00:22:47,100 And what we're going to do is we're going to use the language of probability theory to express uncertainty. 205 00:22:47,400 --> 00:22:55,920 And to me, that is as fundamental as saying that we use calculus as the language to express rates of change. 206 00:22:55,920 --> 00:23:03,239 Probability theory is the language of uncertainty. Then the good news is that we don't have to invoke anything else. 207 00:23:03,240 --> 00:23:11,490 We can just stay within the framework of probability theory to infer aspects of the model from data, 208 00:23:11,490 --> 00:23:18,930 to adapt our model to data, to make predictions, etc. So it all ends up being very, very simple. 209 00:23:21,410 --> 00:23:30,920 And here is what it looks like. Here is Bayes rule, which is the sort of engine that drives learning from data. 210 00:23:32,870 --> 00:23:40,909 And I'm colour coding things into two classes data and hypotheses. 211 00:23:40,910 --> 00:23:45,380 And what I mean by data is anything that's actually measured a measured quantity. 212 00:23:47,090 --> 00:23:50,600 And what I mean by hypotheses is everything else okay? 213 00:23:51,110 --> 00:23:55,370 The world, from a basic point of view, is divided into two kinds of things. 214 00:23:55,850 --> 00:23:59,390 Stuff you're measuring and stuff you're not measuring. 215 00:23:59,780 --> 00:24:02,839 Okay? And the stuff you're measuring, you've measured. 216 00:24:02,840 --> 00:24:05,960 So you kind of know what it is. It could be noisy, but you've measured it. 217 00:24:06,410 --> 00:24:13,070 And the stuff you're not measuring, you better represent the fact that you're uncertain about it because you didn't measure it. 218 00:24:13,530 --> 00:24:18,170 Okay, so all of those things we call hypotheses. 219 00:24:19,100 --> 00:24:30,520 Okay. So that's not the only thing there. I said that these hypotheses, if we think about these, is as if we're trying to express models of data. 220 00:24:31,630 --> 00:24:40,510 We're going to use probability theory to express our models. So basically for every potential configuration of our hypotheses, 221 00:24:41,080 --> 00:24:48,700 we should be able to describe what is the probability of the observed data under that hypothesis. 222 00:24:49,030 --> 00:24:51,060 That's the term that's called the likelihood, 223 00:24:51,250 --> 00:24:58,840 and that's actually what drives most neural network learning is maximising likelihood or penalised likelihood of some kind. 224 00:25:00,230 --> 00:25:03,140 But forget about neural nets. Now we're talking much more generally. 225 00:25:03,470 --> 00:25:08,430 We have this term, which is the likelihood, which gives you the probability of the data given the hypothesis. 226 00:25:09,110 --> 00:25:18,500 And then we have this term, which is called the prior. And the prior is our representation of our uncertainty about everything we haven't observed. 227 00:25:20,750 --> 00:25:25,670 Before we get our data. So the game goes like this. 228 00:25:25,850 --> 00:25:34,399 Before we have our data, we have to place our bets on all the unobserved things we use the language of probability theory to do that. 229 00:25:34,400 --> 00:25:37,760 So we put a probability distribution over our space of hypotheses. 230 00:25:38,180 --> 00:25:44,059 Then we observe the data. Aha! That's the beautiful moment where we can now compute the likelihood, 231 00:25:44,060 --> 00:25:47,600 the probability of the data, given the hypotheses and the simple rules of probability, 232 00:25:47,600 --> 00:25:54,470 tell you you multiply these two, you re normalise over all the hypotheses that you've been considering. 233 00:25:55,640 --> 00:26:00,740 And then what you get is your new state of knowledge, the posterior distribution of your hypotheses, 234 00:26:00,770 --> 00:26:05,329 given the data, and that is the prior that you would use if you got any more data. 235 00:26:05,330 --> 00:26:09,110 So there's nothing really fundamentally different between the prior and the posterior. 236 00:26:09,320 --> 00:26:16,640 It's just the representation of your state of knowledge at any point in the process with the data you've observed so far. 237 00:26:17,150 --> 00:26:25,760 Okay, so learning and prediction can be seen as forms of inference using this this rule. 238 00:26:27,360 --> 00:26:37,200 And here is the slide that I it's a one slide description of Bayesian machine learning that I always use apologies for people who've seen it. 239 00:26:37,530 --> 00:26:44,249 But the point is that even Bayes rule that I had on the previous slide is not a fundamental rule. 240 00:26:44,250 --> 00:26:49,680 The fundamental rules of probability theory are these two simple rules the thumb rule and the product rule. 241 00:26:50,040 --> 00:27:00,179 And the sum rule tells you that the probability of some unknown quantity X is the sum over some other unknown quantity Y of the joint probability. 242 00:27:00,180 --> 00:27:04,770 So the this is called also sometimes called the marginalisation rule. 243 00:27:07,800 --> 00:27:12,270 And the product rule says that the joint probability of X and Y can be factored into 244 00:27:12,270 --> 00:27:16,770 the probability of X times the probability of Y given X or the other way around. 245 00:27:18,030 --> 00:27:28,350 So from these two simple rules, if we substitute X and Y with data and hypotheses, we can get Bayes rule, which we got in the previous slide. 246 00:27:29,580 --> 00:27:35,819 If we use the following symbols theta to represent the parameters of our Model D, 247 00:27:35,820 --> 00:27:42,270 to represent the observed data, and M to represent the model class that we've assumed. 248 00:27:42,540 --> 00:27:49,980 Then we get this expression here, which is just Bayes rule apply to parameters of our model. 249 00:27:50,010 --> 00:27:55,919 What would the parameters be? For example, in a neural net they would be the weights in the neural net and linear regression, 250 00:27:55,920 --> 00:28:02,010 they would be the linear regression coefficients, etc. Every model has parameters in this world. 251 00:28:02,610 --> 00:28:06,629 Okay. And this is the prior that's the likelihood. 252 00:28:06,630 --> 00:28:11,250 And this term here is the normalising constant, which is itself quite interesting. 253 00:28:11,250 --> 00:28:17,070 It's called the marginal likelihood. Now this follows from the salmon product rule. 254 00:28:17,370 --> 00:28:26,069 If you want to make predictions about any unknown quantity X given the data, then the salmon product will tell you that the way you make predictions, 255 00:28:26,070 --> 00:28:37,560 there's only one valid way under this framework, and that one valid way is you consider the predictions made by every possible parameter value. 256 00:28:38,010 --> 00:28:43,379 So those are these terms, and then you weight them by this term in green, 257 00:28:43,380 --> 00:28:48,320 which is the posterior probability of the parameters given the data and the model class. 258 00:28:48,960 --> 00:28:56,700 So the act of forecasting or predicting any unknown quantity X given the observed data is by the salmon 259 00:28:56,700 --> 00:29:02,070 product rule an averaging process you have to average over all the hypotheses that you've considered. 260 00:29:02,250 --> 00:29:07,200 You don't pick the best one or your favourite one, or you don't flip a coin or anything like that. 261 00:29:07,380 --> 00:29:11,100 You're supposed to average over the space of hypotheses in this particular way. 262 00:29:12,170 --> 00:29:23,059 And if you now want to compare different model classes, then you might apply Bayes rule at the level of model classes. 263 00:29:23,060 --> 00:29:28,820 And that looks like this where this term in red, the marginal likelihood now appears in the numerator and denominator. 264 00:29:29,150 --> 00:29:33,770 None of this is actually mysterious. They all follow from from these two rules. 265 00:29:34,010 --> 00:29:40,070 What do I mean by model? Comparison. Model comparison might the story might go like this. 266 00:29:40,310 --> 00:29:49,250 Okay, let's say I'm a biologist. I do an experiment and I have a colleague and my colleague says, I believe that, 267 00:29:49,910 --> 00:29:53,990 you know, this transposition transcription factor regulates these genes. 268 00:29:54,140 --> 00:29:59,510 And I say, no, I have a different model. I believe that it doesn't and that this one does or something like that. 269 00:29:59,750 --> 00:30:03,110 So my colleague and I have two different models now. 270 00:30:03,110 --> 00:30:07,099 We could argue about it in words, but if we follow this probabilistic framework, 271 00:30:07,100 --> 00:30:14,990 what we should do is both of us should write down the model to the specification level that it could make predictions about observable data. 272 00:30:15,440 --> 00:30:21,980 We could assign a probability to the observable data, and then we observe the data. 273 00:30:22,340 --> 00:30:25,370 D And now we can settle the argument. 274 00:30:25,850 --> 00:30:32,899 We basically say, All right, what is the marginal likelihood that that your model gave to my data? 275 00:30:32,900 --> 00:30:38,120 What is the marginal likelihood my model gives to the data? Well, both of our models had some free parameters. 276 00:30:38,300 --> 00:30:43,129 Maybe your model had 17 free parameters, and my model had three free parameters. 277 00:30:43,130 --> 00:30:51,050 So my model is simpler somehow. And I want I don't know, I get nervous, I say, that seems unfair. 278 00:30:51,290 --> 00:30:57,679 Okay, so your model had more parameters if if my colleague goes and optimises those 17 parameters, 279 00:30:57,680 --> 00:31:01,670 then sure enough she can fit the data much better than I can. 280 00:31:02,090 --> 00:31:06,620 Right. But that's not the game optimisation doesn't follow from the some rule in a product rule. 281 00:31:07,220 --> 00:31:11,000 It doesn't matter that my colleague has 17 parameters and I have three. 282 00:31:11,240 --> 00:31:15,950 If we can both compute the marginal likelihood, then we can settle this argument. 283 00:31:16,370 --> 00:31:21,920 Okay, so I actually really strongly believe that in an ideal world, science would be done like this. 284 00:31:22,220 --> 00:31:28,940 People wouldn't just publish their papers in open journals and share their data in an open manner. 285 00:31:29,150 --> 00:31:37,310 I think actually people should write down their models in a way that one could evaluate with future data, 286 00:31:38,090 --> 00:31:43,370 maybe write them as probabilistic programs, which I'll talk about later. And then we could really do objective. 287 00:31:43,760 --> 00:31:46,250 Well, actually, it's subjective, but, you know, 288 00:31:46,370 --> 00:31:55,070 we could do a sort of principled comparison of models giving different subjective opinions about what the hypotheses are. 289 00:31:56,210 --> 00:32:00,140 Okay, so one side on basic machine learning. 290 00:32:00,380 --> 00:32:02,540 So why should we care about all this? 291 00:32:02,540 --> 00:32:12,199 We've had a revolution in machine learning with wonderful, fantastic, deep learning methods that never talk about base anywhere in them. 292 00:32:12,200 --> 00:32:23,120 So why should we care about all this Bayesian stuff? Well, the reason I care is that I'd really like models with calibrated senses of uncertainty. 293 00:32:23,450 --> 00:32:35,749 So I want to be able to trust my system. If it says the probability of there being a pedestrian in front of my car is 0.1, 294 00:32:35,750 --> 00:32:43,010 I want that to mean 10% and I can take actions that correspond to that calibrated probability. 295 00:32:44,430 --> 00:32:48,120 Getting systems that know when they don't know, I feel, is very important. 296 00:32:48,390 --> 00:32:54,719 Also, there's a very beautiful thing about all of this, which is that unease about like 17 parameters, 297 00:32:54,720 --> 00:32:57,480 which the three parameters or different structures of models. 298 00:32:57,720 --> 00:33:07,530 Well, this framework actually gives you automatic tools to compare models of different complexity and to automate the learning of models from data. 299 00:33:07,530 --> 00:33:13,080 And this is called Bayesian Occam's Razor. And it's something I will use in the latter part of my talk. 300 00:33:14,170 --> 00:33:22,090 Okay. So let's go back to our neural networks and just to ground the discussion a little bit. 301 00:33:23,710 --> 00:33:29,650 Here's a neural network and maps from X to why there are different sources of uncertainty here. 302 00:33:29,920 --> 00:33:33,790 One of them is parameter uncertainty. We have weights in the neural network. 303 00:33:34,120 --> 00:33:39,350 And, you know, given any finite amount of data, we're not sure what those weights should be. 304 00:33:39,370 --> 00:33:43,730 So we need to represent our uncertainty. But we also have structural uncertainty. 305 00:33:43,750 --> 00:33:49,780 We've made some structural choices like the architecture, a number of hidden units, our choice of activation functions. 306 00:33:50,080 --> 00:33:56,140 And that's also a source of uncertainty. So it would be great if we could represent all of that. 307 00:33:56,990 --> 00:34:00,780 And that's not a new idea. 308 00:34:00,820 --> 00:34:02,800 None of this is really new ideas. 309 00:34:03,040 --> 00:34:16,359 In fact, the idea of doing Bayesian analysis of neural networks has been around since the early nineties, at least to actually late eighties years. 310 00:34:16,360 --> 00:34:23,060 A bit of a history of a few different methods. Here is a depiction of what we'd really like. 311 00:34:23,080 --> 00:34:26,590 So here's a system that was trained to do some regression on some data. 312 00:34:26,800 --> 00:34:32,650 And what we'd really like is this sort of behaviour that outside of the range of its training data, 313 00:34:32,650 --> 00:34:37,040 it should say, I don't really know and there are many ways of doing that. 314 00:34:37,060 --> 00:34:43,090 These are all different ways of doing that. And we had a nice workshop at NIPS on Bayesian Deep Learning, 315 00:34:44,230 --> 00:34:49,390 where we kind of brought that history together and looked at some of the current state of the art. 316 00:34:51,040 --> 00:34:58,150 So this world machine learning often has camps and people think that you have to be in one or another camp, 317 00:34:58,390 --> 00:35:02,950 but you don't actually you have to understand what all the tools are in the different camps, 318 00:35:02,950 --> 00:35:06,279 and there's a lot of fertile ground at the intersection of these camps. 319 00:35:06,280 --> 00:35:12,070 And that's this is one example of those things. So when do we need probabilities? 320 00:35:12,370 --> 00:35:24,370 Well, we need them when we, our system are, you know, learning an intelligence problem depends crucially on representing uncertainty. 321 00:35:24,380 --> 00:35:28,360 I've sort of said that. But let me describe some examples of that. 322 00:35:28,570 --> 00:35:35,170 So any time we're doing forecasting and, you know, that could be financial forecasting, 323 00:35:35,170 --> 00:35:41,680 weather forecasting, forecasting demand at Uber or for Amazon products or whatever. 324 00:35:42,520 --> 00:35:46,450 We need to represent our uncertainty decision making. 325 00:35:46,630 --> 00:35:52,960 Generally, when you make decisions, you're thinking about the consequences of your actions into the future. 326 00:35:53,200 --> 00:35:56,589 And it's really useful to represent uncertainty there. 327 00:35:56,590 --> 00:36:03,490 It's hard to imagine not doing that at some level when you're learning from limited, noisy and missing data. 328 00:36:03,700 --> 00:36:09,999 So if you imagine dealing with, say, medical records, if you're trying to do machine learning and medical records, 329 00:36:10,000 --> 00:36:16,989 you have patients, your patients and each of them has lots of things that are unobserved. 330 00:36:16,990 --> 00:36:20,190 They maybe there are few medical tests that have been done on each patient. 331 00:36:20,200 --> 00:36:27,610 Most of the data is actually missing. If you look at that, look at it that way if you want to learn complex personalised models. 332 00:36:27,620 --> 00:36:34,150 So it might be, again, whether it's in a medical domain or in a retail domain or something like that, 333 00:36:34,840 --> 00:36:44,200 you might have you might think you have a huge data set, but actually for every patient or every customer, you only have a little bit of data, right? 334 00:36:44,200 --> 00:36:49,300 So it's not really a big data problem. You need to represent uncertainty about that individual. 335 00:36:50,200 --> 00:36:54,579 The whole field of data compression is based on probabilistic modelling and a lot of 336 00:36:54,580 --> 00:36:59,770 my interest in automatic model discovery and experiment design is really based on. 337 00:37:02,630 --> 00:37:08,870 Uncertainty. Now, over the last three months, I've been involved in setting up Uber's AA labs. 338 00:37:08,880 --> 00:37:13,430 I'll just mention that in one slide. Why would Uber care about any of this? 339 00:37:14,120 --> 00:37:23,030 Well, if you look at many of the problems that a large technology company has to solve. 340 00:37:24,660 --> 00:37:31,250 There are problems that deal with uncertainty, decision making, personalisation and so on. 341 00:37:31,500 --> 00:37:33,120 There are huge number of problems. 342 00:37:33,120 --> 00:37:42,330 There are huge number of opportunities around any of the major technology companies for learning from data and for using uncertainty in there. 343 00:37:42,570 --> 00:37:52,740 And, you know, fairly obviously, if you're trying to build a very complicated system that makes decisions in the real world like a self-driving car, 344 00:37:53,460 --> 00:37:58,680 you'd really like to have calibrated uncertainties in that system. Okay. 345 00:37:58,950 --> 00:38:08,460 So here is the one slide picture of my current passions, my current research interests, 346 00:38:09,030 --> 00:38:12,420 and then the next few minutes and I leave a few minutes for questions at the end. 347 00:38:12,720 --> 00:38:17,070 In the next few minutes, I'm going to touch on a few of these topics. 348 00:38:19,690 --> 00:38:23,260 And it's fairly modular, so I can stop to give us time for questions. 349 00:38:23,500 --> 00:38:27,130 But I want to put this slide up here because. 350 00:38:28,310 --> 00:38:30,950 Well, actually, because I had this slide. 351 00:38:31,710 --> 00:38:40,970 So that's one reason because and the reason I had this slide is that I was asked to give a talk about a year ago and they told me, 352 00:38:42,110 --> 00:38:46,130 summarise your work in one slide. So that forced me to produce this slide. 353 00:38:46,310 --> 00:38:50,300 And then when I produced it, I thought it was actually kind of a useful exercise. 354 00:38:50,930 --> 00:38:57,710 So so the, the useful exercise is that it crystallised in my mind the thing that really drives me. 355 00:38:58,730 --> 00:39:04,640 And, you know, it's not that I'm a Bayesian and I just love probabilities or anything like that. 356 00:39:04,820 --> 00:39:10,160 It turns out the thing that really drives me is that. 357 00:39:11,910 --> 00:39:18,270 I like stuff that's automated. I don't I want things to be systematic and automated. 358 00:39:18,270 --> 00:39:23,100 And computer scientists are very good at that. Like computer science, if you put your computer science hat on, 359 00:39:23,520 --> 00:39:29,040 you do something three times and you think, Oh, I need to write a computer program to do that for me. 360 00:39:29,340 --> 00:39:32,580 Three times was two times too many. Right. And. 361 00:39:33,860 --> 00:39:41,060 And the sorry state of machine learning is that stuff is not really automated. 362 00:39:41,450 --> 00:39:53,660 There still is tremendous amounts of human labour, arbitrary decision making and tweaking involved in deploying machine learning systems. 363 00:39:53,960 --> 00:39:57,590 Which is ironic. The whole field is about getting systems to learn from data. 364 00:39:58,190 --> 00:40:04,970 But then there's there there are a lot of well-paid researchers and engineers tweaking those systems that learn from data. 365 00:40:05,480 --> 00:40:08,750 So let's think about automating these things. 366 00:40:09,620 --> 00:40:13,580 And this is what drives me. So if you look at some of these topics, which I'm going to talk about. 367 00:40:14,150 --> 00:40:18,290 So automatic statistician, what is that about? And I'll I'll talk about that in a couple of minutes. 368 00:40:18,500 --> 00:40:23,300 That's about automating the process of model discovery from data. 369 00:40:23,810 --> 00:40:28,410 So searching for a good model from data. Probabilistic programming. 370 00:40:29,520 --> 00:40:36,450 Something that Frank Wood, who is at Oxford, is a world expert in policy programming, 371 00:40:36,450 --> 00:40:43,400 is automating the process of doing inference from a very general probabilistic model. 372 00:40:45,020 --> 00:40:49,190 We also want to automate. Optimisation. 373 00:40:49,190 --> 00:40:52,180 So optimisation is actually a sequential decision problem. 374 00:40:52,190 --> 00:40:58,280 If you have an OPTIMISER that's trying to optimise a function, it's making decisions about where to evaluate the function next. 375 00:40:59,330 --> 00:41:03,200 Collecting some data and then moving on to another point and so on. 376 00:41:03,200 --> 00:41:08,690 People don't think about optimisation that way. They just think about, here's an algorithm and here's something I can prove about the algorithm. 377 00:41:08,990 --> 00:41:12,410 But actually optimisation is very much like, you know. 378 00:41:14,780 --> 00:41:17,209 Bend it problems, reinforcement learning problems, 379 00:41:17,210 --> 00:41:23,180 sequential decision making under uncertainty is something that drives this because we want to optimise sorry, 380 00:41:23,180 --> 00:41:26,960 we want to automate the allocation of computational resources. 381 00:41:27,140 --> 00:41:32,570 So especially now that machine learning systems are very complex. 382 00:41:32,570 --> 00:41:37,490 Right. These these systems use a lot of memory, a lot of CPU. 383 00:41:37,730 --> 00:41:46,670 The datasets are very big. So we can't afford to just tinker about and run a few experiments on a single computer. 384 00:41:47,180 --> 00:41:55,850 And when we run major experiments, we actually have to worry about the fact that this is running on a big, you know, cloud of computers. 385 00:41:56,210 --> 00:42:04,820 And, you know, that's using energy and energy costs money and it's not good for the world, right, using energy like that. 386 00:42:05,540 --> 00:42:10,000 So optimising resource allocation. So these are the things that drive me these days. 387 00:42:10,280 --> 00:42:13,310 I'm going to talk about a couple of them very quickly. 388 00:42:13,490 --> 00:42:15,350 Probabilistic programming is one of them. 389 00:42:17,060 --> 00:42:28,790 The problem here is that developing probabilistic models and deriving inference algorithms is generally a very time consuming and error prone process. 390 00:42:30,010 --> 00:42:34,670 And the solution is to develop probabilistic programming languages. 391 00:42:34,690 --> 00:42:42,220 So what are these things? This is a very beautiful marriage between probabilistic modelling and programming languages, worlds. 392 00:42:42,850 --> 00:42:49,090 And the idea is that you have a probabilistic programming language, which is a way of expressing probabilistic models. 393 00:42:49,330 --> 00:42:58,239 And the modern ones, the ones that people are very interested in, like Frank Wood and myself these days, are completely general programming languages, 394 00:42:58,240 --> 00:43:03,880 sort of Turing complete programming languages that can express any computable probability distribution. 395 00:43:05,110 --> 00:43:08,450 That's the expression part. But what do you do with that? 396 00:43:08,470 --> 00:43:16,270 Well, well, first of all, how do you do that? You express your model as a simulator, a simulator that would generate data. 397 00:43:17,200 --> 00:43:22,090 That's one canonical way of doing that. And that's a very natural concept. 398 00:43:22,120 --> 00:43:25,240 You could say, okay, I have a model for the weather. 399 00:43:26,800 --> 00:43:31,780 Well, that's actually kind of a simulator. Okay. And I write it as a computer program. 400 00:43:31,780 --> 00:43:40,180 I have a model for my gene expression network, and that's going to be a simulator that, you know, simulates gene expression data. 401 00:43:40,490 --> 00:43:44,480 Okay. That's the modelling part. 402 00:43:44,720 --> 00:43:48,020 But then you have some data, you have a simulator and you have some data, 403 00:43:48,020 --> 00:43:56,450 and what you're really interested in is inferring or learning parameters of your simulator, of your model, given the data. 404 00:43:58,260 --> 00:44:04,260 And the very incredible thing is that we can actually come up with universal inference engines. 405 00:44:04,260 --> 00:44:11,280 We can come up with inference engines that in principle could compute the probability 406 00:44:11,280 --> 00:44:16,830 distribution over the hidden variables in our computer program given the data. 407 00:44:17,070 --> 00:44:20,190 So it's basically running Bayes rule on computer programs. 408 00:44:20,910 --> 00:44:26,250 We're all used to running computer programs in the forward direction. You take some inputs and you produce some outputs. 409 00:44:26,640 --> 00:44:28,890 But this is kind of doing it backwards. 410 00:44:29,070 --> 00:44:35,650 You have a computer program that takes some inputs and some calls the random number generators and produces some outputs. 411 00:44:35,670 --> 00:44:37,740 These are random outputs. That's the data. 412 00:44:37,980 --> 00:44:48,540 And now we say, well, what should the inputs and the cost of the random number generators have been to observe this output for the computer program? 413 00:44:48,540 --> 00:44:55,380 That's Bayes rule on the program. And there are many languages. 414 00:44:55,860 --> 00:45:02,220 Now, Anglican is the one that Frank Wood's team has been developing one of the state of the art languages. 415 00:45:02,640 --> 00:45:08,490 Our group in Cambridge has a language called Turing, which is much less developed but also exciting. 416 00:45:08,490 --> 00:45:14,990 It's based on Julia. There are many different languages developed by different groups and there are many 417 00:45:14,990 --> 00:45:20,720 different inference algorithms that can generally run on models in those languages. 418 00:45:23,190 --> 00:45:28,800 Here is, for example, a hidden Markov model written in in Turing. 419 00:45:30,090 --> 00:45:35,430 It's fairly easy to read if you uncomment one line of this model. 420 00:45:35,430 --> 00:45:38,879 You go from a regular hidden Markov model to a Bayesian hidden Markov model. 421 00:45:38,880 --> 00:45:45,990 So changing models around is as easy as sort of adding and removing a few lines of your probabilistic program. 422 00:45:47,490 --> 00:45:53,399 And I really think that, you know, if our vision actually plays out, this could really revolutionise scientific modelling. 423 00:45:53,400 --> 00:45:59,100 If people were actually willing to write probabilistic programs for all of their models and they shared them, 424 00:45:59,310 --> 00:46:03,930 then people could take somebody else's model, run it on their data, improve it, etc. 425 00:46:05,170 --> 00:46:11,320 A few resources here. I'll just give you a few examples. These are now slides from my postdoc, Hong. 426 00:46:11,350 --> 00:46:15,579 Okay. It's a little bit about Turing. 427 00:46:15,580 --> 00:46:20,470 I'll skip through that. That's our H&M example, but much bigger. This is a Bayesian neural network. 428 00:46:20,740 --> 00:46:24,190 Most of this is specifying the prior on the on the weights. 429 00:46:26,050 --> 00:46:32,530 And then this is the actual, you know, Bayesian neural network that's just sort of the neural network function and so on. 430 00:46:32,830 --> 00:46:38,379 And then you could just run inference using, you know, Hamiltonian Monte Carlo or something. 431 00:46:38,380 --> 00:46:47,500 You don't have to even know what that is. It abstracts away the model specification from the inference and ah language. 432 00:46:47,500 --> 00:46:50,070 Turing is pretty competitive. It's, 433 00:46:50,170 --> 00:46:58,960 it's sort of in the same ballpark is Anglican occasionally a bit faster but I know that the Anglican team keeps improving their language as well. 434 00:47:00,640 --> 00:47:03,880 Another topic I want to talk about is Bayesian optimisation. 435 00:47:04,930 --> 00:47:07,240 I have basically a couple of slides on that. 436 00:47:07,450 --> 00:47:17,950 So the problem here is you want to find ideally a global optimum, maybe that's too much ask of some black box function that is expensive to evaluate. 437 00:47:18,340 --> 00:47:24,790 So you can't just evaluate in lots and lots of places. You need to think about where you're going to evaluate your function next. 438 00:47:25,090 --> 00:47:29,230 And we don't want to do that manually. We want to automate the algorithm that thinks about that. 439 00:47:30,310 --> 00:47:36,200 So the solution is to treat the problem as sequential decision making under uncertainty. 440 00:47:36,200 --> 00:47:40,960 And what we're uncertain about is what the actual function is. And this has huge number of applications. 441 00:47:41,530 --> 00:47:46,719 And I'm actually you know, I'll say a couple words about the automatic statistician, but I do want to leave some time for questions. 442 00:47:46,720 --> 00:47:53,770 So the automatic statistician is automating is trying to automate model discovery. 443 00:47:54,250 --> 00:48:00,220 And the idea here is what we'd like is the system where we can just give it data. 444 00:48:00,460 --> 00:48:04,900 It searches over a large space of models, 445 00:48:04,900 --> 00:48:12,729 evaluating models according to some principled metric that trades off model complexity with the amount of data that you have. 446 00:48:12,730 --> 00:48:19,420 And actually the marginal likelihood, which I described, is one such metric, it produces a model and then interestingly, 447 00:48:19,420 --> 00:48:23,770 it translates that into a report that is then interpretable by a human being. 448 00:48:23,770 --> 00:48:32,140 So this is the opposite of a black box. We really want a transparent box, something that the human will be able to understand. 449 00:48:33,430 --> 00:48:40,570 Okay. And again, you know, I'll actually skip over most of this because I do want to leave time for questions. 450 00:48:40,810 --> 00:48:45,040 So we do a search over models. This is the automatic statistician applied to some time series. 451 00:48:45,310 --> 00:48:49,980 It finds a good model. Then it comes up with a description of that model. 452 00:48:49,990 --> 00:48:53,650 It produces the text itself. So this is the executive summary of the text. 453 00:48:54,010 --> 00:49:01,000 Actually, the text is in the form of these documents, which are, you know, 5 to 10 pages long. 454 00:49:01,360 --> 00:49:06,250 And, you know, we can have here is the report writing demo. 455 00:49:06,880 --> 00:49:14,440 You know, we could run this and this is a slightly different version of this, which actually does clustering. 456 00:49:15,460 --> 00:49:19,390 It tries to visualise things, it tells you what it's found, etc. 457 00:49:20,320 --> 00:49:29,970 Okay. And it tends to perform well at prediction because actually being systematic pays off. 458 00:49:31,430 --> 00:49:36,440 Okay. And we've applied this to classification as well, to regression, to clustering and so on. 459 00:49:36,680 --> 00:49:44,540 And we're going to have a release of it, I keep saying very soon, but this time I really mean it very soon means in a couple of months, I think. 460 00:49:45,230 --> 00:49:48,170 Okay. So I'm going to wrap up there. 461 00:49:50,950 --> 00:49:58,270 This probabilistic modelling framework isn't the only way to do machine learning, but it's a really useful organising principle. 462 00:49:58,630 --> 00:50:03,400 There is there are many layers and it's completely compatible with the choice of models that 463 00:50:03,400 --> 00:50:09,100 you have and whether you like deep learning or even logic and other frameworks and so on. 464 00:50:09,460 --> 00:50:18,220 We we really can hybridise a lot of these methods to produce interesting systems that reason about uncertainty and learn from data. 465 00:50:19,000 --> 00:50:24,940 I've briefly reviewed three topics is the review paper I wrote a couple of years ago that summarised 466 00:50:24,940 --> 00:50:30,880 this line of work and I wanted to end by thanking a whole bunch of collaborators I've had.