1
00:00:01,280 --> 00:00:18,069
George. Okay.

2
00:00:18,070 --> 00:00:21,790
So I'm full blown. So I'm going to introduce our speaker for the stretch lecture this term.

3
00:00:22,510 --> 00:00:28,360
Firstly, I'd like to say a big thank you to Oxford Asset Management, our sponsor who makes this series of lectures possible.

4
00:00:28,690 --> 00:00:34,300
I've also been asked to draw your attention to our hashtag that's prominently placed in case you would like to tweet.

5
00:00:35,770 --> 00:00:40,150
And also anyone interested in our software engineering program can find brochures outside and people to talk to.

6
00:00:41,110 --> 00:00:49,120
So, okay. But onto the mean business. So it's my pleasure to introduce Stephen Carbone, who will give our Hillary term strategy lecture.

7
00:00:50,890 --> 00:00:54,970
Zubin is Professor of Information Engineering at Cambridge.

8
00:00:55,510 --> 00:00:56,980
He's a fellow of the Royal Society.

9
00:00:57,760 --> 00:01:05,770
I think it's fair to say that his machine learning group in Cambridge is would be one of the most influential over the last decade.

10
00:01:06,640 --> 00:01:14,440
It's hard to it's hard to go to any major machine learning academic group or industry lab without finding Rubin's ex-students or postdocs.

11
00:01:16,690 --> 00:01:21,520
So and more recently, Zubin founded Geometric Intelligence A Start-Up.

12
00:01:22,150 --> 00:01:27,070
And after its acquisition, he's now the co-director of Uber, Eli Labs.

13
00:01:27,400 --> 00:01:31,420
And maybe if you're off nicely, he might tell you a bit about that. We'll see.

14
00:01:32,020 --> 00:01:41,140
Okay. So you've made contributions across machine learning, particularly probabilistic inference, even deep learning.

15
00:01:41,840 --> 00:01:47,140
It's not surprising. He was Mike Jordan student and did his postdoctoral work with Geoff Hinton.

16
00:01:47,440 --> 00:01:52,020
So a pretty amazing pedigree there. So really, Zubin,

17
00:01:52,030 --> 00:01:58,540
seminal work is really invasion non parametric where he's really been leading this idea that

18
00:01:58,930 --> 00:02:03,880
it's not just enough for now machine learning research to aim for for accurate predictions.

19
00:02:04,090 --> 00:02:08,950
We also need to be able to quantify uncertainty. We need to be able to talk about causation.

20
00:02:09,460 --> 00:02:13,960
And if we really want machine learning and AI to have an impact in industry, we need to be able to tackle those things.

21
00:02:14,530 --> 00:02:18,910
I'm sure he will tell you all about these things. So please welcome Zubin for the talk.

22
00:02:27,140 --> 00:02:35,360
Thanks, Phil, for that great introduction and a great thank you to the Department of Computer Science for inviting me.

23
00:02:36,200 --> 00:02:38,870
Okay. Can you all hear me? Yeah. Good.

24
00:02:39,230 --> 00:02:44,630
So I'm going to talk about probabilistic machine learning, which is my passion, is the thing I'm really excited about.

25
00:02:45,170 --> 00:02:57,230
And I'll start from basics and as the talk goes on, will get into more and more current research, more of what we're actually really doing these days.

26
00:02:57,440 --> 00:03:05,840
So that's why the subtitle is Foundations and Frontiers. Foundations is meant to be, you know, motivation, background material.

27
00:03:06,080 --> 00:03:12,140
But if you're bored by that, don't worry, it'll get more technical later on.

28
00:03:13,160 --> 00:03:17,569
Okay, so let's start from the basics. Machine learning.

29
00:03:17,570 --> 00:03:20,600
Well, what is machine learning? It's just a term.

30
00:03:20,750 --> 00:03:22,940
There are many other related terms.

31
00:03:24,050 --> 00:03:30,620
You know, depending on the community that you come from, you might think about data mining or artificial intelligence or statistical modelling,

32
00:03:30,890 --> 00:03:35,180
neural networks, pattern recognition, sort of a bit of a more old fashioned term.

33
00:03:35,510 --> 00:03:45,890
All these terms are related. I'll focus on the term machine learning, but keep the context in mind in terms of academic disciplines.

34
00:03:46,280 --> 00:03:56,780
This is also a very interdisciplinary area in that we draw from ideas in computer science, engineering statistics,

35
00:03:56,780 --> 00:04:05,720
applied mathematics, and we get a lot of inspiration from cognitive science, economics, even tools from physics and neuroscience.

36
00:04:06,960 --> 00:04:11,730
And then why are people interested in machine learning these days?

37
00:04:11,940 --> 00:04:19,020
Well, it used to be kind of an interesting academic field where you sort of played around and you kind of tried to get computers to learn from data.

38
00:04:19,410 --> 00:04:23,760
Most people didn't care much about it, but now suddenly lots of people care.

39
00:04:24,000 --> 00:04:30,240
And the reason lots of people care is because there are many, many applications of machine learning.

40
00:04:30,600 --> 00:04:35,700
It's sort of I like to think of it as the invisible thing that's behind a lot

41
00:04:35,700 --> 00:04:41,280
of the more visible applications that involve computers learning from data.

42
00:04:43,470 --> 00:04:46,800
So let's just go through some of those applications just to motivate.

43
00:04:48,750 --> 00:04:54,300
Speech and language technologies is an area that has been transformed by the use of machine learning.

44
00:04:54,540 --> 00:05:01,169
So automatic speech recognition, machine translation, question answering dialogue systems.

45
00:05:01,170 --> 00:05:06,990
Then every year we seem to get more and more advances in these sorts of tools.

46
00:05:07,620 --> 00:05:12,120
Computer vision, again, a field that has been around for a very long time.

47
00:05:12,450 --> 00:05:25,409
But with the advent of large amounts of data and more powerful computational tools, we're able to now do interesting things like not just object,

48
00:05:25,410 --> 00:05:33,540
face and handwriting recognition, but image captioning going from an image to a bit of text that's meant to describe the image.

49
00:05:33,540 --> 00:05:38,610
And these are this is from a very famous paper.

50
00:05:39,900 --> 00:05:46,470
And, you know, you can actually pick it apart in the sense that you could say, well, these are hand chosen to make the algorithm look good.

51
00:05:46,680 --> 00:05:48,600
You know, man in black shirt is playing guitar.

52
00:05:48,600 --> 00:05:55,140
That seems pretty amazing that a computer could take an image like this and produce this description of the image.

53
00:05:55,920 --> 00:05:59,069
It doesn't always work that brilliantly,

54
00:05:59,070 --> 00:06:06,840
but I would say that most of us in the field were stunned when we saw this happen for the first time

55
00:06:06,840 --> 00:06:12,840
that we could actually get a system that would produce some reasonable descriptions from images.

56
00:06:14,290 --> 00:06:19,180
Of course, we all have cameras in our pockets that put boxes around people's faces.

57
00:06:19,390 --> 00:06:25,660
If you ever ask yourself, well, how does that work? Well, that's a bit of machine learning that runs on all of your camera devices.

58
00:06:28,120 --> 00:06:33,700
Moving into the sciences, a lot of the sciences have become very data heavy.

59
00:06:34,900 --> 00:06:39,549
Fields like bioinformatics and genomics in the medical sciences,

60
00:06:39,550 --> 00:06:49,629
but also astronomy areas where we're now able to collect much more data than any human being could sit down and analyse manually.

61
00:06:49,630 --> 00:06:57,520
And so machine learning and AI tools have been very important in scientific data analysis,

62
00:06:57,520 --> 00:07:00,340
and that's something I'll talk about maybe a little bit later on as well.

63
00:07:01,330 --> 00:07:06,430
Recommender systems, we all know, but these are, you know, customers who bought this item,

64
00:07:06,430 --> 00:07:15,040
also bought this kind of thing that's driven by machine learning, self-driving cars, something that I'm now much more involved in.

65
00:07:15,880 --> 00:07:18,370
This is not a totally new thing.

66
00:07:18,370 --> 00:07:28,059
I mean, this self-driving car, Alvin was around about 30 years ago, and he used neural networks to drive around at 70 miles per hour on highways.

67
00:07:28,060 --> 00:07:32,320
That's what it says on this slide that I took from about 30 years ago.

68
00:07:32,530 --> 00:07:33,579
That's very scary.

69
00:07:33,580 --> 00:07:42,040
I would not want to be anywhere close to that truck driving at 70 miles an hour on a highway driven by a neural network that's about this big.

70
00:07:43,990 --> 00:07:52,720
But things have moved on and we now have pretty good self-driving systems that are just getting better every year.

71
00:07:54,260 --> 00:07:57,530
Robotics. I just love dogs playing football.

72
00:07:58,430 --> 00:08:04,309
So robotics is this this particular sort of RoboCop isn't necessarily driven by machine learning,

73
00:08:04,310 --> 00:08:12,670
but there are a lot of excellent uses of machine learning in robotics, automated trading, financial prediction,

74
00:08:12,920 --> 00:08:22,640
computer games you're all familiar with, you know, the the DeepMind landmark results first with learning Atari games,

75
00:08:23,570 --> 00:08:34,640
playing Atari games at a human or superhuman level, then more recently beating the world master at goal.

76
00:08:36,110 --> 00:08:50,830
And who knows what this is? This is Lee Brutus, a system that recently won a poker championship.

77
00:08:52,570 --> 00:08:59,110
And this was a very against a whole bunch of humans. This is all the numbers in parentheses are how much money the humans lost to the computer.

78
00:09:00,940 --> 00:09:08,979
And the very interesting thing about this is that this is quite a complicated game in that if you think about poker, what does it involve?

79
00:09:08,980 --> 00:09:19,000
Well, it involves things like trying to understand the state of mind of the other player and bluffing and things like that.

80
00:09:19,000 --> 00:09:24,550
So to be a good poker player, you have to be able to do those things. And so now we have good machine poker players as well.

81
00:09:25,660 --> 00:09:31,330
So what is it? Well, machine learning, if if I had to define it, I would use a sentence like this.

82
00:09:31,330 --> 00:09:38,470
It's an interdisciplinary field that develops both the mathematical foundations and practical applications of systems that learn from data.

83
00:09:39,130 --> 00:09:42,730
Here are some of the main conferences and so on associated with that field.

84
00:09:44,440 --> 00:09:48,669
So that's all in terms of motivation from applications.

85
00:09:48,670 --> 00:09:51,010
But actually when you look at machine learning systems,

86
00:09:51,490 --> 00:09:58,480
most of the time machine learning systems are are trying to solve one of a few canonical problems.

87
00:09:58,750 --> 00:10:04,570
So I'll just go through those canonical problems in my kind of introductory part of the of the lecture.

88
00:10:05,320 --> 00:10:08,889
So this is probably the most canonical problem, the classification problem.

89
00:10:08,890 --> 00:10:14,410
You have some data, you want to classify it into two or more classes.

90
00:10:15,370 --> 00:10:21,160
So the task is to predict some discrete class labels from input data that has lots and lots of applications.

91
00:10:21,400 --> 00:10:25,870
And there are a lot of buzzwords for different methods that can be used for classification.

92
00:10:25,870 --> 00:10:29,170
These are just different ways of trying to do classification from data.

93
00:10:30,620 --> 00:10:37,040
Regression, trying to predict some continuous quantity Y from some inputs X.

94
00:10:37,910 --> 00:10:40,730
Obviously, this has lots of applications as well.

95
00:10:41,000 --> 00:10:49,030
And, you know, there are lots of methods, some of which you might say, well, that's not a machine learning method, that's linear regression.

96
00:10:49,040 --> 00:10:50,630
It's been around for over a hundred years.

97
00:10:50,990 --> 00:10:57,920
But, you know, again, remember, this is all in the context of everything that's been going on in all of these neighbouring fields.

98
00:10:58,160 --> 00:11:03,580
And there's nothing that says, oh, this is a machine learning method and that's not a machine learning method.

99
00:11:03,590 --> 00:11:12,229
If it's just if it's making predictions and decisions from data, it is a machine learning method at some level clustering.

100
00:11:12,230 --> 00:11:17,720
The task here is the group data together so that similar points are put in the same group.

101
00:11:18,740 --> 00:11:22,700
Many applications, again, many different methods. Dimensionality reduction.

102
00:11:22,700 --> 00:11:31,310
When you have very high dimensional data, you might want to find a low dimensional representation of that data that preserves important information.

103
00:11:31,910 --> 00:11:41,899
Another canonical machine learning problem semi supervised learning where you might have some labelled data.

104
00:11:41,900 --> 00:11:48,980
Here you might have a few label points like these two label points that are minuses and these three that are pluses.

105
00:11:49,310 --> 00:11:55,190
And you might want to basically be able to leverage the fact that you have a lot of unlabelled data as well.

106
00:11:55,190 --> 00:12:02,629
And so semi supervised learning combines labelled and unlabelled data to get better predictions and reinforcement learning,

107
00:12:02,630 --> 00:12:07,880
which is related to sequential decision making and adaptive control.

108
00:12:08,180 --> 00:12:16,070
The task there is to learn to interact with an environment, making sequential decisions to maximise future rewards.

109
00:12:16,340 --> 00:12:22,460
So it's an interactive setting where you have an agent producing some actions or decisions in an environment.

110
00:12:22,700 --> 00:12:26,060
There may be some hidden state to both the agent and the environment,

111
00:12:26,360 --> 00:12:36,200
and then you get some observed sensory inputs and the agent has to be has to act in the environment to maximise its rewards.

112
00:12:38,010 --> 00:12:40,080
Okay. So these are the canonical problems.

113
00:12:40,350 --> 00:12:47,399
It is actually quite bewildering if you start reading the machine learning literature and you're not an expert because there are many,

114
00:12:47,400 --> 00:12:53,520
many different methods and you know, every paper seems to present a new method.

115
00:12:53,790 --> 00:12:59,819
And so here is this sort of a very crude way of organising a bunch of machine learning methods.

116
00:12:59,820 --> 00:13:03,540
But don't give this too much input, too much weight on this.

117
00:13:04,410 --> 00:13:12,150
Okay. But I'm going to focus on for the first few minutes, I'm going to focus on one bubble here, which is this neural networks and deep learning one.

118
00:13:12,390 --> 00:13:17,640
And the reason I'm focusing on that should be for any of you who is familiar with the field,

119
00:13:17,640 --> 00:13:22,860
it should be pretty obvious because these methods have been really revolutionary.

120
00:13:22,860 --> 00:13:29,700
They've really been involved in some of the most spectacular breakthroughs in the last few years.

121
00:13:32,170 --> 00:13:36,550
So what are they? Well, a neural network.

122
00:13:39,180 --> 00:13:42,210
And I'm going to focus here on a feedforward neural network.

123
00:13:42,450 --> 00:13:45,719
Just for simplicity, there are other kinds, but a feedforward neural network.

124
00:13:45,720 --> 00:13:49,650
The most standard one is essentially just a function approximate.

125
00:13:50,160 --> 00:13:57,210
So it takes some inputs, called them X and it produces some outputs.

126
00:13:57,540 --> 00:14:08,000
Call them Y. And the way it produces them is through a sequence of transformations organised in layers.

127
00:14:08,510 --> 00:14:11,450
But all of that is in a sense a bit of a detail.

128
00:14:11,690 --> 00:14:20,239
It's just a way of representing a function that maps from X to Y via tuneable parameters called weights.

129
00:14:20,240 --> 00:14:24,710
Or I'm using theta t to note note the denote the parameters of the network.

130
00:14:27,500 --> 00:14:34,580
So neural nets are I mean, one of the important aspects of neural nets is that they're nonlinear functions

131
00:14:35,360 --> 00:14:41,090
and they're often both nonlinear in the input and nonlinear in the parameters.

132
00:14:41,390 --> 00:14:47,720
So optimising them to minimise some objective function tends to be slightly complicated.

133
00:14:48,080 --> 00:14:56,000
The other defining characteristic of neural networks is that they represent the function from X or Y in layers,

134
00:14:56,210 --> 00:15:00,740
which is essentially simply just as a composition of functions.

135
00:15:01,740 --> 00:15:11,040
Okay. So here is a multiplayer neural network with one hidden layer represented as a function that maps from Xs to Y through some parameters.

136
00:15:11,280 --> 00:15:16,560
And these super scripts here one and two you the the two layers of parameters that you have.

137
00:15:18,580 --> 00:15:22,120
These neural networks are usually trained to maximise some likelihood,

138
00:15:22,120 --> 00:15:32,110
so they fall very squarely within the world of statistical models using some variant of stochastic gradient descent optimisation.

139
00:15:32,120 --> 00:15:35,140
So this is where we start using tools from optimisation theory.

140
00:15:35,770 --> 00:15:42,340
Okay. So that's one slide on neural networks. And these things have been around for many decades.

141
00:15:43,540 --> 00:15:43,869
In fact,

142
00:15:43,870 --> 00:15:52,540
these things are what got me excited about AI back in the eighties when I was sort of an undergraduate and thinking about what to do with my life.

143
00:15:55,990 --> 00:16:04,780
But what's happened is that something dramatic has happened between the 1980s and now.

144
00:16:05,200 --> 00:16:09,309
And one of the things that's dramatic is that the terminology has changed.

145
00:16:09,310 --> 00:16:13,810
So people now call these deep learning systems because they have many more layers.

146
00:16:13,990 --> 00:16:16,720
But there are other, more interesting, dramatic things that have happened.

147
00:16:16,960 --> 00:16:22,420
So these deep learning systems that are involved in a lot of these very impressive

148
00:16:22,420 --> 00:16:29,740
benchmarks are very similar to the neural net architectures from the eighties and nineties,

149
00:16:30,010 --> 00:16:37,809
with some important architectural and algorithmic innovations like being able to use many layers and particular nonlinearity,

150
00:16:37,810 --> 00:16:47,140
such as the real you particular ways of regularising them like dropout and very useful tricks for dealing with time series like the.

151
00:16:49,360 --> 00:16:56,260
They are also based on vastly they're trained using vastly larger data sets, really web scale data sets.

152
00:16:57,640 --> 00:17:04,240
To do that, you need vastly larger compute resources, so GPUs, GPUs on clouds, etc.

153
00:17:04,600 --> 00:17:14,350
Importantly, there's been a major effort to democratise the software tools so that it's quite easy to actually train a neural network.

154
00:17:14,350 --> 00:17:20,020
So we have much better software tools, things like torch and TensorFlow.

155
00:17:20,320 --> 00:17:25,180
And of course, there has been vastly increased industry investment and media hype.

156
00:17:25,390 --> 00:17:35,590
And what that what that has meant is that there is a huge influx of people trying out different variations of neural networks on different problems.

157
00:17:36,010 --> 00:17:47,410
And stepping back, I kind of think of this a little bit of as the community of machine learning researchers is running a bit of a genetic algorithm,

158
00:17:47,590 --> 00:17:55,660
trying out lots of different ideas and variations and ideas to be able to improve on the performance of existing benchmarks.

159
00:17:57,490 --> 00:18:05,310
Okay. So that's that's deep learning in a in a nutshell, there's huge amounts more to say about that.

160
00:18:05,620 --> 00:18:08,680
And there are many better people than me to talk about that.

161
00:18:09,700 --> 00:18:12,940
But one thing I do want to talk about is limitations of deep learning.

162
00:18:13,270 --> 00:18:20,560
So let's step back from the excitement. Let's acknowledge the excitement and let's say, well, where do we go next?

163
00:18:20,860 --> 00:18:28,300
What do we need to focus on? And I would argue that there are a few limitations we really need to think about.

164
00:18:28,660 --> 00:18:33,370
So one of them is that neural nets are very data hungry.

165
00:18:33,640 --> 00:18:37,600
You often need millions of examples to train these large models. And that should not be surprising.

166
00:18:37,600 --> 00:18:41,320
If you if you know a bit of statistics,

167
00:18:42,340 --> 00:18:48,850
perhaps the surprising thing is that you don't need that many millions to train models with millions of parameters.

168
00:18:48,850 --> 00:18:57,280
People would have thought that would. That was crazy. And it is surprising that you can get away with, you know, relatively small amounts of data,

169
00:18:57,280 --> 00:19:00,670
even though it's large by the standards of the eighties and nineties.

170
00:19:01,720 --> 00:19:05,050
They're also very compute intensive to train and deploy.

171
00:19:07,060 --> 00:19:11,710
They're poor at representing uncertainty. And this is something that I'm particularly interested in.

172
00:19:12,850 --> 00:19:19,540
There are some great studies that show that neural nets and deep learning systems can be easily fooled by adversarial example.

173
00:19:19,540 --> 00:19:30,429
So you can construct examples that will make the neural network very confidently give the wrong answer, and that should be worrying.

174
00:19:30,430 --> 00:19:35,500
That relates to the uncertainty thing. It's okay for a system to make mistakes,

175
00:19:35,830 --> 00:19:41,680
but it's not okay for it to be really confidently making mistakes because then you don't

176
00:19:41,680 --> 00:19:46,089
know when to trust the answers and you can't really build mission critical systems,

177
00:19:46,090 --> 00:19:51,459
things like in, let's say in the health care domain or in self-driving cars and so on.

178
00:19:51,460 --> 00:19:58,450
If you really can't trust the confidences of your model, they're finicky to optimise.

179
00:19:59,320 --> 00:20:06,400
You know, optimisation is non convex and there are many different parametric architectural choices that need to be made.

180
00:20:06,730 --> 00:20:12,490
And they're generally on interpretable black boxes, lacking in transparency and difficult to trust.

181
00:20:13,030 --> 00:20:16,809
Okay, of course people are working on all of these things,

182
00:20:16,810 --> 00:20:22,960
but I wanted to put them on a slide to sort of motivate us to move towards the interesting challenges that we have.

183
00:20:24,360 --> 00:20:33,000
A particular area that that I'm really interested in, which Phil mentioned in the introduction,

184
00:20:33,300 --> 00:20:37,620
is thinking about machine learning as probabilistic modelling.

185
00:20:37,830 --> 00:20:43,710
So let's go beyond deep learning. I'll come back to neural nets and deep learning in a minute in the context of probabilistic modelling.

186
00:20:44,010 --> 00:20:50,400
Let's go beyond deep learning and let's talk about a particular view of machine learning

187
00:20:50,610 --> 00:20:56,610
that's grounded in the idea that we want systems that will build models from data,

188
00:20:57,510 --> 00:21:00,880
probabilistic models from data. So what do I mean by a model?

189
00:21:00,900 --> 00:21:12,540
The term model gets used by many people in different contexts. What I mean is a model describes data that one could observe from a system.

190
00:21:13,140 --> 00:21:19,250
Okay. So it should models should be able to make predictions.

191
00:21:19,260 --> 00:21:22,530
It should it should say make statements about observable data.

192
00:21:22,890 --> 00:21:30,120
If it doesn't do that, then it's very difficult to know if you have a good model or not, whether you have a falsifiable model, for example, or not.

193
00:21:31,290 --> 00:21:38,220
Now, if a model is making statements about possible data that could be observed,

194
00:21:38,730 --> 00:21:43,230
then what we're going to do is we're going to use the mathematics of probability

195
00:21:43,230 --> 00:21:47,610
theory to express all forms of uncertainty and noise associated with our model.

196
00:21:48,300 --> 00:21:58,320
So think about a simple model. Let's take a let's say a model that does forecasting of the weather tomorrow.

197
00:21:58,320 --> 00:22:02,370
Okay. That's not a necessarily simple model. One could certainly build a simple version of that.

198
00:22:02,430 --> 00:22:10,620
Okay. Now, you don't want models that make forecasts that don't tell you how uncertain they are.

199
00:22:10,620 --> 00:22:17,760
And now you have to consider where are all the different sources of uncertainty that you could have in predicting the weather tomorrow?

200
00:22:18,540 --> 00:22:26,100
You might have uncertainty that's coming from the noise in the sensor data that you collected.

201
00:22:26,400 --> 00:22:32,880
You might have uncertainty that's coming from the fact that there are unpredictable effects that your model did not consider.

202
00:22:33,270 --> 00:22:37,860
Your model might have parameters and you might be uncertain about what the right parameters are.

203
00:22:38,190 --> 00:22:41,910
All of those sources of uncertainty we need to deal with somehow.

204
00:22:42,120 --> 00:22:47,100
And what we're going to do is we're going to use the language of probability theory to express uncertainty.

205
00:22:47,400 --> 00:22:55,920
And to me, that is as fundamental as saying that we use calculus as the language to express rates of change.

206
00:22:55,920 --> 00:23:03,239
Probability theory is the language of uncertainty. Then the good news is that we don't have to invoke anything else.

207
00:23:03,240 --> 00:23:11,490
We can just stay within the framework of probability theory to infer aspects of the model from data,

208
00:23:11,490 --> 00:23:18,930
to adapt our model to data, to make predictions, etc. So it all ends up being very, very simple.

209
00:23:21,410 --> 00:23:30,920
And here is what it looks like. Here is Bayes rule, which is the sort of engine that drives learning from data.

210
00:23:32,870 --> 00:23:40,909
And I'm colour coding things into two classes data and hypotheses.

211
00:23:40,910 --> 00:23:45,380
And what I mean by data is anything that's actually measured a measured quantity.

212
00:23:47,090 --> 00:23:50,600
And what I mean by hypotheses is everything else okay?

213
00:23:51,110 --> 00:23:55,370
The world, from a basic point of view, is divided into two kinds of things.

214
00:23:55,850 --> 00:23:59,390
Stuff you're measuring and stuff you're not measuring.

215
00:23:59,780 --> 00:24:02,839
Okay? And the stuff you're measuring, you've measured.

216
00:24:02,840 --> 00:24:05,960
So you kind of know what it is. It could be noisy, but you've measured it.

217
00:24:06,410 --> 00:24:13,070
And the stuff you're not measuring, you better represent the fact that you're uncertain about it because you didn't measure it.

218
00:24:13,530 --> 00:24:18,170
Okay, so all of those things we call hypotheses.

219
00:24:19,100 --> 00:24:30,520
Okay. So that's not the only thing there. I said that these hypotheses, if we think about these, is as if we're trying to express models of data.

220
00:24:31,630 --> 00:24:40,510
We're going to use probability theory to express our models. So basically for every potential configuration of our hypotheses,

221
00:24:41,080 --> 00:24:48,700
we should be able to describe what is the probability of the observed data under that hypothesis.

222
00:24:49,030 --> 00:24:51,060
That's the term that's called the likelihood,

223
00:24:51,250 --> 00:24:58,840
and that's actually what drives most neural network learning is maximising likelihood or penalised likelihood of some kind.

224
00:25:00,230 --> 00:25:03,140
But forget about neural nets. Now we're talking much more generally.

225
00:25:03,470 --> 00:25:08,430
We have this term, which is the likelihood, which gives you the probability of the data given the hypothesis.

226
00:25:09,110 --> 00:25:18,500
And then we have this term, which is called the prior. And the prior is our representation of our uncertainty about everything we haven't observed.

227
00:25:20,750 --> 00:25:25,670
Before we get our data. So the game goes like this.

228
00:25:25,850 --> 00:25:34,399
Before we have our data, we have to place our bets on all the unobserved things we use the language of probability theory to do that.

229
00:25:34,400 --> 00:25:37,760
So we put a probability distribution over our space of hypotheses.

230
00:25:38,180 --> 00:25:44,059
Then we observe the data. Aha! That's the beautiful moment where we can now compute the likelihood,

231
00:25:44,060 --> 00:25:47,600
the probability of the data, given the hypotheses and the simple rules of probability,

232
00:25:47,600 --> 00:25:54,470
tell you you multiply these two, you re normalise over all the hypotheses that you've been considering.

233
00:25:55,640 --> 00:26:00,740
And then what you get is your new state of knowledge, the posterior distribution of your hypotheses,

234
00:26:00,770 --> 00:26:05,329
given the data, and that is the prior that you would use if you got any more data.

235
00:26:05,330 --> 00:26:09,110
So there's nothing really fundamentally different between the prior and the posterior.

236
00:26:09,320 --> 00:26:16,640
It's just the representation of your state of knowledge at any point in the process with the data you've observed so far.

237
00:26:17,150 --> 00:26:25,760
Okay, so learning and prediction can be seen as forms of inference using this this rule.

238
00:26:27,360 --> 00:26:37,200
And here is the slide that I it's a one slide description of Bayesian machine learning that I always use apologies for people who've seen it.

239
00:26:37,530 --> 00:26:44,249
But the point is that even Bayes rule that I had on the previous slide is not a fundamental rule.

240
00:26:44,250 --> 00:26:49,680
The fundamental rules of probability theory are these two simple rules the thumb rule and the product rule.

241
00:26:50,040 --> 00:27:00,179
And the sum rule tells you that the probability of some unknown quantity X is the sum over some other unknown quantity Y of the joint probability.

242
00:27:00,180 --> 00:27:04,770
So the this is called also sometimes called the marginalisation rule.

243
00:27:07,800 --> 00:27:12,270
And the product rule says that the joint probability of X and Y can be factored into

244
00:27:12,270 --> 00:27:16,770
the probability of X times the probability of Y given X or the other way around.

245
00:27:18,030 --> 00:27:28,350
So from these two simple rules, if we substitute X and Y with data and hypotheses, we can get Bayes rule, which we got in the previous slide.

246
00:27:29,580 --> 00:27:35,819
If we use the following symbols theta to represent the parameters of our Model D,

247
00:27:35,820 --> 00:27:42,270
to represent the observed data, and M to represent the model class that we've assumed.

248
00:27:42,540 --> 00:27:49,980
Then we get this expression here, which is just Bayes rule apply to parameters of our model.

249
00:27:50,010 --> 00:27:55,919
What would the parameters be? For example, in a neural net they would be the weights in the neural net and linear regression,

250
00:27:55,920 --> 00:28:02,010
they would be the linear regression coefficients, etc. Every model has parameters in this world.

251
00:28:02,610 --> 00:28:06,629
Okay. And this is the prior that's the likelihood.

252
00:28:06,630 --> 00:28:11,250
And this term here is the normalising constant, which is itself quite interesting.

253
00:28:11,250 --> 00:28:17,070
It's called the marginal likelihood. Now this follows from the salmon product rule.

254
00:28:17,370 --> 00:28:26,069
If you want to make predictions about any unknown quantity X given the data, then the salmon product will tell you that the way you make predictions,

255
00:28:26,070 --> 00:28:37,560
there's only one valid way under this framework, and that one valid way is you consider the predictions made by every possible parameter value.

256
00:28:38,010 --> 00:28:43,379
So those are these terms, and then you weight them by this term in green,

257
00:28:43,380 --> 00:28:48,320
which is the posterior probability of the parameters given the data and the model class.

258
00:28:48,960 --> 00:28:56,700
So the act of forecasting or predicting any unknown quantity X given the observed data is by the salmon

259
00:28:56,700 --> 00:29:02,070
product rule an averaging process you have to average over all the hypotheses that you've considered.

260
00:29:02,250 --> 00:29:07,200
You don't pick the best one or your favourite one, or you don't flip a coin or anything like that.

261
00:29:07,380 --> 00:29:11,100
You're supposed to average over the space of hypotheses in this particular way.

262
00:29:12,170 --> 00:29:23,059
And if you now want to compare different model classes, then you might apply Bayes rule at the level of model classes.

263
00:29:23,060 --> 00:29:28,820
And that looks like this where this term in red, the marginal likelihood now appears in the numerator and denominator.

264
00:29:29,150 --> 00:29:33,770
None of this is actually mysterious. They all follow from from these two rules.

265
00:29:34,010 --> 00:29:40,070
What do I mean by model? Comparison. Model comparison might the story might go like this.

266
00:29:40,310 --> 00:29:49,250
Okay, let's say I'm a biologist. I do an experiment and I have a colleague and my colleague says, I believe that,

267
00:29:49,910 --> 00:29:53,990
you know, this transposition transcription factor regulates these genes.

268
00:29:54,140 --> 00:29:59,510
And I say, no, I have a different model. I believe that it doesn't and that this one does or something like that.

269
00:29:59,750 --> 00:30:03,110
So my colleague and I have two different models now.

270
00:30:03,110 --> 00:30:07,099
We could argue about it in words, but if we follow this probabilistic framework,

271
00:30:07,100 --> 00:30:14,990
what we should do is both of us should write down the model to the specification level that it could make predictions about observable data.

272
00:30:15,440 --> 00:30:21,980
We could assign a probability to the observable data, and then we observe the data.

273
00:30:22,340 --> 00:30:25,370
D And now we can settle the argument.

274
00:30:25,850 --> 00:30:32,899
We basically say, All right, what is the marginal likelihood that that your model gave to my data?

275
00:30:32,900 --> 00:30:38,120
What is the marginal likelihood my model gives to the data? Well, both of our models had some free parameters.

276
00:30:38,300 --> 00:30:43,129
Maybe your model had 17 free parameters, and my model had three free parameters.

277
00:30:43,130 --> 00:30:51,050
So my model is simpler somehow. And I want I don't know, I get nervous, I say, that seems unfair.

278
00:30:51,290 --> 00:30:57,679
Okay, so your model had more parameters if if my colleague goes and optimises those 17 parameters,

279
00:30:57,680 --> 00:31:01,670
then sure enough she can fit the data much better than I can.

280
00:31:02,090 --> 00:31:06,620
Right. But that's not the game optimisation doesn't follow from the some rule in a product rule.

281
00:31:07,220 --> 00:31:11,000
It doesn't matter that my colleague has 17 parameters and I have three.

282
00:31:11,240 --> 00:31:15,950
If we can both compute the marginal likelihood, then we can settle this argument.

283
00:31:16,370 --> 00:31:21,920
Okay, so I actually really strongly believe that in an ideal world, science would be done like this.

284
00:31:22,220 --> 00:31:28,940
People wouldn't just publish their papers in open journals and share their data in an open manner.

285
00:31:29,150 --> 00:31:37,310
I think actually people should write down their models in a way that one could evaluate with future data,

286
00:31:38,090 --> 00:31:43,370
maybe write them as probabilistic programs, which I'll talk about later. And then we could really do objective.

287
00:31:43,760 --> 00:31:46,250
Well, actually, it's subjective, but, you know,

288
00:31:46,370 --> 00:31:55,070
we could do a sort of principled comparison of models giving different subjective opinions about what the hypotheses are.

289
00:31:56,210 --> 00:32:00,140
Okay, so one side on basic machine learning.

290
00:32:00,380 --> 00:32:02,540
So why should we care about all this?

291
00:32:02,540 --> 00:32:12,199
We've had a revolution in machine learning with wonderful, fantastic, deep learning methods that never talk about base anywhere in them.

292
00:32:12,200 --> 00:32:23,120
So why should we care about all this Bayesian stuff? Well, the reason I care is that I'd really like models with calibrated senses of uncertainty.

293
00:32:23,450 --> 00:32:35,749
So I want to be able to trust my system. If it says the probability of there being a pedestrian in front of my car is 0.1,

294
00:32:35,750 --> 00:32:43,010
I want that to mean 10% and I can take actions that correspond to that calibrated probability.

295
00:32:44,430 --> 00:32:48,120
Getting systems that know when they don't know, I feel, is very important.

296
00:32:48,390 --> 00:32:54,719
Also, there's a very beautiful thing about all of this, which is that unease about like 17 parameters,

297
00:32:54,720 --> 00:32:57,480
which the three parameters or different structures of models.

298
00:32:57,720 --> 00:33:07,530
Well, this framework actually gives you automatic tools to compare models of different complexity and to automate the learning of models from data.

299
00:33:07,530 --> 00:33:13,080
And this is called Bayesian Occam's Razor. And it's something I will use in the latter part of my talk.

300
00:33:14,170 --> 00:33:22,090
Okay. So let's go back to our neural networks and just to ground the discussion a little bit.

301
00:33:23,710 --> 00:33:29,650
Here's a neural network and maps from X to why there are different sources of uncertainty here.

302
00:33:29,920 --> 00:33:33,790
One of them is parameter uncertainty. We have weights in the neural network.

303
00:33:34,120 --> 00:33:39,350
And, you know, given any finite amount of data, we're not sure what those weights should be.

304
00:33:39,370 --> 00:33:43,730
So we need to represent our uncertainty. But we also have structural uncertainty.

305
00:33:43,750 --> 00:33:49,780
We've made some structural choices like the architecture, a number of hidden units, our choice of activation functions.

306
00:33:50,080 --> 00:33:56,140
And that's also a source of uncertainty. So it would be great if we could represent all of that.

307
00:33:56,990 --> 00:34:00,780
And that's not a new idea.

308
00:34:00,820 --> 00:34:02,800
None of this is really new ideas.

309
00:34:03,040 --> 00:34:16,359
In fact, the idea of doing Bayesian analysis of neural networks has been around since the early nineties, at least to actually late eighties years.

310
00:34:16,360 --> 00:34:23,060
A bit of a history of a few different methods. Here is a depiction of what we'd really like.

311
00:34:23,080 --> 00:34:26,590
So here's a system that was trained to do some regression on some data.

312
00:34:26,800 --> 00:34:32,650
And what we'd really like is this sort of behaviour that outside of the range of its training data,

313
00:34:32,650 --> 00:34:37,040
it should say, I don't really know and there are many ways of doing that.

314
00:34:37,060 --> 00:34:43,090
These are all different ways of doing that. And we had a nice workshop at NIPS on Bayesian Deep Learning,

315
00:34:44,230 --> 00:34:49,390
where we kind of brought that history together and looked at some of the current state of the art.

316
00:34:51,040 --> 00:34:58,150
So this world machine learning often has camps and people think that you have to be in one or another camp,

317
00:34:58,390 --> 00:35:02,950
but you don't actually you have to understand what all the tools are in the different camps,

318
00:35:02,950 --> 00:35:06,279
and there's a lot of fertile ground at the intersection of these camps.

319
00:35:06,280 --> 00:35:12,070
And that's this is one example of those things. So when do we need probabilities?

320
00:35:12,370 --> 00:35:24,370
Well, we need them when we, our system are, you know, learning an intelligence problem depends crucially on representing uncertainty.

321
00:35:24,380 --> 00:35:28,360
I've sort of said that. But let me describe some examples of that.

322
00:35:28,570 --> 00:35:35,170
So any time we're doing forecasting and, you know, that could be financial forecasting,

323
00:35:35,170 --> 00:35:41,680
weather forecasting, forecasting demand at Uber or for Amazon products or whatever.

324
00:35:42,520 --> 00:35:46,450
We need to represent our uncertainty decision making.

325
00:35:46,630 --> 00:35:52,960
Generally, when you make decisions, you're thinking about the consequences of your actions into the future.

326
00:35:53,200 --> 00:35:56,589
And it's really useful to represent uncertainty there.

327
00:35:56,590 --> 00:36:03,490
It's hard to imagine not doing that at some level when you're learning from limited, noisy and missing data.

328
00:36:03,700 --> 00:36:09,999
So if you imagine dealing with, say, medical records, if you're trying to do machine learning and medical records,

329
00:36:10,000 --> 00:36:16,989
you have patients, your patients and each of them has lots of things that are unobserved.

330
00:36:16,990 --> 00:36:20,190
They maybe there are few medical tests that have been done on each patient.

331
00:36:20,200 --> 00:36:27,610
Most of the data is actually missing. If you look at that, look at it that way if you want to learn complex personalised models.

332
00:36:27,620 --> 00:36:34,150
So it might be, again, whether it's in a medical domain or in a retail domain or something like that,

333
00:36:34,840 --> 00:36:44,200
you might have you might think you have a huge data set, but actually for every patient or every customer, you only have a little bit of data, right?

334
00:36:44,200 --> 00:36:49,300
So it's not really a big data problem. You need to represent uncertainty about that individual.

335
00:36:50,200 --> 00:36:54,579
The whole field of data compression is based on probabilistic modelling and a lot of

336
00:36:54,580 --> 00:36:59,770
my interest in automatic model discovery and experiment design is really based on.

337
00:37:02,630 --> 00:37:08,870
Uncertainty. Now, over the last three months, I've been involved in setting up Uber's AA labs.

338
00:37:08,880 --> 00:37:13,430
I'll just mention that in one slide. Why would Uber care about any of this?

339
00:37:14,120 --> 00:37:23,030
Well, if you look at many of the problems that a large technology company has to solve.

340
00:37:24,660 --> 00:37:31,250
There are problems that deal with uncertainty, decision making, personalisation and so on.

341
00:37:31,500 --> 00:37:33,120
There are huge number of problems.

342
00:37:33,120 --> 00:37:42,330
There are huge number of opportunities around any of the major technology companies for learning from data and for using uncertainty in there.

343
00:37:42,570 --> 00:37:52,740
And, you know, fairly obviously, if you're trying to build a very complicated system that makes decisions in the real world like a self-driving car,

344
00:37:53,460 --> 00:37:58,680
you'd really like to have calibrated uncertainties in that system. Okay.

345
00:37:58,950 --> 00:38:08,460
So here is the one slide picture of my current passions, my current research interests,

346
00:38:09,030 --> 00:38:12,420
and then the next few minutes and I leave a few minutes for questions at the end.

347
00:38:12,720 --> 00:38:17,070
In the next few minutes, I'm going to touch on a few of these topics.

348
00:38:19,690 --> 00:38:23,260
And it's fairly modular, so I can stop to give us time for questions.

349
00:38:23,500 --> 00:38:27,130
But I want to put this slide up here because.

350
00:38:28,310 --> 00:38:30,950
Well, actually, because I had this slide.

351
00:38:31,710 --> 00:38:40,970
So that's one reason because and the reason I had this slide is that I was asked to give a talk about a year ago and they told me,

352
00:38:42,110 --> 00:38:46,130
summarise your work in one slide. So that forced me to produce this slide.

353
00:38:46,310 --> 00:38:50,300
And then when I produced it, I thought it was actually kind of a useful exercise.

354
00:38:50,930 --> 00:38:57,710
So so the, the useful exercise is that it crystallised in my mind the thing that really drives me.

355
00:38:58,730 --> 00:39:04,640
And, you know, it's not that I'm a Bayesian and I just love probabilities or anything like that.

356
00:39:04,820 --> 00:39:10,160
It turns out the thing that really drives me is that.

357
00:39:11,910 --> 00:39:18,270
I like stuff that's automated. I don't I want things to be systematic and automated.

358
00:39:18,270 --> 00:39:23,100
And computer scientists are very good at that. Like computer science, if you put your computer science hat on,

359
00:39:23,520 --> 00:39:29,040
you do something three times and you think, Oh, I need to write a computer program to do that for me.

360
00:39:29,340 --> 00:39:32,580
Three times was two times too many. Right. And.

361
00:39:33,860 --> 00:39:41,060
And the sorry state of machine learning is that stuff is not really automated.

362
00:39:41,450 --> 00:39:53,660
There still is tremendous amounts of human labour, arbitrary decision making and tweaking involved in deploying machine learning systems.

363
00:39:53,960 --> 00:39:57,590
Which is ironic. The whole field is about getting systems to learn from data.

364
00:39:58,190 --> 00:40:04,970
But then there's there there are a lot of well-paid researchers and engineers tweaking those systems that learn from data.

365
00:40:05,480 --> 00:40:08,750
So let's think about automating these things.

366
00:40:09,620 --> 00:40:13,580
And this is what drives me. So if you look at some of these topics, which I'm going to talk about.

367
00:40:14,150 --> 00:40:18,290
So automatic statistician, what is that about? And I'll I'll talk about that in a couple of minutes.

368
00:40:18,500 --> 00:40:23,300
That's about automating the process of model discovery from data.

369
00:40:23,810 --> 00:40:28,410
So searching for a good model from data. Probabilistic programming.

370
00:40:29,520 --> 00:40:36,450
Something that Frank Wood, who is at Oxford, is a world expert in policy programming,

371
00:40:36,450 --> 00:40:43,400
is automating the process of doing inference from a very general probabilistic model.

372
00:40:45,020 --> 00:40:49,190
We also want to automate. Optimisation.

373
00:40:49,190 --> 00:40:52,180
So optimisation is actually a sequential decision problem.

374
00:40:52,190 --> 00:40:58,280
If you have an OPTIMISER that's trying to optimise a function, it's making decisions about where to evaluate the function next.

375
00:40:59,330 --> 00:41:03,200
Collecting some data and then moving on to another point and so on.

376
00:41:03,200 --> 00:41:08,690
People don't think about optimisation that way. They just think about, here's an algorithm and here's something I can prove about the algorithm.

377
00:41:08,990 --> 00:41:12,410
But actually optimisation is very much like, you know.

378
00:41:14,780 --> 00:41:17,209
Bend it problems, reinforcement learning problems,

379
00:41:17,210 --> 00:41:23,180
sequential decision making under uncertainty is something that drives this because we want to optimise sorry,

380
00:41:23,180 --> 00:41:26,960
we want to automate the allocation of computational resources.

381
00:41:27,140 --> 00:41:32,570
So especially now that machine learning systems are very complex.

382
00:41:32,570 --> 00:41:37,490
Right. These these systems use a lot of memory, a lot of CPU.

383
00:41:37,730 --> 00:41:46,670
The datasets are very big. So we can't afford to just tinker about and run a few experiments on a single computer.

384
00:41:47,180 --> 00:41:55,850
And when we run major experiments, we actually have to worry about the fact that this is running on a big, you know, cloud of computers.

385
00:41:56,210 --> 00:42:04,820
And, you know, that's using energy and energy costs money and it's not good for the world, right, using energy like that.

386
00:42:05,540 --> 00:42:10,000
So optimising resource allocation. So these are the things that drive me these days.

387
00:42:10,280 --> 00:42:13,310
I'm going to talk about a couple of them very quickly.

388
00:42:13,490 --> 00:42:15,350
Probabilistic programming is one of them.

389
00:42:17,060 --> 00:42:28,790
The problem here is that developing probabilistic models and deriving inference algorithms is generally a very time consuming and error prone process.

390
00:42:30,010 --> 00:42:34,670
And the solution is to develop probabilistic programming languages.

391
00:42:34,690 --> 00:42:42,220
So what are these things? This is a very beautiful marriage between probabilistic modelling and programming languages, worlds.

392
00:42:42,850 --> 00:42:49,090
And the idea is that you have a probabilistic programming language, which is a way of expressing probabilistic models.

393
00:42:49,330 --> 00:42:58,239
And the modern ones, the ones that people are very interested in, like Frank Wood and myself these days, are completely general programming languages,

394
00:42:58,240 --> 00:43:03,880
sort of Turing complete programming languages that can express any computable probability distribution.

395
00:43:05,110 --> 00:43:08,450
That's the expression part. But what do you do with that?

396
00:43:08,470 --> 00:43:16,270
Well, well, first of all, how do you do that? You express your model as a simulator, a simulator that would generate data.

397
00:43:17,200 --> 00:43:22,090
That's one canonical way of doing that. And that's a very natural concept.

398
00:43:22,120 --> 00:43:25,240
You could say, okay, I have a model for the weather.

399
00:43:26,800 --> 00:43:31,780
Well, that's actually kind of a simulator. Okay. And I write it as a computer program.

400
00:43:31,780 --> 00:43:40,180
I have a model for my gene expression network, and that's going to be a simulator that, you know, simulates gene expression data.

401
00:43:40,490 --> 00:43:44,480
Okay. That's the modelling part.

402
00:43:44,720 --> 00:43:48,020
But then you have some data, you have a simulator and you have some data,

403
00:43:48,020 --> 00:43:56,450
and what you're really interested in is inferring or learning parameters of your simulator, of your model, given the data.

404
00:43:58,260 --> 00:44:04,260
And the very incredible thing is that we can actually come up with universal inference engines.

405
00:44:04,260 --> 00:44:11,280
We can come up with inference engines that in principle could compute the probability

406
00:44:11,280 --> 00:44:16,830
distribution over the hidden variables in our computer program given the data.

407
00:44:17,070 --> 00:44:20,190
So it's basically running Bayes rule on computer programs.

408
00:44:20,910 --> 00:44:26,250
We're all used to running computer programs in the forward direction. You take some inputs and you produce some outputs.

409
00:44:26,640 --> 00:44:28,890
But this is kind of doing it backwards.

410
00:44:29,070 --> 00:44:35,650
You have a computer program that takes some inputs and some calls the random number generators and produces some outputs.

411
00:44:35,670 --> 00:44:37,740
These are random outputs. That's the data.

412
00:44:37,980 --> 00:44:48,540
And now we say, well, what should the inputs and the cost of the random number generators have been to observe this output for the computer program?

413
00:44:48,540 --> 00:44:55,380
That's Bayes rule on the program. And there are many languages.

414
00:44:55,860 --> 00:45:02,220
Now, Anglican is the one that Frank Wood's team has been developing one of the state of the art languages.

415
00:45:02,640 --> 00:45:08,490
Our group in Cambridge has a language called Turing, which is much less developed but also exciting.

416
00:45:08,490 --> 00:45:14,990
It's based on Julia. There are many different languages developed by different groups and there are many

417
00:45:14,990 --> 00:45:20,720
different inference algorithms that can generally run on models in those languages.

418
00:45:23,190 --> 00:45:28,800
Here is, for example, a hidden Markov model written in in Turing.

419
00:45:30,090 --> 00:45:35,430
It's fairly easy to read if you uncomment one line of this model.

420
00:45:35,430 --> 00:45:38,879
You go from a regular hidden Markov model to a Bayesian hidden Markov model.

421
00:45:38,880 --> 00:45:45,990
So changing models around is as easy as sort of adding and removing a few lines of your probabilistic program.

422
00:45:47,490 --> 00:45:53,399
And I really think that, you know, if our vision actually plays out, this could really revolutionise scientific modelling.

423
00:45:53,400 --> 00:45:59,100
If people were actually willing to write probabilistic programs for all of their models and they shared them,

424
00:45:59,310 --> 00:46:03,930
then people could take somebody else's model, run it on their data, improve it, etc.

425
00:46:05,170 --> 00:46:11,320
A few resources here. I'll just give you a few examples. These are now slides from my postdoc, Hong.

426
00:46:11,350 --> 00:46:15,579
Okay. It's a little bit about Turing.

427
00:46:15,580 --> 00:46:20,470
I'll skip through that. That's our H&M example, but much bigger. This is a Bayesian neural network.

428
00:46:20,740 --> 00:46:24,190
Most of this is specifying the prior on the on the weights.

429
00:46:26,050 --> 00:46:32,530
And then this is the actual, you know, Bayesian neural network that's just sort of the neural network function and so on.

430
00:46:32,830 --> 00:46:38,379
And then you could just run inference using, you know, Hamiltonian Monte Carlo or something.

431
00:46:38,380 --> 00:46:47,500
You don't have to even know what that is. It abstracts away the model specification from the inference and ah language.

432
00:46:47,500 --> 00:46:50,070
Turing is pretty competitive. It's,

433
00:46:50,170 --> 00:46:58,960
it's sort of in the same ballpark is Anglican occasionally a bit faster but I know that the Anglican team keeps improving their language as well.

434
00:47:00,640 --> 00:47:03,880
Another topic I want to talk about is Bayesian optimisation.

435
00:47:04,930 --> 00:47:07,240
I have basically a couple of slides on that.

436
00:47:07,450 --> 00:47:17,950
So the problem here is you want to find ideally a global optimum, maybe that's too much ask of some black box function that is expensive to evaluate.

437
00:47:18,340 --> 00:47:24,790
So you can't just evaluate in lots and lots of places. You need to think about where you're going to evaluate your function next.

438
00:47:25,090 --> 00:47:29,230
And we don't want to do that manually. We want to automate the algorithm that thinks about that.

439
00:47:30,310 --> 00:47:36,200
So the solution is to treat the problem as sequential decision making under uncertainty.

440
00:47:36,200 --> 00:47:40,960
And what we're uncertain about is what the actual function is. And this has huge number of applications.

441
00:47:41,530 --> 00:47:46,719
And I'm actually you know, I'll say a couple words about the automatic statistician, but I do want to leave some time for questions.

442
00:47:46,720 --> 00:47:53,770
So the automatic statistician is automating is trying to automate model discovery.

443
00:47:54,250 --> 00:48:00,220
And the idea here is what we'd like is the system where we can just give it data.

444
00:48:00,460 --> 00:48:04,900
It searches over a large space of models,

445
00:48:04,900 --> 00:48:12,729
evaluating models according to some principled metric that trades off model complexity with the amount of data that you have.

446
00:48:12,730 --> 00:48:19,420
And actually the marginal likelihood, which I described, is one such metric, it produces a model and then interestingly,

447
00:48:19,420 --> 00:48:23,770
it translates that into a report that is then interpretable by a human being.

448
00:48:23,770 --> 00:48:32,140
So this is the opposite of a black box. We really want a transparent box, something that the human will be able to understand.

449
00:48:33,430 --> 00:48:40,570
Okay. And again, you know, I'll actually skip over most of this because I do want to leave time for questions.

450
00:48:40,810 --> 00:48:45,040
So we do a search over models. This is the automatic statistician applied to some time series.

451
00:48:45,310 --> 00:48:49,980
It finds a good model. Then it comes up with a description of that model.

452
00:48:49,990 --> 00:48:53,650
It produces the text itself. So this is the executive summary of the text.

453
00:48:54,010 --> 00:49:01,000
Actually, the text is in the form of these documents, which are, you know, 5 to 10 pages long.

454
00:49:01,360 --> 00:49:06,250
And, you know, we can have here is the report writing demo.

455
00:49:06,880 --> 00:49:14,440
You know, we could run this and this is a slightly different version of this, which actually does clustering.

456
00:49:15,460 --> 00:49:19,390
It tries to visualise things, it tells you what it's found, etc.

457
00:49:20,320 --> 00:49:29,970
Okay. And it tends to perform well at prediction because actually being systematic pays off.

458
00:49:31,430 --> 00:49:36,440
Okay. And we've applied this to classification as well, to regression, to clustering and so on.

459
00:49:36,680 --> 00:49:44,540
And we're going to have a release of it, I keep saying very soon, but this time I really mean it very soon means in a couple of months, I think.

460
00:49:45,230 --> 00:49:48,170
Okay. So I'm going to wrap up there.

461
00:49:50,950 --> 00:49:58,270
This probabilistic modelling framework isn't the only way to do machine learning, but it's a really useful organising principle.

462
00:49:58,630 --> 00:50:03,400
There is there are many layers and it's completely compatible with the choice of models that

463
00:50:03,400 --> 00:50:09,100
you have and whether you like deep learning or even logic and other frameworks and so on.

464
00:50:09,460 --> 00:50:18,220
We we really can hybridise a lot of these methods to produce interesting systems that reason about uncertainty and learn from data.

465
00:50:19,000 --> 00:50:24,940
I've briefly reviewed three topics is the review paper I wrote a couple of years ago that summarised

466
00:50:24,940 --> 00:50:30,880
this line of work and I wanted to end by thanking a whole bunch of collaborators I've had.