1
00:00:15,450 --> 00:00:20,010
So one of the interesting things is the way the deep learning has changed,

2
00:00:20,010 --> 00:00:27,210
machine learning has changed even in this physics departments is on Mondays, we've just started an informal lecture series.

3
00:00:27,210 --> 00:00:36,850
One of our former postdocs, Typekit Ruiner, give me lectures on basic, basic introductions to machine learning and packed out.

4
00:00:36,850 --> 00:00:42,600
We advertised only to the physics faculty and graduate students in our final year students and we packed.

5
00:00:42,600 --> 00:00:47,070
We've packed out this lecture, lecture theatre and that wouldn't have happened two or three years ago.

6
00:00:47,070 --> 00:00:54,750
So there's been an enormous change in the interest inside physics, as well as a huge change in interest worldwide.

7
00:00:54,750 --> 00:00:57,600
So I'm going to do a very basic introduction.

8
00:00:57,600 --> 00:01:02,610
Very pedestrian, if you know a lot about machine learning, I'm not going to say anything new until the end.

9
00:01:02,610 --> 00:01:06,420
I apologise for that, but I'm setting up the next two speakers.

10
00:01:06,420 --> 00:01:14,640
Andrea Lucas and Elliot Benton, who'll be giving you some exciting cutting edge applications of machine learning in physics.

11
00:01:14,640 --> 00:01:24,780
I want to start by giving a kind of potted and partial history of the fields and on the go.

12
00:01:24,780 --> 00:01:31,920
And it starts with this great man, Alan Turing, who in this unbelievably important paper computing machinery and intelligence,

13
00:01:31,920 --> 00:01:37,050
which is kind of the founding paper of the field of artificial intelligence. He.

14
00:01:37,050 --> 00:01:40,670
They made many important contributions there, including the Turing test.

15
00:01:40,670 --> 00:01:45,860
But also said, what if instead of trying to produce a programme to simulate the adult minds,

16
00:01:45,860 --> 00:01:52,910
why not rather try to produce one which simulates the child's in this world, then subjected to an appropriate course of education?

17
00:01:52,910 --> 00:02:00,720
One would obtain the adult brain. We've thus divided up into two parts the child programme and the education process.

18
00:02:00,720 --> 00:02:05,740
I'm going to tell you today about the child programme and the education process.

19
00:02:05,740 --> 00:02:12,240
So this paper estimated enormous amounts of work, and it's important when we talk about machine learning that we look at the history

20
00:02:12,240 --> 00:02:18,350
of really machine learning and artificial intelligence on this ambiguity. Those words a little bit later.

21
00:02:18,350 --> 00:02:22,440
You have a little timeline from 1950 down to here arrows wrong way around.

22
00:02:22,440 --> 00:02:25,140
Let's see the Turing Test 1950.

23
00:02:25,140 --> 00:02:32,550
Not that long afterwards, the first automated translators came out on and they kind of worked, but they didn't work very well.

24
00:02:32,550 --> 00:02:39,030
There was enormous interest, huge amounts of money went into it, and then it kind of died out.

25
00:02:39,030 --> 00:02:45,120
In 1957, the man called Rosenblat the myth of the perceptron,

26
00:02:45,120 --> 00:02:50,130
which is a very basic model of a neurone, which will come across a bit further in the talk.

27
00:02:50,130 --> 00:02:57,120
So in the New York Times in 1957. So that's only seven years after Turing's paper and the journalist wrote the navy

28
00:02:57,120 --> 00:03:00,570
revealed the embryo of an electronic computer that expects will be able to walk,

29
00:03:00,570 --> 00:03:05,250
talk, see, write, reproduce itself and be conscious of its existence.

30
00:03:05,250 --> 00:03:11,940
So there's nothing new about hype or used to hype cycles in lots of fields.

31
00:03:11,940 --> 00:03:19,230
But in 1957, this was the the feeling and huge amounts of money poured into this area.

32
00:03:19,230 --> 00:03:25,680
Over 60, there's lots of the end of the sixties, many started to dry up because the promise was not really being kept,

33
00:03:25,680 --> 00:03:31,200
and the very famous book in 1969 by Minsky and Puppets on the perceptron showed that in fact,

34
00:03:31,200 --> 00:03:34,650
these perceptions were not nearly as powerful as people had initially thought.

35
00:03:34,650 --> 00:03:40,950
They couldn't do some basic logical operations like Exor and so on.

36
00:03:40,950 --> 00:03:43,440
Interest in these things start to die down.

37
00:03:43,440 --> 00:03:53,360
And in 1973, drought and light hail reports of the British government commissioned a big report on the AI and a very famous scientist mathematician.

38
00:03:53,360 --> 00:04:00,560
Wrote that there's really going to be no progress in this, and his argument was, well, these problems are typically so complex.

39
00:04:00,560 --> 00:04:07,220
It's a commentary explosion of possibilities that they're already going to work on toy problems and never anything else.

40
00:04:07,220 --> 00:04:16,910
And there came the first winter. I mean, the funding completely dropped people's careers ground to a halt, and nobody wanted to fund this anymore.

41
00:04:16,910 --> 00:04:24,320
Then in the 1980s, we had expert systems come back, and a few of you may remember this on excellent lists.

42
00:04:24,320 --> 00:04:30,290
All kinds of companies jump on the bandwagon. There's hundreds of millions of dollars and pounds poured into research,

43
00:04:30,290 --> 00:04:36,780
and great promises were made of what this would all mean, how this would transform our lives.

44
00:04:36,780 --> 00:04:40,830
But then by the end of the 80s, these systems turn out to not work,

45
00:04:40,830 --> 00:04:48,540
as well as people thought they were very expensive to maintain they were very brittle. And so all the funding again stopped.

46
00:04:48,540 --> 00:04:57,120
And as recently as 2007 in The Economist, I found the following quote investors were put off by the term voice-recognition,

47
00:04:57,120 --> 00:05:03,140
which, like artificial intelligence, is associated with systems that have all too often failed to live up to their promises.

48
00:05:03,140 --> 00:05:12,200
And so this is something close to the end of the second age or when people when there was very little funding in this area.

49
00:05:12,200 --> 00:05:17,600
Now, everything changed, obviously the last decade, we wouldn't be here if that were not the case.

50
00:05:17,600 --> 00:05:25,700
And I wanted to locate the big change in 2012. It's somewhat arbitrary, but it's an I'll explain to you what changed this,

51
00:05:25,700 --> 00:05:35,300
this what changed in 2012 and what made CEO Google CEO Sundar Pichai say something like A.I. is one of the most profound things we're working on,

52
00:05:35,300 --> 00:05:39,170
as humanity is more profound than fire and electricity.

53
00:05:39,170 --> 00:05:48,440
Now I can safely say that is probably hype of this, but he's excited, OK, people are excited and there's enormous amounts of money.

54
00:05:48,440 --> 00:05:55,700
Billions and billions are being poured into this area. I have a friend who recently listed a company, the stock exchange,

55
00:05:55,700 --> 00:06:00,680
and he told me by putting artificial intelligence the title, he doubled the valuation of the company.

56
00:06:00,680 --> 00:06:05,390
And so there's a lot of interest in this, and I think for good reason. There's also a lot of hype.

57
00:06:05,390 --> 00:06:11,060
If you go in the newspaper, you find all kinds of wild claims. You know, I will take away your jobs.

58
00:06:11,060 --> 00:06:19,340
It's going to be doctors anymore. Some new, as you would like, and a lot of that is hype cycle and probably not true.

59
00:06:19,340 --> 00:06:26,420
But they're also extremely exciting things that have happened. So one very thing is, don't we maybe familiar with in 2016,

60
00:06:26,420 --> 00:06:36,380
a computer programme from Google DeepMind called AlphaGo beat Lisa Dole, who was the 18 time world champion and in Go.

61
00:06:36,380 --> 00:06:42,980
And that was a big step forward because as opposed to something like chess, which is a relatively constrained game,

62
00:06:42,980 --> 00:06:48,110
goes unconstrained, has an enormous number of possibilities and was thought to be an unsolvable game.

63
00:06:48,110 --> 00:06:54,770
And so this deep learning new age, I had done something really amazing and perhaps even more interesting.

64
00:06:54,770 --> 00:07:01,820
In the summer of 2017, DeepMind released a programme called AlphaGo Zero, which didn't even know anything about Go.

65
00:07:01,820 --> 00:07:09,830
It just learnt the rules OK, except the rules played against itself, so it didn't look at any games that were played by experts.

66
00:07:09,830 --> 00:07:19,190
And after 40 days, it was able to outperform AlphaGo, which had beaten this at all, which had been trained on all the expert games.

67
00:07:19,190 --> 00:07:25,160
And it had taught itself how to play go much better than other computers, uncertain much better than humans.

68
00:07:25,160 --> 00:07:29,540
And the same thing for chess. They did a chess programme that taught itself how to play chess.

69
00:07:29,540 --> 00:07:35,090
Interesting where it started out, it was about chess player. Then it went through all the kind of standard opening moves that you have it learnt them

70
00:07:35,090 --> 00:07:39,920
and discarded them one by one by one until it became better than the world's best.

71
00:07:39,920 --> 00:07:44,900
The best alternative chess programmes, which are built with a lot of expert information.

72
00:07:44,900 --> 00:07:52,990
So this is a programme that teaches itself how to play the game so that without a doubt is an extraordinary achievement and a very exciting thing.

73
00:07:52,990 --> 00:07:58,450
And so the question is. That's why people are excited, but how do these programmes work?

74
00:07:58,450 --> 00:08:04,510
Well, I'm going to give you a very potted introduction or a simple induction to the basic technology behind this,

75
00:08:04,510 --> 00:08:09,970
and I'm going to tell you why I think 2012 is the start of what we call the deep learning era.

76
00:08:09,970 --> 00:08:17,860
So on this very important part of what's changed things is the accessibility of huge amounts of new data.

77
00:08:17,860 --> 00:08:22,120
This is the fifth year computer scientists at Stanford.

78
00:08:22,120 --> 00:08:24,740
I put her in because she was a physicist originally. And actually,

79
00:08:24,740 --> 00:08:27,880
there's a lot of people that were physicists that moved into this field over the last few

80
00:08:27,880 --> 00:08:32,470
decades that have transformed things that she introduced a competition called Image Net,

81
00:08:32,470 --> 00:08:40,000
where there's now 14 million images. The 20000 categories of images are like a cat, a dog, no bicycle.

82
00:08:40,000 --> 00:08:45,370
And there was a competition which was Take your best computer programme, learn a bunch of images,

83
00:08:45,370 --> 00:08:52,510
and then we're going to give you new ones that you haven't seen before and predict where this is or this is a cat or a dog or a bicycle.

84
00:08:52,510 --> 00:09:00,700
And in 2012, a team from the rivers of Toronto, Alex took the main person.

85
00:09:00,700 --> 00:09:08,860
Was Geoffrey Hinton, actually the most famous person in that field of this network, which is called Alex Nets at 60 million parameters.

86
00:09:08,860 --> 00:09:16,510
And it beats all the other comers by of factory 40 percent better accurate 40 percent lower error.

87
00:09:16,510 --> 00:09:21,520
That was a huge step forward. And now we're down to about a two percent error.

88
00:09:21,520 --> 00:09:29,200
And it's all based on these are machine learning networks. Now here I get a little schematic picture of Alex that still looks pretty complicated, OK?

89
00:09:29,200 --> 00:09:35,980
It's a slightly messy system, but. It got people really interested.

90
00:09:35,980 --> 00:09:40,810
Immediately, Google sort of pouring huge amounts of money into its Facebook into Microsoft,

91
00:09:40,810 --> 00:09:47,940
these companies are now rebranding themselves as A.I. companies. And it's all based on these basic technologies called deep learning.

92
00:09:47,940 --> 00:09:55,020
And although this is not complicated, I'll explain to you in a minute more how that more or less works in the academic world.

93
00:09:55,020 --> 00:10:00,150
This grew incredibly so. Geoffrey Hinton here I will mention again a few more times.

94
00:10:00,150 --> 00:10:07,110
If you look on Google Scholar, this is his you see his citations. So this is 2011, 2012, and it's just exploded.

95
00:10:07,110 --> 00:10:13,440
Last year, he had seventy three thousand citations. So these are citations is the number of citations of a world leading.

96
00:10:13,440 --> 00:10:17,820
Scientists are minimal prise winners who have less citations than that in their lifetime.

97
00:10:17,820 --> 00:10:24,600
And last year, this is how many citations you have. This gives you a sense of how many people are working on this field.

98
00:10:24,600 --> 00:10:30,360
Three of the five most cited papers in nature from 2019 were on deep learning,

99
00:10:30,360 --> 00:10:37,980
so there's an enormous excitement in this and the the three great kind of experts in

100
00:10:37,980 --> 00:10:42,360
this field of three great founders of these fields and the current Joshua Bengio.

101
00:10:42,360 --> 00:10:44,100
So I got the order wrong. It's Bengio.

102
00:10:44,100 --> 00:10:50,820
Hinton and LeCun won the Turing prise, which is the equivalent of the Nobel prise in computer science last year.

103
00:10:50,820 --> 00:11:00,630
What's interesting is that these three pioneers worked more or less anonymously, but at the fringe for many years and just lost.

104
00:11:00,630 --> 00:11:09,930
About two months ago, I saw a quote from Hinton where he talked about given try to submit the paper to a conference in AI and he was before, he said.

105
00:11:09,930 --> 00:11:16,080
Hinton's been working on this idea for seven years, and no one's interested. It's time to move on.

106
00:11:16,080 --> 00:11:20,550
So for many years, I mean, Hinton is in his 70s now, so for many years he worked on these ideas.

107
00:11:20,550 --> 00:11:21,630
Everybody told it was crazy.

108
00:11:21,630 --> 00:11:27,990
They worked through the big A.I. winters and in spite of the fact that everyone told them this is wrong and this was they couldn't get funding.

109
00:11:27,990 --> 00:11:31,590
They couldn't get published. They kept at it. And now the revolution has come.

110
00:11:31,590 --> 00:11:36,450
This is a good lesson for us that very often great innovations come from outside.

111
00:11:36,450 --> 00:11:42,540
They may be. The few things is wrong at one time, they actually turn out to be revolutionary at a different time.

112
00:11:42,540 --> 00:11:48,900
And what changed was the availability of large amounts of data like image nets that I was a trained things and of course,

113
00:11:48,900 --> 00:11:52,980
lots of proprietary data and commercial data. And of course, large computers.

114
00:11:52,980 --> 00:12:03,380
And those two things, plus some algorithmic. Innovations allowed an idea which actually comes back from the nineteen fifties, nineteen sixties,

115
00:12:03,380 --> 00:12:11,120
these neural networks to to revolutionise the way that we're doing artificial intelligence today and what's new about them?

116
00:12:11,120 --> 00:12:16,280
I'll explain a few things new about them. What's new about them is they're not simple to use in traditional A.I. systems.

117
00:12:16,280 --> 00:12:20,270
They can be used by students. Secondary school students can use them now and train stuff.

118
00:12:20,270 --> 00:12:23,690
And so it's a huge revolution in physics.

119
00:12:23,690 --> 00:12:30,320
There's a physics world as an article just last year called a machine learning revolution where machine learning revolutionised physics.

120
00:12:30,320 --> 00:12:36,890
Well, we will. See you. Also, be careful to call it revolutions on.

121
00:12:36,890 --> 00:12:42,380
There's definitely a bandwagon, there's hype. I was talking to a colleague this week who said, you know,

122
00:12:42,380 --> 00:12:50,230
one of the tricks in science is to jump on bandwagons and ride them a job, offer them before they crash.

123
00:12:50,230 --> 00:12:55,340
I don't think this was going to crash because we're seeing enormous and exciting applications.

124
00:12:55,340 --> 00:12:59,360
So it's been a long history actually of applications and data analysis.

125
00:12:59,360 --> 00:13:04,760
Particle physics have actually been using neural networks for a very long time to analyse their data.

126
00:13:04,760 --> 00:13:10,850
And it's one of the reasons why I think there's a natural link between physics and this kind of machine learning based on their own networks.

127
00:13:10,850 --> 00:13:15,720
There's a huge amount of stuff in image analysis. We've got people here working on biological physics.

128
00:13:15,720 --> 00:13:22,340
They're using these machines to analyse images or pictures of cells dynamically.

129
00:13:22,340 --> 00:13:27,530
And these these networks are by looking at images, recognising things that are human.

130
00:13:27,530 --> 00:13:34,250
I seemed unable to recognise patterns in the data are being exploited in that way.

131
00:13:34,250 --> 00:13:45,140
It's even used textbooks make quantum many body weight functions and calculates the energy levels of of molecules or because of the body's systems,

132
00:13:45,140 --> 00:13:48,320
the much higher accuracy than was thought before.

133
00:13:48,320 --> 00:13:54,950
There's a beautiful experiments by some colleagues here in Inferiors Davis and others in physics where they looked at

134
00:13:54,950 --> 00:14:01,010
some quantum magnet experiments and they looked at using a machine learning algorithm looking at their data again.

135
00:14:01,010 --> 00:14:03,320
They had discovered a bunch of new patterns in there.

136
00:14:03,320 --> 00:14:11,130
So there's really exciting ways of looking at this and an image analysis of these ways of controlling experiments and the events in the last talk,

137
00:14:11,130 --> 00:14:17,360
we'll talk about that and much, much more. An enormous number of cool examples.

138
00:14:17,360 --> 00:14:22,700
And the. Just to give you my last little kind of.

139
00:14:22,700 --> 00:14:27,170
Big picture argument, so a big picture story, not argue, a big picture explanation,

140
00:14:27,170 --> 00:14:33,770
so what is artificial intelligence we booked this day is artificial intelligence in physics.

141
00:14:33,770 --> 00:14:38,820
What we're really talking about something simple called machine learning, and I'm talking about a subset of that called deep learning.

142
00:14:38,820 --> 00:14:42,620
Artificial intelligence is a catch all phrase for all kinds of computational

143
00:14:42,620 --> 00:14:46,640
methods that you may have a computer intelligence in one way or the other.

144
00:14:46,640 --> 00:14:49,880
There's lots of methods that are based on symbolic manipulation,

145
00:14:49,880 --> 00:14:55,670
or it's a really wide range of techniques inside that is a smaller subset of techniques called machine learning,

146
00:14:55,670 --> 00:14:58,430
which when you saw the quote from Turing,

147
00:14:58,430 --> 00:15:07,100
he was speaking about training a child of computer like a child is a machine learning user use data to train the parameters of a machine.

148
00:15:07,100 --> 00:15:10,700
And there are many different machine learning techniques, a wide variety of different ones.

149
00:15:10,700 --> 00:15:15,200
And the one that really caught everyone's attention is this method called deep learning.

150
00:15:15,200 --> 00:15:21,080
And that big Alex that you saw that solved that basically did the best image that

151
00:15:21,080 --> 00:15:24,860
competition 2012 and is now dominated the images that competitions ever since.

152
00:15:24,860 --> 00:15:32,040
Is a method called deep learning. So what is deep learning? All right.

153
00:15:32,040 --> 00:15:36,150
This is a pretty basic, so this is the quote from touring again with the second part of the quotes.

154
00:15:36,150 --> 00:15:40,710
We've got two parts to the problem a child programme and education process.

155
00:15:40,710 --> 00:15:45,930
So deep learning is based on the following idea I have some input nodes.

156
00:15:45,930 --> 00:15:49,530
I've got some weight, some kind of.

157
00:15:49,530 --> 00:15:54,870
Relationships, which are which I'll describe in two minutes to another set of nodes, those that have nodes and have an output layer,

158
00:15:54,870 --> 00:16:01,500
so the input might be pictures of cats and pictures of dogs, and the output might be this is cat resources a dog?

159
00:16:01,500 --> 00:16:09,540
And very, very loosely. This is inspired by neurones in the brain's neurones in the brain can fire or not.

160
00:16:09,540 --> 00:16:16,590
And so firing would be within the when this notice has a large value and not fire would be as a small value.

161
00:16:16,590 --> 00:16:22,480
And how it connects to the other nodes is the same way a neurone in your brain is connected to many other neurones.

162
00:16:22,480 --> 00:16:29,910
And so these weights are taking that into account. So I want to just draw one or two things here on the board.

163
00:16:29,910 --> 00:16:35,310
I realise you can only see a small amount of the board, but that's OK.

164
00:16:35,310 --> 00:16:37,980
So I'm going to give you a really, really simple example.

165
00:16:37,980 --> 00:16:50,210
So if I have three input nodes like this, OK, those are my input nodes, and I might have a problem that I'm interested in something like.

166
00:16:50,210 --> 00:16:59,150
I ask you three questions. I'm a doctor to figure out if you're ill or see if you need to be quarantined, for example.

167
00:16:59,150 --> 00:17:06,680
So the first question might be, you know, do you have a cough? And so this could be one or zero, right?

168
00:17:06,680 --> 00:17:16,550
The first the first one question one question two would be, do you have a little bit and the degree one or zero?

169
00:17:16,550 --> 00:17:24,020
So yes or no? And the Question three could be, did you just come back from Italy and be one or zero?

170
00:17:24,020 --> 00:17:30,000
OK. And then I might think of some, there's some kind of logic that I would apply to this.

171
00:17:30,000 --> 00:17:38,480
So, for example, if none of these are true. Then the answer would be you don't need to be quarantined, OK?

172
00:17:38,480 --> 00:17:45,110
And maybe if you've just been to Italy, I would point to you no matter what.

173
00:17:45,110 --> 00:17:52,250
And you can kind of even think you can see how this would go on, right? So I could do one zero one, maybe if you sniffle, but you have met Italy.

174
00:17:52,250 --> 00:17:58,460
I will always watch about you. But if you sniffle and you've been to Italy, I will definitely quarantine you, etc.

175
00:17:58,460 --> 00:18:03,710
And I can go one. I'll just I'll just go through it really quickly.

176
00:18:03,710 --> 00:18:13,850
This is not necessarily at some point my logic thing breaks down, but I will not pretend that this is how I do this.

177
00:18:13,850 --> 00:18:22,460
What's what you see? Something interesting, by the way, is I've got one to one two three four one two one one one.

178
00:18:22,460 --> 00:18:26,810
That's zero one. So for each of these inputs?

179
00:18:26,810 --> 00:18:35,910
OK. I have an output, and those are my inputs and outputs.

180
00:18:35,910 --> 00:18:40,110
So what I've actually done in this particular problem is I've got a little function, right?

181
00:18:40,110 --> 00:18:44,490
I function takes this set of inputs and it gives, in this case, a set of outputs.

182
00:18:44,490 --> 00:18:47,940
If I was a well trained doctor to see whether squinting or not,

183
00:18:47,940 --> 00:18:52,050
then this would be my inputs, outputs or perhaps the government would give me this advice.

184
00:18:52,050 --> 00:19:01,480
And that's a function that I then have. And so I met with the train a computer to be able to learn this function.

185
00:19:01,480 --> 00:19:05,330
And this is a relatively simple function. So how would I how would I do this?

186
00:19:05,330 --> 00:19:11,570
Well, I would say I've got three input nodes one to three, which can be zero or one, which.

187
00:19:11,570 --> 00:19:17,980
And then I'm going to have a layer here of further notes.

188
00:19:17,980 --> 00:19:28,680
Maybe lots of notes, and I'm going to call this layer one layer to.

189
00:19:28,680 --> 00:19:41,120
I have an output node, and that I'm going to do is I'm going to make lines here that tell me this node connects to all the nodes in the first layer.

190
00:19:41,120 --> 00:19:47,040
This one does the same. This one does the same.

191
00:19:47,040 --> 00:19:53,160
OK. Simple, incredibly simple. And these nodes do the same to the next layer.

192
00:19:53,160 --> 00:20:04,800
OK. Etc. And I can keep drawing forever, and at the end, the notes all converge on this final decision nodes.

193
00:20:04,800 --> 00:20:08,580
It's all going to draw on, can draw the lines because they keep going on, there's tons and tons of them.

194
00:20:08,580 --> 00:20:16,560
I can have many layers and I want to do this. And then what I do is I'm going to put on each of these layers away, so I'll call this a weights.

195
00:20:16,560 --> 00:20:21,330
This is number one. And this is layer one, right?

196
00:20:21,330 --> 00:20:28,800
This is the first one to three, four or five I call this weight one one.

197
00:20:28,800 --> 00:20:36,300
I call this weights from here. Maybe I'll colour it so you can see it. So these are the weights that are going into number one.

198
00:20:36,300 --> 00:20:47,820
There's three of them weights one one four eight one two and weights one three.

199
00:20:47,820 --> 00:20:52,470
I couldn't quite figure out how to make this go up, so I'm going to jump to the side,

200
00:20:52,470 --> 00:20:58,050
I'm going to hide this little truth table here for a minute or two over here.

201
00:20:58,050 --> 00:21:05,910
So sorry. All right. So what is going to happen to this node?

202
00:21:05,910 --> 00:21:11,730
OK, what's going to devalue this nodes? Well, the value of that nodes? OK.

203
00:21:11,730 --> 00:21:20,680
So the node one in there one.

204
00:21:20,680 --> 00:21:26,110
Is going to have a value, which is the function, some function of the inputs.

205
00:21:26,110 --> 00:21:31,720
What are the inputs? Well, it's x one weights one one plus x two.

206
00:21:31,720 --> 00:21:42,340
Wait. Two to two. Sorry weights two one plus x three weights three one plus maybe an offset.

207
00:21:42,340 --> 00:21:57,380
OK. And this function will typically be something like this would be a a classic one.

208
00:21:57,380 --> 00:22:04,700
And this simple function looks like this. So this is for large negatives ads.

209
00:22:04,700 --> 00:22:13,090
This is zero. At zero, it's a half, and for large positive z, it's equal to one.

210
00:22:13,090 --> 00:22:19,520
All right, so if this inputs is large negative, my nose goes to zero.

211
00:22:19,520 --> 00:22:27,230
Is this if this inputs are large and positive, it goes to +1 otherwise as a value in between?

212
00:22:27,230 --> 00:22:32,870
So it's extremely this is this is one of the many on activation function you could have.

213
00:22:32,870 --> 00:22:42,440
So this is really this thing here. Is this really it's X in W plus b that is the function of an Inter product.

214
00:22:42,440 --> 00:22:47,870
Now I do that for this node, for that node, that note that they're not nodes.

215
00:22:47,870 --> 00:22:54,710
And then I think those nodes and use them for the next layer and I do the same game and I play the same game at the end.

216
00:22:54,710 --> 00:23:01,190
I look at this final node and then I might say if the value is larger one half, then I should quarantine you.

217
00:23:01,190 --> 00:23:10,940
If that's enough. I will quarantine you. OK. So that's basically the the way this works is incredibly simple.

218
00:23:10,940 --> 00:23:19,280
It's almost you wonder, how could something like this if I put a lot, if I put enough of these together, how could that?

219
00:23:19,280 --> 00:23:25,310
Learn how to play go, for example, or how could that learn how to recognise images is quite a striking thing,

220
00:23:25,310 --> 00:23:28,250
but it's incredibly simple, mathematically, unbelievably simple.

221
00:23:28,250 --> 00:23:34,340
You're just adding up numbers, multiplying them and then having a little known linearity at the end.

222
00:23:34,340 --> 00:23:41,040
The threshold? But it is not in the news, it is important because without it, this model just is a linear model.

223
00:23:41,040 --> 00:23:48,120
You can write down as a series, a matrix multiplications and a product of matrices as just a matrix that becomes a linear model of this do very much.

224
00:23:48,120 --> 00:23:53,210
So there's no linearity turns out to be really important. All right.

225
00:23:53,210 --> 00:23:59,510
So that's the first part of the basics. The second part of the basics is that I need to to have some education process.

226
00:23:59,510 --> 00:24:03,890
And there are basically three main classes of education processes.

227
00:24:03,890 --> 00:24:09,530
The first one is called supervised learning and it's supervised learning.

228
00:24:09,530 --> 00:24:19,640
What I do is I have. I have two sets of data, so here's a very famous example called end this.

229
00:24:19,640 --> 00:24:25,820
These are on hand right hand handwritten numbers. And so these were used to train.

230
00:24:25,820 --> 00:24:32,930
In fact, some of the earlier neural networks will be for 2012 that we're using automatic scanning of of of checks, for example.

231
00:24:32,930 --> 00:24:41,750
And so this is zeros. Those are the ones that are too. So I might take this first subset here and then I will take them as inputs to my machine.

232
00:24:41,750 --> 00:24:48,650
And then I will try to play with these waits until for each of these images.

233
00:24:48,650 --> 00:24:55,700
If it's this, I get a zero. If it's this, I get a one, two, three, four or five, so I train it until I get zero error, hopefully.

234
00:24:55,700 --> 00:24:59,640
Or typically I get zero error on my training side. Or I can even make it simple.

235
00:24:59,640 --> 00:25:02,150
If I want to do this example, I would say, Well,

236
00:25:02,150 --> 00:25:10,070
let me pick this as my training set and this so I'm going to give you these four inputs with the correct outputs.

237
00:25:10,070 --> 00:25:18,680
I'm going to ask the automated Optimiser to find weights that will then if handles inputs, give me those outputs.

238
00:25:18,680 --> 00:25:24,230
OK. Then I've trained my system and then I'll test it.

239
00:25:24,230 --> 00:25:32,050
OK, so test it, so I might do it on something more complicated this image.

240
00:25:32,050 --> 00:25:35,830
I would take these ones and see how well I do.

241
00:25:35,830 --> 00:25:44,230
Or I might do it on if I was trying to train extremely simple DR algorithm, I would train it first and then eventually I would test it.

242
00:25:44,230 --> 00:25:46,240
And then once I'm happy that it always gives me the right result,

243
00:25:46,240 --> 00:25:52,300
it gives me the right result with high enough accuracy than I would unleash on the public.

244
00:25:52,300 --> 00:25:59,020
So that's the first of ethical, supervised learning, and most of what I'll be speaking about today later is supervised learning.

245
00:25:59,020 --> 00:26:06,100
Does that make sense? It's a super, super simple thing. One way of thinking about this is that what these things are are a kind of computer programme.

246
00:26:06,100 --> 00:26:12,220
It's fitting some kind of function. And you should have quite written the programme yet because you have to figure out what these weights are.

247
00:26:12,220 --> 00:26:16,810
So this training process is a way of having the machine write its own programme.

248
00:26:16,810 --> 00:26:22,100
OK. So we're thinking about what it's doing. It's writing some programme, seemingly some.

249
00:26:22,100 --> 00:26:27,890
Another kind of learning is called reinforcement learning, which is somewhat similar to this.

250
00:26:27,890 --> 00:26:34,880
You're trying to train these weights, but rather than using label data where you know what the answers are in advance,

251
00:26:34,880 --> 00:26:40,220
you've got a bunch of curated stuff. You you have some kind of process that an agent is going through and some environments.

252
00:26:40,220 --> 00:26:45,350
And every time the agent does well, you say this is a good set of weights of keep something like that.

253
00:26:45,350 --> 00:26:53,210
And if there's value, then you, you penalise it and you keep doing this rewards personalisation technique until eventually the system learns.

254
00:26:53,210 --> 00:27:00,470
Well, that's what AlphaGo zero Alpha Zero does. It trains on games, and every time it wins, it plays against itself.

255
00:27:00,470 --> 00:27:07,400
Without wins, it says that's good times. It loses isn't bad. It's very roughly akin to what we do with our children.

256
00:27:07,400 --> 00:27:14,420
They don't always work quite as well as we fortunate immune system and the lost methods, which which is also super important.

257
00:27:14,420 --> 00:27:17,570
But I won't talk about which today is unsupervised learning. So it's supervised learning.

258
00:27:17,570 --> 00:27:23,780
I might have, you know, two vectors that tell me some property of objects and then all the red ones are here, the blue ones are there.

259
00:27:23,780 --> 00:27:30,320
So I'm trying to learn some decision boundaries. Some line that tells me if my parameters are here, I'm right here as there are blue.

260
00:27:30,320 --> 00:27:34,220
It also provides learning. I have no idea what the the axes are.

261
00:27:34,220 --> 00:27:39,560
I'm looking for patterns of data and I might see that these these are clustered together and those are clustered together.

262
00:27:39,560 --> 00:27:43,940
And so I'll start picking out features of the data, and that's extremely important as well.

263
00:27:43,940 --> 00:27:47,300
If you don't if you've got too much data, you're not quite sure what the right way of thinking about it is.

264
00:27:47,300 --> 00:27:53,500
And unsupervised learning techniques allow you to find patterns there that you wouldn't otherwise see.

265
00:27:53,500 --> 00:27:56,470
Just the basics of how these things work,

266
00:27:56,470 --> 00:28:04,240
and so I think it's absolutely remarkable that a system this simple like what I just showed you maybe souped up with more layers or some tricks,

267
00:28:04,240 --> 00:28:10,350
but effectively this is what you're doing that this can achieve these amazing feats.

268
00:28:10,350 --> 00:28:21,800
And so the big question, I think the very super interesting question for us today is why do deep neural networks work so well?

269
00:28:21,800 --> 00:28:31,660
It's completely remarkable. Why do they give us such incredibly good predictions for all kinds of amazing things

270
00:28:31,660 --> 00:28:38,500
you're going to see on methods of this type being used to find patches in string theory?

271
00:28:38,500 --> 00:28:46,720
They've been used to find, obviously to play games. They've been used in all kinds of image recognition of about six, seven years ago,

272
00:28:46,720 --> 00:28:52,210
Eric Schmidt from Google said We should stop training radiologists because they're all going to become irrelevant

273
00:28:52,210 --> 00:28:57,820
because this image recognition is going to look at your scans and tell you whether you are this cancerous or not.

274
00:28:57,820 --> 00:29:04,940
And already there are on. There are some studies that show that these things work roughly that well,

275
00:29:04,940 --> 00:29:09,350
they can work better than human and best trained radiologist at recognising cancer.

276
00:29:09,350 --> 00:29:15,560
The reason why this hasn't why we have not stopped treating radiologists is because there's a lot more to radiology than just looking at images.

277
00:29:15,560 --> 00:29:18,450
And also, it turns out that these systems are more brittle than you might expect.

278
00:29:18,450 --> 00:29:23,390
So a classic example would be you do this, you train your system on hospital a,

279
00:29:23,390 --> 00:29:32,600
you get a very high accuracy and you break the hospital B and then the accuracy drops by fairly large amounts and nobody quite knows why.

280
00:29:32,600 --> 00:29:38,300
But those are the kinds of problems that we have. And so until we solve these fundamental questions about why they work well in the first place,

281
00:29:38,300 --> 00:29:44,720
we won't be understands why they they have these funny ways of not working well in other ways and other times.

282
00:29:44,720 --> 00:29:56,960
The classic example that's not really well understood is we have a adversarial example, so I can show a computer a picture of a.

283
00:29:56,960 --> 00:30:07,220
Gibbon, sorry about Panda, and by tweaking a few of the pixels cleverly, it will completely be confused and categorised as a gibbon, for example.

284
00:30:07,220 --> 00:30:15,070
There were more frightening examples. There's a recent exploit by a Chinese group called Tencent, where they use.

285
00:30:15,070 --> 00:30:23,870
They did this on Tesla cars. So Tesla cars, like many, many systems in an industry, namely self-driving systems,

286
00:30:23,870 --> 00:30:29,830
use some kind of machine learning to recognise, for example, the lane that you're in.

287
00:30:29,830 --> 00:30:38,200
And they were able to put a few small stickers on the lane and confuse the network so that it thought that it was in the wrong in the wrong lane,

288
00:30:38,200 --> 00:30:45,310
it swerved very quickly. Now before you, before you sell your Tesla very quickly,

289
00:30:45,310 --> 00:30:51,070
the way this works is that you get paid a lot of money by Tesla if you find these kinds of errors.

290
00:30:51,070 --> 00:30:55,360
So you there's there's a whole industry of people that try to find these errors in Tesla.

291
00:30:55,360 --> 00:31:02,920
Tesla will then pay them and then tell them, give them a timeline after which the ramp would be public about it or not.

292
00:31:02,920 --> 00:31:07,750
And so this group did that, but it was quite striking because a human would never do that.

293
00:31:07,750 --> 00:31:10,360
So one of the questions is why do these machines?

294
00:31:10,360 --> 00:31:16,120
Why this is sceptical to things like like adversarial examples, and there's many other interesting questions.

295
00:31:16,120 --> 00:31:24,070
And so although although. We know that these things work very well.

296
00:31:24,070 --> 00:31:29,380
There's really a question of why they work so well, so I'm going to unpack that in a very particular way.

297
00:31:29,380 --> 00:31:33,850
So one really fascinating thing that we've known for quite a while about deep neural networks,

298
00:31:33,850 --> 00:31:38,350
neural networks is that there are, in fact, universal function approximations.

299
00:31:38,350 --> 00:31:43,210
So here's the most recent, probably the most recent theorem by Boris huntin,

300
00:31:43,210 --> 00:31:52,060
but there's a long history of theories before that's really exist for any kind of function from some high dimensional space to the real numbers.

301
00:31:52,060 --> 00:31:57,360
So, for example, this would be from a hydrogen space to a real output, which is either one or zero.

302
00:31:57,360 --> 00:32:05,290
It's there exists a fully connected rather network revenue is just a fancy word for one of these activation functions.

303
00:32:05,290 --> 00:32:09,190
And it's important technically, but not that important for you to understand.

304
00:32:09,190 --> 00:32:17,470
It's basically something like that. And with if you have if you have a width of the of the network,

305
00:32:17,470 --> 00:32:27,430
so if the number of nodes in this layer has to simply be less than and plus four inches, four is the size of dimension of your space.

306
00:32:27,430 --> 00:32:28,990
So you were a three dimensional space.

307
00:32:28,990 --> 00:32:36,940
So if I have a seven dimensional one, I should be able to to produce completely every function in of this of this type.

308
00:32:36,940 --> 00:32:44,120
It's a very powerful kind of theorem. And so that tells us that the whole networks are extremely highly expressive.

309
00:32:44,120 --> 00:32:48,830
That means they can they can fit almost anything you throw at them.

310
00:32:48,830 --> 00:32:57,800
Now why is that kind of interesting? Because it gives us a conundrum, OK, if they're so highly expressive that music can fit any function to the data?

311
00:32:57,800 --> 00:33:02,840
Why do they pick the right function? So let me give you the example here, right?

312
00:33:02,840 --> 00:33:07,050
So I've got. Three bits. All right.

313
00:33:07,050 --> 00:33:13,560
That gives me. That gives me two to the end is a number of different bits I have.

314
00:33:13,560 --> 00:33:20,340
So that gives me eight to eight different possible inputs, how many different functions I have, I can.

315
00:33:20,340 --> 00:33:30,340
The function length is eight. There's two to the eight cities, two to the two to the two to the end functions.

316
00:33:30,340 --> 00:33:38,730
That this can produce, so that's two hundred fifty six.

317
00:33:38,730 --> 00:33:45,240
Have you seen the 56 different functions, so I can the tune of fifty six different ways I can map these bits to these outputs?

318
00:33:45,240 --> 00:33:52,470
Okay, so if I do it and then if I train, so if I train my neural network to give me zero targets, all of these correct.

319
00:33:52,470 --> 00:33:56,550
OK, then there's four left here, right?

320
00:33:56,550 --> 00:34:01,980
So there's 16 possible functions that are all consistent with being correct here,

321
00:34:01,980 --> 00:34:05,500
the 16 different functions here and the system to know what these functions are.

322
00:34:05,500 --> 00:34:09,120
As if I train, it's how does it know? Why is it?

323
00:34:09,120 --> 00:34:13,650
It can do all of them. So why is it pick a good one, which is what typically does?

324
00:34:13,650 --> 00:34:15,730
Does that make sense? It's extremely simple question.

325
00:34:15,730 --> 00:34:20,560
And just to give you a sense of how these numbers grow, the number of functions grows really quickly to the two to the end.

326
00:34:20,560 --> 00:34:29,350
So if I have seven functions, it's going to be two to the hundred and twenty eight, which is ten to the thirty eight functions.

327
00:34:29,350 --> 00:34:38,480
If I have nine, it's due to the thirty eight squared. OK. And if I have this, if I have 10, if it keeps going about eight.

328
00:34:38,480 --> 00:34:44,140
So it's two to thirty eight. Squares have nine two to thirty eight squared squares.

329
00:34:44,140 --> 00:34:45,760
It goes incredibly fast.

330
00:34:45,760 --> 00:34:57,760
So I have roughly if I worked out that if I have nine inputs, OK, and I look at all possible functions, there's tons of that 156 function functions.

331
00:34:57,760 --> 00:35:05,300
That's a lot more than what we currently think. The universe has bits of storage, and if you use every thing you could in the universe to store.

332
00:35:05,300 --> 00:35:07,270
OK, so that's a relatively small system.

333
00:35:07,270 --> 00:35:12,400
Actually, nine questions and I asked you, what are all the possible combinations of ulcers that you could have?

334
00:35:12,400 --> 00:35:17,020
And there are more than that can store in the whole universe. This is what Light Hill was trying to say.

335
00:35:17,020 --> 00:35:24,120
The Coventry explosion, of course, is this even a relatively small problem becomes unbelievably big and.

336
00:35:24,120 --> 00:35:30,210
Intractable, and so he said you're never going to get beyond toy problems, which seems reasonable in this kind of arguments.

337
00:35:30,210 --> 00:35:32,160
Now, fascinatingly and then on that,

338
00:35:32,160 --> 00:35:37,470
we'll be able to reproduce all those very large number of functions because of the universe approximation theorem.

339
00:35:37,470 --> 00:35:40,800
So why does it pick a good function? This is a really interesting question.

340
00:35:40,800 --> 00:35:46,050
And in the field, this was kind of sharpened by a very famous paper by saying it's all being called.

341
00:35:46,050 --> 00:35:52,230
It's sort of been cited, I think, almost 50 times in the last four years, just extraordinary.

342
00:35:52,230 --> 00:35:57,630
And they did define the experiment. They took this image datasets called See Far from Canadian datasets.

343
00:35:57,630 --> 00:36:01,860
So this is aeroplanes and automobiles and birds, cats and deer, et cetera.

344
00:36:01,860 --> 00:36:09,600
And they simply that if you're permitted the labels, so they basically label this rather than label aeroplane, they just picked a random label.

345
00:36:09,600 --> 00:36:17,310
OK. And what they showed was a neural network could learn that new corrupted data relatively quickly.

346
00:36:17,310 --> 00:36:24,570
So not that much more work, and it could reproduce the training that was 100 percent accuracy, zero error.

347
00:36:24,570 --> 00:36:31,600
That's very striking. How could it? How? So it can do that, right, so it can memorise the data.

348
00:36:31,600 --> 00:36:39,130
Obviously, if you then give it new correct images, it gives you zero accuracy, 100 percent error because it hasn't learnt anything.

349
00:36:39,130 --> 00:36:45,340
It's just memorised the links, the labels and images so it can do it, which we we knew theoretically, you could.

350
00:36:45,340 --> 00:36:50,150
But they showed that we could find that solution really quickly, which in itself is exciting and interesting.

351
00:36:50,150 --> 00:36:55,630
And the question is whether why does it generalise? Well, why does it give us good solutions, right?

352
00:36:55,630 --> 00:37:01,420
Given that if I gave it the correct labels, it could just memorise the labels but not have any predictive power.

353
00:37:01,420 --> 00:37:07,390
So why does it not do that when it's given the correct labels, whereas it can do that easily when given incorrect labels?

354
00:37:07,390 --> 00:37:10,840
This is a very big question in the field and a very interesting one.

355
00:37:10,840 --> 00:37:14,720
And as physicists, of course, we're very nervous about these high dimensional systems.

356
00:37:14,720 --> 00:37:20,680
I told you there are 60 million parameters in Hinton's in the Alex Nets from 2012.

357
00:37:20,680 --> 00:37:26,200
There are no models with billions of parameters in them, and they work extremely well.

358
00:37:26,200 --> 00:37:31,450
So why are we worried? Well, as physicists, we've told that you should never have too many parameters, right?

359
00:37:31,450 --> 00:37:35,830
There's a very famous story by Freeman Dyson. He's talked about.

360
00:37:35,830 --> 00:37:43,270
He went to Fermi and he had a more high energy physics model with, I think, five parameters in its.

361
00:37:43,270 --> 00:37:47,410
And he went to Fermi and Fermi. Also, how much data do you have? And he didn't have that much data.

362
00:37:47,410 --> 00:37:52,600
The Fermi said that claim that von Neumann had said to him with four parameters I could fit an elephant with five.

363
00:37:52,600 --> 00:37:59,260
I can make it wiggle its trunk. And so what you're see, we may see the data well, but it is probably not true.

364
00:37:59,260 --> 00:38:05,170
And so Dyson went home not to Cornell with his tail between his legs, but that's an intuition that we teach our students all the time.

365
00:38:05,170 --> 00:38:14,380
Never used too many parameters. If you think if you're these things as a group from 10 years ago that worked out indeed for Norman was right.

366
00:38:14,380 --> 00:38:19,060
You can fit elephant and make a bigger stroke with five parameters with four is not enough.

367
00:38:19,060 --> 00:38:28,030
And so it's just great genius. So I'll give you a very simple toy problem that visualises this in a really simple way.

368
00:38:28,030 --> 00:38:42,790
So here I have a 10 data points. I can fit those 10 data points by, you know, Apollo mule.

369
00:38:42,790 --> 00:38:49,150
Right. And what we tell our students is if you've got 10 data points, don't fit a higher polynomial to it.

370
00:38:49,150 --> 00:38:56,650
So here you see this dash line is a five or polynomial. If I fit a 20 order polynomial to it, I get zero error because I can fit it extremely well.

371
00:38:56,650 --> 00:39:04,420
The data extremely well, but it starts to show very odd behaviour, which intuitively I think is probably not going to generalise well.

372
00:39:04,420 --> 00:39:08,320
Other words, if I give it new data points, it's going to give very bad predictions, right?

373
00:39:08,320 --> 00:39:14,710
And that's what that's what one woman and Fermi were trying to tell Dyson.

374
00:39:14,710 --> 00:39:18,370
Now, here's the fascinating thing we've trained here a bunch of neural networks with one, two,

375
00:39:18,370 --> 00:39:23,980
three, one, two and five hidden layers there, but with them, it doesn't seem to matter very much.

376
00:39:23,980 --> 00:39:30,070
All they feed data like this, that's the green curve. They actually on top of each other in this on this scale.

377
00:39:30,070 --> 00:39:37,630
So the question is why do these networks with thousands of parameters fit the data so smoothly that looks to our eye?

378
00:39:37,630 --> 00:39:41,470
This seems to be much better. So this is a kind of central conundrum in the field.

379
00:39:41,470 --> 00:39:44,230
Why do they work so well?

380
00:39:44,230 --> 00:39:54,850
In fact, there was a famous kind of spat between two younger and another al-Hashimi and two important theorists in this field,

381
00:39:54,850 --> 00:39:58,900
where the first one said, You know, deep learning, machine learning is alchemy, OK?

382
00:39:58,900 --> 00:40:02,830
It works. We have no idea why, and this is the kind of argument they're having.

383
00:40:02,830 --> 00:40:11,230
We have no idea why it's generalised so well. No, I wouldn't be setting them up in this way if I didn't think I had something to say about it.

384
00:40:11,230 --> 00:40:18,130
And so last time when you hear I explains a new idea from our group,

385
00:40:18,130 --> 00:40:24,970
which is actually generalising something that's been around for 50 years in the field of algorithm information theory.

386
00:40:24,970 --> 00:40:31,120
So and very often it is a study of the complexity of single objects.

387
00:40:31,120 --> 00:40:35,920
It's an information theory, but as opposed to sharing information, which is about distributions, it's about single objects.

388
00:40:35,920 --> 00:40:40,180
And so the central thing in this theory is a goal of complexity,

389
00:40:40,180 --> 00:40:47,290
which is formally defined as the length of the shortest programme that will generate a particular string on a universal Turing machine.

390
00:40:47,290 --> 00:40:53,750
So I'll illustrate this to you with a very simple example. Imagine a monkey typing on a typewriter using a typewriter,

391
00:40:53,750 --> 00:40:58,060
or I should say a word processor now for the younger people in the audience who have never seen typewriters.

392
00:40:58,060 --> 00:41:03,190
The point is the result. Let's make it simple as a binary one is only zeros and ones and rookie types on this.

393
00:41:03,190 --> 00:41:08,440
How likely will this monkey type the following one hundreds digit long sequence?

394
00:41:08,440 --> 00:41:14,290
Well, the monkey will type with one half to the power 100 because the monkey what monkeys are not truly random,

395
00:41:14,290 --> 00:41:19,510
but let's assume that we have a random monkey. That's what the monkey is going to do. OK?

396
00:41:19,510 --> 00:41:25,180
There was actually experiment in Portsmouth Zoo where they they gave a typewriter to a monkey.

397
00:41:25,180 --> 00:41:33,970
A bunch of monkeys. And what they found is the monkeys kept tapping the same thing many, many times to preference for K, and then they defaecated on.

398
00:41:33,970 --> 00:41:39,500
And that was the end of the experiment. OK. This is this is a hypothetical monkey.

399
00:41:39,500 --> 00:41:46,280
OK, well, on on a binary keyboard, this is the input output, the probability we will have the power and hundreds.

400
00:41:46,280 --> 00:41:50,080
But what if the monkey was Typekit to some kind of computer programme?

401
00:41:50,080 --> 00:41:56,350
Well, then we might expect that you would get something they might actually type the sequence of prints,

402
00:41:56,350 --> 00:41:59,920
so that's not binaries, assume it's a proper computer.

403
00:41:59,920 --> 00:42:06,340
So the print zero 150 times, let's ignore for a moment the the it's what we want over the size of the keyboard,

404
00:42:06,340 --> 00:42:10,210
but it would be something that scales is to the power. 19.

405
00:42:10,210 --> 00:42:16,300
Are making it doesn't quite work in binary arithmetic, but I'm just doing it because I'll need that binary to stick later.

406
00:42:16,300 --> 00:42:22,720
The point being the if you're typing into some kind of programme that you might accidentally type print zero one many times,

407
00:42:22,720 --> 00:42:31,360
it'll give you a short description and call. The graph shows that mathematically is the right way of describing the complexity of a single object.

408
00:42:31,360 --> 00:42:39,280
There's also the reason why this hasn't been applied is because this common graph complexity links to the halting problem in Turing's machines,

409
00:42:39,280 --> 00:42:46,650
which links to girls on desirability theory. And so there's lots of uniformity can never know whether you have the short programme.

410
00:42:46,650 --> 00:42:54,930
But it's you can approximate, and so what we did is this we said, let's assume that we can approximate anyway.

411
00:42:54,930 --> 00:42:59,460
OK. And then we said to the probability that you get a certain outcome should be something

412
00:42:59,460 --> 00:43:03,930
like one half to the power of the shortest programme that gives you that outputs.

413
00:43:03,930 --> 00:43:07,980
And some countries in there which we could ignore. But the first order, qualitatively, that's what we're saying.

414
00:43:07,980 --> 00:43:15,810
So we're saying this might be a much more general property of maps of Input-Output maps, a map of inputs with an output.

415
00:43:15,810 --> 00:43:20,430
This is inputs outputs that should give me something of that nature.

416
00:43:20,430 --> 00:43:25,080
So the question then is how can we think of the neural network as an input output map?

417
00:43:25,080 --> 00:43:28,980
And we have to be careful here because the input outputs.

418
00:43:28,980 --> 00:43:35,640
But you can also think about it in a different kind of input output map, which is a parameter function that I've got parameters here.

419
00:43:35,640 --> 00:43:42,750
OK. So arbitrarily change. OK. And I've got outputs which are the function that these parameters pick, right?

420
00:43:42,750 --> 00:43:46,750
And so the inputs are the parameters and the outputs are the functions.

421
00:43:46,750 --> 00:43:50,460
Inputs are not the inputs. They're all network. The the parameters that I choose.

422
00:43:50,460 --> 00:43:57,750
And if this coding theorem is correct, then we should see that a poor, randomly picking parameters.

423
00:43:57,750 --> 00:44:01,790
I should get functions that have shortcomings of complexity.

424
00:44:01,790 --> 00:44:05,810
Now, we've one of the great things about being in Oxford is that you have unbelievably talented undergraduates,

425
00:44:05,810 --> 00:44:11,000
so an undergraduate that worked for us over a summer programme and he was thinking about something different.

426
00:44:11,000 --> 00:44:16,880
He's actually working on reinforcement learning. And then one day he came to me said, I've been thinking, I think I've proved something.

427
00:44:16,880 --> 00:44:23,000
And so he's and it turns out that he hadn't quite proved something, but it almost proved something which we have now proven,

428
00:44:23,000 --> 00:44:26,600
which is that for a very simple neural network, it's called perceptron.

429
00:44:26,600 --> 00:44:31,540
We only have inputs then weights and outputs, which it's like a zero layer neural network.

430
00:44:31,540 --> 00:44:36,880
We can prove that for this kind of. Binary problems, OK?

431
00:44:36,880 --> 00:44:42,550
The probability upon random sampling these weights that you get a particular function.

432
00:44:42,550 --> 00:44:46,390
OK. We can't prove that. We can prove that you get a certain entropy.

433
00:44:46,390 --> 00:44:49,720
The functions or a certain number of ones is all equal.

434
00:44:49,720 --> 00:44:56,110
So the probability I get zero ones, which is all zeros, is equal to the probability of all functions that have four ones.

435
00:44:56,110 --> 00:44:59,350
OK. But as you can see, if I have four ones, I can permit them in lots of ways.

436
00:44:59,350 --> 00:45:03,340
There are many functions that give me four ones, but there's only one that gets me all zeros.

437
00:45:03,340 --> 00:45:07,600
So I'm much more likely to get the one with all zeros than I'm going to get this particular one.

438
00:45:07,600 --> 00:45:13,390
OK, because it's one of a big class of functions, so we prove that exactly. And we also do.

439
00:45:13,390 --> 00:45:18,280
We then did some. This is for a very simple system. We then took a neural network with multiple layers.

440
00:45:18,280 --> 00:45:21,940
We randomly picked the functions we took and the seven.

441
00:45:21,940 --> 00:45:30,430
So seven of these seven of these inputs to this one handed to an eight possible inputs is

442
00:45:30,430 --> 00:45:34,820
to turn them into an eight just two to the tenth of the thirty eight possible functions.

443
00:45:34,820 --> 00:45:40,150
OK. And then we just repeat as many times. Also, what one? How often you see the same function again and again?

444
00:45:40,150 --> 00:45:45,910
Well, what we find is the the this plot right here. I just want to point out the scale of this graph.

445
00:45:45,910 --> 00:45:52,810
We know the tens of thirty eight functions. So if the functions were all equally likely to appear, which you might assume is of no model,

446
00:45:52,810 --> 00:45:56,350
then you would get a function with probability of one in tens of thirty eight.

447
00:45:56,350 --> 00:46:03,160
You'd never get the same function twice in the age of our universe, even with the fastest computers by random sampling.

448
00:46:03,160 --> 00:46:10,930
What we find here is that some functions are appearing almost 10 percent and one in one percent of time one in one thousand.

449
00:46:10,930 --> 00:46:17,260
That's minutes. That's thirty five orders of magnitude thirty six automated, more likely than you expect on no model.

450
00:46:17,260 --> 00:46:21,770
And so there are these functions that are appearing with high probability when I randomly pick these these parameters.

451
00:46:21,770 --> 00:46:25,780
And the question is, do I know which ones they are? Well, I just have this theory here.

452
00:46:25,780 --> 00:46:32,500
This theorem that says the probability of getting a function should scale with two to the minus the complexities.

453
00:46:32,500 --> 00:46:38,110
If I put the log with the probability analogue scale here versus the complexity of the function.

454
00:46:38,110 --> 00:46:42,040
In fact, the the function, I basically say. So this would be a slightly more complex function.

455
00:46:42,040 --> 00:46:47,410
A function which is zero one zero one zero one zero one would be simpler function or all zeros would be simpler.

456
00:46:47,410 --> 00:46:49,240
Here I brought the complexity of the function,

457
00:46:49,240 --> 00:46:57,040
and this theory is actually the upper bounds is the upper bounds that we calculate from theory, and it predicts exactly what you expect.

458
00:46:57,040 --> 00:47:03,850
So other words, you're very likely to get these simple functions and very unlikely to get complicated functions.

459
00:47:03,850 --> 00:47:07,240
We've shown this holds much more generally for neural networks.

460
00:47:07,240 --> 00:47:12,940
And so this is exciting because we show that they have this fundamental bias where simplicity.

461
00:47:12,940 --> 00:47:17,170
And what's interesting is you can do other games so I can do a training, OK,

462
00:47:17,170 --> 00:47:22,310
I can train manual network on a complicated function, on a simple function and then ask myself, how well does it do?

463
00:47:22,310 --> 00:47:26,800
So I'll pick half of my inputs to train and the other half the tests.

464
00:47:26,800 --> 00:47:31,270
And what you see is as the target functions as a function of trying to learn, it's more complicated.

465
00:47:31,270 --> 00:47:38,170
Actually, my error goes up. So one thing we see is that these networks do well on simple tasks, but poorly on bad tasks.

466
00:47:38,170 --> 00:47:44,410
And this is one of the top. Here's a random learner where you just pick a random function that fits the test.

467
00:47:44,410 --> 00:47:49,180
And the vast majority of functions that you pick randomly are terrible. Not expectedly.

468
00:47:49,180 --> 00:47:55,160
And so the random learner basically acts like might as well just pick might not learn at all.

469
00:47:55,160 --> 00:48:02,390
And what's interesting is if the network, if you train it on a simple function, it finds functions with this complexity and this error,

470
00:48:02,390 --> 00:48:06,770
so it tends to bunch around the one that you're trying to find and this with the random nerve does.

471
00:48:06,770 --> 00:48:10,880
But if I give it a complicated function, it behaves more or less the same as a random murder.

472
00:48:10,880 --> 00:48:15,620
So that's what you expect. Networks are not magic, right? They just have a bias towards simple functions.

473
00:48:15,620 --> 00:48:19,850
So if I give them simple functions to learn, they should do well. I give you a complicated function.

474
00:48:19,850 --> 00:48:23,450
They're not going to do any better than any other kind of random guessing.

475
00:48:23,450 --> 00:48:29,210
So you might think that this would be very exciting if people believe this is the reason why these things generalise so well,

476
00:48:29,210 --> 00:48:35,460
they generalise well, because just by a simple function, the simple functions give you a good generalisation.

477
00:48:35,460 --> 00:48:39,150
But what we find, what what we find, and I think for good reasons,

478
00:48:39,150 --> 00:48:42,720
people are sceptical and the reason they're sceptical is because you train a neural

479
00:48:42,720 --> 00:48:46,530
network by randomly picking parameters that would be really silly and very slow.

480
00:48:46,530 --> 00:48:48,990
Instead, what you do is you your lost functions,

481
00:48:48,990 --> 00:48:56,640
you calculate how far your function is from the function that you want and then you you use a gradient descent mythical.

482
00:48:56,640 --> 00:49:02,940
So castigating the sense which drives you down down your lost landscape to a minimum.

483
00:49:02,940 --> 00:49:09,400
And that's an optimisation algorithm which is extremely effective and efficient.

484
00:49:09,400 --> 00:49:14,000
And the kind of consensus in the field on the most common argument in the few

485
00:49:14,000 --> 00:49:18,760
would be that there's something magic about study that gives you good solutions.

486
00:49:18,760 --> 00:49:25,060
So people push back on this just because the opportunity probability of a function is high.

487
00:49:25,060 --> 00:49:31,760
Does it mean that you're going to find it that way on on going down gradient descent, which I think is good arguments.

488
00:49:31,760 --> 00:49:36,820
However, we have a kind of qualitative argument says, you know, on this space, you've got lots of different possible functions.

489
00:49:36,820 --> 00:49:42,790
You could find the all fit the data. If that function has a large base in so many different ways, you can find it.

490
00:49:42,790 --> 00:49:48,770
You expect that that it's that high loss, you're still going to be likely to fall into that basement and go down into it.

491
00:49:48,770 --> 00:49:52,120
Right. So this is here are some functions with large based in size.

492
00:49:52,120 --> 00:49:58,120
You're going to probably go down into here where this one has a very small and so it's based on is going to be small.

493
00:49:58,120 --> 00:50:06,190
So, Chris, this undergraduate has done the second piece of work is together with a bunch of other undergraduates, actually.

494
00:50:06,190 --> 00:50:11,260
So I've got three undergraduates working on this project and myself and Guillermo and a few others.

495
00:50:11,260 --> 00:50:18,850
He trained on this image that called mists and that he worked out how likely was do to find a particular function.

496
00:50:18,850 --> 00:50:25,660
So these are the functions with one error out of 100, so a one percent error, there's a function of two percent error.

497
00:50:25,660 --> 00:50:27,610
And this is a function with zero error.

498
00:50:27,610 --> 00:50:35,440
And what we found is that the probability that the function is found a priori is very close to the probability that is found by stochastic.

499
00:50:35,440 --> 00:50:41,290
Great instance. And in fact, so that means that occasionally finds poor functions with a probability that we can predict more or less.

500
00:50:41,290 --> 00:50:47,500
And so this is exciting because we can we think we can find more or less the probability with which these functions are found.

501
00:50:47,500 --> 00:50:51,430
So this is the first time I've presented this data, this data that came from last week.

502
00:50:51,430 --> 00:50:56,230
OK, so it's very hot off the press, but we're excited by we have actually a bunch of other examples of this,

503
00:50:56,230 --> 00:50:59,290
and we think that this may therefore be the explanation.

504
00:50:59,290 --> 00:51:05,770
So this a priori explanation seems to be qualitatively at least what study is doing upon training.

505
00:51:05,770 --> 00:51:09,940
If that's true, then this explains why neural networks work so well.

506
00:51:09,940 --> 00:51:12,670
What it doesn't explain what explains part of expertise is they are biased

507
00:51:12,670 --> 00:51:16,870
towards simple functions in the same way that Occam's Razor is biased towards.

508
00:51:16,870 --> 00:51:22,580
You know, you will learn about very registry that you should you should be biased toward simple solutions.

509
00:51:22,580 --> 00:51:27,980
And that's kind of the deep question that is, why are the things that we look at in nature?

510
00:51:27,980 --> 00:51:33,140
Simple physics we're used to looking at things are simple and so we intuitively think that must be true.

511
00:51:33,140 --> 00:51:34,670
But why would that be true in general?

512
00:51:34,670 --> 00:51:44,030
Well, for this and this dataset, actually, there's it's it's a grid of twenty eight by twenty eight pixels, so 700 eighty four dimensional space.

513
00:51:44,030 --> 00:51:51,620
But you can work out the dimensions of the space by looking at how things scale with size and the dimensions seem to be about 14 or 15 dimensional.

514
00:51:51,620 --> 00:51:55,430
So these images are in a very low dimensional manifold in a very high dimensional space.

515
00:51:55,430 --> 00:51:59,930
They're therefore simple. So a bias for simplicity is good for this kind of image recognition.

516
00:51:59,930 --> 00:52:06,650
But more generally, I think it's a wide open question about the nature of the universe, the nature of of the kind of things we're studying.

517
00:52:06,650 --> 00:52:13,730
Why are they so complicated and why are they simple? And we're claiming that there is a bias for simplicity, which is good.

518
00:52:13,730 --> 00:52:18,290
And if it's good, it should explain exactly why you see what you see.

519
00:52:18,290 --> 00:52:23,570
So conclusions, I think what she's learning is transforming physics, it's not just hype.

520
00:52:23,570 --> 00:52:27,710
I can make that point, but I will be made later today, but I think is very important.

521
00:52:27,710 --> 00:52:33,620
The main reason that hopefully of today and then what I tried to argue to you is a deep learning may work because differing

522
00:52:33,620 --> 00:52:39,170
techniques are work because they have a natural bias towards simple functions and the kind of inbuilt Occam's razor without.

523
00:52:39,170 --> 00:52:45,376
I thank you very much for your attention.