1
00:00:02,980 --> 00:00:10,290
I. Okay.

2
00:00:10,300 --> 00:00:12,630
Welcome, everybody, to the strategy lecture.

3
00:00:13,500 --> 00:00:21,479
I'd like to start by expressing our gratitude to Oxford Asset Management, which has really generously supported the lecture.

4
00:00:21,480 --> 00:00:30,420
So I want to thank them for that. I also want to start with one announcement, which is that there will be refreshments after the lecture just outside.

5
00:00:30,420 --> 00:00:39,569
So everybody is invited. Please do join. And then the most pleasant task will be introducing less volume to is going to be

6
00:00:39,570 --> 00:00:44,880
speaking to us about whether one can define intelligence as a computational problem.

7
00:00:45,690 --> 00:00:54,480
So let's delve into one of these people who, as many of you know, invents one research field after another.

8
00:00:55,080 --> 00:00:59,580
So when I thought of which things to tell you, I just picked a small selection of them.

9
00:01:00,150 --> 00:01:05,520
And so one is less did a lot of work in algebraic complexity theory.

10
00:01:05,520 --> 00:01:12,300
So the complexity classes, VHP and BNP, which is still a very active area, the V stands for Valiant.

11
00:01:13,770 --> 00:01:22,020
Then we moved on and started the complexity of counting and the class number, and there's a whole community that works in that.

12
00:01:22,920 --> 00:01:30,749
Then we went on and founded Computational Learning Theory, so the book probably approximately correct was really influential and basically

13
00:01:30,750 --> 00:01:38,120
was the first rigorous study of what can learn after that not being enough.

14
00:01:38,460 --> 00:01:49,620
Let's became interested in classically simulating quantum, and he discovered holographic algorithms and that whole area, which is another concern.

15
00:01:50,310 --> 00:01:57,970
Then went on to write his book, Circuits of the Mind, which is about computational analysis and studying the human brain.

16
00:01:58,320 --> 00:02:02,310
And I think many of these threads will be brought together today.

17
00:02:03,660 --> 00:02:12,450
It's traditional in these talks to mention prizes, but Les is actually one actually pretty much every prize.

18
00:02:12,450 --> 00:02:16,350
So I picked out just four so that we would get on to the talk.

19
00:02:16,680 --> 00:02:28,080
So he's won the Never Linda Prize in 1986, became an FRC in 1991, won the Knuth Prize in 1997, and the Turing Award in 2010.

20
00:02:28,800 --> 00:02:34,770
And I looked up letters on mathematics, genealogy and found 109 descendants.

21
00:02:35,130 --> 00:02:39,420
But I'd like to point out that 13 of them are right here in our department.

22
00:02:39,420 --> 00:02:47,520
So we we all owe quite a lot to Les, not just for his intellectual stimulation, but also for many mentoring.

23
00:02:48,060 --> 00:02:51,900
So we're really. Oh, and I forgot. Please fill out the question.

24
00:02:52,470 --> 00:02:56,580
And so I'm delighted to introduce Les Valiant.

25
00:03:10,070 --> 00:03:15,290
Well, thank you very much, Leslie, for the very kind introduction, and thank you very much for inviting me here.

26
00:03:16,490 --> 00:03:20,660
So also, about 30 years ago, I spent a very happy sabbatical year in Oxford.

27
00:03:21,000 --> 00:03:27,350
I was treated very well. So I do have very happy memories of Oxford and I'm very glad to be to be back.

28
00:03:29,090 --> 00:03:33,620
So what I'm talking about is kind of a theoretical approach to a I.

29
00:03:34,100 --> 00:03:39,350
And in brief summary, it's a way of reconciling machine learning and reasoning.

30
00:03:40,850 --> 00:03:43,700
So it's a topic which is close to my heart and has been for a long time.

31
00:03:45,110 --> 00:03:51,770
But in giving in talks, often the hardest thing to understand is why someone is doing this kind of thing at all.

32
00:03:52,280 --> 00:03:57,380
So I'll be slightly self-indulgent and try to explain the motivation of of this kind of approach.

33
00:03:58,610 --> 00:04:08,420
And so first, I want to discuss this notion of a computational phenomenon, which not many people discuss, but to no avail.

34
00:04:09,050 --> 00:04:18,560
So, you know, algorithms have been around for a long time, and Euclid had had a very good algorithm given to numbers.

35
00:04:18,560 --> 00:04:23,000
You can find the greatest common divisor and efficiently.

36
00:04:24,290 --> 00:04:33,319
And so if now, for example, if I give you one number and you want the divisor, you know, which is factoring it, you know, that takes exponential time.

37
00:04:33,320 --> 00:04:36,320
So no two numbers can find a common factor.

38
00:04:36,770 --> 00:04:38,400
So there's something very striking already.

39
00:04:38,420 --> 00:04:47,960
And and what people knew about the algorithms a couple of thousand years ago and many of the best algorithms we know are ancient.

40
00:04:48,230 --> 00:04:51,470
So what is computer science contributing in general?

41
00:04:52,790 --> 00:04:57,620
Well, of course, the big change was Turing's paper in 1936.

42
00:04:58,130 --> 00:05:02,390
So I'll start by trying to spell out my view of what the big event was.

43
00:05:04,280 --> 00:05:12,439
And I will. Discuss this notion of a computational phenomenon, which for Turing,

44
00:05:12,440 --> 00:05:18,220
the phenomenon involves computation itself and the model of computation, which for him was a Turing machine.

45
00:05:18,770 --> 00:05:24,500
And the best way of explaining this, these ideas is, is by making an analogy with physics.

46
00:05:24,890 --> 00:05:30,080
So I'm not trying to say that computer science is physics are the same, but analogies do serve some purpose.

47
00:05:31,160 --> 00:05:34,790
So here the claim is that so what do you have in physics?

48
00:05:35,150 --> 00:05:42,440
You have some laws like F equals M.A. and the Gravitation Law.

49
00:05:44,420 --> 00:05:53,030
So you've got some laws which are believed to hold generally, but what they really are supported by mathematical theorems which are consequences,

50
00:05:53,060 --> 00:06:02,750
deductions from the laws and with which you can really understand the incredible breadth of, of what the law means and the analogy.

51
00:06:03,530 --> 00:06:06,950
Okay, so we learned about this a long time ago.

52
00:06:07,520 --> 00:06:14,000
So as a computer scientist, sometimes I have wondered what are we offering in a comparable to what the physicists have been doing?

53
00:06:14,570 --> 00:06:25,760
And on reflection, I think what we are doing is what's Turing did is that what corresponds to the law in physics is a model of computation.

54
00:06:25,760 --> 00:06:28,340
So he defined a model Turing machine.

55
00:06:28,970 --> 00:06:37,100
And and the general pain was that it kind of it captures computation in the real world in a very significant and general sense.

56
00:06:38,330 --> 00:06:44,030
And this is a big statement, but it's supported by mathematical consequences exactly like physicists do.

57
00:06:44,450 --> 00:06:50,689
So, for example, important consequences is that there's a universe give us a Turing machine, which, without his notion,

58
00:06:50,690 --> 00:06:56,809
he can discuss the non computable problems and other very important things about models of computation.

59
00:06:56,810 --> 00:07:01,820
Is that the robust if you make small changes, it shouldn't change the power.

60
00:07:02,810 --> 00:07:12,710
So I think this is what the main thing computer science offers and what the rest of us since have been trying to emulate in other ways.

61
00:07:13,310 --> 00:07:25,370
And so the idea here is that as an even more general level, of course, what Newton implied is that you can capture these laws of physics by equations.

62
00:07:26,420 --> 00:07:35,630
And I'm claiming that because the general statement is that that is phenomena in computation, and you should capture them by models of computation.

63
00:07:36,110 --> 00:07:42,050
So Turing's example was that there was competition in general and the model was Turing machines.

64
00:07:42,920 --> 00:07:49,100
But much of what we've been doing since in the algorithms area has been on the same tracks.

65
00:07:49,460 --> 00:07:55,640
So for example, okay, so, so that's a side comment.

66
00:07:56,510 --> 00:08:02,600
So I'm describing an analogy with physics and I'm just pointing out that there are other analogies other people use.

67
00:08:02,600 --> 00:08:12,200
So some people use the analogy that maybe unproved mathematical conjectures like P, not equal turn p, we should treat like physical laws.

68
00:08:12,380 --> 00:08:17,600
Okay, that the things people believe you can't prove, let's believe it until someone disproves it.

69
00:08:18,740 --> 00:08:20,899
So that's okay. I've got no quarrel with that.

70
00:08:20,900 --> 00:08:30,500
But I'm really drawing a different analogy which this kind of more general book I try to it's expansive, a bit more detail than here.

71
00:08:32,060 --> 00:08:36,200
So the physical laws are which are true. That's true, but not provable.

72
00:08:36,680 --> 00:08:43,550
It's like a model of computation. The claim that the model of computation is a valid for real, real phenomenon.

73
00:08:44,480 --> 00:08:49,700
Okay. And okay. So for example, another phenomenon is search.

74
00:08:50,960 --> 00:08:57,440
And in fact, the best description of this in words is the phrase mental search, which Turing used already in 1948.

75
00:08:58,040 --> 00:09:01,939
So the idea is that you're searching, searching for oil in the ground.

76
00:09:01,940 --> 00:09:06,560
You're searching for something in your head, and you're like searching for factors of a number.

77
00:09:07,190 --> 00:09:16,670
And the definition is called NP, where you're searching for solutions which are short compared to the input size.

78
00:09:17,120 --> 00:09:20,149
And also given this solution, you can easily verify it.

79
00:09:20,150 --> 00:09:26,000
So this is a formalisation of search and P is a rigorous is a formal statement of it.

80
00:09:26,450 --> 00:09:31,970
And we believe that NP is a real phenomenon and computation which lots of people find useful.

81
00:09:32,720 --> 00:09:35,960
And the model of computation is this non deterministic Turing machine.

82
00:09:36,590 --> 00:09:41,840
And again, the interest of this definition by itself is not so impressive,

83
00:09:41,840 --> 00:09:50,150
but the interest is that the powerful mathematical statements and an important one is that the hardest search problems and be complete problems.

84
00:09:51,140 --> 00:09:58,100
And there's a model of computation and some stunning mathematical statements which are surprising, which make it okay.

85
00:09:58,100 --> 00:10:07,790
So that's one. Okay. So as it happens, so by P, I mean roughly what we're sure we can compute efficiently.

86
00:10:08,270 --> 00:10:14,240
In this universe in polynomial time I should put it in randomisation, so it's kind of a stand in for that.

87
00:10:14,780 --> 00:10:19,670
So computer will captures everything that is during staying the most. General ANP is like a subclass.

88
00:10:21,590 --> 00:10:27,770
So in other subclasses as a Sharpie, some people here call it number three, which is the counting version.

89
00:10:28,550 --> 00:10:32,810
And again, this is like a. Okay.

90
00:10:35,170 --> 00:10:41,379
Okay. So yet another one is BQ P, which is quantum polynomial time.

91
00:10:41,380 --> 00:10:45,550
So this is our best effort at describing what a quantum machine what?

92
00:10:45,690 --> 00:10:49,330
How if you use quantum theory for computation, what you'd get.

93
00:10:50,880 --> 00:10:57,990
And so with each of these classes, again, I think it's the same story that there's a model of computation.

94
00:10:58,800 --> 00:11:04,470
This is bounded quantum polynomial time and again.

95
00:11:05,010 --> 00:11:11,159
Besides the suggestion that we should use quantum theory to compute, there are some mathematical consequences.

96
00:11:11,160 --> 00:11:14,550
And for example, a very powerful consequence is the second one.

97
00:11:14,970 --> 00:11:21,990
Powerful theorem is that the many ways you could try to use Quantum to compute, and it turns out that all equivalent.

98
00:11:22,140 --> 00:11:23,549
So that's a strong results.

99
00:11:23,550 --> 00:11:34,710
So by looking at this model of computation, you do arrive at the conclusion that there's a real competition phenomenon, which is right.

100
00:11:35,250 --> 00:11:40,110
And so another result is that BQ P is in fact reducible to sharpies.

101
00:11:40,110 --> 00:11:46,470
So the counting counting says problem. If good counting says problems are more powerful than the quantum class.

102
00:11:48,210 --> 00:11:52,020
Okay, so we've got this various classes with different.

103
00:11:53,580 --> 00:11:59,440
Uh. Power and there are more.

104
00:11:59,450 --> 00:12:04,490
So another phenomenon is a is for this captures the idea of games.

105
00:12:06,140 --> 00:12:11,210
So do I have a. No, I don't have that yet. So this is your kind of game theory.

106
00:12:12,890 --> 00:12:19,460
And again, you know, powerful results about that are that the compete problems the hardest members of the game theory class.

107
00:12:20,840 --> 00:12:26,240
So of course mathematically we don't know all these forces could collapse for all we know.

108
00:12:27,140 --> 00:12:35,000
And it may be that all these things well up to short P it may be that you can do everything in polynomial, polynomial time even efficiently.

109
00:12:36,980 --> 00:12:40,340
But even if that happens to the case, one can still discuss these.

110
00:12:40,340 --> 00:12:45,530
As far as for the phenomena, I think and others extended discussion of of that.

111
00:12:46,430 --> 00:12:52,010
But so it's with this background that I think I have approached topics.

112
00:12:53,030 --> 00:13:04,130
So if one looks into, into, into, into machine learning, then so I've tried to formalise the notion of, of supervised learning.

113
00:13:04,370 --> 00:13:11,910
So we discussed that more code back learning probably possible to correct learning, which is a subclass of of,

114
00:13:11,940 --> 00:13:18,229
of P which which roughly means that is believed to be a subclass is that there are many things

115
00:13:18,230 --> 00:13:23,420
that could write a program for by the program could fit into your computer or the universe.

116
00:13:24,050 --> 00:13:27,280
But this belief that most of this you can't learn from examples.

117
00:13:27,290 --> 00:13:37,510
So learning is harder than just computing. And okay, so the question of of where the speculating thing is within P, it's a.

118
00:13:39,950 --> 00:13:43,549
Uh, you know, some important stuff.

119
00:13:43,550 --> 00:13:47,750
For example, one observation is that cryptography lives out the difference.

120
00:13:48,080 --> 00:13:53,989
So, uh, so public key cryptography wouldn't exist if any function.

121
00:13:53,990 --> 00:13:59,000
You could easily learn from examples because then you could kind of learn all the secrets.

122
00:13:59,480 --> 00:14:07,280
So negative results about, uh, complexity are used every day, especially by cryptographers.

123
00:14:08,480 --> 00:14:15,830
Okay. Okay. So with the pack learning again, you have a model of computation.

124
00:14:16,640 --> 00:14:21,620
And so this model, which I describe in more detail, captures the notion of supervised learning,

125
00:14:22,220 --> 00:14:25,790
which is a well-known concept and widely practised, of course.

126
00:14:27,260 --> 00:14:35,810
And again, by following this, having a formalisation, uh, so obviously questions of robustness are important.

127
00:14:36,020 --> 00:14:38,990
If I define this class in different ways, do I get different classes?

128
00:14:39,140 --> 00:14:48,550
It's important that it's a robust many variants give you the same class and and so some consequences of

129
00:14:48,560 --> 00:14:55,160
for example that you can give a rigorous demonstration that learning algorithm does really generalise.

130
00:14:55,160 --> 00:15:04,370
You know, it's a generalisation which used to be some philosophical issue not many, many decades ago, of course, now is practised by machines.

131
00:15:04,700 --> 00:15:12,950
And you can also explain why some algorithms are, you know, predict in a certain sense, there's no nothing magical about them.

132
00:15:14,020 --> 00:15:21,730
Okay. So. Okay. So I want to describe actually a little bit because I won't I'll build later on on the on that.

133
00:15:24,340 --> 00:15:29,830
So it's a form of formalisation, of supervised learning and so supervised learning.

134
00:15:29,980 --> 00:15:34,540
So there's this term supervised learning, unsupervised learning, which I use in very general senses.

135
00:15:34,540 --> 00:15:40,540
And one reason for formalising it is that at least it defines what we're discussing.

136
00:15:42,670 --> 00:15:45,670
But generally it's, it's talk about learning where there's some feedback.

137
00:15:46,360 --> 00:15:50,980
So supervised learning doesn't mean that there's a supervisor necessarily.

138
00:15:51,550 --> 00:15:56,680
So, for example, I can look around the room and learn something about the average audience in a

139
00:15:56,680 --> 00:16:00,370
computer science lecture in Oxford is like the average age or something like that.

140
00:16:00,850 --> 00:16:07,290
There's no supervisor telling me things. I'm doing this because from other knowledge I can label people myself.

141
00:16:07,300 --> 00:16:11,350
I know roughly how old everyone is. Okay, so I can learn without an external label.

142
00:16:11,440 --> 00:16:15,280
So supervised learning doesn't mean that it has to be a supervisor.

143
00:16:15,670 --> 00:16:19,050
And so essentially it means there's any kind of feedback.

144
00:16:19,150 --> 00:16:24,430
It's, it's supervised learning. So unsupervised learning is where there's truly, truly no feedback.

145
00:16:24,430 --> 00:16:29,140
You know, it's some pattern. And somehow we're supposed to draw some conclusion.

146
00:16:30,340 --> 00:16:36,220
But certainly I think the or the impact of machine learning recently is all this feedback, the supervised ending.

147
00:16:36,550 --> 00:16:38,320
Okay, so what's this formalisation?

148
00:16:39,040 --> 00:16:49,120
So the idea is that there's some space through space where it's taught it is an example maybe of a flower and you're trying to classify flowers into,

149
00:16:49,150 --> 00:16:54,520
into what species they come from. They have types. And B there's a truth.

150
00:16:55,000 --> 00:16:59,440
F is a ground truth separates is s from the bees.

151
00:17:00,760 --> 00:17:06,340
And the learner also has a hypothesis which classifies examples.

152
00:17:06,880 --> 00:17:11,410
And in any rich enough world worth talking about, there will be errors.

153
00:17:11,650 --> 00:17:14,750
Okay. And we assume serious.

154
00:17:14,750 --> 00:17:20,960
It's a very rich world exponentially. Many different kinds of examples may maybe influenced infinitely many different examples.

155
00:17:21,290 --> 00:17:24,740
So it's we want to talk about something realistic.

156
00:17:26,180 --> 00:17:31,990
And so what's. What's this? Uh, what's the supervised learning phenomenon?

157
00:17:32,000 --> 00:17:36,770
It seems amazing. It works. People celebrate it, even even in the popular press.

158
00:17:37,700 --> 00:17:41,420
So what this formulation is first the three, three points.

159
00:17:41,930 --> 00:17:45,890
So one is that it's an efficiency criterion.

160
00:17:46,400 --> 00:17:49,700
So it says that there will always be errors.

161
00:17:50,480 --> 00:17:59,710
But the more examples you take and the more computation you apply, you should be able to reduce your error fairly fast.

162
00:17:59,720 --> 00:18:03,860
Okay. It should be rewarding to put more effort into into it, into learning.

163
00:18:05,570 --> 00:18:13,670
So if you double the amount of effort or the number of examples, you should see the increase decrease in the error.

164
00:18:14,690 --> 00:18:23,780
And so this is a quantify something quantified and the important thing is that it goes down algebraically.

165
00:18:23,780 --> 00:18:30,980
So if you have an examples there may be errors are good good at one over the square root of n maybe one over the 10th root of N,

166
00:18:31,430 --> 00:18:42,959
but it wouldn't or shouldn't be slower than that. Okay. So. And of course this is a realised realised gain that basically people in the last ten

167
00:18:42,960 --> 00:18:50,190
years have increased the budget in data and computation by a factor of maybe thousands.

168
00:18:50,640 --> 00:18:56,530
And this has brought really good rewards. So this is okay.

169
00:18:56,620 --> 00:19:01,480
And for some simple learning algorithms, you can prove that the thing learns and it learns so fast.

170
00:19:01,780 --> 00:19:05,420
Okay. And so this is just in pictures.

171
00:19:05,430 --> 00:19:14,320
So the quantitative aspect is that the more effort you put in, the error goes down as power of the effort.

172
00:19:14,950 --> 00:19:19,330
So it may be, you know, one over into the half.

173
00:19:19,460 --> 00:19:27,460
Okay. So if you want to reduce the error by a factor of two, maybe you should put in the fact a fixed factor like 100 or four.

174
00:19:27,760 --> 00:19:36,790
Okay. And some people actually experimentally verify this that for some task of predicting next words various deep learning.

175
00:19:39,030 --> 00:19:47,990
Uh, algorithms do have this linear this to go to this polynomial resulting in error.

176
00:19:48,000 --> 00:19:51,600
So this is a log log scale, so you straighten out the curve to a straight line.

177
00:19:52,410 --> 00:19:57,690
And so also speculating doesn't tell you what power law this should be.

178
00:19:58,170 --> 00:20:02,370
And in fact, there's evidence that different applications do have different vowels.

179
00:20:02,370 --> 00:20:08,009
So if you do some natural language data sets or a vision data set, they have different power.

180
00:20:08,010 --> 00:20:09,360
They have different power laws. Okay.

181
00:20:09,990 --> 00:20:16,890
So like, like this one is assumed to be a very slow power that's like point sort of fixed power, but it's .06 good enough.

182
00:20:18,000 --> 00:20:21,570
Okay. So this is an efficient efficiency that.

183
00:20:23,990 --> 00:20:29,240
Okay. So so we demand before we call a machine learning algorithm successfully,

184
00:20:29,240 --> 00:20:36,910
we demand this efficiency criterion because the two other aspects are one is that we want to be realistic.

185
00:20:36,920 --> 00:20:48,920
The world is complicated. So the last thing we want is to solve this, okay, is to make an assumption so we know that something's a something is a BS,

186
00:20:49,370 --> 00:20:54,350
but in different worlds, maybe the different probabilities of each, each kind.

187
00:20:54,800 --> 00:21:00,140
So if you come here to go to China, maybe the same flowers, but with different possibilities.

188
00:21:01,130 --> 00:21:08,960
So, so this requirement is that the second requirement is that this learning algorithm work for arbitrary distributions.

189
00:21:10,410 --> 00:21:18,180
So the secret, of course, is that you're going to learn on or learn on a distribution and you're going to have to perform on the same distribution.

190
00:21:18,750 --> 00:21:23,430
So for certain, the kind of flowers common here, you'll be tested on the on these common flowers which are common here,

191
00:21:23,970 --> 00:21:26,100
the different china, they'll be tested on something different.

192
00:21:27,220 --> 00:21:33,580
But so this basically says that in practice, the successful learning algorithms are very broad, broad spectrum.

193
00:21:33,860 --> 00:21:36,820
Okay. They don't just work for the uniform distribution. Okay.

194
00:21:38,770 --> 00:21:47,469
And then the third thing, which is a bit more sophisticated, is that so when you're learning, the learning algorithm gives you a hypothesis.

195
00:21:47,470 --> 00:21:49,570
There's a computational representation of everything else.

196
00:21:50,440 --> 00:21:56,620
But the classification, the teacher is just a function of you don't look inside the teacher.

197
00:21:57,430 --> 00:22:06,850
Okay. So it's just a behaviour. And so in practice it's, you know, so this learning algorithm is something you have in your hand,

198
00:22:07,330 --> 00:22:14,110
maybe a perception, a deep learning network, some sort of boosting.

199
00:22:14,860 --> 00:22:20,050
So something you have in your hand. But then the examples come and no one guarantees where it comes from.

200
00:22:21,350 --> 00:22:28,160
But often the use of this representation for learning and you've got no chance of learning everything you can represent.

201
00:22:28,760 --> 00:22:34,670
But you're still successful. And the reason usually is that the examples come from from a weaker world.

202
00:22:34,970 --> 00:22:39,560
So is there something something simple about the world are learning from. So the.

203
00:22:40,790 --> 00:22:43,350
Like this. So the. Okay.

204
00:22:43,460 --> 00:22:51,380
So the mystery of why certain the surest X work well in practice is often that the tasks they're given have some simplicity in them,

205
00:22:51,380 --> 00:22:55,580
which is often very hard to identify. Anyway.

206
00:22:55,580 --> 00:23:03,020
So so this is a specification of of a formal model of supervised learning.

207
00:23:03,860 --> 00:23:07,760
Okay. So. Okay. That was by way of introduction.

208
00:23:09,800 --> 00:23:13,490
Okay. So how. Okay. So.

209
00:23:15,980 --> 00:23:24,860
Okay. So I've got this model of inductive learning and we know that machine learning, which does roughly this kind of thing, is very successful.

210
00:23:25,340 --> 00:23:28,700
But the question is the type of problem is something about the intelligence.

211
00:23:29,150 --> 00:23:35,300
So is this all of intelligence? And so everyone would agree that the answer is kind of no or almost everyone agrees.

212
00:23:36,080 --> 00:23:45,570
So what? What more what more is there? And so what we want to do, if you follow this approach is to we need a model of computation.

213
00:23:45,580 --> 00:23:51,460
So back learning is a model of learning, but it's not enough because just learning we don't think is enough.

214
00:23:51,940 --> 00:23:56,140
So know what can we do more? What's what should we add?

215
00:23:56,830 --> 00:24:05,470
And I will add because I think inductive learning gets pretty powerful phenomenon and we need to add to it rather than start from scratch.

216
00:24:07,300 --> 00:24:17,090
What do we need to need to capture? And so the adage, which I've been going around for a long time and drawing in advertising is,

217
00:24:17,290 --> 00:24:23,680
is this line from Aristotle who said that all belief comes from syllogism or induction.

218
00:24:24,520 --> 00:24:28,810
And by which you mean something like, if you believe that, if you have a belief in your head,

219
00:24:29,350 --> 00:24:35,950
then you either deduced it's a selective syllogism or some sort of logical deduction.

220
00:24:36,220 --> 00:24:40,810
You use it for something else to you, or always by induction.

221
00:24:40,810 --> 00:24:46,820
So induction means that you somehow from on basic empirical evidence, you generalised it somehow.

222
00:24:46,840 --> 00:24:55,620
Okay. And so, of course, he spent, you know, 99% of his effort on syllogism and didn't say much about induction.

223
00:24:57,480 --> 00:25:03,690
But so what's happened since, of course, Syllogism has become this big field of mathematical logic and formulas.

224
00:25:03,710 --> 00:25:08,790
His reasoning? So induction became this mysterious sort of philosophical field.

225
00:25:09,660 --> 00:25:13,770
But I think it's the issues have been clarified by machine learning and machine learning theory.

226
00:25:15,060 --> 00:25:19,260
So as an example. So when I started.

227
00:25:19,560 --> 00:25:25,020
The question is question how come that, you know, children have seen different examples of of chairs,

228
00:25:25,300 --> 00:25:30,210
different parts of the world, and yet even a new chair, they agree on what's a chair and what's not.

229
00:25:30,780 --> 00:25:35,879
That was kind of a mystery. You know, there wasn't a good answer to that. But now machines can do this routinely.

230
00:25:35,880 --> 00:25:40,470
So asking this question when mystify anyone living now.

231
00:25:42,510 --> 00:25:48,330
And the reason is that that machine learning theory gives an answer on what it means to achieve this.

232
00:25:49,200 --> 00:25:55,229
You don't have to perform well on this distribution. You've seen it's probabilistic anyway.

233
00:25:55,230 --> 00:26:06,600
So we we do have a handle on on this. And before I go on, just to say that there are some technological aims here.

234
00:26:07,430 --> 00:26:13,770
So so what I'm discussing will be how you want to unify a view,

235
00:26:13,770 --> 00:26:19,110
have a unified view of of reasoning and of learning, because at the moment they're very different.

236
00:26:19,530 --> 00:26:22,920
You know, reasoning is a very classical reasoning.

237
00:26:23,100 --> 00:26:27,360
Classical logic is this very brittle kind of a mathematical theory.

238
00:26:27,900 --> 00:26:31,950
But as machine learning is of, it's this kind of robust thing of a different kind.

239
00:26:31,950 --> 00:26:37,800
So we do want to unify them. But kind of the grand goal, if you can do that as foundational technology,

240
00:26:38,310 --> 00:26:46,200
is kind of to approach the central problem of A.I., which I believe is, you know, how you put into a computer knowledge,

241
00:26:46,200 --> 00:26:53,939
which at the moment is very hard to acquire common sense knowledge and be able to use it in the computer to to reason,

242
00:26:53,940 --> 00:26:58,020
to make predictions, deductions, whatever. Okay.

243
00:26:58,100 --> 00:27:05,730
So to do the second, I can't imagine how you can do the second unless you take some unified view of what reasoning and learning are.

244
00:27:06,090 --> 00:27:10,950
It seems that if the two disparate things, it's a bit difficult.

245
00:27:12,600 --> 00:27:19,499
Now. So in modern terms, I suppose there's a debate.

246
00:27:19,500 --> 00:27:24,260
And so I'm basically I'll be saying that reasoning and learning are both important and we have to reconcile them.

247
00:27:25,410 --> 00:27:29,610
Not everyone has to agree. So, for example, at the moment,

248
00:27:30,540 --> 00:27:34,829
there are some people who are so enthusiastic about machine learning that they think that

249
00:27:34,830 --> 00:27:38,940
a single black box machine learning thing will do everything and we won't need reasoning.

250
00:27:39,690 --> 00:27:49,310
Okay, so that's as have you. And other people may, may put reasoning high on the pedestal.

251
00:27:50,870 --> 00:27:57,300
And says, but putting it more simply. The question is, are other people actually deny that reasoning is real?

252
00:27:57,870 --> 00:28:04,129
Other people who deny that learning is real. So certainly, I think 30, 40 years ago there were real learning deniers,

253
00:28:04,130 --> 00:28:11,510
people who thought that intelligence was all reasoning and putting learning facts and reasoning efficiently with them.

254
00:28:12,470 --> 00:28:16,970
They were suddenly learning deniers then and now there's some reasoning deniers around.

255
00:28:18,020 --> 00:28:24,640
But, you know, in this talk, I'll take a middle ground. Okay, so let's buy this one.

256
00:28:25,220 --> 00:28:29,690
That Aristotle on a cell phone. So.

257
00:28:30,410 --> 00:28:33,850
Okay, so. Okay.

258
00:28:33,870 --> 00:28:36,690
So so most people can answer this question without too much effort.

259
00:28:37,350 --> 00:28:45,180
But the question is, you know, did you use pure learning for this or did you use pure reasoning or did you use something else?

260
00:28:46,020 --> 00:28:52,079
Okay. So the main contrast I want to give is that at the moment one has to argue a bit

261
00:28:52,080 --> 00:28:56,550
against people who want to do everything by a single black box machine learning thing.

262
00:28:57,540 --> 00:29:03,120
So for example, the idea is that if you feed this black box, you know, a billion sentences from the Web,

263
00:29:04,350 --> 00:29:07,860
you know, maybe you can answer every question and the reasoning will go away.

264
00:29:08,850 --> 00:29:16,080
But I think kind of commonsense introspection suggests that to answer this question, you know,

265
00:29:16,440 --> 00:29:22,990
it's not that we've been exposed to thousands of sentences about Aristotle's property, okay?

266
00:29:24,570 --> 00:29:28,670
But we somehow knew some facts and we train together facts with you.

267
00:29:28,680 --> 00:29:32,710
So some some reasoning involved. So this is introspection.

268
00:29:33,220 --> 00:29:38,500
But can we ask the same question of learning versus reasoning a bit more scientifically?

269
00:29:39,190 --> 00:29:49,930
So. So we want some somebody to do some experiment to do which tests this kind of issue in a plausible way.

270
00:29:50,560 --> 00:29:59,590
And so the problem which came to us, which is, I think, very natural for this, is called the work completion problem.

271
00:30:00,470 --> 00:30:08,470
And this essentially is that I'll take a phrase from a website or a newspaper, usually a headline, and I delete a word.

272
00:30:09,040 --> 00:30:12,910
You have to guess what the missing word is, because this is kind of a test.

273
00:30:13,060 --> 00:30:16,300
It's quite a good IQ test because it's quite hard to do.

274
00:30:16,600 --> 00:30:24,550
These headlines are often quite succinct, just the minimum number of words to express what you want to say.

275
00:30:25,210 --> 00:30:34,270
And okay. And of course, it's important we took these headlines from is maybe from a world where you have no knowledge.

276
00:30:34,690 --> 00:30:40,600
So actually the ones are examples I have has happens to be from our English language Chinese newspaper.

277
00:30:41,590 --> 00:30:44,500
Okay. So let's have a some examples here. So.

278
00:30:46,140 --> 00:30:52,560
This was whatever the year of the dog holds in store, pet owners will be lavishing more attention than ever on their.

279
00:30:54,050 --> 00:30:59,450
So you have to guess where the missing word is. And question is, could your computer program do it?

280
00:31:02,790 --> 00:31:06,630
Any children. Okay.

281
00:31:07,100 --> 00:31:16,980
Okay. So the answer was Peaches, which is a in a bunch of scientists in early, early 20th century American origin word for dog issue.

282
00:31:18,600 --> 00:31:23,190
And so the question is, is this a hard problem, say, for a black box machine learning algorithm?

283
00:31:23,670 --> 00:31:26,100
And so my guess is that this is easy,

284
00:31:26,670 --> 00:31:33,390
because if you look up Google and you look for sentences with pet owners and their pooches in it, you've got tens of thousands.

285
00:31:33,750 --> 00:31:37,080
So this is an easy problem for a black box machine learning. Okay.

286
00:31:38,400 --> 00:31:45,690
Okay. So another one. China rises as a maritime powerhouse after snapping up profitable blank blank across the world.

287
00:31:47,670 --> 00:31:50,820
Fragrance and fragrance trade.

288
00:31:51,230 --> 00:31:55,470
So. Yeah. Okay. So that's the answer. The good seaport terminals.

289
00:31:55,830 --> 00:32:01,080
Okay, good. So I reckon this is slightly harder that you probably to do some reasoning.

290
00:32:01,410 --> 00:32:10,200
You just can't do it by some sort of word because, you know, I didn't have any examples of that.

291
00:32:11,640 --> 00:32:18,920
So the hardest examples for where? Inductive learning is disposable as well, where there's some kind of news.

292
00:32:18,920 --> 00:32:22,329
So to understand the headline, you have to know what happened yesterday. Okay.

293
00:32:22,330 --> 00:32:27,340
So, for example, one thing is retail sales are up 20.7% in second quarter.

294
00:32:28,120 --> 00:32:33,040
Okay. So so maybe again. So, okay, so the answer here is a cow.

295
00:32:33,520 --> 00:32:38,020
So maybe if you're an expert on conditions in the different parts of China, he could do it.

296
00:32:38,320 --> 00:32:41,110
But you need lots of knowledge and maybe recent news, that kind of stuff.

297
00:32:41,950 --> 00:32:50,680
But certainly, you know, if things depend on you news this morning, then, you know, having a billion sentences in your brain doesn't help you.

298
00:32:51,380 --> 00:33:01,140
Okay, so. Okay. So the interesting thing about this thing is that the idea is that with this problem,

299
00:33:01,530 --> 00:33:08,220
you can test your machine learning system how how well it solves this problem.

300
00:33:09,450 --> 00:33:16,920
And so I think this problem is not bad. So this is so it's kind of a stand in for the Turing test in certain ways.

301
00:33:17,640 --> 00:33:21,540
So in one sense, there's something for the Turing test has many aspects,

302
00:33:21,540 --> 00:33:27,240
but one aspect is that the measure something, you know, how well do you perform compared to something else?

303
00:33:28,140 --> 00:33:34,980
And of course, the other important thing about the Turing test is that he didn't say the intelligence depends on how well you play chess or how well,

304
00:33:34,980 --> 00:33:40,680
you know, chemistry, but it depends on general general knowledge of general stuff.

305
00:33:41,190 --> 00:33:45,630
So. So this missing word test is good on that.

306
00:33:46,770 --> 00:33:49,349
And so the learning theory, Pursell adds,

307
00:33:49,350 --> 00:33:57,270
is that certainly it emphasises that any kind of performance in any system like this is with respect to a particular distribution.

308
00:33:58,410 --> 00:34:03,510
So it's hard to be intelligent if, you know, if you go somewhere where your knowledge is irrelevant.

309
00:34:05,880 --> 00:34:12,780
And also it emphasises feasible computation that we're interested in efficient computation,

310
00:34:12,780 --> 00:34:16,290
infeasible computation and controlling the error of your prediction and things like that.

311
00:34:16,980 --> 00:34:20,360
Okay. So we'll come back to this problem.

312
00:34:20,370 --> 00:34:26,790
So I'm suggesting that if you tackle this problem of of common sense knowledge and learning and reasoning or whatever,

313
00:34:27,030 --> 00:34:33,360
this isn't a bad problem to test your system on because, you know, it's there's a ground truth.

314
00:34:33,390 --> 00:34:36,540
There's a ground truth. And it's about this general knowledge.

315
00:34:36,690 --> 00:34:38,150
Okay. Okay.

316
00:34:38,160 --> 00:34:51,420
So so what I'm really coming to is my content, which is okay, which is kind of my suggestions for having a model of computation which can do both,

317
00:34:51,630 --> 00:34:57,990
both inductive learning, which I think is important phenomenon, and you can add on reasoning to it.

318
00:34:58,500 --> 00:35:03,540
Okay. So and with this combined system, if you do it well, which we haven't yet,

319
00:35:04,290 --> 00:35:08,730
that you can test it on on this a problem like this with competition problem.

320
00:35:09,930 --> 00:35:13,510
Missing work problem. Okay. So the question is, how how do we add?

321
00:35:13,770 --> 00:35:17,760
Uh. Okay. So what is intelligent thinking?

322
00:35:18,240 --> 00:35:23,880
You know, what else do we do besides inductive learning? And how do we make this into a model of computation?

323
00:35:24,930 --> 00:35:29,340
So these models get, you know, kind of complicated.

324
00:35:29,620 --> 00:35:32,580
A story machine is very complicated. This is much more complicated.

325
00:35:34,080 --> 00:35:41,760
And it's justification is that you're capturing something important maybe, and that other ways of capturing it would boil down to the same thing.

326
00:35:42,240 --> 00:35:45,480
Okay. Anyway, so this is more a list of things you need to capture.

327
00:35:46,710 --> 00:35:53,460
And so the first feature of it is this idea which is borrowed from cognitive science of a working memory.

328
00:35:53,970 --> 00:36:02,520
So this amazing thing we have to cognition, which is that while we have this enormous storage of memories of each instance,

329
00:36:02,820 --> 00:36:10,290
somehow we've got this small mind's eye which directs our behaviour at each instant and of this little world in front of us,

330
00:36:10,590 --> 00:36:16,710
what we're aware of, and we use this awareness to plan our lives, what we do next, what we do after the lecture.

331
00:36:17,250 --> 00:36:20,580
And so now all our behave is channelled through the small window.

332
00:36:20,910 --> 00:36:22,320
And so what's going on here?

333
00:36:23,280 --> 00:36:31,770
So the explanation we're here will be is that we need to restrict the window for complexity reasons, for computational complexity reasons.

334
00:36:32,430 --> 00:36:35,820
And we as a model, we need to use it to get anywhere.

335
00:36:37,020 --> 00:36:40,530
Okay, so so roughly, this is how we formulated.

336
00:36:41,580 --> 00:36:46,320
So you wake up in the morning and your mind's eye is blank,

337
00:36:46,740 --> 00:36:50,910
but it's got some two free tokens and can fill it up during the day with what you're thinking about.

338
00:36:52,030 --> 00:36:55,209
Okay. So you fill it out with the scene.

339
00:36:55,210 --> 00:37:00,720
So you think of your dog. Okay, so you think of your dog, and then you want it.

340
00:37:00,730 --> 00:37:03,250
You want to pick your dog, you want to see what is the dog like.

341
00:37:04,630 --> 00:37:09,460
And then you have a rule in your head which tells you that, in fact, dog's like bones.

342
00:37:09,490 --> 00:37:12,670
Okay. Okay. So, so.

343
00:37:12,670 --> 00:37:23,170
So somehow with your background noise from your big, long term memory, you can fill up your your mind's eye, too, with missing information.

344
00:37:23,410 --> 00:37:32,650
So this is roughly what goes on. But but here we come to the first difference between logic and learning.

345
00:37:33,310 --> 00:37:43,000
And so the point is that this implication doesn't fit well with with like pack learning or any kind of learning because.

346
00:37:44,370 --> 00:37:52,080
I think when you do machine learning, you go got some target function and so you're learning to recognise an elephant.

347
00:37:52,650 --> 00:38:00,600
And so basically what you're recognising is in this is a, this is a sufficient condition of a picture to contain a reputation of an elephant.

348
00:38:01,020 --> 00:38:05,640
Okay. So what we are definitely learning, if you do if you do supervised learning is an equivalence.

349
00:38:06,450 --> 00:38:13,890
Okay. So you have a maybe of a big neural network, perceptron or decision tree.

350
00:38:14,310 --> 00:38:17,400
And it's you want to predict whether what's in front of you is a or not.

351
00:38:18,060 --> 00:38:25,080
And on the left hand side is some some very rich, complicated, incredibly complicated rule.

352
00:38:25,920 --> 00:38:36,690
Tens of thousands of bits. Best to try, possibly, but it does contain a useful criterion of whether you know what's in front of you is a bone or not.

353
00:38:36,900 --> 00:38:44,940
Like a it's a predictor. Hmm. Okay. So the proposal, the first step of the proposal is that are going to learn these things.

354
00:38:46,100 --> 00:38:53,090
So maybe for each word in the dictionary, you're going to learn a predictor in terms in terms of the other words.

355
00:38:53,630 --> 00:39:02,450
And this predictor can be any whatever you're machine learning out and can do depends on your computational resources about what you can do.

356
00:39:03,440 --> 00:39:10,910
And this is what learning rules are. They'll do equivalences, but the point is that you can use a equivalences to make.

357
00:39:11,600 --> 00:39:13,640
You can change these together to make predictions.

358
00:39:14,180 --> 00:39:23,420
Okay, so, so using something like this, if the conditions in your C predict that this is a bone, then this is a very good thing to predict as a bone.

359
00:39:24,020 --> 00:39:29,720
And you can once you predict it's a bone, you can make further predictions about your seen using these equivalences.

360
00:39:30,290 --> 00:39:40,189
Okay. So so that's that's basically the idea is that you're this is your mind works your mind is full of not one black box neural net,

361
00:39:40,190 --> 00:39:49,370
but tens of thousands. And somehow the predictions of these tens of thousands are can be used together in a principled way to make predictions,

362
00:39:50,120 --> 00:39:54,110
because that's the kind of the rough summary. Okay.

363
00:39:54,560 --> 00:39:57,709
So that's okay. First step.

364
00:39:57,710 --> 00:40:03,440
Okay. So so this robust logic is this the system for is this model of computation.

365
00:40:04,160 --> 00:40:18,920
And the first aspect of it is that it we're going to learn rules with equivalences which predict maybe every concept in the in the in the dictionary.

366
00:40:19,250 --> 00:40:23,150
Okay. Aspect one. Aspect to.

367
00:40:29,570 --> 00:40:32,000
Well that we will have quantifies. Okay.

368
00:40:32,750 --> 00:40:42,710
So this is we have sun exists and whatever's bit like in logic, but now they mean something much more grounded than they do in conventional logic.

369
00:40:43,310 --> 00:40:46,670
So in logic you learn things like you know or man or mortal.

370
00:40:47,820 --> 00:40:52,590
But then what does that mean? Has someone checked out meaning throughout the universe?

371
00:40:52,860 --> 00:40:56,890
Probably not. So this quantum quantifies a bit.

372
00:40:57,170 --> 00:41:04,260
Bit? Kind of embarrassing, almost. So in this logic, the quantifies only refer to this mind's eye.

373
00:41:04,740 --> 00:41:10,650
Okay, so you got what you're thinking about. Does something exist there in your mind's eye, or is something true for everything in your mind's eye?

374
00:41:11,680 --> 00:41:18,630
So it's a very, very local thing. And I suppose I should just point out already is that, you know,

375
00:41:18,700 --> 00:41:27,280
somehow this political calculus logic is hasn't worked out too well for I and there's almost something kind of embarrassing about it.

376
00:41:28,720 --> 00:41:35,400
So, you know, but certainly this rule well, obviously there are many things which your dog likes, not just the bone.

377
00:41:35,410 --> 00:41:41,890
So there's something, something very simplistic about its logical expressions and the something brittle.

378
00:41:42,280 --> 00:41:47,860
But the idea that, you know, the idea of predicting what a bone is or possible versions of it,

379
00:41:48,340 --> 00:41:54,520
you know, you know, there's nothing mysterious about that. That's what machine learning technology does for you.

380
00:41:55,830 --> 00:42:00,410
Okay. So, uh, so we have some quantify as etc., etc.

381
00:42:01,090 --> 00:42:06,060
Um, okay. So third thing is, uh, what about consistency?

382
00:42:07,760 --> 00:42:11,480
So as I said, so we're going to learn all these rules.

383
00:42:12,880 --> 00:42:20,890
Okay. So that's a rule we a lot of these rules. So what if somehow they if you train them together, you get inconsistencies.

384
00:42:21,190 --> 00:42:24,579
So logic is certainly hung up on inconsistency. Okay.

385
00:42:24,580 --> 00:42:28,270
So here we say, don't worry. You learn all these rules.

386
00:42:28,270 --> 00:42:36,040
We'll just live with the inconsistencies. And if you if the inconsistencies are important, you somehow learn your way out of it.

387
00:42:36,730 --> 00:42:46,150
Right. So I know the 1960s or seventies version of this problem used to be no longer is this what's called the Nixon Triangle.

388
00:42:46,630 --> 00:42:51,910
So you learn that, uh, Quakers are pacifists.

389
00:42:52,890 --> 00:42:56,540
Uh, that's a rule. You know, Republicans are not pacifists.

390
00:42:56,540 --> 00:43:04,270
So these are two rules you go round with. And then there's example of someone called Richard Nixon, and he was both a Quaker and not a pacifist.

391
00:43:04,630 --> 00:43:08,200
So then what do you do? Okay, so here the answer is, don't worry about it.

392
00:43:08,200 --> 00:43:15,939
Go around with your general rules. And if this counterexample is somewhat worrying, then, you know, your learning algorithm will say that,

393
00:43:15,940 --> 00:43:19,420
you know, all Quakers, Quakers except Richard Nixon are pacifists.

394
00:43:19,450 --> 00:43:24,819
Okay, so you you'll learn your way out of an inconsistency if it's important,

395
00:43:24,820 --> 00:43:29,350
but you've got no chance of maintaining consistency in a complicated world.

396
00:43:29,710 --> 00:43:33,480
Okay, so that's as easy. Okay.

397
00:43:34,560 --> 00:43:43,830
So rules will be learned. So instead of learning to recognise elephants, we're going to learn rules.

398
00:43:44,400 --> 00:43:48,270
And so these rules will predict in design the mind's eye.

399
00:43:49,110 --> 00:43:56,490
And then this probably possibly the correct sense. And we will look for rules which are highly reliable.

400
00:43:56,560 --> 00:43:59,640
Okay. So we've just learned rules.

401
00:44:00,390 --> 00:44:05,100
So as aspect for. So a more subtle issue, actually.

402
00:44:05,430 --> 00:44:08,730
There's a lot of discussion is was this distribution business.

403
00:44:09,630 --> 00:44:15,030
So here we do get rather kind of strange philosophical problems.

404
00:44:15,450 --> 00:44:20,130
So as I said, this question of how come we agree on other areas?

405
00:44:20,520 --> 00:44:26,309
Although we've seen different examples, that's kind of has some history, but in the end it's not so mysterious.

406
00:44:26,310 --> 00:44:31,530
Like we could believe it. There's one distribution here, but then it gets more mysterious.

407
00:44:31,620 --> 00:44:38,810
So we've learned that to know, uh, you know, Aristotle lived a long time ago, well, etc., etc.

408
00:44:38,820 --> 00:44:41,640
So how we, how we use that. He didn't have a cell phone.

409
00:44:42,540 --> 00:44:46,240
So when we learn these general facts, it's not quite clear what the distribution is, you know?

410
00:44:46,260 --> 00:44:49,889
So it's a bit lost. But but you have to take a stance on this.

411
00:44:49,890 --> 00:44:57,900
We have a model of computation. You have to kind of commit yourself. So the following is just a detailed version.

412
00:44:58,440 --> 00:45:06,960
But okay, but the short term version is, is, is this that maybe this is the central thing is that.

413
00:45:08,520 --> 00:45:11,720
So do we. Oh, yes. Okay.

414
00:45:12,230 --> 00:45:18,080
So the central model is this. The idea is that you've got this very long term memory of of lots of rules.

415
00:45:18,950 --> 00:45:22,580
You got very big brain, very complicated world outside.

416
00:45:23,120 --> 00:45:28,189
But what saves us makes cognition possible is that there's this kind of funnel

417
00:45:28,190 --> 00:45:34,129
in between your mind's eye and somehow the examples of the world you see,

418
00:45:34,130 --> 00:45:39,320
you summarise very as as a subset of sketch or caricature.

419
00:45:39,980 --> 00:45:44,990
And then within the simple scene, you see you you apply your rules.

420
00:45:45,080 --> 00:45:53,540
Okay. So if I look out and I'll probably see three groups of seats, I don't I can't see every individual.

421
00:45:53,960 --> 00:45:58,080
So we we apply these rules to simplified scenes. Okay.

422
00:45:58,500 --> 00:46:01,830
Like this. Okay. So. So what is this? Well, distribution.

423
00:46:02,490 --> 00:46:07,170
So, very roughly. Just to persuade you that there's a way of committing yourself.

424
00:46:07,170 --> 00:46:10,800
Although could persuade you. Persuade you that this is the right way.

425
00:46:11,340 --> 00:46:18,510
Isn't that would take more time. So the idea is that for each scene in your mind's eye, you think of something.

426
00:46:18,990 --> 00:46:22,140
There are all these features. So everything is true or false.

427
00:46:22,530 --> 00:46:28,200
But the whole essence is that the world in this game is in the description is incomplete.

428
00:46:28,860 --> 00:46:34,680
Okay. You think I want to go home? And then somehow you fill in this scene about how you're going to.

429
00:46:35,210 --> 00:46:38,790
How. What's a reasonable way of going home? So.

430
00:46:39,240 --> 00:46:44,490
So in this mind's eye, there's very little which is specified. So most of it are stars.

431
00:46:45,390 --> 00:46:49,950
So some are definitely. Yes, some are. Definitely knows about two knows infinite list of stars.

432
00:46:51,420 --> 00:46:57,930
Okay. And so this is that the world sometimes doesn't specify the value of a feature.

433
00:46:58,380 --> 00:47:03,630
Most of the time it doesn't. And again, so going back to the A.I. from a long time ago.

434
00:47:03,960 --> 00:47:09,460
So again, the famous paradox was that this is a bird called Tweety.

435
00:47:10,380 --> 00:47:14,210
I'll tell you, it's a bird. And I ask you, trees is a bird.

436
00:47:14,220 --> 00:47:19,900
Does it fly? And you say yes. And then I'll tell you the truth as a penguin.

437
00:47:20,350 --> 00:47:26,080
Then you change your mind. Okay, so this is some sort of paradox, if you think of it in any kind of reasonable logic.

438
00:47:27,130 --> 00:47:33,160
But again, in this formulation, uh, almost without doing anything, there is no paradox,

439
00:47:33,580 --> 00:47:42,700
because the idea is that if I tell you something as a bird and I don't comment on whether it's a penguin, then in fact it's probably not a penguin.

440
00:47:42,940 --> 00:47:49,510
If it was a penguin, I'd probably bother to tell you. So that is that there's a distribution of examples you've seen and the.

441
00:47:49,810 --> 00:47:54,100
And if something is mentioned, that may be useful information.

442
00:47:54,820 --> 00:48:04,630
Okay. So, so the incomplete specifications, uh, solve some, uh, paradoxes already.

443
00:48:04,950 --> 00:48:10,989
That's a comment. Okay. And then the game being played.

444
00:48:10,990 --> 00:48:14,230
Is that so? You've got your mind's eye.

445
00:48:14,710 --> 00:48:17,930
Uh, some things are, yes.

446
00:48:18,220 --> 00:48:22,510
Talking about your dog? No, it's not a cat. Most things are unspecified.

447
00:48:22,870 --> 00:48:27,190
And then there's one thing you want to predict or, you know, what is your dog like?

448
00:48:27,730 --> 00:48:31,750
So a question mark is like a force predictions that there's a ground truth.

449
00:48:31,750 --> 00:48:35,110
Maybe there's ground truth as a probability distribution, you have to reply.

450
00:48:35,560 --> 00:48:39,800
So this is there's a distribute, there's a distribution out there. Okay.

451
00:48:40,580 --> 00:48:46,790
Okay. So now. Okay. So another aspect which I think is a is a deep one is this idea that.

452
00:48:47,750 --> 00:48:53,210
So once you we have you're learning many things. At the same time, there's this notion of hierarchical learning.

453
00:48:53,720 --> 00:48:57,200
Okay. So you're going to learn this word in the dictionary, that little word in the dictionary.

454
00:48:57,650 --> 00:49:03,080
And so what happens if you only understand this word and you understand the second word in terms of the first?

455
00:49:03,500 --> 00:49:09,650
Or like if you go to a math, scores that are different concepts and someone.

456
00:49:10,100 --> 00:49:19,910
Okay. So so if you only have understand the concept, is it useful to label to tell you a new example where that's half understood concepts is in that.

457
00:49:21,240 --> 00:49:24,870
So all the evidence is that if you half understand, stuff is not very useful.

458
00:49:24,930 --> 00:49:27,990
Okay. So it's very hard to learn things.

459
00:49:28,260 --> 00:49:34,080
So it's very hard to learn a concept in terms of other concepts before you really understand other concepts very well.

460
00:49:34,620 --> 00:49:44,200
Okay. So this is one reason why if you just stare around in the universe, it's hard to learn a complicated concept like that.

461
00:49:44,220 --> 00:49:49,350
The planets go round in ellipses. But that's not so easy to spot from looking at the sky.

462
00:49:49,780 --> 00:49:58,349
Okay, so. So, in fact, the way the system works, which, you know, first I thought it's a weakness and embarrassment,

463
00:49:58,350 --> 00:50:06,360
but I think it's probably inevitable is that these examples do have to come with with kind of correct labels.

464
00:50:07,170 --> 00:50:14,190
So it's, you know, the value of of universities is that you go to lectures and someone,

465
00:50:14,700 --> 00:50:19,350
you know, meticulously gives the exact labelled examples which are kind of correct.

466
00:50:20,000 --> 00:50:23,730
Okay. You don't just skim the web and try to learn something complicated.

467
00:50:24,420 --> 00:50:29,700
Okay. So someone has to label the outputs correctly and also the features correctly.

468
00:50:30,310 --> 00:50:37,590
And if I say that's a yeah and a group or something, which is this is this then it has to be commutative.

469
00:50:38,160 --> 00:50:42,150
Not very helpful unless you've learned about commutative means at the beginning.

470
00:50:42,240 --> 00:50:45,940
Okay. So.

471
00:50:48,550 --> 00:50:53,200
Okay. So so that's an aspect of this of this model of computation and.

472
00:50:59,100 --> 00:51:05,910
Okay. And that's also a lost aspect of what I want to emphasise is that so this is different from making a public stock model, model of the world.

473
00:51:06,720 --> 00:51:09,900
This is something which is kind of avoid some of the complications.

474
00:51:10,440 --> 00:51:19,530
So idea of probably personally correct is that you're assuming that the things you're going to predict are close to probability, one of being correct.

475
00:51:20,070 --> 00:51:24,960
I'm not in the business of estimating probabilities point three and point five and point seven and computing with them.

476
00:51:25,170 --> 00:51:29,040
There's little, little evidence that humans are any good at that.

477
00:51:29,370 --> 00:51:34,110
And in trying to understand cognition, we somehow have to avoid that.

478
00:51:35,150 --> 00:51:38,300
Okay. Okay. So that's seven features.

479
00:51:40,980 --> 00:51:48,090
Okay. So we. Okay, so we're going to learn the rules using whatever it is you like.

480
00:51:48,150 --> 00:51:51,540
I mean, that's. That's the parameter.

481
00:51:54,560 --> 00:52:00,610
Okay. So I mentioned currently general features in a second, maybe one with one dictionary.

482
00:52:00,620 --> 00:52:03,859
And you. Okay. Okay. Good. Okay.

483
00:52:03,860 --> 00:52:07,820
So I mentioned this word missing word problem.

484
00:52:08,330 --> 00:52:13,580
So a long time ago. Well, a while ago with the law is this, Michael, which is an experiment.

485
00:52:13,580 --> 00:52:18,620
This was ten years ago. A simple experiment. Small data set, simple algorithms.

486
00:52:20,090 --> 00:52:23,930
And so the idea was that we took a natural language database from the Wall Street Journal.

487
00:52:25,040 --> 00:52:28,600
We use some standard stuff from machine learning, from natural language processing.

488
00:52:30,140 --> 00:52:33,590
We used online dictionaries, word net services,

489
00:52:34,550 --> 00:52:43,070
and and the exercise was that from record we were going to learn rules about the world from single sentences.

490
00:52:45,310 --> 00:52:50,230
So to do it properly, we should be learning from paragraphs or more.

491
00:52:51,220 --> 00:52:54,340
So the idea was we were trying to learn facts about the world,

492
00:52:55,360 --> 00:53:04,270
which are different from just syntactic features of which you can do by just by applying machine learning boxes.

493
00:53:05,530 --> 00:53:13,630
Okay. And the issue was so issue was testing my main hypothesis, which is that there's a there's even if you can learn very well,

494
00:53:13,890 --> 00:53:19,420
even if you can do black box learning well, is there added value in training and training these rules?

495
00:53:19,700 --> 00:53:29,340
Okay. So. Here we are testing this hypothesis, and this is an example of the kind of rules we learned.

496
00:53:29,880 --> 00:53:36,630
So this is The Wall Street Journal. It's about business. So a typical word is is so we call a missing word.

497
00:53:37,110 --> 00:53:42,239
And the question is, is the missing word price typical word you find in Wall Street Journal?

498
00:53:42,240 --> 00:53:48,090
Maybe. And so for each word, as I said, you have a predictor.

499
00:53:49,210 --> 00:53:51,490
Predicting some big, enormous, big mess.

500
00:53:52,120 --> 00:54:02,620
But suppose it is predicted for you, whether messing with its price and the machine learning algorithm we used was essentially close to perceptron.

501
00:54:03,220 --> 00:54:10,270
So we learning inequality, a linear inequality, but the features were these compounds features.

502
00:54:12,830 --> 00:54:20,690
And so the idea was that if somehow you find the structure in the sentence where there's one word X,

503
00:54:20,690 --> 00:54:29,570
which the word was bargain and the sentence was telling you that this bargain lowers something,

504
00:54:30,440 --> 00:54:33,640
then you should deduce that what it lowers is the price. Okay.

505
00:54:34,580 --> 00:54:41,390
So bargain lowering something is good evidence for the missing well being price and competition also known as price.

506
00:54:42,470 --> 00:54:48,110
So lots of independent evidence which could add up to decide whether you're missing, whether it was price.

507
00:54:48,750 --> 00:54:54,590
But anyway, so you could somehow you through this data set at your learning algorithm.

508
00:54:55,070 --> 00:55:02,540
And the aim was to learn the facts about the world. Okay. So this is facts about the world which were beyond just what things you could.

509
00:55:04,240 --> 00:55:05,890
Okay. So you learn facts about the world,

510
00:55:06,070 --> 00:55:12,310
which hopefully you could train together and then reach conclusions which would be beyond just a simple black box learning.

511
00:55:13,230 --> 00:55:20,920
Okay, so, so we did this and, you know, it's a kind of very small scale experiment.

512
00:55:21,340 --> 00:55:24,640
We got some results. So. So main thing was that.

513
00:55:25,360 --> 00:55:31,270
So we have this first about 260 words which were targets which are quite frequently enough.

514
00:55:32,020 --> 00:55:36,160
And I think already we. This is ordered from left to right.

515
00:55:37,030 --> 00:55:43,210
So the words were our methods were most favoured were the right. And I think red was what the.

516
00:55:44,080 --> 00:55:49,390
Okay. So blue was just machine learning, red was machine learning post reasoning.

517
00:55:50,380 --> 00:55:57,700
And so for some words we really did much better. And with some others kind of everything was hidden in the noise.

518
00:55:58,480 --> 00:56:04,570
And the general phenomenon here is that in machine learning, big data is very powerful.

519
00:56:05,530 --> 00:56:10,660
And once you start adding reasoning, you can introduce all kinds of noise.

520
00:56:10,990 --> 00:56:17,290
So pure machine learning is quite something to compete against, so it's quite hard to improve on it, but it's still possible.

521
00:56:18,640 --> 00:56:23,530
Okay. So. Okay. So I think I've got maybe two slides, which is a bit more technical.

522
00:56:24,460 --> 00:56:30,400
So this is what the robust logic thing is. So in our minds I.

523
00:56:32,700 --> 00:56:35,820
Okay. Because some objects to objects.

524
00:56:36,600 --> 00:56:44,580
So the right hand side of the rules on just particles like bone, they could be relations too, like above or byes or something like that.

525
00:56:46,200 --> 00:56:54,660
So the left hand side could be any the the hypothesis of any learning algorithm we used in the inequalities.

526
00:56:56,870 --> 00:57:03,860
Again, they could have compound features like this that, you know, so this thing is true.

527
00:57:04,340 --> 00:57:10,220
If, you know, there's an object in your mind's eye as if every other object in the mind's eye, various things are true.

528
00:57:11,150 --> 00:57:17,240
So having complicated features makes you learn better.

529
00:57:17,870 --> 00:57:21,370
But you've got enormous numbers of these. You generate them momentarily.

530
00:57:21,680 --> 00:57:29,570
So you got a trade off. And use any learning algorithm you like because it becomes propositional.

531
00:57:30,530 --> 00:57:33,380
You just plug in your learning algorithm. That's what you do.

532
00:57:33,980 --> 00:57:41,570
And then what you are guaranteed is that, well, these rules will be learnable by definition.

533
00:57:42,350 --> 00:57:46,820
And the main promise is that when you train together these rules, then.

534
00:57:48,550 --> 00:57:59,470
There's some some promises. And very roughly the main promise is that if you've turned together the two rules and each rule is accurate to 95%,

535
00:58:00,520 --> 00:58:04,420
then when you train them together, the conclusion will be correct with it's probably 90%.

536
00:58:05,120 --> 00:58:12,040
Okay, so you lose accuracy the deeper the training, but it gives you some principle, principled way of training together, even two things.

537
00:58:13,030 --> 00:58:16,150
So that's the kind of main promise.

538
00:58:16,810 --> 00:58:22,719
And the idea is that it seems that if you want to do logic on a learned on set knowledge in a principled way,

539
00:58:22,720 --> 00:58:28,000
in a big machine, in a, it seems hard to avoid such a requirement.

540
00:58:28,750 --> 00:58:34,150
Okay. And so everything will work in polynomial time.

541
00:58:35,590 --> 00:58:36,790
That's how things are defined.

542
00:58:37,000 --> 00:58:46,390
The only restriction you need is that the relations of constant parity so I can have above is above B, likes A, likes B.

543
00:58:47,380 --> 00:58:53,350
So there's a binary and but the costs will be go up exponentially with the arity of of the relations.

544
00:58:53,350 --> 00:58:57,520
So we have to divide straight up the world into relations of constant parity.

545
00:58:57,850 --> 00:59:01,510
So this doesn't worry people because it's a reasonable requirement.

546
00:59:02,930 --> 00:59:08,330
Otherwise everything is polynomials. In fact, the number of tokens you have in your mind's eye.

547
00:59:08,480 --> 00:59:13,040
So psychologists tell us it's what seven plus one is two.

548
00:59:14,210 --> 00:59:18,070
So things are polynomial in that. So we know it's not exponential.

549
00:59:18,080 --> 00:59:21,890
So maybe you can have 20, maybe 30. You don't have to worry too much about that.

550
00:59:23,670 --> 00:59:28,459
Okay. So. Okay.

551
00:59:28,460 --> 00:59:31,190
So the outcome was if you build a system on these principles,

552
00:59:31,970 --> 00:59:38,180
you would learn a lot is use lots of learning boxes, but they interact in a principled way.

553
00:59:39,700 --> 00:59:47,770
And what this would address, I think, is certainly acquiring knowledge, which is hard to acquire how to program.

554
00:59:48,490 --> 00:59:58,600
It has to be done by learning. And so hopefully we're building reasoning systems by programming failed because they're too brittle.

555
00:59:58,630 --> 01:00:02,440
Hopefully learning will get you out of the brittleness.

556
01:00:06,280 --> 01:00:13,569
Okay. So make a consistent. Okay.

557
01:00:13,570 --> 01:00:18,010
So so I think what the reasoning,

558
01:00:18,650 --> 01:00:23,590
the Aristotle example suggests is that will be when we reasonably often reason

559
01:00:23,590 --> 01:00:28,840
in cases when there are few direct examples and that's what this solves and.

560
01:00:30,640 --> 01:00:33,760
I may be a general comment. So this is a very general issue everyone discusses.

561
01:00:33,760 --> 01:00:37,330
In the end, machine learning is the idea of explanations.

562
01:00:37,780 --> 01:00:42,100
Okay, so people don't like blackbox machine learning because they know no explanations.

563
01:00:42,730 --> 01:00:50,620
So in this kind of system, it's kind of a half a solution because what we're saying is that we're going to we're going to have lots of black boxes,

564
01:00:51,130 --> 01:00:55,210
but each black box is going to predict something. You understand some word you want.

565
01:00:55,270 --> 01:00:57,880
You're going to choose the terms in which you want your problem. Understood.

566
01:00:58,540 --> 01:01:03,939
And once you've got a prediction, then you've got lots of black boxes there.

567
01:01:03,940 --> 01:01:10,330
But you understand which features you, you, you, you care about are being explained.

568
01:01:10,960 --> 01:01:17,220
And so this idea of explanations being kind of only half a level, I think is,

569
01:01:17,220 --> 01:01:21,130
is it's quite appropriate because I think this is similar with human explanation.

570
01:01:21,670 --> 01:01:27,670
So if if you ask me, you know, why, why do I bring my my umbrella?

571
01:01:27,880 --> 01:01:30,940
I'll say I, you know, I don't want to go. I thought it was going to rain.

572
01:01:31,450 --> 01:01:34,360
And you say, oh, well, so, well, I don't want to get wet.

573
01:01:34,600 --> 01:01:39,020
But at some point, if you keep asking me questions, at some point I'll say, I don't know the answer.

574
01:01:39,040 --> 01:01:45,609
Okay. So, okay, so our explanations are also only up to a certain level that we can explain what we think.

575
01:01:45,610 --> 01:01:53,050
So. Okay. So so computers won't be able to explain to you that they are soft give giving explanations also at some point.

576
01:01:54,040 --> 01:01:57,880
So this kind of system gives explanations in terms of what you request.

577
01:01:58,510 --> 01:02:02,830
And maybe beyond that is as is hopeless anyway.

578
01:02:03,710 --> 01:02:10,620
Okay. So. Okay, so I think. So by machine being educated rather than trained.

579
01:02:11,130 --> 01:02:18,420
I mean that you when the machine learns it doesn't know how it's learned nowadays it's going to be used.

580
01:02:19,170 --> 01:02:24,900
So when you're train with one single machine learning books, there's a lot of knowledge goes into it.

581
01:02:25,470 --> 01:02:30,540
But the only thing this knowledge will be able to do is to predict exactly what you had in mind when you were training it.

582
01:02:31,370 --> 01:02:38,340
Okay. Now we learn all kinds of stuff in college and elsewhere, and then we can apply it to the new situation.

583
01:02:38,900 --> 01:02:40,290
And this is very much like having, you know,

584
01:02:40,350 --> 01:02:50,940
many black boxes loading in parallel and having a principled way of of using them to make a prediction or an explanation and a new situation.

585
01:02:52,100 --> 01:02:55,610
Okay. So. Okay. Okay.

586
01:02:55,610 --> 01:02:59,930
So why me? Okay.

587
01:03:00,650 --> 01:03:04,940
Okay. So. Okay. Very quickly. And difficulties.

588
01:03:05,270 --> 01:03:14,120
Well, the main difficulty is getting good training sets. So in machine learning, obviously good training sets were very important.

589
01:03:14,120 --> 01:03:17,750
And the recent developments. This is a challenge. So.

590
01:03:18,170 --> 01:03:21,200
Okay, so where do I get trading material?

591
01:03:22,070 --> 01:03:29,510
I want to know the colour of an elephant. I put these different options into Google and I find this.

592
01:03:31,400 --> 01:03:35,660
So then I decide, well, this is better data than this. I want to go to Google Scholar.

593
01:03:36,350 --> 01:03:41,660
Okay, then I find this. Okay, so this is good.

594
01:03:42,330 --> 01:03:48,260
Okay. So, okay, so it seems that it's getting good data sets.

595
01:03:48,260 --> 01:03:51,590
Is is a problem. Okay. So let's forget about this.

596
01:03:52,610 --> 01:03:59,870
So I think what's needed is really big experiment, which basically needs big new data sets which can test something like this.

597
01:04:00,260 --> 01:04:04,260
So just like know the big vision data sets produced six, seven years ago.

598
01:04:04,280 --> 01:04:09,260
A very important influential. Um, what we need is,

599
01:04:09,680 --> 01:04:21,960
is big enough data sets with good with good information which kind of challenges this requirement of doing reasoning in a broad enough um,

600
01:04:22,430 --> 01:04:26,300
context context to be to be interesting. Okay.

601
01:04:27,060 --> 01:04:30,440
Um, okay. So certainly.

602
01:04:30,560 --> 01:04:36,080
So the general summary is that what we're good at is throwing computational power at something.

603
01:04:37,160 --> 01:04:42,319
And I'm suggesting that we should throw computational power at something where we know that there's a real phenomenon,

604
01:04:42,320 --> 01:04:46,969
that it's a supervised learning as one. But if we want to broaden it to intelligence,

605
01:04:46,970 --> 01:04:55,310
then we have to first decide what's the real phenomenon and then throw a computational power at it so that it.