1 00:00:02,980 --> 00:00:10,290 I. Okay. 2 00:00:10,300 --> 00:00:12,630 Welcome, everybody, to the strategy lecture. 3 00:00:13,500 --> 00:00:21,479 I'd like to start by expressing our gratitude to Oxford Asset Management, which has really generously supported the lecture. 4 00:00:21,480 --> 00:00:30,420 So I want to thank them for that. I also want to start with one announcement, which is that there will be refreshments after the lecture just outside. 5 00:00:30,420 --> 00:00:39,569 So everybody is invited. Please do join. And then the most pleasant task will be introducing less volume to is going to be 6 00:00:39,570 --> 00:00:44,880 speaking to us about whether one can define intelligence as a computational problem. 7 00:00:45,690 --> 00:00:54,480 So let's delve into one of these people who, as many of you know, invents one research field after another. 8 00:00:55,080 --> 00:00:59,580 So when I thought of which things to tell you, I just picked a small selection of them. 9 00:01:00,150 --> 00:01:05,520 And so one is less did a lot of work in algebraic complexity theory. 10 00:01:05,520 --> 00:01:12,300 So the complexity classes, VHP and BNP, which is still a very active area, the V stands for Valiant. 11 00:01:13,770 --> 00:01:22,020 Then we moved on and started the complexity of counting and the class number, and there's a whole community that works in that. 12 00:01:22,920 --> 00:01:30,749 Then we went on and founded Computational Learning Theory, so the book probably approximately correct was really influential and basically 13 00:01:30,750 --> 00:01:38,120 was the first rigorous study of what can learn after that not being enough. 14 00:01:38,460 --> 00:01:49,620 Let's became interested in classically simulating quantum, and he discovered holographic algorithms and that whole area, which is another concern. 15 00:01:50,310 --> 00:01:57,970 Then went on to write his book, Circuits of the Mind, which is about computational analysis and studying the human brain. 16 00:01:58,320 --> 00:02:02,310 And I think many of these threads will be brought together today. 17 00:02:03,660 --> 00:02:12,450 It's traditional in these talks to mention prizes, but Les is actually one actually pretty much every prize. 18 00:02:12,450 --> 00:02:16,350 So I picked out just four so that we would get on to the talk. 19 00:02:16,680 --> 00:02:28,080 So he's won the Never Linda Prize in 1986, became an FRC in 1991, won the Knuth Prize in 1997, and the Turing Award in 2010. 20 00:02:28,800 --> 00:02:34,770 And I looked up letters on mathematics, genealogy and found 109 descendants. 21 00:02:35,130 --> 00:02:39,420 But I'd like to point out that 13 of them are right here in our department. 22 00:02:39,420 --> 00:02:47,520 So we we all owe quite a lot to Les, not just for his intellectual stimulation, but also for many mentoring. 23 00:02:48,060 --> 00:02:51,900 So we're really. Oh, and I forgot. Please fill out the question. 24 00:02:52,470 --> 00:02:56,580 And so I'm delighted to introduce Les Valiant. 25 00:03:10,070 --> 00:03:15,290 Well, thank you very much, Leslie, for the very kind introduction, and thank you very much for inviting me here. 26 00:03:16,490 --> 00:03:20,660 So also, about 30 years ago, I spent a very happy sabbatical year in Oxford. 27 00:03:21,000 --> 00:03:27,350 I was treated very well. So I do have very happy memories of Oxford and I'm very glad to be to be back. 28 00:03:29,090 --> 00:03:33,620 So what I'm talking about is kind of a theoretical approach to a I. 29 00:03:34,100 --> 00:03:39,350 And in brief summary, it's a way of reconciling machine learning and reasoning. 30 00:03:40,850 --> 00:03:43,700 So it's a topic which is close to my heart and has been for a long time. 31 00:03:45,110 --> 00:03:51,770 But in giving in talks, often the hardest thing to understand is why someone is doing this kind of thing at all. 32 00:03:52,280 --> 00:03:57,380 So I'll be slightly self-indulgent and try to explain the motivation of of this kind of approach. 33 00:03:58,610 --> 00:04:08,420 And so first, I want to discuss this notion of a computational phenomenon, which not many people discuss, but to no avail. 34 00:04:09,050 --> 00:04:18,560 So, you know, algorithms have been around for a long time, and Euclid had had a very good algorithm given to numbers. 35 00:04:18,560 --> 00:04:23,000 You can find the greatest common divisor and efficiently. 36 00:04:24,290 --> 00:04:33,319 And so if now, for example, if I give you one number and you want the divisor, you know, which is factoring it, you know, that takes exponential time. 37 00:04:33,320 --> 00:04:36,320 So no two numbers can find a common factor. 38 00:04:36,770 --> 00:04:38,400 So there's something very striking already. 39 00:04:38,420 --> 00:04:47,960 And and what people knew about the algorithms a couple of thousand years ago and many of the best algorithms we know are ancient. 40 00:04:48,230 --> 00:04:51,470 So what is computer science contributing in general? 41 00:04:52,790 --> 00:04:57,620 Well, of course, the big change was Turing's paper in 1936. 42 00:04:58,130 --> 00:05:02,390 So I'll start by trying to spell out my view of what the big event was. 43 00:05:04,280 --> 00:05:12,439 And I will. Discuss this notion of a computational phenomenon, which for Turing, 44 00:05:12,440 --> 00:05:18,220 the phenomenon involves computation itself and the model of computation, which for him was a Turing machine. 45 00:05:18,770 --> 00:05:24,500 And the best way of explaining this, these ideas is, is by making an analogy with physics. 46 00:05:24,890 --> 00:05:30,080 So I'm not trying to say that computer science is physics are the same, but analogies do serve some purpose. 47 00:05:31,160 --> 00:05:34,790 So here the claim is that so what do you have in physics? 48 00:05:35,150 --> 00:05:42,440 You have some laws like F equals M.A. and the Gravitation Law. 49 00:05:44,420 --> 00:05:53,030 So you've got some laws which are believed to hold generally, but what they really are supported by mathematical theorems which are consequences, 50 00:05:53,060 --> 00:06:02,750 deductions from the laws and with which you can really understand the incredible breadth of, of what the law means and the analogy. 51 00:06:03,530 --> 00:06:06,950 Okay, so we learned about this a long time ago. 52 00:06:07,520 --> 00:06:14,000 So as a computer scientist, sometimes I have wondered what are we offering in a comparable to what the physicists have been doing? 53 00:06:14,570 --> 00:06:25,760 And on reflection, I think what we are doing is what's Turing did is that what corresponds to the law in physics is a model of computation. 54 00:06:25,760 --> 00:06:28,340 So he defined a model Turing machine. 55 00:06:28,970 --> 00:06:37,100 And and the general pain was that it kind of it captures computation in the real world in a very significant and general sense. 56 00:06:38,330 --> 00:06:44,030 And this is a big statement, but it's supported by mathematical consequences exactly like physicists do. 57 00:06:44,450 --> 00:06:50,689 So, for example, important consequences is that there's a universe give us a Turing machine, which, without his notion, 58 00:06:50,690 --> 00:06:56,809 he can discuss the non computable problems and other very important things about models of computation. 59 00:06:56,810 --> 00:07:01,820 Is that the robust if you make small changes, it shouldn't change the power. 60 00:07:02,810 --> 00:07:12,710 So I think this is what the main thing computer science offers and what the rest of us since have been trying to emulate in other ways. 61 00:07:13,310 --> 00:07:25,370 And so the idea here is that as an even more general level, of course, what Newton implied is that you can capture these laws of physics by equations. 62 00:07:26,420 --> 00:07:35,630 And I'm claiming that because the general statement is that that is phenomena in computation, and you should capture them by models of computation. 63 00:07:36,110 --> 00:07:42,050 So Turing's example was that there was competition in general and the model was Turing machines. 64 00:07:42,920 --> 00:07:49,100 But much of what we've been doing since in the algorithms area has been on the same tracks. 65 00:07:49,460 --> 00:07:55,640 So for example, okay, so, so that's a side comment. 66 00:07:56,510 --> 00:08:02,600 So I'm describing an analogy with physics and I'm just pointing out that there are other analogies other people use. 67 00:08:02,600 --> 00:08:12,200 So some people use the analogy that maybe unproved mathematical conjectures like P, not equal turn p, we should treat like physical laws. 68 00:08:12,380 --> 00:08:17,600 Okay, that the things people believe you can't prove, let's believe it until someone disproves it. 69 00:08:18,740 --> 00:08:20,899 So that's okay. I've got no quarrel with that. 70 00:08:20,900 --> 00:08:30,500 But I'm really drawing a different analogy which this kind of more general book I try to it's expansive, a bit more detail than here. 71 00:08:32,060 --> 00:08:36,200 So the physical laws are which are true. That's true, but not provable. 72 00:08:36,680 --> 00:08:43,550 It's like a model of computation. The claim that the model of computation is a valid for real, real phenomenon. 73 00:08:44,480 --> 00:08:49,700 Okay. And okay. So for example, another phenomenon is search. 74 00:08:50,960 --> 00:08:57,440 And in fact, the best description of this in words is the phrase mental search, which Turing used already in 1948. 75 00:08:58,040 --> 00:09:01,939 So the idea is that you're searching, searching for oil in the ground. 76 00:09:01,940 --> 00:09:06,560 You're searching for something in your head, and you're like searching for factors of a number. 77 00:09:07,190 --> 00:09:16,670 And the definition is called NP, where you're searching for solutions which are short compared to the input size. 78 00:09:17,120 --> 00:09:20,149 And also given this solution, you can easily verify it. 79 00:09:20,150 --> 00:09:26,000 So this is a formalisation of search and P is a rigorous is a formal statement of it. 80 00:09:26,450 --> 00:09:31,970 And we believe that NP is a real phenomenon and computation which lots of people find useful. 81 00:09:32,720 --> 00:09:35,960 And the model of computation is this non deterministic Turing machine. 82 00:09:36,590 --> 00:09:41,840 And again, the interest of this definition by itself is not so impressive, 83 00:09:41,840 --> 00:09:50,150 but the interest is that the powerful mathematical statements and an important one is that the hardest search problems and be complete problems. 84 00:09:51,140 --> 00:09:58,100 And there's a model of computation and some stunning mathematical statements which are surprising, which make it okay. 85 00:09:58,100 --> 00:10:07,790 So that's one. Okay. So as it happens, so by P, I mean roughly what we're sure we can compute efficiently. 86 00:10:08,270 --> 00:10:14,240 In this universe in polynomial time I should put it in randomisation, so it's kind of a stand in for that. 87 00:10:14,780 --> 00:10:19,670 So computer will captures everything that is during staying the most. General ANP is like a subclass. 88 00:10:21,590 --> 00:10:27,770 So in other subclasses as a Sharpie, some people here call it number three, which is the counting version. 89 00:10:28,550 --> 00:10:32,810 And again, this is like a. Okay. 90 00:10:35,170 --> 00:10:41,379 Okay. So yet another one is BQ P, which is quantum polynomial time. 91 00:10:41,380 --> 00:10:45,550 So this is our best effort at describing what a quantum machine what? 92 00:10:45,690 --> 00:10:49,330 How if you use quantum theory for computation, what you'd get. 93 00:10:50,880 --> 00:10:57,990 And so with each of these classes, again, I think it's the same story that there's a model of computation. 94 00:10:58,800 --> 00:11:04,470 This is bounded quantum polynomial time and again. 95 00:11:05,010 --> 00:11:11,159 Besides the suggestion that we should use quantum theory to compute, there are some mathematical consequences. 96 00:11:11,160 --> 00:11:14,550 And for example, a very powerful consequence is the second one. 97 00:11:14,970 --> 00:11:21,990 Powerful theorem is that the many ways you could try to use Quantum to compute, and it turns out that all equivalent. 98 00:11:22,140 --> 00:11:23,549 So that's a strong results. 99 00:11:23,550 --> 00:11:34,710 So by looking at this model of computation, you do arrive at the conclusion that there's a real competition phenomenon, which is right. 100 00:11:35,250 --> 00:11:40,110 And so another result is that BQ P is in fact reducible to sharpies. 101 00:11:40,110 --> 00:11:46,470 So the counting counting says problem. If good counting says problems are more powerful than the quantum class. 102 00:11:48,210 --> 00:11:52,020 Okay, so we've got this various classes with different. 103 00:11:53,580 --> 00:11:59,440 Uh. Power and there are more. 104 00:11:59,450 --> 00:12:04,490 So another phenomenon is a is for this captures the idea of games. 105 00:12:06,140 --> 00:12:11,210 So do I have a. No, I don't have that yet. So this is your kind of game theory. 106 00:12:12,890 --> 00:12:19,460 And again, you know, powerful results about that are that the compete problems the hardest members of the game theory class. 107 00:12:20,840 --> 00:12:26,240 So of course mathematically we don't know all these forces could collapse for all we know. 108 00:12:27,140 --> 00:12:35,000 And it may be that all these things well up to short P it may be that you can do everything in polynomial, polynomial time even efficiently. 109 00:12:36,980 --> 00:12:40,340 But even if that happens to the case, one can still discuss these. 110 00:12:40,340 --> 00:12:45,530 As far as for the phenomena, I think and others extended discussion of of that. 111 00:12:46,430 --> 00:12:52,010 But so it's with this background that I think I have approached topics. 112 00:12:53,030 --> 00:13:04,130 So if one looks into, into, into, into machine learning, then so I've tried to formalise the notion of, of supervised learning. 113 00:13:04,370 --> 00:13:11,910 So we discussed that more code back learning probably possible to correct learning, which is a subclass of of, 114 00:13:11,940 --> 00:13:18,229 of P which which roughly means that is believed to be a subclass is that there are many things 115 00:13:18,230 --> 00:13:23,420 that could write a program for by the program could fit into your computer or the universe. 116 00:13:24,050 --> 00:13:27,280 But this belief that most of this you can't learn from examples. 117 00:13:27,290 --> 00:13:37,510 So learning is harder than just computing. And okay, so the question of of where the speculating thing is within P, it's a. 118 00:13:39,950 --> 00:13:43,549 Uh, you know, some important stuff. 119 00:13:43,550 --> 00:13:47,750 For example, one observation is that cryptography lives out the difference. 120 00:13:48,080 --> 00:13:53,989 So, uh, so public key cryptography wouldn't exist if any function. 121 00:13:53,990 --> 00:13:59,000 You could easily learn from examples because then you could kind of learn all the secrets. 122 00:13:59,480 --> 00:14:07,280 So negative results about, uh, complexity are used every day, especially by cryptographers. 123 00:14:08,480 --> 00:14:15,830 Okay. Okay. So with the pack learning again, you have a model of computation. 124 00:14:16,640 --> 00:14:21,620 And so this model, which I describe in more detail, captures the notion of supervised learning, 125 00:14:22,220 --> 00:14:25,790 which is a well-known concept and widely practised, of course. 126 00:14:27,260 --> 00:14:35,810 And again, by following this, having a formalisation, uh, so obviously questions of robustness are important. 127 00:14:36,020 --> 00:14:38,990 If I define this class in different ways, do I get different classes? 128 00:14:39,140 --> 00:14:48,550 It's important that it's a robust many variants give you the same class and and so some consequences of 129 00:14:48,560 --> 00:14:55,160 for example that you can give a rigorous demonstration that learning algorithm does really generalise. 130 00:14:55,160 --> 00:15:04,370 You know, it's a generalisation which used to be some philosophical issue not many, many decades ago, of course, now is practised by machines. 131 00:15:04,700 --> 00:15:12,950 And you can also explain why some algorithms are, you know, predict in a certain sense, there's no nothing magical about them. 132 00:15:14,020 --> 00:15:21,730 Okay. So. Okay. So I want to describe actually a little bit because I won't I'll build later on on the on that. 133 00:15:24,340 --> 00:15:29,830 So it's a form of formalisation, of supervised learning and so supervised learning. 134 00:15:29,980 --> 00:15:34,540 So there's this term supervised learning, unsupervised learning, which I use in very general senses. 135 00:15:34,540 --> 00:15:40,540 And one reason for formalising it is that at least it defines what we're discussing. 136 00:15:42,670 --> 00:15:45,670 But generally it's, it's talk about learning where there's some feedback. 137 00:15:46,360 --> 00:15:50,980 So supervised learning doesn't mean that there's a supervisor necessarily. 138 00:15:51,550 --> 00:15:56,680 So, for example, I can look around the room and learn something about the average audience in a 139 00:15:56,680 --> 00:16:00,370 computer science lecture in Oxford is like the average age or something like that. 140 00:16:00,850 --> 00:16:07,290 There's no supervisor telling me things. I'm doing this because from other knowledge I can label people myself. 141 00:16:07,300 --> 00:16:11,350 I know roughly how old everyone is. Okay, so I can learn without an external label. 142 00:16:11,440 --> 00:16:15,280 So supervised learning doesn't mean that it has to be a supervisor. 143 00:16:15,670 --> 00:16:19,050 And so essentially it means there's any kind of feedback. 144 00:16:19,150 --> 00:16:24,430 It's, it's supervised learning. So unsupervised learning is where there's truly, truly no feedback. 145 00:16:24,430 --> 00:16:29,140 You know, it's some pattern. And somehow we're supposed to draw some conclusion. 146 00:16:30,340 --> 00:16:36,220 But certainly I think the or the impact of machine learning recently is all this feedback, the supervised ending. 147 00:16:36,550 --> 00:16:38,320 Okay, so what's this formalisation? 148 00:16:39,040 --> 00:16:49,120 So the idea is that there's some space through space where it's taught it is an example maybe of a flower and you're trying to classify flowers into, 149 00:16:49,150 --> 00:16:54,520 into what species they come from. They have types. And B there's a truth. 150 00:16:55,000 --> 00:16:59,440 F is a ground truth separates is s from the bees. 151 00:17:00,760 --> 00:17:06,340 And the learner also has a hypothesis which classifies examples. 152 00:17:06,880 --> 00:17:11,410 And in any rich enough world worth talking about, there will be errors. 153 00:17:11,650 --> 00:17:14,750 Okay. And we assume serious. 154 00:17:14,750 --> 00:17:20,960 It's a very rich world exponentially. Many different kinds of examples may maybe influenced infinitely many different examples. 155 00:17:21,290 --> 00:17:24,740 So it's we want to talk about something realistic. 156 00:17:26,180 --> 00:17:31,990 And so what's. What's this? Uh, what's the supervised learning phenomenon? 157 00:17:32,000 --> 00:17:36,770 It seems amazing. It works. People celebrate it, even even in the popular press. 158 00:17:37,700 --> 00:17:41,420 So what this formulation is first the three, three points. 159 00:17:41,930 --> 00:17:45,890 So one is that it's an efficiency criterion. 160 00:17:46,400 --> 00:17:49,700 So it says that there will always be errors. 161 00:17:50,480 --> 00:17:59,710 But the more examples you take and the more computation you apply, you should be able to reduce your error fairly fast. 162 00:17:59,720 --> 00:18:03,860 Okay. It should be rewarding to put more effort into into it, into learning. 163 00:18:05,570 --> 00:18:13,670 So if you double the amount of effort or the number of examples, you should see the increase decrease in the error. 164 00:18:14,690 --> 00:18:23,780 And so this is a quantify something quantified and the important thing is that it goes down algebraically. 165 00:18:23,780 --> 00:18:30,980 So if you have an examples there may be errors are good good at one over the square root of n maybe one over the 10th root of N, 166 00:18:31,430 --> 00:18:42,959 but it wouldn't or shouldn't be slower than that. Okay. So. And of course this is a realised realised gain that basically people in the last ten 167 00:18:42,960 --> 00:18:50,190 years have increased the budget in data and computation by a factor of maybe thousands. 168 00:18:50,640 --> 00:18:56,530 And this has brought really good rewards. So this is okay. 169 00:18:56,620 --> 00:19:01,480 And for some simple learning algorithms, you can prove that the thing learns and it learns so fast. 170 00:19:01,780 --> 00:19:05,420 Okay. And so this is just in pictures. 171 00:19:05,430 --> 00:19:14,320 So the quantitative aspect is that the more effort you put in, the error goes down as power of the effort. 172 00:19:14,950 --> 00:19:19,330 So it may be, you know, one over into the half. 173 00:19:19,460 --> 00:19:27,460 Okay. So if you want to reduce the error by a factor of two, maybe you should put in the fact a fixed factor like 100 or four. 174 00:19:27,760 --> 00:19:36,790 Okay. And some people actually experimentally verify this that for some task of predicting next words various deep learning. 175 00:19:39,030 --> 00:19:47,990 Uh, algorithms do have this linear this to go to this polynomial resulting in error. 176 00:19:48,000 --> 00:19:51,600 So this is a log log scale, so you straighten out the curve to a straight line. 177 00:19:52,410 --> 00:19:57,690 And so also speculating doesn't tell you what power law this should be. 178 00:19:58,170 --> 00:20:02,370 And in fact, there's evidence that different applications do have different vowels. 179 00:20:02,370 --> 00:20:08,009 So if you do some natural language data sets or a vision data set, they have different power. 180 00:20:08,010 --> 00:20:09,360 They have different power laws. Okay. 181 00:20:09,990 --> 00:20:16,890 So like, like this one is assumed to be a very slow power that's like point sort of fixed power, but it's .06 good enough. 182 00:20:18,000 --> 00:20:21,570 Okay. So this is an efficient efficiency that. 183 00:20:23,990 --> 00:20:29,240 Okay. So so we demand before we call a machine learning algorithm successfully, 184 00:20:29,240 --> 00:20:36,910 we demand this efficiency criterion because the two other aspects are one is that we want to be realistic. 185 00:20:36,920 --> 00:20:48,920 The world is complicated. So the last thing we want is to solve this, okay, is to make an assumption so we know that something's a something is a BS, 186 00:20:49,370 --> 00:20:54,350 but in different worlds, maybe the different probabilities of each, each kind. 187 00:20:54,800 --> 00:21:00,140 So if you come here to go to China, maybe the same flowers, but with different possibilities. 188 00:21:01,130 --> 00:21:08,960 So, so this requirement is that the second requirement is that this learning algorithm work for arbitrary distributions. 189 00:21:10,410 --> 00:21:18,180 So the secret, of course, is that you're going to learn on or learn on a distribution and you're going to have to perform on the same distribution. 190 00:21:18,750 --> 00:21:23,430 So for certain, the kind of flowers common here, you'll be tested on the on these common flowers which are common here, 191 00:21:23,970 --> 00:21:26,100 the different china, they'll be tested on something different. 192 00:21:27,220 --> 00:21:33,580 But so this basically says that in practice, the successful learning algorithms are very broad, broad spectrum. 193 00:21:33,860 --> 00:21:36,820 Okay. They don't just work for the uniform distribution. Okay. 194 00:21:38,770 --> 00:21:47,469 And then the third thing, which is a bit more sophisticated, is that so when you're learning, the learning algorithm gives you a hypothesis. 195 00:21:47,470 --> 00:21:49,570 There's a computational representation of everything else. 196 00:21:50,440 --> 00:21:56,620 But the classification, the teacher is just a function of you don't look inside the teacher. 197 00:21:57,430 --> 00:22:06,850 Okay. So it's just a behaviour. And so in practice it's, you know, so this learning algorithm is something you have in your hand, 198 00:22:07,330 --> 00:22:14,110 maybe a perception, a deep learning network, some sort of boosting. 199 00:22:14,860 --> 00:22:20,050 So something you have in your hand. But then the examples come and no one guarantees where it comes from. 200 00:22:21,350 --> 00:22:28,160 But often the use of this representation for learning and you've got no chance of learning everything you can represent. 201 00:22:28,760 --> 00:22:34,670 But you're still successful. And the reason usually is that the examples come from from a weaker world. 202 00:22:34,970 --> 00:22:39,560 So is there something something simple about the world are learning from. So the. 203 00:22:40,790 --> 00:22:43,350 Like this. So the. Okay. 204 00:22:43,460 --> 00:22:51,380 So the mystery of why certain the surest X work well in practice is often that the tasks they're given have some simplicity in them, 205 00:22:51,380 --> 00:22:55,580 which is often very hard to identify. Anyway. 206 00:22:55,580 --> 00:23:03,020 So so this is a specification of of a formal model of supervised learning. 207 00:23:03,860 --> 00:23:07,760 Okay. So. Okay. That was by way of introduction. 208 00:23:09,800 --> 00:23:13,490 Okay. So how. Okay. So. 209 00:23:15,980 --> 00:23:24,860 Okay. So I've got this model of inductive learning and we know that machine learning, which does roughly this kind of thing, is very successful. 210 00:23:25,340 --> 00:23:28,700 But the question is the type of problem is something about the intelligence. 211 00:23:29,150 --> 00:23:35,300 So is this all of intelligence? And so everyone would agree that the answer is kind of no or almost everyone agrees. 212 00:23:36,080 --> 00:23:45,570 So what? What more what more is there? And so what we want to do, if you follow this approach is to we need a model of computation. 213 00:23:45,580 --> 00:23:51,460 So back learning is a model of learning, but it's not enough because just learning we don't think is enough. 214 00:23:51,940 --> 00:23:56,140 So know what can we do more? What's what should we add? 215 00:23:56,830 --> 00:24:05,470 And I will add because I think inductive learning gets pretty powerful phenomenon and we need to add to it rather than start from scratch. 216 00:24:07,300 --> 00:24:17,090 What do we need to need to capture? And so the adage, which I've been going around for a long time and drawing in advertising is, 217 00:24:17,290 --> 00:24:23,680 is this line from Aristotle who said that all belief comes from syllogism or induction. 218 00:24:24,520 --> 00:24:28,810 And by which you mean something like, if you believe that, if you have a belief in your head, 219 00:24:29,350 --> 00:24:35,950 then you either deduced it's a selective syllogism or some sort of logical deduction. 220 00:24:36,220 --> 00:24:40,810 You use it for something else to you, or always by induction. 221 00:24:40,810 --> 00:24:46,820 So induction means that you somehow from on basic empirical evidence, you generalised it somehow. 222 00:24:46,840 --> 00:24:55,620 Okay. And so, of course, he spent, you know, 99% of his effort on syllogism and didn't say much about induction. 223 00:24:57,480 --> 00:25:03,690 But so what's happened since, of course, Syllogism has become this big field of mathematical logic and formulas. 224 00:25:03,710 --> 00:25:08,790 His reasoning? So induction became this mysterious sort of philosophical field. 225 00:25:09,660 --> 00:25:13,770 But I think it's the issues have been clarified by machine learning and machine learning theory. 226 00:25:15,060 --> 00:25:19,260 So as an example. So when I started. 227 00:25:19,560 --> 00:25:25,020 The question is question how come that, you know, children have seen different examples of of chairs, 228 00:25:25,300 --> 00:25:30,210 different parts of the world, and yet even a new chair, they agree on what's a chair and what's not. 229 00:25:30,780 --> 00:25:35,879 That was kind of a mystery. You know, there wasn't a good answer to that. But now machines can do this routinely. 230 00:25:35,880 --> 00:25:40,470 So asking this question when mystify anyone living now. 231 00:25:42,510 --> 00:25:48,330 And the reason is that that machine learning theory gives an answer on what it means to achieve this. 232 00:25:49,200 --> 00:25:55,229 You don't have to perform well on this distribution. You've seen it's probabilistic anyway. 233 00:25:55,230 --> 00:26:06,600 So we we do have a handle on on this. And before I go on, just to say that there are some technological aims here. 234 00:26:07,430 --> 00:26:13,770 So so what I'm discussing will be how you want to unify a view, 235 00:26:13,770 --> 00:26:19,110 have a unified view of of reasoning and of learning, because at the moment they're very different. 236 00:26:19,530 --> 00:26:22,920 You know, reasoning is a very classical reasoning. 237 00:26:23,100 --> 00:26:27,360 Classical logic is this very brittle kind of a mathematical theory. 238 00:26:27,900 --> 00:26:31,950 But as machine learning is of, it's this kind of robust thing of a different kind. 239 00:26:31,950 --> 00:26:37,800 So we do want to unify them. But kind of the grand goal, if you can do that as foundational technology, 240 00:26:38,310 --> 00:26:46,200 is kind of to approach the central problem of A.I., which I believe is, you know, how you put into a computer knowledge, 241 00:26:46,200 --> 00:26:53,939 which at the moment is very hard to acquire common sense knowledge and be able to use it in the computer to to reason, 242 00:26:53,940 --> 00:26:58,020 to make predictions, deductions, whatever. Okay. 243 00:26:58,100 --> 00:27:05,730 So to do the second, I can't imagine how you can do the second unless you take some unified view of what reasoning and learning are. 244 00:27:06,090 --> 00:27:10,950 It seems that if the two disparate things, it's a bit difficult. 245 00:27:12,600 --> 00:27:19,499 Now. So in modern terms, I suppose there's a debate. 246 00:27:19,500 --> 00:27:24,260 And so I'm basically I'll be saying that reasoning and learning are both important and we have to reconcile them. 247 00:27:25,410 --> 00:27:29,610 Not everyone has to agree. So, for example, at the moment, 248 00:27:30,540 --> 00:27:34,829 there are some people who are so enthusiastic about machine learning that they think that 249 00:27:34,830 --> 00:27:38,940 a single black box machine learning thing will do everything and we won't need reasoning. 250 00:27:39,690 --> 00:27:49,310 Okay, so that's as have you. And other people may, may put reasoning high on the pedestal. 251 00:27:50,870 --> 00:27:57,300 And says, but putting it more simply. The question is, are other people actually deny that reasoning is real? 252 00:27:57,870 --> 00:28:04,129 Other people who deny that learning is real. So certainly, I think 30, 40 years ago there were real learning deniers, 253 00:28:04,130 --> 00:28:11,510 people who thought that intelligence was all reasoning and putting learning facts and reasoning efficiently with them. 254 00:28:12,470 --> 00:28:16,970 They were suddenly learning deniers then and now there's some reasoning deniers around. 255 00:28:18,020 --> 00:28:24,640 But, you know, in this talk, I'll take a middle ground. Okay, so let's buy this one. 256 00:28:25,220 --> 00:28:29,690 That Aristotle on a cell phone. So. 257 00:28:30,410 --> 00:28:33,850 Okay, so. Okay. 258 00:28:33,870 --> 00:28:36,690 So so most people can answer this question without too much effort. 259 00:28:37,350 --> 00:28:45,180 But the question is, you know, did you use pure learning for this or did you use pure reasoning or did you use something else? 260 00:28:46,020 --> 00:28:52,079 Okay. So the main contrast I want to give is that at the moment one has to argue a bit 261 00:28:52,080 --> 00:28:56,550 against people who want to do everything by a single black box machine learning thing. 262 00:28:57,540 --> 00:29:03,120 So for example, the idea is that if you feed this black box, you know, a billion sentences from the Web, 263 00:29:04,350 --> 00:29:07,860 you know, maybe you can answer every question and the reasoning will go away. 264 00:29:08,850 --> 00:29:16,080 But I think kind of commonsense introspection suggests that to answer this question, you know, 265 00:29:16,440 --> 00:29:22,990 it's not that we've been exposed to thousands of sentences about Aristotle's property, okay? 266 00:29:24,570 --> 00:29:28,670 But we somehow knew some facts and we train together facts with you. 267 00:29:28,680 --> 00:29:32,710 So some some reasoning involved. So this is introspection. 268 00:29:33,220 --> 00:29:38,500 But can we ask the same question of learning versus reasoning a bit more scientifically? 269 00:29:39,190 --> 00:29:49,930 So. So we want some somebody to do some experiment to do which tests this kind of issue in a plausible way. 270 00:29:50,560 --> 00:29:59,590 And so the problem which came to us, which is, I think, very natural for this, is called the work completion problem. 271 00:30:00,470 --> 00:30:08,470 And this essentially is that I'll take a phrase from a website or a newspaper, usually a headline, and I delete a word. 272 00:30:09,040 --> 00:30:12,910 You have to guess what the missing word is, because this is kind of a test. 273 00:30:13,060 --> 00:30:16,300 It's quite a good IQ test because it's quite hard to do. 274 00:30:16,600 --> 00:30:24,550 These headlines are often quite succinct, just the minimum number of words to express what you want to say. 275 00:30:25,210 --> 00:30:34,270 And okay. And of course, it's important we took these headlines from is maybe from a world where you have no knowledge. 276 00:30:34,690 --> 00:30:40,600 So actually the ones are examples I have has happens to be from our English language Chinese newspaper. 277 00:30:41,590 --> 00:30:44,500 Okay. So let's have a some examples here. So. 278 00:30:46,140 --> 00:30:52,560 This was whatever the year of the dog holds in store, pet owners will be lavishing more attention than ever on their. 279 00:30:54,050 --> 00:30:59,450 So you have to guess where the missing word is. And question is, could your computer program do it? 280 00:31:02,790 --> 00:31:06,630 Any children. Okay. 281 00:31:07,100 --> 00:31:16,980 Okay. So the answer was Peaches, which is a in a bunch of scientists in early, early 20th century American origin word for dog issue. 282 00:31:18,600 --> 00:31:23,190 And so the question is, is this a hard problem, say, for a black box machine learning algorithm? 283 00:31:23,670 --> 00:31:26,100 And so my guess is that this is easy, 284 00:31:26,670 --> 00:31:33,390 because if you look up Google and you look for sentences with pet owners and their pooches in it, you've got tens of thousands. 285 00:31:33,750 --> 00:31:37,080 So this is an easy problem for a black box machine learning. Okay. 286 00:31:38,400 --> 00:31:45,690 Okay. So another one. China rises as a maritime powerhouse after snapping up profitable blank blank across the world. 287 00:31:47,670 --> 00:31:50,820 Fragrance and fragrance trade. 288 00:31:51,230 --> 00:31:55,470 So. Yeah. Okay. So that's the answer. The good seaport terminals. 289 00:31:55,830 --> 00:32:01,080 Okay, good. So I reckon this is slightly harder that you probably to do some reasoning. 290 00:32:01,410 --> 00:32:10,200 You just can't do it by some sort of word because, you know, I didn't have any examples of that. 291 00:32:11,640 --> 00:32:18,920 So the hardest examples for where? Inductive learning is disposable as well, where there's some kind of news. 292 00:32:18,920 --> 00:32:22,329 So to understand the headline, you have to know what happened yesterday. Okay. 293 00:32:22,330 --> 00:32:27,340 So, for example, one thing is retail sales are up 20.7% in second quarter. 294 00:32:28,120 --> 00:32:33,040 Okay. So so maybe again. So, okay, so the answer here is a cow. 295 00:32:33,520 --> 00:32:38,020 So maybe if you're an expert on conditions in the different parts of China, he could do it. 296 00:32:38,320 --> 00:32:41,110 But you need lots of knowledge and maybe recent news, that kind of stuff. 297 00:32:41,950 --> 00:32:50,680 But certainly, you know, if things depend on you news this morning, then, you know, having a billion sentences in your brain doesn't help you. 298 00:32:51,380 --> 00:33:01,140 Okay, so. Okay. So the interesting thing about this thing is that the idea is that with this problem, 299 00:33:01,530 --> 00:33:08,220 you can test your machine learning system how how well it solves this problem. 300 00:33:09,450 --> 00:33:16,920 And so I think this problem is not bad. So this is so it's kind of a stand in for the Turing test in certain ways. 301 00:33:17,640 --> 00:33:21,540 So in one sense, there's something for the Turing test has many aspects, 302 00:33:21,540 --> 00:33:27,240 but one aspect is that the measure something, you know, how well do you perform compared to something else? 303 00:33:28,140 --> 00:33:34,980 And of course, the other important thing about the Turing test is that he didn't say the intelligence depends on how well you play chess or how well, 304 00:33:34,980 --> 00:33:40,680 you know, chemistry, but it depends on general general knowledge of general stuff. 305 00:33:41,190 --> 00:33:45,630 So. So this missing word test is good on that. 306 00:33:46,770 --> 00:33:49,349 And so the learning theory, Pursell adds, 307 00:33:49,350 --> 00:33:57,270 is that certainly it emphasises that any kind of performance in any system like this is with respect to a particular distribution. 308 00:33:58,410 --> 00:34:03,510 So it's hard to be intelligent if, you know, if you go somewhere where your knowledge is irrelevant. 309 00:34:05,880 --> 00:34:12,780 And also it emphasises feasible computation that we're interested in efficient computation, 310 00:34:12,780 --> 00:34:16,290 infeasible computation and controlling the error of your prediction and things like that. 311 00:34:16,980 --> 00:34:20,360 Okay. So we'll come back to this problem. 312 00:34:20,370 --> 00:34:26,790 So I'm suggesting that if you tackle this problem of of common sense knowledge and learning and reasoning or whatever, 313 00:34:27,030 --> 00:34:33,360 this isn't a bad problem to test your system on because, you know, it's there's a ground truth. 314 00:34:33,390 --> 00:34:36,540 There's a ground truth. And it's about this general knowledge. 315 00:34:36,690 --> 00:34:38,150 Okay. Okay. 316 00:34:38,160 --> 00:34:51,420 So so what I'm really coming to is my content, which is okay, which is kind of my suggestions for having a model of computation which can do both, 317 00:34:51,630 --> 00:34:57,990 both inductive learning, which I think is important phenomenon, and you can add on reasoning to it. 318 00:34:58,500 --> 00:35:03,540 Okay. So and with this combined system, if you do it well, which we haven't yet, 319 00:35:04,290 --> 00:35:08,730 that you can test it on on this a problem like this with competition problem. 320 00:35:09,930 --> 00:35:13,510 Missing work problem. Okay. So the question is, how how do we add? 321 00:35:13,770 --> 00:35:17,760 Uh. Okay. So what is intelligent thinking? 322 00:35:18,240 --> 00:35:23,880 You know, what else do we do besides inductive learning? And how do we make this into a model of computation? 323 00:35:24,930 --> 00:35:29,340 So these models get, you know, kind of complicated. 324 00:35:29,620 --> 00:35:32,580 A story machine is very complicated. This is much more complicated. 325 00:35:34,080 --> 00:35:41,760 And it's justification is that you're capturing something important maybe, and that other ways of capturing it would boil down to the same thing. 326 00:35:42,240 --> 00:35:45,480 Okay. Anyway, so this is more a list of things you need to capture. 327 00:35:46,710 --> 00:35:53,460 And so the first feature of it is this idea which is borrowed from cognitive science of a working memory. 328 00:35:53,970 --> 00:36:02,520 So this amazing thing we have to cognition, which is that while we have this enormous storage of memories of each instance, 329 00:36:02,820 --> 00:36:10,290 somehow we've got this small mind's eye which directs our behaviour at each instant and of this little world in front of us, 330 00:36:10,590 --> 00:36:16,710 what we're aware of, and we use this awareness to plan our lives, what we do next, what we do after the lecture. 331 00:36:17,250 --> 00:36:20,580 And so now all our behave is channelled through the small window. 332 00:36:20,910 --> 00:36:22,320 And so what's going on here? 333 00:36:23,280 --> 00:36:31,770 So the explanation we're here will be is that we need to restrict the window for complexity reasons, for computational complexity reasons. 334 00:36:32,430 --> 00:36:35,820 And we as a model, we need to use it to get anywhere. 335 00:36:37,020 --> 00:36:40,530 Okay, so so roughly, this is how we formulated. 336 00:36:41,580 --> 00:36:46,320 So you wake up in the morning and your mind's eye is blank, 337 00:36:46,740 --> 00:36:50,910 but it's got some two free tokens and can fill it up during the day with what you're thinking about. 338 00:36:52,030 --> 00:36:55,209 Okay. So you fill it out with the scene. 339 00:36:55,210 --> 00:37:00,720 So you think of your dog. Okay, so you think of your dog, and then you want it. 340 00:37:00,730 --> 00:37:03,250 You want to pick your dog, you want to see what is the dog like. 341 00:37:04,630 --> 00:37:09,460 And then you have a rule in your head which tells you that, in fact, dog's like bones. 342 00:37:09,490 --> 00:37:12,670 Okay. Okay. So, so. 343 00:37:12,670 --> 00:37:23,170 So somehow with your background noise from your big, long term memory, you can fill up your your mind's eye, too, with missing information. 344 00:37:23,410 --> 00:37:32,650 So this is roughly what goes on. But but here we come to the first difference between logic and learning. 345 00:37:33,310 --> 00:37:43,000 And so the point is that this implication doesn't fit well with with like pack learning or any kind of learning because. 346 00:37:44,370 --> 00:37:52,080 I think when you do machine learning, you go got some target function and so you're learning to recognise an elephant. 347 00:37:52,650 --> 00:38:00,600 And so basically what you're recognising is in this is a, this is a sufficient condition of a picture to contain a reputation of an elephant. 348 00:38:01,020 --> 00:38:05,640 Okay. So what we are definitely learning, if you do if you do supervised learning is an equivalence. 349 00:38:06,450 --> 00:38:13,890 Okay. So you have a maybe of a big neural network, perceptron or decision tree. 350 00:38:14,310 --> 00:38:17,400 And it's you want to predict whether what's in front of you is a or not. 351 00:38:18,060 --> 00:38:25,080 And on the left hand side is some some very rich, complicated, incredibly complicated rule. 352 00:38:25,920 --> 00:38:36,690 Tens of thousands of bits. Best to try, possibly, but it does contain a useful criterion of whether you know what's in front of you is a bone or not. 353 00:38:36,900 --> 00:38:44,940 Like a it's a predictor. Hmm. Okay. So the proposal, the first step of the proposal is that are going to learn these things. 354 00:38:46,100 --> 00:38:53,090 So maybe for each word in the dictionary, you're going to learn a predictor in terms in terms of the other words. 355 00:38:53,630 --> 00:39:02,450 And this predictor can be any whatever you're machine learning out and can do depends on your computational resources about what you can do. 356 00:39:03,440 --> 00:39:10,910 And this is what learning rules are. They'll do equivalences, but the point is that you can use a equivalences to make. 357 00:39:11,600 --> 00:39:13,640 You can change these together to make predictions. 358 00:39:14,180 --> 00:39:23,420 Okay, so, so using something like this, if the conditions in your C predict that this is a bone, then this is a very good thing to predict as a bone. 359 00:39:24,020 --> 00:39:29,720 And you can once you predict it's a bone, you can make further predictions about your seen using these equivalences. 360 00:39:30,290 --> 00:39:40,189 Okay. So so that's that's basically the idea is that you're this is your mind works your mind is full of not one black box neural net, 361 00:39:40,190 --> 00:39:49,370 but tens of thousands. And somehow the predictions of these tens of thousands are can be used together in a principled way to make predictions, 362 00:39:50,120 --> 00:39:54,110 because that's the kind of the rough summary. Okay. 363 00:39:54,560 --> 00:39:57,709 So that's okay. First step. 364 00:39:57,710 --> 00:40:03,440 Okay. So so this robust logic is this the system for is this model of computation. 365 00:40:04,160 --> 00:40:18,920 And the first aspect of it is that it we're going to learn rules with equivalences which predict maybe every concept in the in the in the dictionary. 366 00:40:19,250 --> 00:40:23,150 Okay. Aspect one. Aspect to. 367 00:40:29,570 --> 00:40:32,000 Well that we will have quantifies. Okay. 368 00:40:32,750 --> 00:40:42,710 So this is we have sun exists and whatever's bit like in logic, but now they mean something much more grounded than they do in conventional logic. 369 00:40:43,310 --> 00:40:46,670 So in logic you learn things like you know or man or mortal. 370 00:40:47,820 --> 00:40:52,590 But then what does that mean? Has someone checked out meaning throughout the universe? 371 00:40:52,860 --> 00:40:56,890 Probably not. So this quantum quantifies a bit. 372 00:40:57,170 --> 00:41:04,260 Bit? Kind of embarrassing, almost. So in this logic, the quantifies only refer to this mind's eye. 373 00:41:04,740 --> 00:41:10,650 Okay, so you got what you're thinking about. Does something exist there in your mind's eye, or is something true for everything in your mind's eye? 374 00:41:11,680 --> 00:41:18,630 So it's a very, very local thing. And I suppose I should just point out already is that, you know, 375 00:41:18,700 --> 00:41:27,280 somehow this political calculus logic is hasn't worked out too well for I and there's almost something kind of embarrassing about it. 376 00:41:28,720 --> 00:41:35,400 So, you know, but certainly this rule well, obviously there are many things which your dog likes, not just the bone. 377 00:41:35,410 --> 00:41:41,890 So there's something, something very simplistic about its logical expressions and the something brittle. 378 00:41:42,280 --> 00:41:47,860 But the idea that, you know, the idea of predicting what a bone is or possible versions of it, 379 00:41:48,340 --> 00:41:54,520 you know, you know, there's nothing mysterious about that. That's what machine learning technology does for you. 380 00:41:55,830 --> 00:42:00,410 Okay. So, uh, so we have some quantify as etc., etc. 381 00:42:01,090 --> 00:42:06,060 Um, okay. So third thing is, uh, what about consistency? 382 00:42:07,760 --> 00:42:11,480 So as I said, so we're going to learn all these rules. 383 00:42:12,880 --> 00:42:20,890 Okay. So that's a rule we a lot of these rules. So what if somehow they if you train them together, you get inconsistencies. 384 00:42:21,190 --> 00:42:24,579 So logic is certainly hung up on inconsistency. Okay. 385 00:42:24,580 --> 00:42:28,270 So here we say, don't worry. You learn all these rules. 386 00:42:28,270 --> 00:42:36,040 We'll just live with the inconsistencies. And if you if the inconsistencies are important, you somehow learn your way out of it. 387 00:42:36,730 --> 00:42:46,150 Right. So I know the 1960s or seventies version of this problem used to be no longer is this what's called the Nixon Triangle. 388 00:42:46,630 --> 00:42:51,910 So you learn that, uh, Quakers are pacifists. 389 00:42:52,890 --> 00:42:56,540 Uh, that's a rule. You know, Republicans are not pacifists. 390 00:42:56,540 --> 00:43:04,270 So these are two rules you go round with. And then there's example of someone called Richard Nixon, and he was both a Quaker and not a pacifist. 391 00:43:04,630 --> 00:43:08,200 So then what do you do? Okay, so here the answer is, don't worry about it. 392 00:43:08,200 --> 00:43:15,939 Go around with your general rules. And if this counterexample is somewhat worrying, then, you know, your learning algorithm will say that, 393 00:43:15,940 --> 00:43:19,420 you know, all Quakers, Quakers except Richard Nixon are pacifists. 394 00:43:19,450 --> 00:43:24,819 Okay, so you you'll learn your way out of an inconsistency if it's important, 395 00:43:24,820 --> 00:43:29,350 but you've got no chance of maintaining consistency in a complicated world. 396 00:43:29,710 --> 00:43:33,480 Okay, so that's as easy. Okay. 397 00:43:34,560 --> 00:43:43,830 So rules will be learned. So instead of learning to recognise elephants, we're going to learn rules. 398 00:43:44,400 --> 00:43:48,270 And so these rules will predict in design the mind's eye. 399 00:43:49,110 --> 00:43:56,490 And then this probably possibly the correct sense. And we will look for rules which are highly reliable. 400 00:43:56,560 --> 00:43:59,640 Okay. So we've just learned rules. 401 00:44:00,390 --> 00:44:05,100 So as aspect for. So a more subtle issue, actually. 402 00:44:05,430 --> 00:44:08,730 There's a lot of discussion is was this distribution business. 403 00:44:09,630 --> 00:44:15,030 So here we do get rather kind of strange philosophical problems. 404 00:44:15,450 --> 00:44:20,130 So as I said, this question of how come we agree on other areas? 405 00:44:20,520 --> 00:44:26,309 Although we've seen different examples, that's kind of has some history, but in the end it's not so mysterious. 406 00:44:26,310 --> 00:44:31,530 Like we could believe it. There's one distribution here, but then it gets more mysterious. 407 00:44:31,620 --> 00:44:38,810 So we've learned that to know, uh, you know, Aristotle lived a long time ago, well, etc., etc. 408 00:44:38,820 --> 00:44:41,640 So how we, how we use that. He didn't have a cell phone. 409 00:44:42,540 --> 00:44:46,240 So when we learn these general facts, it's not quite clear what the distribution is, you know? 410 00:44:46,260 --> 00:44:49,889 So it's a bit lost. But but you have to take a stance on this. 411 00:44:49,890 --> 00:44:57,900 We have a model of computation. You have to kind of commit yourself. So the following is just a detailed version. 412 00:44:58,440 --> 00:45:06,960 But okay, but the short term version is, is, is this that maybe this is the central thing is that. 413 00:45:08,520 --> 00:45:11,720 So do we. Oh, yes. Okay. 414 00:45:12,230 --> 00:45:18,080 So the central model is this. The idea is that you've got this very long term memory of of lots of rules. 415 00:45:18,950 --> 00:45:22,580 You got very big brain, very complicated world outside. 416 00:45:23,120 --> 00:45:28,189 But what saves us makes cognition possible is that there's this kind of funnel 417 00:45:28,190 --> 00:45:34,129 in between your mind's eye and somehow the examples of the world you see, 418 00:45:34,130 --> 00:45:39,320 you summarise very as as a subset of sketch or caricature. 419 00:45:39,980 --> 00:45:44,990 And then within the simple scene, you see you you apply your rules. 420 00:45:45,080 --> 00:45:53,540 Okay. So if I look out and I'll probably see three groups of seats, I don't I can't see every individual. 421 00:45:53,960 --> 00:45:58,080 So we we apply these rules to simplified scenes. Okay. 422 00:45:58,500 --> 00:46:01,830 Like this. Okay. So. So what is this? Well, distribution. 423 00:46:02,490 --> 00:46:07,170 So, very roughly. Just to persuade you that there's a way of committing yourself. 424 00:46:07,170 --> 00:46:10,800 Although could persuade you. Persuade you that this is the right way. 425 00:46:11,340 --> 00:46:18,510 Isn't that would take more time. So the idea is that for each scene in your mind's eye, you think of something. 426 00:46:18,990 --> 00:46:22,140 There are all these features. So everything is true or false. 427 00:46:22,530 --> 00:46:28,200 But the whole essence is that the world in this game is in the description is incomplete. 428 00:46:28,860 --> 00:46:34,680 Okay. You think I want to go home? And then somehow you fill in this scene about how you're going to. 429 00:46:35,210 --> 00:46:38,790 How. What's a reasonable way of going home? So. 430 00:46:39,240 --> 00:46:44,490 So in this mind's eye, there's very little which is specified. So most of it are stars. 431 00:46:45,390 --> 00:46:49,950 So some are definitely. Yes, some are. Definitely knows about two knows infinite list of stars. 432 00:46:51,420 --> 00:46:57,930 Okay. And so this is that the world sometimes doesn't specify the value of a feature. 433 00:46:58,380 --> 00:47:03,630 Most of the time it doesn't. And again, so going back to the A.I. from a long time ago. 434 00:47:03,960 --> 00:47:09,460 So again, the famous paradox was that this is a bird called Tweety. 435 00:47:10,380 --> 00:47:14,210 I'll tell you, it's a bird. And I ask you, trees is a bird. 436 00:47:14,220 --> 00:47:19,900 Does it fly? And you say yes. And then I'll tell you the truth as a penguin. 437 00:47:20,350 --> 00:47:26,080 Then you change your mind. Okay, so this is some sort of paradox, if you think of it in any kind of reasonable logic. 438 00:47:27,130 --> 00:47:33,160 But again, in this formulation, uh, almost without doing anything, there is no paradox, 439 00:47:33,580 --> 00:47:42,700 because the idea is that if I tell you something as a bird and I don't comment on whether it's a penguin, then in fact it's probably not a penguin. 440 00:47:42,940 --> 00:47:49,510 If it was a penguin, I'd probably bother to tell you. So that is that there's a distribution of examples you've seen and the. 441 00:47:49,810 --> 00:47:54,100 And if something is mentioned, that may be useful information. 442 00:47:54,820 --> 00:48:04,630 Okay. So, so the incomplete specifications, uh, solve some, uh, paradoxes already. 443 00:48:04,950 --> 00:48:10,989 That's a comment. Okay. And then the game being played. 444 00:48:10,990 --> 00:48:14,230 Is that so? You've got your mind's eye. 445 00:48:14,710 --> 00:48:17,930 Uh, some things are, yes. 446 00:48:18,220 --> 00:48:22,510 Talking about your dog? No, it's not a cat. Most things are unspecified. 447 00:48:22,870 --> 00:48:27,190 And then there's one thing you want to predict or, you know, what is your dog like? 448 00:48:27,730 --> 00:48:31,750 So a question mark is like a force predictions that there's a ground truth. 449 00:48:31,750 --> 00:48:35,110 Maybe there's ground truth as a probability distribution, you have to reply. 450 00:48:35,560 --> 00:48:39,800 So this is there's a distribute, there's a distribution out there. Okay. 451 00:48:40,580 --> 00:48:46,790 Okay. So now. Okay. So another aspect which I think is a is a deep one is this idea that. 452 00:48:47,750 --> 00:48:53,210 So once you we have you're learning many things. At the same time, there's this notion of hierarchical learning. 453 00:48:53,720 --> 00:48:57,200 Okay. So you're going to learn this word in the dictionary, that little word in the dictionary. 454 00:48:57,650 --> 00:49:03,080 And so what happens if you only understand this word and you understand the second word in terms of the first? 455 00:49:03,500 --> 00:49:09,650 Or like if you go to a math, scores that are different concepts and someone. 456 00:49:10,100 --> 00:49:19,910 Okay. So so if you only have understand the concept, is it useful to label to tell you a new example where that's half understood concepts is in that. 457 00:49:21,240 --> 00:49:24,870 So all the evidence is that if you half understand, stuff is not very useful. 458 00:49:24,930 --> 00:49:27,990 Okay. So it's very hard to learn things. 459 00:49:28,260 --> 00:49:34,080 So it's very hard to learn a concept in terms of other concepts before you really understand other concepts very well. 460 00:49:34,620 --> 00:49:44,200 Okay. So this is one reason why if you just stare around in the universe, it's hard to learn a complicated concept like that. 461 00:49:44,220 --> 00:49:49,350 The planets go round in ellipses. But that's not so easy to spot from looking at the sky. 462 00:49:49,780 --> 00:49:58,349 Okay, so. So, in fact, the way the system works, which, you know, first I thought it's a weakness and embarrassment, 463 00:49:58,350 --> 00:50:06,360 but I think it's probably inevitable is that these examples do have to come with with kind of correct labels. 464 00:50:07,170 --> 00:50:14,190 So it's, you know, the value of of universities is that you go to lectures and someone, 465 00:50:14,700 --> 00:50:19,350 you know, meticulously gives the exact labelled examples which are kind of correct. 466 00:50:20,000 --> 00:50:23,730 Okay. You don't just skim the web and try to learn something complicated. 467 00:50:24,420 --> 00:50:29,700 Okay. So someone has to label the outputs correctly and also the features correctly. 468 00:50:30,310 --> 00:50:37,590 And if I say that's a yeah and a group or something, which is this is this then it has to be commutative. 469 00:50:38,160 --> 00:50:42,150 Not very helpful unless you've learned about commutative means at the beginning. 470 00:50:42,240 --> 00:50:45,940 Okay. So. 471 00:50:48,550 --> 00:50:53,200 Okay. So so that's an aspect of this of this model of computation and. 472 00:50:59,100 --> 00:51:05,910 Okay. And that's also a lost aspect of what I want to emphasise is that so this is different from making a public stock model, model of the world. 473 00:51:06,720 --> 00:51:09,900 This is something which is kind of avoid some of the complications. 474 00:51:10,440 --> 00:51:19,530 So idea of probably personally correct is that you're assuming that the things you're going to predict are close to probability, one of being correct. 475 00:51:20,070 --> 00:51:24,960 I'm not in the business of estimating probabilities point three and point five and point seven and computing with them. 476 00:51:25,170 --> 00:51:29,040 There's little, little evidence that humans are any good at that. 477 00:51:29,370 --> 00:51:34,110 And in trying to understand cognition, we somehow have to avoid that. 478 00:51:35,150 --> 00:51:38,300 Okay. Okay. So that's seven features. 479 00:51:40,980 --> 00:51:48,090 Okay. So we. Okay, so we're going to learn the rules using whatever it is you like. 480 00:51:48,150 --> 00:51:51,540 I mean, that's. That's the parameter. 481 00:51:54,560 --> 00:52:00,610 Okay. So I mentioned currently general features in a second, maybe one with one dictionary. 482 00:52:00,620 --> 00:52:03,859 And you. Okay. Okay. Good. Okay. 483 00:52:03,860 --> 00:52:07,820 So I mentioned this word missing word problem. 484 00:52:08,330 --> 00:52:13,580 So a long time ago. Well, a while ago with the law is this, Michael, which is an experiment. 485 00:52:13,580 --> 00:52:18,620 This was ten years ago. A simple experiment. Small data set, simple algorithms. 486 00:52:20,090 --> 00:52:23,930 And so the idea was that we took a natural language database from the Wall Street Journal. 487 00:52:25,040 --> 00:52:28,600 We use some standard stuff from machine learning, from natural language processing. 488 00:52:30,140 --> 00:52:33,590 We used online dictionaries, word net services, 489 00:52:34,550 --> 00:52:43,070 and and the exercise was that from record we were going to learn rules about the world from single sentences. 490 00:52:45,310 --> 00:52:50,230 So to do it properly, we should be learning from paragraphs or more. 491 00:52:51,220 --> 00:52:54,340 So the idea was we were trying to learn facts about the world, 492 00:52:55,360 --> 00:53:04,270 which are different from just syntactic features of which you can do by just by applying machine learning boxes. 493 00:53:05,530 --> 00:53:13,630 Okay. And the issue was so issue was testing my main hypothesis, which is that there's a there's even if you can learn very well, 494 00:53:13,890 --> 00:53:19,420 even if you can do black box learning well, is there added value in training and training these rules? 495 00:53:19,700 --> 00:53:29,340 Okay. So. Here we are testing this hypothesis, and this is an example of the kind of rules we learned. 496 00:53:29,880 --> 00:53:36,630 So this is The Wall Street Journal. It's about business. So a typical word is is so we call a missing word. 497 00:53:37,110 --> 00:53:42,239 And the question is, is the missing word price typical word you find in Wall Street Journal? 498 00:53:42,240 --> 00:53:48,090 Maybe. And so for each word, as I said, you have a predictor. 499 00:53:49,210 --> 00:53:51,490 Predicting some big, enormous, big mess. 500 00:53:52,120 --> 00:54:02,620 But suppose it is predicted for you, whether messing with its price and the machine learning algorithm we used was essentially close to perceptron. 501 00:54:03,220 --> 00:54:10,270 So we learning inequality, a linear inequality, but the features were these compounds features. 502 00:54:12,830 --> 00:54:20,690 And so the idea was that if somehow you find the structure in the sentence where there's one word X, 503 00:54:20,690 --> 00:54:29,570 which the word was bargain and the sentence was telling you that this bargain lowers something, 504 00:54:30,440 --> 00:54:33,640 then you should deduce that what it lowers is the price. Okay. 505 00:54:34,580 --> 00:54:41,390 So bargain lowering something is good evidence for the missing well being price and competition also known as price. 506 00:54:42,470 --> 00:54:48,110 So lots of independent evidence which could add up to decide whether you're missing, whether it was price. 507 00:54:48,750 --> 00:54:54,590 But anyway, so you could somehow you through this data set at your learning algorithm. 508 00:54:55,070 --> 00:55:02,540 And the aim was to learn the facts about the world. Okay. So this is facts about the world which were beyond just what things you could. 509 00:55:04,240 --> 00:55:05,890 Okay. So you learn facts about the world, 510 00:55:06,070 --> 00:55:12,310 which hopefully you could train together and then reach conclusions which would be beyond just a simple black box learning. 511 00:55:13,230 --> 00:55:20,920 Okay, so, so we did this and, you know, it's a kind of very small scale experiment. 512 00:55:21,340 --> 00:55:24,640 We got some results. So. So main thing was that. 513 00:55:25,360 --> 00:55:31,270 So we have this first about 260 words which were targets which are quite frequently enough. 514 00:55:32,020 --> 00:55:36,160 And I think already we. This is ordered from left to right. 515 00:55:37,030 --> 00:55:43,210 So the words were our methods were most favoured were the right. And I think red was what the. 516 00:55:44,080 --> 00:55:49,390 Okay. So blue was just machine learning, red was machine learning post reasoning. 517 00:55:50,380 --> 00:55:57,700 And so for some words we really did much better. And with some others kind of everything was hidden in the noise. 518 00:55:58,480 --> 00:56:04,570 And the general phenomenon here is that in machine learning, big data is very powerful. 519 00:56:05,530 --> 00:56:10,660 And once you start adding reasoning, you can introduce all kinds of noise. 520 00:56:10,990 --> 00:56:17,290 So pure machine learning is quite something to compete against, so it's quite hard to improve on it, but it's still possible. 521 00:56:18,640 --> 00:56:23,530 Okay. So. Okay. So I think I've got maybe two slides, which is a bit more technical. 522 00:56:24,460 --> 00:56:30,400 So this is what the robust logic thing is. So in our minds I. 523 00:56:32,700 --> 00:56:35,820 Okay. Because some objects to objects. 524 00:56:36,600 --> 00:56:44,580 So the right hand side of the rules on just particles like bone, they could be relations too, like above or byes or something like that. 525 00:56:46,200 --> 00:56:54,660 So the left hand side could be any the the hypothesis of any learning algorithm we used in the inequalities. 526 00:56:56,870 --> 00:57:03,860 Again, they could have compound features like this that, you know, so this thing is true. 527 00:57:04,340 --> 00:57:10,220 If, you know, there's an object in your mind's eye as if every other object in the mind's eye, various things are true. 528 00:57:11,150 --> 00:57:17,240 So having complicated features makes you learn better. 529 00:57:17,870 --> 00:57:21,370 But you've got enormous numbers of these. You generate them momentarily. 530 00:57:21,680 --> 00:57:29,570 So you got a trade off. And use any learning algorithm you like because it becomes propositional. 531 00:57:30,530 --> 00:57:33,380 You just plug in your learning algorithm. That's what you do. 532 00:57:33,980 --> 00:57:41,570 And then what you are guaranteed is that, well, these rules will be learnable by definition. 533 00:57:42,350 --> 00:57:46,820 And the main promise is that when you train together these rules, then. 534 00:57:48,550 --> 00:57:59,470 There's some some promises. And very roughly the main promise is that if you've turned together the two rules and each rule is accurate to 95%, 535 00:58:00,520 --> 00:58:04,420 then when you train them together, the conclusion will be correct with it's probably 90%. 536 00:58:05,120 --> 00:58:12,040 Okay, so you lose accuracy the deeper the training, but it gives you some principle, principled way of training together, even two things. 537 00:58:13,030 --> 00:58:16,150 So that's the kind of main promise. 538 00:58:16,810 --> 00:58:22,719 And the idea is that it seems that if you want to do logic on a learned on set knowledge in a principled way, 539 00:58:22,720 --> 00:58:28,000 in a big machine, in a, it seems hard to avoid such a requirement. 540 00:58:28,750 --> 00:58:34,150 Okay. And so everything will work in polynomial time. 541 00:58:35,590 --> 00:58:36,790 That's how things are defined. 542 00:58:37,000 --> 00:58:46,390 The only restriction you need is that the relations of constant parity so I can have above is above B, likes A, likes B. 543 00:58:47,380 --> 00:58:53,350 So there's a binary and but the costs will be go up exponentially with the arity of of the relations. 544 00:58:53,350 --> 00:58:57,520 So we have to divide straight up the world into relations of constant parity. 545 00:58:57,850 --> 00:59:01,510 So this doesn't worry people because it's a reasonable requirement. 546 00:59:02,930 --> 00:59:08,330 Otherwise everything is polynomials. In fact, the number of tokens you have in your mind's eye. 547 00:59:08,480 --> 00:59:13,040 So psychologists tell us it's what seven plus one is two. 548 00:59:14,210 --> 00:59:18,070 So things are polynomial in that. So we know it's not exponential. 549 00:59:18,080 --> 00:59:21,890 So maybe you can have 20, maybe 30. You don't have to worry too much about that. 550 00:59:23,670 --> 00:59:28,459 Okay. So. Okay. 551 00:59:28,460 --> 00:59:31,190 So the outcome was if you build a system on these principles, 552 00:59:31,970 --> 00:59:38,180 you would learn a lot is use lots of learning boxes, but they interact in a principled way. 553 00:59:39,700 --> 00:59:47,770 And what this would address, I think, is certainly acquiring knowledge, which is hard to acquire how to program. 554 00:59:48,490 --> 00:59:58,600 It has to be done by learning. And so hopefully we're building reasoning systems by programming failed because they're too brittle. 555 00:59:58,630 --> 01:00:02,440 Hopefully learning will get you out of the brittleness. 556 01:00:06,280 --> 01:00:13,569 Okay. So make a consistent. Okay. 557 01:00:13,570 --> 01:00:18,010 So so I think what the reasoning, 558 01:00:18,650 --> 01:00:23,590 the Aristotle example suggests is that will be when we reasonably often reason 559 01:00:23,590 --> 01:00:28,840 in cases when there are few direct examples and that's what this solves and. 560 01:00:30,640 --> 01:00:33,760 I may be a general comment. So this is a very general issue everyone discusses. 561 01:00:33,760 --> 01:00:37,330 In the end, machine learning is the idea of explanations. 562 01:00:37,780 --> 01:00:42,100 Okay, so people don't like blackbox machine learning because they know no explanations. 563 01:00:42,730 --> 01:00:50,620 So in this kind of system, it's kind of a half a solution because what we're saying is that we're going to we're going to have lots of black boxes, 564 01:00:51,130 --> 01:00:55,210 but each black box is going to predict something. You understand some word you want. 565 01:00:55,270 --> 01:00:57,880 You're going to choose the terms in which you want your problem. Understood. 566 01:00:58,540 --> 01:01:03,939 And once you've got a prediction, then you've got lots of black boxes there. 567 01:01:03,940 --> 01:01:10,330 But you understand which features you, you, you, you care about are being explained. 568 01:01:10,960 --> 01:01:17,220 And so this idea of explanations being kind of only half a level, I think is, 569 01:01:17,220 --> 01:01:21,130 is it's quite appropriate because I think this is similar with human explanation. 570 01:01:21,670 --> 01:01:27,670 So if if you ask me, you know, why, why do I bring my my umbrella? 571 01:01:27,880 --> 01:01:30,940 I'll say I, you know, I don't want to go. I thought it was going to rain. 572 01:01:31,450 --> 01:01:34,360 And you say, oh, well, so, well, I don't want to get wet. 573 01:01:34,600 --> 01:01:39,020 But at some point, if you keep asking me questions, at some point I'll say, I don't know the answer. 574 01:01:39,040 --> 01:01:45,609 Okay. So, okay, so our explanations are also only up to a certain level that we can explain what we think. 575 01:01:45,610 --> 01:01:53,050 So. Okay. So so computers won't be able to explain to you that they are soft give giving explanations also at some point. 576 01:01:54,040 --> 01:01:57,880 So this kind of system gives explanations in terms of what you request. 577 01:01:58,510 --> 01:02:02,830 And maybe beyond that is as is hopeless anyway. 578 01:02:03,710 --> 01:02:10,620 Okay. So. Okay, so I think. So by machine being educated rather than trained. 579 01:02:11,130 --> 01:02:18,420 I mean that you when the machine learns it doesn't know how it's learned nowadays it's going to be used. 580 01:02:19,170 --> 01:02:24,900 So when you're train with one single machine learning books, there's a lot of knowledge goes into it. 581 01:02:25,470 --> 01:02:30,540 But the only thing this knowledge will be able to do is to predict exactly what you had in mind when you were training it. 582 01:02:31,370 --> 01:02:38,340 Okay. Now we learn all kinds of stuff in college and elsewhere, and then we can apply it to the new situation. 583 01:02:38,900 --> 01:02:40,290 And this is very much like having, you know, 584 01:02:40,350 --> 01:02:50,940 many black boxes loading in parallel and having a principled way of of using them to make a prediction or an explanation and a new situation. 585 01:02:52,100 --> 01:02:55,610 Okay. So. Okay. Okay. 586 01:02:55,610 --> 01:02:59,930 So why me? Okay. 587 01:03:00,650 --> 01:03:04,940 Okay. So. Okay. Very quickly. And difficulties. 588 01:03:05,270 --> 01:03:14,120 Well, the main difficulty is getting good training sets. So in machine learning, obviously good training sets were very important. 589 01:03:14,120 --> 01:03:17,750 And the recent developments. This is a challenge. So. 590 01:03:18,170 --> 01:03:21,200 Okay, so where do I get trading material? 591 01:03:22,070 --> 01:03:29,510 I want to know the colour of an elephant. I put these different options into Google and I find this. 592 01:03:31,400 --> 01:03:35,660 So then I decide, well, this is better data than this. I want to go to Google Scholar. 593 01:03:36,350 --> 01:03:41,660 Okay, then I find this. Okay, so this is good. 594 01:03:42,330 --> 01:03:48,260 Okay. So, okay, so it seems that it's getting good data sets. 595 01:03:48,260 --> 01:03:51,590 Is is a problem. Okay. So let's forget about this. 596 01:03:52,610 --> 01:03:59,870 So I think what's needed is really big experiment, which basically needs big new data sets which can test something like this. 597 01:04:00,260 --> 01:04:04,260 So just like know the big vision data sets produced six, seven years ago. 598 01:04:04,280 --> 01:04:09,260 A very important influential. Um, what we need is, 599 01:04:09,680 --> 01:04:21,960 is big enough data sets with good with good information which kind of challenges this requirement of doing reasoning in a broad enough um, 600 01:04:22,430 --> 01:04:26,300 context context to be to be interesting. Okay. 601 01:04:27,060 --> 01:04:30,440 Um, okay. So certainly. 602 01:04:30,560 --> 01:04:36,080 So the general summary is that what we're good at is throwing computational power at something. 603 01:04:37,160 --> 01:04:42,319 And I'm suggesting that we should throw computational power at something where we know that there's a real phenomenon, 604 01:04:42,320 --> 01:04:46,969 that it's a supervised learning as one. But if we want to broaden it to intelligence, 605 01:04:46,970 --> 01:04:55,310 then we have to first decide what's the real phenomenon and then throw a computational power at it so that it.