1 00:00:02,610 --> 00:00:09,180 So hello, everyone, so welcome to the last seminary in Hillary term. 2 00:00:09,180 --> 00:00:16,350 So today it's our pleasure to have Benjamin Guarch as the speaker. 3 00:00:16,350 --> 00:00:26,160 So Ben is a principal research fellow in Machine Learning at UCL and a tiny research scientist area, 4 00:00:26,160 --> 00:00:31,380 and he is also a visiting researcher at the Alan Turing Institute. 5 00:00:31,380 --> 00:00:38,400 So he's a member of our society and a fellow of the Royal Statistical Society. 6 00:00:38,400 --> 00:00:47,100 So he's a leading researcher in pack based. So personally, I have read many of his papers on Taipei's during my Ph.D. study. 7 00:00:47,100 --> 00:00:55,050 So today he will talk. Tell us more about Taipei's and the new research in his group. 8 00:00:55,050 --> 00:00:56,370 Thank you very much. 9 00:00:56,370 --> 00:01:05,100 Thank you very much for that very nice introduction and for the invitation, which I'm afraid has had to be postponed over the past few months, 10 00:01:05,100 --> 00:01:11,790 certainly wish were not meant to meet with all of you, but things are like they are. 11 00:01:11,790 --> 00:01:19,920 I'm very pleased to present a short overview, really, of what Backpage is, as I was joking a few minutes ago, 12 00:01:19,920 --> 00:01:26,460 saying that it certainly is much more popular than used to be 10 years ago when I was on the topic. 13 00:01:26,460 --> 00:01:29,360 But still, I appreciate that. 14 00:01:29,360 --> 00:01:37,950 I think it's good to have maybe a kind of a refresher on the Watpac base theory is and can bring in particular commission learning and statistics. 15 00:01:37,950 --> 00:01:41,880 And then I'll cover also a few recent results from my group. 16 00:01:41,880 --> 00:01:45,210 So I should say also disclaimer if you do have any questions, please. 17 00:01:45,210 --> 00:01:56,370 And Mitchell said, I'm afraid I'm not quite sure I can see if you type in the chat, but please feel free to interrupt me if you do have any question. 18 00:01:56,370 --> 00:02:00,990 Right. So you should see my slides moving forward. 19 00:02:00,990 --> 00:02:02,580 So, yeah, as I was saying, 20 00:02:02,580 --> 00:02:11,640 so I'm a researcher affiliated with in in France of the National Research Institute for Science and Mathematics and UCL since twenty eighteen, 21 00:02:11,640 --> 00:02:20,730 where some of the members of the centre of the eye, which is recently inaugurated, we based in Ireland, been in London, 22 00:02:20,730 --> 00:02:28,710 and that also serve as the scientific director of this joint venture between France and the UK called the General and that programme. 23 00:02:28,710 --> 00:02:34,320 So if you if you are even remotely interested in hearing more about that, please feel free to get in that. 24 00:02:34,320 --> 00:02:35,820 I'd be very happy to. 25 00:02:35,820 --> 00:02:45,750 So my research sits at the crossroads of statistics and statistics, but also more recently, optimisation and application to machine learning. 26 00:02:45,750 --> 00:02:50,640 So the few keywords in my highlighting here are stats for learning to impact base. 27 00:02:50,640 --> 00:02:56,440 Obviously, as I would say, this is the core of my research contributions over the past year. 28 00:02:56,440 --> 00:03:05,250 I also have a keen interest in competition statistics and much more recently I've taken interest in theoretical analysis of deep learning, 29 00:03:05,250 --> 00:03:12,400 as you will see. So we were able to to obtain generalisation bounds for some architectures of the network. 30 00:03:12,400 --> 00:03:15,720 So as you will very much gather from my talk, 31 00:03:15,720 --> 00:03:24,300 I hope my personal obsession is absolutely generalisation and what it entails for a series of problems in machine learning and statistics, 32 00:03:24,300 --> 00:03:32,430 how it can translate also to obtaining better theoretical understanding of some algorithms and also drive new algorithms. 33 00:03:32,430 --> 00:03:40,620 Sorry, I'm trying to do, I think, identification's to let people in, but I don't know if you're taking care of this. 34 00:03:40,620 --> 00:03:46,140 I'm trying to do. Yeah, ok. OK, I guess. 35 00:03:46,140 --> 00:03:56,040 All right. Let me start with a very brief introduction and maybe kind of a thought provoking into Matari to the topic of judaisation. 36 00:03:56,040 --> 00:04:02,310 So I'd like to start with a very bold claim or phrase, which is learning history to be able to generalise. 37 00:04:02,310 --> 00:04:08,190 And you all know this kind of classical pictures, you know, this kind of classical pictures. 38 00:04:08,190 --> 00:04:10,980 This one's taken from Wikipedia, for example. 39 00:04:10,980 --> 00:04:19,050 And the crux of translation is trying to understand from example what can an intelligent system learn about what is underlying? 40 00:04:19,050 --> 00:04:24,510 And you pretty much all agree that if you do memorise what you've already seen, 41 00:04:24,510 --> 00:04:29,010 it's usually pretty bad, this of fitting in statistics and generalisations. 42 00:04:29,010 --> 00:04:37,380 Precisely the opposite of that is your ability to escape of a fitting and trying to be able to understand 43 00:04:37,380 --> 00:04:44,190 what is underlying and to understand really what generalising beyond the datasets truly means. 44 00:04:44,190 --> 00:04:51,960 So let me start also with a very small appetiser, really food for thought. 45 00:04:51,960 --> 00:04:57,090 There's been some evidence in the past few years that deep learning might be conflicting or even 46 00:04:57,090 --> 00:05:02,400 actually breaking statistical learning theory as we know it on that very simple observation. 47 00:05:02,400 --> 00:05:08,130 So most neural network is trained on that. Is that actually achieve a fantastic training error? 48 00:05:08,130 --> 00:05:14,560 Very low training error, basically zero. And as statisticians, I mean, this should be a red flag, right? 49 00:05:14,560 --> 00:05:18,610 In the sense that might not bode well for the performance this strong. 50 00:05:18,610 --> 00:05:23,000 This suggests actually very strongly of fitting. 51 00:05:23,000 --> 00:05:33,990 This is not what happens. Surprisingly, they're able to generalise from from massive data sets to achieve actually excellent test errors. 52 00:05:33,990 --> 00:05:39,300 You might also know, of course, this kind of picture. This is very statistics one to one. 53 00:05:39,300 --> 00:05:44,310 This one's taken from from Balcones Paper and also from coughers. 54 00:05:44,310 --> 00:05:54,710 You all know this kind of trade off between actually increasing the complexity of space of hypothesis versus. 55 00:05:54,710 --> 00:06:01,790 Optimising the test risks, so the goal of statistics and statistical learning in particular is really to aim for that sweet spot right 56 00:06:01,790 --> 00:06:09,440 over there where you achieve the ideal Trade-Off between a good training loss and the patronisation ability. 57 00:06:09,440 --> 00:06:12,470 That's conjectures actually from a series of others. 58 00:06:12,470 --> 00:06:19,940 But starting really with this very influential paper from Michelle and co-authors that this might actually just be half of the picture. 59 00:06:19,940 --> 00:06:22,880 So this is what the other half might look like. 60 00:06:22,880 --> 00:06:30,950 And the idea that they put forward is really that rather than that sweet spot, we should aim for some kind of interpellation threshold. 61 00:06:30,950 --> 00:06:39,680 So an explanation, a possible explanation for how massive architectures of networks actually journalise is very well indeed, 62 00:06:39,680 --> 00:06:49,220 is that there might be this second phase, which they called the interpolating regime in which you've indicated it's very handwaving. 63 00:06:49,220 --> 00:06:50,880 But in a nutshell, 64 00:06:50,880 --> 00:06:59,450 you've learnt the data so well that you start to be able to generalise slightly around the few data points that you already collected. 65 00:06:59,450 --> 00:07:23,520 So that's that's. I mean, we can't we can't hear you now. 66 00:07:23,520 --> 00:07:41,890 No. It was working like 10 seconds ago. 67 00:07:41,890 --> 00:07:47,220 Can anyone hear me? Yeah, yeah, we can hear it now. OK. 68 00:07:47,220 --> 00:07:54,870 No. Oh, boy. 69 00:07:54,870 --> 00:08:06,960 OK. Now we can hear you. 70 00:08:06,960 --> 00:08:11,770 It should be back. Can anyone hear me? Yes, we can. 71 00:08:11,770 --> 00:08:18,870 Yeah, fantastic. I'm very sorry about that's my Bluetooth headphones just decided to go off. 72 00:08:18,870 --> 00:08:23,580 All right. The joys of going for landline, I guess. 73 00:08:23,580 --> 00:08:32,160 OK, so let me just resume on that picture. So this is a kind of conjectures that are being addressed at the very moment by Magpie's. 74 00:08:32,160 --> 00:08:35,010 Another thought provoking example, I think, 75 00:08:35,010 --> 00:08:43,320 is I'd like to to focus on the semantic content of learning and the way that semantic representation can be used and leveraged, 76 00:08:43,320 --> 00:08:49,950 really to accelerate learning and try to better understand. Actually, Putrajaya is from a few examples. 77 00:08:49,950 --> 00:08:57,630 This is very much addressed by one of the big projects that I was very fortunate to join when I joined the UK two years ago, 78 00:08:57,630 --> 00:09:01,350 which goes semantic information pursuit for multimodal data analysis. 79 00:09:01,350 --> 00:09:08,700 So there's a lot of partners I know in particular, and the department is one of the pieces of that project. 80 00:09:08,700 --> 00:09:18,090 And this is very much what we try to address. The last third and last example is what I like to to narrate as a tale of two learners. 81 00:09:18,090 --> 00:09:24,450 So for one specific task, let's assume that you have a definition that works for a classical architecture. 82 00:09:24,450 --> 00:09:30,480 I'm not going to comment on that. This is all very classical to all of you, surely. 83 00:09:30,480 --> 00:09:37,470 But it's it's widely Invicta now that four very simple tasks, such as identifying an item in an image. 84 00:09:37,470 --> 00:09:45,390 So you want to identify horses in images? Well, you would typically achieve accuracy's, which are excellence, basically a hundred percent. 85 00:09:45,390 --> 00:09:52,890 And the way it works is usually through training samples, hundreds of thousands, maybe millions of annotated images. 86 00:09:52,890 --> 00:09:58,140 And it is important here because you need some kind of of way to ensure that data is curated, 87 00:09:58,140 --> 00:10:04,590 which see what this actually means in theory, because it translates in many ways in theory. 88 00:10:04,590 --> 00:10:11,520 And then you train this on your favourite GPU and then you get you get a deployable dignon network. 89 00:10:11,520 --> 00:10:18,180 That's the first contender. So that's considered the number two. 90 00:10:18,180 --> 00:10:24,640 We are also extremely good identifying horses on images, actually, they do achieve the same accuracy, 91 00:10:24,640 --> 00:10:29,000 they have 100 percent accuracy and surprisingly, they are also able to do other things. 92 00:10:29,000 --> 00:10:36,610 So, for example, there are very good transferring to other items which resemble horses, for example, zebras or unicorns. 93 00:10:36,610 --> 00:10:40,180 And my three year old daughter is all about unicorns at the moment. 94 00:10:40,180 --> 00:10:48,160 So this ability is obviously very far from what you would expect from the network, especially with the same training samples. 95 00:10:48,160 --> 00:10:54,070 So train samples in that particular example is an equal five or six basically. 96 00:10:54,070 --> 00:10:59,860 So what I'm saying with this little tail is that there is a striking mismatch between theory 97 00:10:59,860 --> 00:11:06,460 and the way we actually use algorithms and the way we understand intelligent systems approach. 98 00:11:06,460 --> 00:11:13,360 And this is very much where my research is sitting for for the past few years and for the upcoming years for sure, 99 00:11:13,360 --> 00:11:18,670 trying to address that gap, trying to better understand how semantic representation can be used to leverage, 100 00:11:18,670 --> 00:11:26,800 learning how we can actually hope to aim for more intelligent algorithms in the sense that it would be much more frugal, 101 00:11:26,800 --> 00:11:32,170 require much less resources and try to leverage more internal representation. 102 00:11:32,170 --> 00:11:37,660 So back to that original claim. Learning is indeed to be able to generalise, but not from scratch. 103 00:11:37,660 --> 00:11:46,510 So if you do consider each learning tasks as a very fresh road, being blind to context, then you are unlikely to be efficient. 104 00:11:46,510 --> 00:11:56,770 So this is a call for incorporating structure and semantic information and what I like to call implicit representations of the world that we humans, 105 00:11:56,770 --> 00:12:00,940 for example, are using constantly without even realising it. 106 00:12:00,940 --> 00:12:06,820 So the hope is that it very much drives a new series of algorithms, 107 00:12:06,820 --> 00:12:15,010 which probably will be on the long run, but hopefully much more frugal in terms of resources. 108 00:12:15,010 --> 00:12:22,570 So that's very much where my efforts on bank base will sit in the sense that we would like to drive this with 109 00:12:22,570 --> 00:12:30,010 my group would like to drive this actually from the theory and try to translate this into working algorithms. 110 00:12:30,010 --> 00:12:33,010 So that brings me to the bottom of my talk. 111 00:12:33,010 --> 00:12:43,270 So, again, a very short primer on what bank base is very much inspired by the tutorial that I gave with John Shortener on the slide. 112 00:12:43,270 --> 00:12:50,890 And this was this was that said two years ago. So let me start with a very simple settings. 113 00:12:50,890 --> 00:12:55,390 And again, if you do have any questions or remarks, please feel free to amuse yourself. 114 00:12:55,390 --> 00:12:59,560 I'm saying this for there's a joint there a bit late. Right. 115 00:12:59,560 --> 00:13:09,490 So that's very simple setting again, bear in mind that pretty much all of what I will be saying also extends to much more complicated, sad things. 116 00:13:09,490 --> 00:13:12,640 For example, I would restrict myself to a supervised case, 117 00:13:12,640 --> 00:13:18,190 but there are a few times for backspace in particular, for unsupervised learning or contrastive learning. 118 00:13:18,190 --> 00:13:21,460 I'll touch briefly on this in the last part. But all right. 119 00:13:21,460 --> 00:13:28,090 So this is implicit saying I have a learning algorithm say let's assume that I have pairs of inputs and outputs. 120 00:13:28,090 --> 00:13:33,430 So, again, superclass connotations. I'm not going to comment on that too much. 121 00:13:33,430 --> 00:13:38,470 The underlying distribution P this is the underlying data generating distribution. 122 00:13:38,470 --> 00:13:43,520 We just assume that we collect a sample of the examples and decide the assumptions. 123 00:13:43,520 --> 00:13:48,400 Of course, extremely central and key in statistics. 124 00:13:48,400 --> 00:13:56,170 But you will say in the second part that we actually have a few attempts and successful contributions to to relax that assumption. 125 00:13:56,170 --> 00:14:05,230 And this is also in the sense of translating this into much more working algorithms and more realistic assumptions. 126 00:14:05,230 --> 00:14:11,050 So let me start briefly with again a very broad overview of statistical learning theory. 127 00:14:11,050 --> 00:14:20,620 So let me take just this example. If you fix an algorithm, A if you fix the function plus H, if you fix also the sample size, 128 00:14:20,620 --> 00:14:24,950 which you could play with is changing random samples and then just looking at the errors. 129 00:14:24,950 --> 00:14:33,340 So you look at the test errors from those samples. In CESCO, learning theory is all about trying to understand the behaviour of that sample. 130 00:14:33,340 --> 00:14:39,770 So rather than focussing on the mean, what makes a lot more sense is to focus on the tail of the distribution of those samples. 131 00:14:39,770 --> 00:14:49,270 So what you want to do ultimately in of in theory is finding bands which hold as high probability statements of a random samples. 132 00:14:49,270 --> 00:14:52,870 If you compare this, just as for test, it's basically the same kind of interpretation. 133 00:14:52,870 --> 00:14:57,820 So if you do assume that you have 99 percent confidence level, say, 134 00:14:57,820 --> 00:15:05,290 the chance of the conclusion of what is classical learning to rebound might tell you not being true, those are less than one percent. 135 00:15:05,290 --> 00:15:10,650 So this is where you should interpret those probabilistic statements. 136 00:15:10,650 --> 00:15:20,430 And this is the celebrated back framework's of Pakistan's for probably approximately correct, can be traced back to the 80s with aliens. 137 00:15:20,430 --> 00:15:24,270 And basically the way it goes is you would typically define a convenience 138 00:15:24,270 --> 00:15:30,060 parameter delta so that you can control the probability of making a huge mistake, 139 00:15:30,060 --> 00:15:43,800 that you can control this by Delta. Then you understand the phrasing, high incidents as if you take the the the the complementary sets of that event, 140 00:15:43,800 --> 00:15:47,700 then the probability of being not in the case of a large mistake. 141 00:15:47,700 --> 00:15:55,800 So approximately correct. That comment on this in the slides, then this probability is actually arbitrarily close to one. 142 00:15:55,800 --> 00:16:07,600 So that's why I like to claim as well that still any theories about controlling what happens to the tester with high confidence? 143 00:16:07,600 --> 00:16:12,390 What can you achieve from that particular assembly so it can do basically two things, you can first learn a predictor. 144 00:16:12,390 --> 00:16:20,860 You can train a machine learning algorithm which delivers a predictor on these two major and then you can certify this performance. 145 00:16:20,860 --> 00:16:26,470 So by learning, I mean that you could train them using any kind of learning principles. 146 00:16:26,470 --> 00:16:30,010 So this covers pretty much everything that you could think of. 147 00:16:30,010 --> 00:16:35,900 You could have regularised methods like Bayesian techniques, networks and so on and so on. 148 00:16:35,900 --> 00:16:44,440 This can be informed or not by applying knowledge. But the crux is you can also certify the performance of that predictor. 149 00:16:44,440 --> 00:16:50,470 And this is typically what is addressed by transition bands, as you will see now in in the minutes. 150 00:16:50,470 --> 00:16:54,050 So those two goals obviously interact with each other. 151 00:16:54,050 --> 00:17:05,000 And this research is pretty much very much involved in the interplay between those two goals and how you can use one to leverage the other. 152 00:17:05,000 --> 00:17:09,140 So no crossing into the actual definition of legislation is. 153 00:17:09,140 --> 00:17:19,820 So, again, this is all very classical. Please feel free to ask you to ask questions or make remarks or maybe you'll go slightly faster on this. 154 00:17:19,820 --> 00:17:25,340 But so, again, very classically, we use a lost function to measure the discrepancy between the actual outputs. 155 00:17:25,340 --> 00:17:35,060 Why and what you do predict for. So with this very simple notation that's evaluated in X, and obviously those are random variables. 156 00:17:35,060 --> 00:17:43,670 So if you want to be able to make comparisons between both, you take the expectation that's what I'm going to call the out of simple risk or are out. 157 00:17:43,670 --> 00:17:48,740 So the expectation of respect to the underlying distribution of data and obviously you can't compute. 158 00:17:48,740 --> 00:17:57,740 That's the results of the empirical counterparts. The in risk transition is answering that question in the middle of a slide. 159 00:17:57,740 --> 00:18:02,300 If you have a predictor, which does well on the path that you've collected so far, 160 00:18:02,300 --> 00:18:08,510 would you still do well on new pairs if you do collect new data points, which is to form reasonably good. 161 00:18:08,510 --> 00:18:15,350 And for that, we use the transition gap, which is simply the difference between the assessment of risk and the simple risk. 162 00:18:15,350 --> 00:18:21,620 Probabilistic statement, as was mentioning before, might be in the form of at the bands like Bets. 163 00:18:21,620 --> 00:18:26,790 So there's also research on lower bands. I'm not going to touch on that in this thought. 164 00:18:26,790 --> 00:18:36,000 But basically, those upper bounds come with a series of flavours of this excellent term is actually the one which is which is of the utmost 165 00:18:36,000 --> 00:18:43,710 interest overseeing those in those those results and the different flavours I'm going to be touching upon are either distribution, 166 00:18:43,710 --> 00:18:49,290 free of distribution, dependence in the sense that it might depend on the underlying distribution of the data, 167 00:18:49,290 --> 00:18:55,170 meaning that you could kick in actually additional assumptions on the underlying distribution dates. 168 00:18:55,170 --> 00:19:02,520 For example, if you assume mundaneness, if you assume the existence of Kaif moments of the data, 169 00:19:02,520 --> 00:19:06,930 if you assume a linear dependence between X and Y with the cushion noise and so on and so on, 170 00:19:06,930 --> 00:19:12,690 those are old model specifications and the band can be made tighter and tighter with those assumptions. 171 00:19:12,690 --> 00:19:17,820 But hopefully the band would also be valid with a minimal set of assumptions, if any. 172 00:19:17,820 --> 00:19:26,290 Really. The other kind of flavour that it comes in are the algorithm for your independent flavours in the sense that 173 00:19:26,290 --> 00:19:33,460 this might be true only for a specific age or it could be true for a very wide range of age functions. 174 00:19:33,460 --> 00:19:37,630 And if this is the case, I think you already see this coming. 175 00:19:37,630 --> 00:19:44,330 That is, you can optimise the band for the specific age that you would be more inclined to use. 176 00:19:44,330 --> 00:19:55,370 So getting back to the pack framework, so in a nutshell, this is a kind of prototypical form of pack of pack pounds, not jet pack based. 177 00:19:55,370 --> 00:20:04,130 So with that probability, with probability of it very close to one, really the transition of an act of this age, which is these blue parts of it here. 178 00:20:04,130 --> 00:20:10,700 So the transition gap, as we had before, is that most disturb this excellent term that we can control. 179 00:20:10,700 --> 00:20:14,600 And hopefully in good cases, we can also compute, not just control. 180 00:20:14,600 --> 00:20:19,580 So you could think of this as a measure of complexity of the problem that you're looking at. 181 00:20:19,580 --> 00:20:28,310 And this typically scales. I see I see someone typing the chat, actually do see magicians. 182 00:20:28,310 --> 00:20:33,110 Could you provide an example of a description free of the independent approaches? Yes, absolutely. 183 00:20:33,110 --> 00:20:40,800 But that's going to come in a few slides. That's the right. All right. 184 00:20:40,800 --> 00:20:48,300 And. Feel free to enrich yourselves if you can. 185 00:20:48,300 --> 00:20:58,650 All right, so those those statements are really eye candy statements on the tail of the distribution of these are out random variable. 186 00:20:58,650 --> 00:21:04,830 All right. Again, keep in mind that this is very similar to what you would achieve. 187 00:21:04,830 --> 00:21:10,690 The statement from a full test at level one man is Delta, so pack based. 188 00:21:10,690 --> 00:21:18,550 So here comes Taipei's is really about patronisation bounce, not just for hypothesis, but rather for Distribution's hypothesis. 189 00:21:18,550 --> 00:21:27,270 So hence the connexion to base statistics that we will see Backpage is actually fairly different from Ferentz. 190 00:21:27,270 --> 00:21:33,990 So let me also try to convince you, if you're not already at this stage, that you should, 191 00:21:33,990 --> 00:21:42,450 as a statistician and as you should care about translation in a sense that transition towns really are a safety check. 192 00:21:42,450 --> 00:21:50,620 The way I see it, it would give you a guarantee for data that you have not yet collected or you might never collect. 193 00:21:50,620 --> 00:21:57,060 But to give you a theoretical guarantee of the performance of one specific algorithm on any unseen data. 194 00:21:57,060 --> 00:22:05,100 And as you will see, the most interesting cases of judaisation bounce are when they provide this control, which is computable. 195 00:22:05,100 --> 00:22:10,230 So if you can actually compute numerically the right hand side of a transition bounce, 196 00:22:10,230 --> 00:22:17,670 then you're in business basically because you would have the tightest kind of control that you could hope for, for that particular algorithm. 197 00:22:17,670 --> 00:22:27,540 They also explained why some algorithm actually work, if you want to, to explain why some of them don't work, you're more interested in lower bounds. 198 00:22:27,540 --> 00:22:33,480 But as far as the bounds they do explain what some of them actually work and were saying. 199 00:22:33,480 --> 00:22:39,330 If you are able to optimise the right side of the bound with respect to the hypothesis or something even more complex, 200 00:22:39,330 --> 00:22:46,530 really, then it can lead to designing new algorithms. I'm going to be commenting on this in a minute. 201 00:22:46,530 --> 00:22:56,850 So at this stage, this is a text message. I very much hope that will convince you that Backpage is a framework which is extremely generic in the 202 00:22:56,850 --> 00:23:04,170 sense that you would say that it's been applied to a lot of frameworks in statistics and machine learning. 203 00:23:04,170 --> 00:23:12,000 The only thing you need to do to apply really based reasoning is to be able to define distributions of hypotheses 204 00:23:12,000 --> 00:23:19,350 and what you've done that you can basically use the Backpage machinery and then try to to derive patterns. 205 00:23:19,350 --> 00:23:26,360 So it really much is a technique to think about what transition means for those good times. 206 00:23:26,360 --> 00:23:30,930 And as I was briefly saying it, it has a Bayesian flavour. 207 00:23:30,930 --> 00:23:36,360 So it leverages the flexibility of Bayesian trends, as you will see, and so you can derive new results from it. 208 00:23:36,360 --> 00:23:45,450 So this is, as I was saying, very much inspired by that material that the shows to say that we we ran the workshop two years before at Napes, 209 00:23:45,450 --> 00:23:57,790 where we in particular, we highlighted and discussed the difference and the differences really between Bayesian inference and fact based learning. 210 00:23:57,790 --> 00:24:04,750 You was he was a panellist, by the way, I see that you're here, are you all right? 211 00:24:04,750 --> 00:24:09,730 And so, as I was mentioning as well, if this is something of interest to you. So we do. 212 00:24:09,730 --> 00:24:15,130 We do. We do have quite a few openings in my group here in London. 213 00:24:15,130 --> 00:24:24,160 But I'd be very happy to elaborate on this. Uttaranchal, the questions maybe so before jumping into the formal definition, a fact based. 214 00:24:24,160 --> 00:24:29,740 Let me just briefly remind you what transition bounce without fact based look like, 215 00:24:29,740 --> 00:24:35,060 hence answering the questions for distribution, free distribution, dependence. 216 00:24:35,060 --> 00:24:41,390 So this is the very simple building block of transition where you have basically just one eye, 217 00:24:41,390 --> 00:24:46,750 but it's obviously not very realistic, but you have to start somewhere. 218 00:24:46,750 --> 00:24:49,480 So this is right. Plus the results. 219 00:24:49,480 --> 00:24:59,140 And so what it says is that the transition gap with high probability will be no more than the log of one of the confidence 220 00:24:59,140 --> 00:25:06,870 with which you states that the inequality divided by the square root of how many points that you have in your training set. 221 00:25:06,870 --> 00:25:12,370 So this tells us one of those square root of M, and this is very simple example. 222 00:25:12,370 --> 00:25:23,530 If you do have a finite function plus and which is again, in a sense the very worst case approach, those bounds are extremely conservative. 223 00:25:23,530 --> 00:25:31,150 I remember discussing this with a colleague recently that those bounds are actually a worst case approach. 224 00:25:31,150 --> 00:25:36,940 They consider that the worst situation that could arise is these rates in practise, 225 00:25:36,940 --> 00:25:42,070 although the right conditions might be even better, performance may be actually even better. 226 00:25:42,070 --> 00:25:53,020 So again, illustrating that mismatch between between theory and algorithms, this is something which has guided me for for as long as I remember. 227 00:25:53,020 --> 00:26:02,850 Really. So so, you know, you you all know those classical results and transition building during the 60s on, for example, 228 00:26:02,850 --> 00:26:11,220 that picture of and then just to mention the complexity and all those techniques actually targeting specific algorithms. 229 00:26:11,220 --> 00:26:18,030 So those are very much algorithm dependent approaches. 230 00:26:18,030 --> 00:26:24,610 And rather than algorithm free, the algorithm free comes more with a pack based approach, as you will see. 231 00:26:24,610 --> 00:26:29,740 So those techniques take some accounts, other correlations between the different hypotheses. 232 00:26:29,740 --> 00:26:37,030 If you have a finite number of hypotheses, obviously you have an infinite function plus or even comfortably. 233 00:26:37,030 --> 00:26:42,520 And they do take some regression, but quite imperfectly, to be honest. 234 00:26:42,520 --> 00:26:51,670 And so this is where Backpage also brings in something new. As I was saying, Backpage considers distributions of hypothesis rather than hypothesis. 235 00:26:51,670 --> 00:26:59,440 And so you have a perfect control on the on the correlations between the different hypothesis. 236 00:26:59,440 --> 00:27:09,420 All right, question, please. So did you just say now that the victim actually ran away and complexity were algorithm dependent? 237 00:27:09,420 --> 00:27:19,720 Though he says, no, no, no, I was saying that those bounds of how this lights are out are. 238 00:27:19,720 --> 00:27:28,660 No, sorry, I was I was sorry I had to choose between distribution and, ah, with independents and those bounds are valid, whatever the algorithm. 239 00:27:28,660 --> 00:27:33,070 Yes, OK, yeah. Sorry, sorry. I'm not quite sure what I said. 240 00:27:33,070 --> 00:27:38,820 I maybe I misspoke. Thanks. Thanks for catching this. 241 00:27:38,820 --> 00:27:46,860 All right, so now coming to what is probably the most thought-Provoking slide of my talk, 242 00:27:46,860 --> 00:27:53,490 at least this over with a slide that ends up being discussed in particular by patients. 243 00:27:53,490 --> 00:27:59,820 That's my attempt at summarising the key differences between Bayesian statistics and fact. 244 00:27:59,820 --> 00:28:09,300 Based on the top of the slides, you get a very summarised interpretation of of what base insurance is. 245 00:28:09,300 --> 00:28:13,980 So by defining a prior distribution and specifying a statistical model. 246 00:28:13,980 --> 00:28:20,520 So typically the likelihood function you get in an unique way up to in the monetisation constant, obviously, but you get in the unique way, 247 00:28:20,520 --> 00:28:28,440 the posterior distribution, which is the object of interest and invasion statistics, and by by unique on this slide. 248 00:28:28,440 --> 00:28:31,770 What I mean is that if you do pick a different player, 249 00:28:31,770 --> 00:28:38,970 then the whole insurance machinery is very likely to be fairly different if you pick a different likelihoods as well. 250 00:28:38,970 --> 00:28:42,450 This is very likely to lead to to much different results. 251 00:28:42,450 --> 00:28:50,190 And as a matter of fact, the likelihood and the prior are chosen very carefully invasion invasion statistics. 252 00:28:50,190 --> 00:28:52,770 And that choice is not independent from each other. 253 00:28:52,770 --> 00:29:05,590 There's a very rich literature on eliciting proper Prior's targeting different statistical models leading to efficient insurance supersaver in base. 254 00:29:05,590 --> 00:29:14,580 There's a considerable shift. So I should just say as a disclaimer that we do use prior into Syria in bank base, 255 00:29:14,580 --> 00:29:20,870 but they have nothing to do really with proper Bayesian priors and proper Pistorius. 256 00:29:20,870 --> 00:29:27,260 So backspaces or some other free approach in the sense that it does not require to have a statistical model. 257 00:29:27,260 --> 00:29:32,010 There is no likelihood function. The only thing that you're using is lost function. 258 00:29:32,010 --> 00:29:35,690 So you could argue that the loss induces a likelihood. 259 00:29:35,690 --> 00:29:45,830 But even if you don't need to to define it analytically, you you can actually never consider this as as proper likelihoods. 260 00:29:45,830 --> 00:29:52,280 So the prior is just a way, just a measure to explore the state of hypotheses. 261 00:29:52,280 --> 00:29:57,840 So this is what I mean by explanation mechanism. It's just a way to navigate within the hypothesis. 262 00:29:57,840 --> 00:30:05,810 It doesn't need to to actually be built with a particular inference scheme in mind is they need to be that 263 00:30:05,810 --> 00:30:13,610 the Pusser is really the twisted prior after confronting with data just as much as invasion's statistics. 264 00:30:13,610 --> 00:30:21,290 But as you will say, there's also a line of work in backspace which considered data dependent Pryors, for example. 265 00:30:21,290 --> 00:30:30,390 So what I'm saying here is that the vocabulary is the same, really, but the object is fairly different. 266 00:30:30,390 --> 00:30:37,200 And also lighting the differences between Beijing fans and back bays. 267 00:30:37,200 --> 00:30:44,460 So, as you will see, the back base bounce will hold whatever the prior, whatever the area and whatever the death attributed to distribution. 268 00:30:44,460 --> 00:30:46,500 And this is a massive difference. 269 00:30:46,500 --> 00:30:53,840 As I was saying earlier, the choice of the prior and the likelihood function, for example, has a massive impact in Beijing status. 270 00:30:53,840 --> 00:31:00,260 So you cannot just pick anything and then and then use the same machinery. 271 00:31:00,260 --> 00:31:11,000 So that's a very salient difference between base and back base, so I'm going to consider it from now on, Pioppi and the ACU, 272 00:31:11,000 --> 00:31:22,040 which only needs to be absolutely continuous with respect to peace as this title here and the rest as before was defined for hypothesising, 273 00:31:22,040 --> 00:31:22,430 you know, 274 00:31:22,430 --> 00:31:31,460 one to consider the risk of a whole distribution where we just basically integrate those previous divisions of the rest with respect to this area. 275 00:31:31,460 --> 00:31:35,480 And if you want to measure the discrepancy, not between just two hypotheses, 276 00:31:35,480 --> 00:31:41,600 but rather between two distributions of business, then you need to resort to divergences. 277 00:31:41,600 --> 00:31:47,210 So most of the literature is holding for the back labeller divergence, as you will see in the minutes. 278 00:31:47,210 --> 00:31:52,880 I should also say that there's a few attempts in the literature to extend this and go beyond AKL, 279 00:31:52,880 --> 00:31:59,010 and I've actually much more general classes of divergences. Here we are. 280 00:31:59,010 --> 00:32:04,980 This is now finally, after half an hour, the first classical palpation band, 281 00:32:04,980 --> 00:32:13,190 so the prototypical bound from from David McAllister, from the series of papers at the end of the 90s, 282 00:32:13,190 --> 00:32:21,720 and the prototypical form of a back based band is holding, as was saying for any prior any confidence level, 283 00:32:21,720 --> 00:32:27,360 the probability for the musician gap to be at most these terms. 284 00:32:27,360 --> 00:32:33,480 They had this big square roots right there where this probability is pretty close to one. 285 00:32:33,480 --> 00:32:36,150 And again, this holds whatever the posterior. 286 00:32:36,150 --> 00:32:42,630 So this for any prior and for any poster is actually extremely important in Backpage, because as you could imagine, 287 00:32:42,630 --> 00:32:51,190 this opens the way to optimising the band for the best posterior distribution in some sense. 288 00:32:51,190 --> 00:32:57,760 Are there any questions so far? And your remarks are criminals excuse me. 289 00:32:57,760 --> 00:33:02,140 This is a bond for the bond that lost Functioning 011. 290 00:33:02,140 --> 00:33:10,370 Yes, absolutely. Absolutely. Thank you. 291 00:33:10,370 --> 00:33:22,100 All right, so it's touching on the whole base actually used to drive new learning algorithm, most bank based bands, we have this shape. 292 00:33:22,100 --> 00:33:25,070 So basically on the left hand side, the arrow and the unscented. 293 00:33:25,070 --> 00:33:34,400 So the arrows quantity and on the right hand side, this empirical error plus complex to determine if you can compute that complex to term, 294 00:33:34,400 --> 00:33:39,740 if it either has an analytical form or you can approximate it reasonably. 295 00:33:39,740 --> 00:33:48,180 Yes, I see a question from the. I mean, what's the definition of a risk of cute? 296 00:33:48,180 --> 00:33:55,750 Oh, sorry, I went a bit fast and that's the risk of Q the rest of A distribution or any decision, for that matter. 297 00:33:55,750 --> 00:34:04,080 It's just the integrated fashion. So the expectation I expect. Q Thank you. 298 00:34:04,080 --> 00:34:09,060 Yeah. If you can if you can actually compute the right hand side of the bounce, 299 00:34:09,060 --> 00:34:15,570 it gives you a principled way to derive new algorithms in the sense that you can just hold the right hand side for. 300 00:34:15,570 --> 00:34:23,040 Q So this gives you say accused of distribution and then you sample new hypothesis from Q Store. 301 00:34:23,040 --> 00:34:29,340 So it's a randomised process, you define the distribution and then you can just sample from the distribution or 302 00:34:29,340 --> 00:34:35,340 look at the expectation of the distribution or modes of the median or whatever. 303 00:34:35,340 --> 00:34:41,150 What it tells you is the optimal distribution to make the bounce the tightest ready. 304 00:34:41,150 --> 00:34:45,060 So by using algorithm sample from that start distribution, 305 00:34:45,060 --> 00:34:50,220 you have the guarantee that the right hand side of the transition plan will be the tightest. 306 00:34:50,220 --> 00:34:56,490 So it's an accumulation problem. There's been a few attempts at trying to solve it over the years, 307 00:34:56,490 --> 00:35:04,830 have been mostly working with multicolour Markov chain techniques and algorithms back in my early days as a student. 308 00:35:04,830 --> 00:35:16,230 There's also a great line of contributions using inference, including lots of people and departments, obviously, and. 309 00:35:16,230 --> 00:35:23,790 Mostly what I would call gradient descent flavoured methods, they come in a variety of of shapes and colours, really, 310 00:35:23,790 --> 00:35:27,600 but there's there's a few techniques out there to try and solve that problem, 311 00:35:27,600 --> 00:35:32,910 or at least approximate one remark, because I think I think it closes very nicely. 312 00:35:32,910 --> 00:35:34,710 The loop in the literature. 313 00:35:34,710 --> 00:35:40,830 There's a few algorithms, very classical, for instance, machine learning, which have all been found to be minimises in based patterns. 314 00:35:40,830 --> 00:35:48,630 And I think it's remarkable because those algorithms actually much more ancient than backspace, which is twenty five years old. 315 00:35:48,630 --> 00:35:55,440 But for example, as VMS or essential waits for that matter, I've found I've been found to be minimises or specific pack based. 316 00:35:55,440 --> 00:36:01,590 And so it gives also a very nice way to justify the use of those algorithms. 317 00:36:01,590 --> 00:36:06,270 And I think I think I'm running a bit low on time, so maybe I'll skip this. 318 00:36:06,270 --> 00:36:12,120 This is a central result, in fact, based on the divergence, 319 00:36:12,120 --> 00:36:19,770 which gives you an explicit formula for the for the distribution, which actually maximises this quantity. 320 00:36:19,770 --> 00:36:27,570 And this is the Gibbs distribution, which is, again, very classical and statistics and mechanical statistics in particular. 321 00:36:27,570 --> 00:36:33,240 So I'll go very briefly on this. This result due to his arms and stronger than in the 70s. 322 00:36:33,240 --> 00:36:43,530 And this actual formulation is due to an getting in one of his funding books, really on bank base and the formalisation of that base. 323 00:36:43,530 --> 00:36:48,810 I'll skip the proof. It's very elementary. But I think I think it's quite an elegant profile. 324 00:36:48,810 --> 00:36:57,210 It's like trying to showcase. It's also because it fits on one slice with another case of that many proofs and papers and the trade. 325 00:36:57,210 --> 00:37:03,060 But just to comment on the last line, so you get at the top the statements of that lemer, 326 00:37:03,060 --> 00:37:09,810 if you pick that five function to be some temperature, time is the empirical risk. 327 00:37:09,810 --> 00:37:12,780 And you write actually, you rewrite actually that formula. 328 00:37:12,780 --> 00:37:19,320 You get that this distribution cue lambda, which is the expansion of minus lambda at times ten times the prior. 329 00:37:19,320 --> 00:37:29,790 So basically a GIPS potential gets distribution. Right, so that this distribution is the end of this problem, this constraint problem. 330 00:37:29,790 --> 00:37:31,860 And this seems very reasonable. Right. 331 00:37:31,860 --> 00:37:38,400 Because what you're doing here is minimising the expectation of the empirical risk with respect to a distribution. 332 00:37:38,400 --> 00:37:43,140 Q Plus a discrepancy between Q and P with some temperature parameter. 333 00:37:43,140 --> 00:37:48,960 But the discrepancy with the cap, this is very much resembles what we're doing in regularisation. 334 00:37:48,960 --> 00:37:53,520 In statistics, for example, LSU is not a different interpretation. 335 00:37:53,520 --> 00:37:58,520 It's exactly the kind of of of criterion that you're optimising. 336 00:37:58,520 --> 00:38:02,160 It fits the data plus a regularisation. 337 00:38:02,160 --> 00:38:14,500 So so again, in a way, this discussed back base into into a series of of well-known techniques, well-known results and literature. 338 00:38:14,500 --> 00:38:19,820 And I think this is an interpretation that is quite nice. 339 00:38:19,820 --> 00:38:26,300 All right, so that closes the first part, that's a small primer on what backbeats is. 340 00:38:26,300 --> 00:38:32,870 Again, the message is that backspace is a way to understand generalisation. 341 00:38:32,870 --> 00:38:39,100 It's it's a way to deal with distribution's of a protector's rather than just protector's. 342 00:38:39,100 --> 00:38:46,850 And in as a discuss source, go learning theories all about this high confidence control of the transition. 343 00:38:46,850 --> 00:38:54,270 And what that allows us to do is really to extend this to a series of frameworks. 344 00:38:54,270 --> 00:39:01,850 There's this really old Tsou of frameworks in which factis has been used and has delivered transition balance, 345 00:39:01,850 --> 00:39:11,330 which often are state of the art transition. It can also lead to new algorithms by minimising right and side of of generalisation bounds. 346 00:39:11,330 --> 00:39:18,290 You might think that this would lead to classical posteriors and classical algorithms is not always the case. 347 00:39:18,290 --> 00:39:26,840 So then there's a Trade-Off between what this actually gives you and how achievable and how computable those objects are. 348 00:39:26,840 --> 00:39:31,620 So again, the Trade-Off between theory and algorithms. Right. 349 00:39:31,620 --> 00:39:37,710 So this couples, as I was saying, a lot of tools from statistics, optimisation, probability theory, 350 00:39:37,710 --> 00:39:44,370 and so then we can back base for 10 years now since since I finished publicity. 351 00:39:44,370 --> 00:39:47,590 And that was much less popular back then than it is now. 352 00:39:47,590 --> 00:39:56,490 And let's say I think one of the explanations for the surge of interest in bank base is that, to my knowledge, is one of the very few, 353 00:39:56,490 --> 00:40:05,220 if not the only one way to address transaction for networks, at least for some architectures of networks, 354 00:40:05,220 --> 00:40:09,770 hence the very strong spotlights that Backpage has received. 355 00:40:09,770 --> 00:40:15,890 All right, what's coming next in the remainder of my talk is what we've been up to for bank base, 356 00:40:15,890 --> 00:40:25,430 maybe now is a good time to take a very short break and ask if there are any questions or comments. 357 00:40:25,430 --> 00:40:34,190 I have a question. So can you comment on I'm not very familiar with this formalism, 358 00:40:34,190 --> 00:40:42,260 but how does how does having a random procedure for breaking your hypothesis relate to practise? 359 00:40:42,260 --> 00:40:43,930 Because it seems that the framework. 360 00:40:43,930 --> 00:40:52,060 Well, you know, you have some kind of search procedure or selection procedure for picking one thing exactly what we do in practise. 361 00:40:52,060 --> 00:40:59,020 Absolutely, that is an excellent point, and this is what I was alluding to when I was talking about the gap between theory and practise, 362 00:40:59,020 --> 00:41:05,730 so there's a lot of the results which actually hold for realisation that the stimulus we proposed, 363 00:41:05,730 --> 00:41:10,780 the randomised balance in the sense that you have that posterior and then you sample from it 364 00:41:10,780 --> 00:41:15,370 and then you have one specific hypothesis for which you want to assess the transition ability. 365 00:41:15,370 --> 00:41:24,520 It doesn't seem very sensible in practise and know anyone who really does that to to solve a particular problem. 366 00:41:24,520 --> 00:41:29,750 So there's also a whole series of bands holding for the expectation with so. 367 00:41:29,750 --> 00:41:41,330 So in that sense, the final muster, the final predictor is not random is the expectation of that distribution, which is called the posterior. 368 00:41:41,330 --> 00:41:45,140 Does that make sense? Yeah, OK, thank you. 369 00:41:45,140 --> 00:41:51,020 And this poses but that that's that's also challenging in some settings because this 370 00:41:51,020 --> 00:41:57,620 might be a distribution which is not so easy to sample from or even to use analytically. 371 00:41:57,620 --> 00:42:01,430 It's a good beating. The mean of the distribution might also be a constitutional challenge. 372 00:42:01,430 --> 00:42:08,800 But I'll touch on this briefly in the following slides. OK, can I have a very quick follow up, please? 373 00:42:08,800 --> 00:42:18,830 Taking a practical example. So you have a neural network and you like use random initialisation and then you train with gradient descent. 374 00:42:18,830 --> 00:42:23,400 What is the plan? Is it the push forward of the initialisation by gradient descent? 375 00:42:23,400 --> 00:42:28,040 So what's the problem? It could be, yeah, it could be that absolutely. 376 00:42:28,040 --> 00:42:33,320 It could be pretty much. It could be anything really, because the back base band will hold water for the prior. 377 00:42:33,320 --> 00:42:38,890 So what you want to do ultimately is take the player which would make the band the tightest. 378 00:42:38,890 --> 00:42:43,560 All right. OK. 379 00:42:43,560 --> 00:42:47,520 All right, thanks. Well, I have a similar question if you have time. 380 00:42:47,520 --> 00:42:57,090 Yeah. So what's the. So, yeah, I'm not quite sure how we do on time anyways, but I'm happy to. 381 00:42:57,090 --> 00:43:00,990 So what's the meaning of this prior distribution in real life. 382 00:43:00,990 --> 00:43:05,690 So this is the most confusing part of baseball's one. 383 00:43:05,690 --> 00:43:12,920 Since we can choose it like anything, pretty much. Yeah, so, OK, let me ask a question by another question. 384 00:43:12,920 --> 00:43:22,490 If I go back to this, for example, some the very classical based bands, this is this is a probabilistic statement. 385 00:43:22,490 --> 00:43:26,290 But if this term over here in the square root. 386 00:43:26,290 --> 00:43:31,820 So let's assume that I mean, a binary classification problem, this is actually the case for this kind of bounce. 387 00:43:31,820 --> 00:43:37,520 So this is a number between zero and one. This is a tester on the binary classification problem. 388 00:43:37,520 --> 00:43:42,110 This is also a number between zero and one. And this might be something that blows up. 389 00:43:42,110 --> 00:43:51,770 So if you pick prior, which is very far from the stereo, for example, the sense then this will blow up and then your band will be with probability. 390 00:43:51,770 --> 00:43:57,200 Close to one digitisation gap is abandoned by 10 to the 30 or something. 391 00:43:57,200 --> 00:44:01,340 This is true. These are probabilistic statement, which is true, but it's absolutely useless. 392 00:44:01,340 --> 00:44:06,020 So then we say that the band is vacuous in some sense. 393 00:44:06,020 --> 00:44:12,470 So the past few works on back base have very much been in the direction of non vacuous 394 00:44:12,470 --> 00:44:17,360 bands trying to ensure that we have something which is ultimately meaningful. 395 00:44:17,360 --> 00:44:24,110 And so the role of the player in that sense is to try and make this complex term as small 396 00:44:24,110 --> 00:44:29,300 as possible so it could be informed by actual prior knowledge as invasion statistics. 397 00:44:29,300 --> 00:44:37,010 It can be just the way you initialise, for example, you or your grid, in a sense, for the way of your network on the push forward phase. 398 00:44:37,010 --> 00:44:41,690 It could be pretty much any anything like that. The band will be valid regardless. 399 00:44:41,690 --> 00:44:48,750 So what's interesting is try and optimise for that prior when you want to make the band tighter and tighter. 400 00:44:48,750 --> 00:44:52,740 I have this answers your question. OK, I think I'm still confused, 401 00:44:52,740 --> 00:45:03,520 but it's better when we translate this into please tell me to stop if you want to tell me when we translate this into practise. 402 00:45:03,520 --> 00:45:10,080 Suppose we're trying to analyse a situation where somebody does a specific training procedure in order to get down to that procedure. 403 00:45:10,080 --> 00:45:13,540 Do we try to find the prior that most represents what they do? 404 00:45:13,540 --> 00:45:23,430 Because it seems like what you've given here is a method to finding a way of a good and a guarantee if you the in such a way. 405 00:45:23,430 --> 00:45:26,370 But then you're going to use this to talk about other things that people do in 406 00:45:26,370 --> 00:45:30,330 the real world where the algorithms are not necessarily derived in this way. 407 00:45:30,330 --> 00:45:42,210 So how do you match the. So that's also a very good point, but I should maybe take a step back and highlight that even though we call this prior, 408 00:45:42,210 --> 00:45:51,170 it doesn't need to be an actual prior in the sense. What I'm saying is because, for example, if you have a classified practises which are, you know, 409 00:45:51,170 --> 00:45:58,680 different on networks and what you want to do is he's learnt the proper weight of that network, it might correspond to prior knowledge. 410 00:45:58,680 --> 00:46:03,250 But what if you had I don't know, you have a finite collection of predictors. 411 00:46:03,250 --> 00:46:09,630 You have one your network, you have one decision tree, one random, first one, whatever. 412 00:46:09,630 --> 00:46:15,540 Then the players just possibly led by, I don't know, the cost of using any of those experts. 413 00:46:15,540 --> 00:46:20,010 Right. So it doesn't correspond to a specific knowledge of your problem. 414 00:46:20,010 --> 00:46:25,080 It's rather driven by, for example, competition constraints could be something like that. 415 00:46:25,080 --> 00:46:32,580 So I think what is always quite confusing and buy, in fact, is that we do call this a fire, 416 00:46:32,580 --> 00:46:37,560 even though it might be an object which is very different from from prior invasion statistics. 417 00:46:37,560 --> 00:46:45,180 And I don't know if that answers your question. I'm just wondering how this bears on on practical situations. 418 00:46:45,180 --> 00:46:54,240 So how do you get how do you put this in a form? So if you take the standard Radovic about right now, like just now we apply it to neural network, 419 00:46:54,240 --> 00:46:58,050 will you calculate the rate of complexity of the neural network and then you put it in? 420 00:46:58,050 --> 00:47:04,890 And so I'm saying we have some algorithms to understand the world and we have to pick the P to correspond to what 421 00:47:04,890 --> 00:47:10,320 they're doing in order for this statement on the screen to give a generalisation guarantee for that procedure. 422 00:47:10,320 --> 00:47:14,490 So the answer the short answer is no, you don't need to do the same. 423 00:47:14,490 --> 00:47:22,400 But if you want this to be a fair comparison with the complexity based bounds, then it's probably a good idea. 424 00:47:22,400 --> 00:47:29,120 That's my sense. OK, thank you. Well, thanks. 425 00:47:29,120 --> 00:47:38,510 All right, sir, I'm going to be touching, actually, very briefly on some of the recent contributions, 426 00:47:38,510 --> 00:47:44,330 because I think I'm running out of time, I'm afraid, but I'll try to wrap up quickly. 427 00:47:44,330 --> 00:47:51,620 So we have been extremely fortunate to work with all those fantastic people over the past few years. 428 00:47:51,620 --> 00:48:00,050 And so I probably just pick one because I want to cover all of those, obviously. 429 00:48:00,050 --> 00:48:05,120 So I probably just briefly touch on on the third one. 430 00:48:05,120 --> 00:48:14,330 Actually, this paper is economising journalise, which is which is a back page in an attempt at understanding binary activity in your network. 431 00:48:14,330 --> 00:48:17,780 So just to briefly also give some context on the other work. 432 00:48:17,780 --> 00:48:25,880 So you had contributions from a theatrical perspective trying to to obtain PANKRATION bands with different divergences, for example. 433 00:48:25,880 --> 00:48:30,830 So this is the first paper with Clarkie, who's with lican in Japan, 434 00:48:30,830 --> 00:48:38,510 where we have PANKRATION bands holding for what we call it as either Heifetz's or Dependent's or relaxing that idea. 435 00:48:38,510 --> 00:48:46,670 Sencion, which obviously is extremely useful in statistics, but not very realistic in the real world, obviously. 436 00:48:46,670 --> 00:48:53,180 And there's also a line of work towards difference between bands either with faster rates. 437 00:48:53,180 --> 00:49:00,280 This is this paper with the comedian Peter Kornfeld and also different measures of the loss. 438 00:49:00,280 --> 00:49:05,840 So, for example, have been touching on the fact that the risk is mostly the expectation of a loss. 439 00:49:05,840 --> 00:49:12,740 But again, this might not correspond to situations where you want to optimise the different quantity for in medical 440 00:49:12,740 --> 00:49:19,610 drugs testing rather than the expectation you'd be much more interested in the quintiles of that loss, 441 00:49:19,610 --> 00:49:27,800 because you want to you want to avoid at all costs costly errors, all leading to death of of patients. 442 00:49:27,800 --> 00:49:33,540 So the expectation just taking the average of a distribution might not be a good idea in those respects. 443 00:49:33,540 --> 00:49:39,110 So we have that paper, at least in Europe, on Backpage for the conditional value at risk in particular. 444 00:49:39,110 --> 00:49:46,130 So different ways to measure the performance. We have all sorts of contributions for structures, predictions. 445 00:49:46,130 --> 00:49:53,090 So echoing what I were saying earlier about the nature and corporate structure and semantic content into learning, 446 00:49:53,090 --> 00:50:02,660 and also this fact based contrastive technique, which I visited last July with German companies, are also from Japan. 447 00:50:02,660 --> 00:50:09,710 So what I'm highlighting here is the range of contributions that are accessible through a bank base. 448 00:50:09,710 --> 00:50:20,660 And again, the only thing to use the bank base machinery is to be able to define a probability distribution, a measure on the space of it. 449 00:50:20,660 --> 00:50:25,490 If you can do that, then it can adapt the bank based framework straightforwardly. All right. 450 00:50:25,490 --> 00:50:32,720 So just take in the three, four minutes. Just give a very brief overview of that one. 451 00:50:32,720 --> 00:50:37,190 So let me just give all of that. Yeah. 452 00:50:37,190 --> 00:50:44,000 So Dippenaar, activity in your network is just a very classical forewords network, fully connected. 453 00:50:44,000 --> 00:50:48,410 The only thing is that the activation function is the same function. 454 00:50:48,410 --> 00:50:52,850 So at each node you output either one or minus one. 455 00:50:52,850 --> 00:50:57,350 This is a challenge because it's obviously non differentiable. 456 00:50:57,350 --> 00:51:06,620 So the way we address this with backspace is with a small trache, which is as in bank base, you're mostly looking at the expectation in many papers. 457 00:51:06,620 --> 00:51:12,800 You're mostly looking at the expectation with respect to the prior. This turns out to actually have a closed form. 458 00:51:12,800 --> 00:51:19,250 So the expectation of the output of your network with a binary activation is actually the arrow function. 459 00:51:19,250 --> 00:51:20,840 And that was the beginning. 460 00:51:20,840 --> 00:51:32,360 So this is a fact pointed out by an earlier paper from Gillette and Jazmyne that nine and this little fact here is that the core of that contribution, 461 00:51:32,360 --> 00:51:42,340 because it allows us to both write back Bayesian generalisation band for that specific architecture, we use also a few tricks in the balance. 462 00:51:42,340 --> 00:51:50,270 So, for example, the divergence term has a closed form if you assume that prior and wistaria with parameters. 463 00:51:50,270 --> 00:51:59,120 So you basically demands to optimising for those parameters to try and control the discrepancy between the prior and the posterior. 464 00:51:59,120 --> 00:52:06,500 This is the very simple case of just one layer is no hidden layer really on your network. 465 00:52:06,500 --> 00:52:10,970 But this is just a teaser. This really is the building block in that paper. 466 00:52:10,970 --> 00:52:20,090 And ultimately we were able to derive a full strategy to ride the gradient descent training of the network and the transition date, 467 00:52:20,090 --> 00:52:24,830 which is right there. So that's the transition. 468 00:52:24,830 --> 00:52:29,360 But so it's a very. Sorry, the questions in the chat. 469 00:52:29,360 --> 00:52:38,560 What is Q W so cute w is the posterior on the weights W as the posterior distribution's. 470 00:52:38,560 --> 00:52:51,690 And so the transition bounce that we have for that specific sector of your network is very much inspired by a band from L.A. and it basically 471 00:52:51,690 --> 00:53:00,930 tells you that the inseparate at first approximation of the insipid risk of the network is far from the outcome of risk by a margin, 472 00:53:00,930 --> 00:53:05,400 which is the care divided by and in that context. 473 00:53:05,400 --> 00:53:09,180 So this really much is consistent with the remaining of the theory. 474 00:53:09,180 --> 00:53:15,840 What was interesting in that is that we're able to derive the actual value of that band in a series of settings. 475 00:53:15,840 --> 00:53:19,950 Let me just briefly zoom in, because I want to leave some time for questions. 476 00:53:19,950 --> 00:53:30,510 I realise we're closing in on the hour. So this is a sample of the few American experiments that we did with that particular architecture. 477 00:53:30,510 --> 00:53:39,770 And so if you look at this column, for example, in me, zoomIn, so this is four variations of Enis as a baseline. 478 00:53:39,770 --> 00:53:46,160 So this is the Back Bays inspired network with the interactive activation function where you 479 00:53:46,160 --> 00:53:51,380 get on the first column is the empirical research on the different instances of data sets. 480 00:53:51,380 --> 00:53:54,710 This is a test server and this is a value of the bands. 481 00:53:54,710 --> 00:54:03,560 And when I was saying earlier that we very much the transition bands are the third full guarantee, which translates into working to for practitioners. 482 00:54:03,560 --> 00:54:07,070 This is very much what we hope to achieve as a practitioner. 483 00:54:07,070 --> 00:54:15,920 If you enter the American error achieved by someone, give them some data set what Watpac Base gives you in return through, 484 00:54:15,920 --> 00:54:19,730 of course, a lot of different steps that would not elaborate too much on. 485 00:54:19,730 --> 00:54:27,110 But what it gives you ultimately is the value of the balance, which is an upper bound with probability, with a confidence level. 486 00:54:27,110 --> 00:54:36,680 So in this case, it's 99 percent. So with 99 percent on this particular problem, I compute the empirical error, which is one point eight percent. 487 00:54:36,680 --> 00:54:45,320 And backspace based theory tells you, all right, if you collect any new instances from that data set, the error will never be more than three percent. 488 00:54:45,320 --> 00:54:54,080 And this occurs with probability 99 percent. So and that is also a very recent paper from my good colleagues and friends at UCLA, 489 00:54:54,080 --> 00:54:58,910 Omar, John and Maria, talking about certificates for machine learning. 490 00:54:58,910 --> 00:55:11,080 And this is very much is the idea of those kind of statements. Judith is asking, what was Delta, so Delta is the temperature. 491 00:55:11,080 --> 00:55:15,700 I'm guessing you were asking, oh, no, sorry you're asking, but this Delta, right, 492 00:55:15,700 --> 00:55:24,370 so the confidence barometer and the confidence barometer is the the probability with which you state the evidence. 493 00:55:24,370 --> 00:55:31,390 So there is something morally very close to zero, the closer to zero it is, and the more likely this is to blow up, obviously. 494 00:55:31,390 --> 00:55:49,360 So this strain of. All right, I think I think I'll wrap up here just to give you a very an attempt at summarising these different lines of research. 495 00:55:49,360 --> 00:55:56,630 So as you understood, the quest was very much for transition guarantees, about half of the contributions from my group with bikeway. 496 00:55:56,630 --> 00:56:06,010 So this is the tip of the iceberg that you see now and in the few directions that have been extremely interested in over the past few years are 497 00:56:06,010 --> 00:56:15,910 providing either generic bounce in the sense that it would be holding under a less restricted set of activities such as I.T. or something like that, 498 00:56:15,910 --> 00:56:23,860 and also tighter bans for specific algorithms. So if you pick a specific algorithm, it's reasonable to hope that you will have tighter rates, 499 00:56:23,860 --> 00:56:28,570 for example, of tighter constants in the balance for that particular algorithm. 500 00:56:28,570 --> 00:56:34,810 Also briefly mentioned the new measures of performance, such as conditions, belayer, trist or contrastive losses. 501 00:56:34,810 --> 00:56:43,330 And again, the interplay between coupling of theory and the algorithm is very much at the centre of all of our team. 502 00:56:43,330 --> 00:56:48,790 Last but not least, the applications that we foresee, that is very much towards. 503 00:56:48,790 --> 00:56:55,540 What I like to call sustainable machine learning is sense that if you have guarantee that with that many data points, 504 00:56:55,540 --> 00:57:04,690 the best you could achieve as a test error or an estimate of the test error with probability 99 percent, for example, is this then? 505 00:57:04,690 --> 00:57:11,320 It is, I think, very valuable guidelines to measure learning practitioners and users of aquifer's. 506 00:57:11,320 --> 00:57:15,040 So I think I'll close up here. Thank you very much. Right on time. 507 00:57:15,040 --> 00:57:19,930 It's four thirty and yeah, I'd welcome any questions if you have any more. 508 00:57:19,930 --> 00:57:25,100 Thank you very much. Stand by for the very nice talk. 509 00:57:25,100 --> 00:57:30,880 So we have a little time for some more quick questions from the audience. 510 00:57:30,880 --> 00:57:33,040 Have a seat. 511 00:57:33,040 --> 00:57:43,630 So you were at the beginning, you were explaining, sort of just justifying the dinner as a strange man to make a link with a double story. 512 00:57:43,630 --> 00:57:49,570 Yeah. Do you pack these bones, give some insight on the double descent or not? 513 00:57:49,570 --> 00:57:55,270 So this is very much the hope in an ongoing work, which is not yet an occasion, I'm afraid. 514 00:57:55,270 --> 00:58:02,970 But so what what we're trying to assess is if you do augments the complexity of the class. 515 00:58:02,970 --> 00:58:11,680 So if you do add layers and layers from your network, do you foresee at some point a decrease in the right answer? 516 00:58:11,680 --> 00:58:24,530 This is very much what we what we're working on. And the answer is yes. But I'm afraid I don't have any more material to share this point. 517 00:58:24,530 --> 00:58:32,120 I see there is a question about. And no, I don't know the question. 518 00:58:32,120 --> 00:58:37,970 You had your hands raised, so it was collapsing, I think. 519 00:58:37,970 --> 00:58:42,410 Thank you. OK. I think we already run out of time. 520 00:58:42,410 --> 00:58:47,540 So let's end this seminar. So thank you, everyone, for coming in. 521 00:58:47,540 --> 00:58:54,770 So let's thank the speaker again. Thanks, everyone. 522 00:58:54,770 --> 00:59:00,020 And I should also mention that the slides will be on my Web page very shortly, 523 00:59:00,020 --> 00:59:07,036 in a few minutes actually, and feel absolutely free to reach out if you have questions or comments.