1 00:00:00,480 --> 00:00:10,650 Then to hear the approximately correct who is Professor Zoric, Sarah has previously held positions in Bristol, 2 00:00:10,650 --> 00:00:19,410 Utrecht, Toulouse and Leiden, where she received APHC in 1987, 3 00:00:19,410 --> 00:00:28,860 and she is an expert on empirical processes, having written the book Empirical in Estimation, which I use quite a lot during my master's thesis. 4 00:00:28,860 --> 00:00:36,240 Thank you for that and has worked recently on lots of things relating to high dimensional data. 5 00:00:36,240 --> 00:00:41,280 So we're very excited to hear about what she's going to talk about today. 6 00:00:41,280 --> 00:00:46,320 So for questions, if you want to ask a question, there are two ways you can do it. 7 00:00:46,320 --> 00:00:53,490 So the first is to raise your hand and then I will interrupt Sara at a convenient point. 8 00:00:53,490 --> 00:01:00,240 And you can ask you and then you can ask a question or if you prefer, 9 00:01:00,240 --> 00:01:04,800 you can type your question in the chat and then you won't appear on the recording. 10 00:01:04,800 --> 00:01:12,170 And I can read the question out to Sara for you. OK, excellent. 11 00:01:12,170 --> 00:01:17,180 All right, take it away, sir. Thank you. 12 00:01:17,180 --> 00:01:23,360 Thank you very much for this nice introduction and thank you very much for inviting me. 13 00:01:23,360 --> 00:01:32,680 So let me start screen sharing. This one. 14 00:01:32,680 --> 00:01:38,460 It's already full screen, right? Yeah, good. 15 00:01:38,460 --> 00:01:46,740 So so in my thought, what I will address is the following in many algorithms nowadays, 16 00:01:46,740 --> 00:01:56,410 machine learning algorithms, for instance, they report zero training error, but still a good. 17 00:01:56,410 --> 00:02:08,230 Just error. And so I want to somehow contribute to an understanding of this phenomenon and I will look at classification problems. 18 00:02:08,230 --> 00:02:12,580 And in classification problems, if you have a small test error, 19 00:02:12,580 --> 00:02:21,490 it must mean that the base error is small because that's the unavoidable error that you always make if you believe in a model. 20 00:02:21,490 --> 00:02:26,470 So the test error can only be small if the base error is small. 21 00:02:26,470 --> 00:02:33,820 So I'll look at that context. And then look at the algorithms with zero training error, 22 00:02:33,820 --> 00:02:43,630 actually just three algorithms and zero training error means that they interpolate the data right. 23 00:02:43,630 --> 00:02:50,020 And if you interpolate the data, that's only possible if you have enough flexibility to do that. 24 00:02:50,020 --> 00:02:59,740 So that's what I will look at, the situation, which is high dimensional, so that you have like a regression model with very many variables. 25 00:02:59,740 --> 00:03:09,160 And then the main issue will be the other boost algorithm, which you might know, which has been around for a while from Shapir, 26 00:03:09,160 --> 00:03:21,610 Point and Shapir, and try to explain its success a little bit using this Max Marxian approximation. 27 00:03:21,610 --> 00:03:28,540 OK, so this is the title again, and then this is joint work with Jeffrey Chenault, 28 00:03:28,540 --> 00:03:37,090 Felix Meister and Matthias LeFleur, and these are so Geoffrey and Matthias are both stocks at our institute. 29 00:03:37,090 --> 00:03:44,200 And Felix is a Ph.D. student. And so I, I work together with them and they're a great team. 30 00:03:44,200 --> 00:03:50,970 And so much or maybe everything of what I'm going to tell this is they're hard to work. 31 00:03:50,970 --> 00:04:01,850 OK. So here's a little bit of the overview, so I will look at the algorithms with zero training error, 32 00:04:01,850 --> 00:04:07,100 that means that the interpolated data and I'm going to concentrate on the special 33 00:04:07,100 --> 00:04:12,680 case where they the final INTERPOLATE is the one with the smallest L1 norm, 34 00:04:12,680 --> 00:04:16,070 because you can interpolate data in very many ways. 35 00:04:16,070 --> 00:04:25,130 You can also look at interpolation, looking at the smallest L2 norm, but that's a completely different situation and. 36 00:04:25,130 --> 00:04:32,180 So my view will be more like the sparsity view where you should sparsity. 37 00:04:32,180 --> 00:04:39,410 And if you do that, you want to, of course, see how how it relates to regularisation instead of interpellation. 38 00:04:39,410 --> 00:04:43,190 So if you're high dimensional model, you either interpolate or not, 39 00:04:43,190 --> 00:04:49,640 and then you do regularisation that you introduce some kind of penalty term or regularisation term. 40 00:04:49,640 --> 00:04:56,200 And I'll just look at that. And it's kind of easy for that case to prove. 41 00:04:56,200 --> 00:05:03,010 Some bounce and they will form the benchmark for the interpolating case, 42 00:05:03,010 --> 00:05:10,860 and then I'll show you some rate of convergence, which I can then compare with this benchmark. 43 00:05:10,860 --> 00:05:24,940 So here's the setup. So it's going to be a theoretical work where I have given design matrix X with and rows and columns. 44 00:05:24,940 --> 00:05:29,980 So NSA, the number of observation bees, the number of variables, and B, 45 00:05:29,980 --> 00:05:41,510 the number of variables is assumed to be much larger than and and so that's the high dimensional component. 46 00:05:41,510 --> 00:05:49,390 Then you can think of a regression model with an unknown vector of coefficients, which actually will look at classification. 47 00:05:49,390 --> 00:05:56,680 And then the the the the the scale is not identified. 48 00:05:56,680 --> 00:06:02,650 So there you can assume without loss of generality that the vector of coefficients has L2 naum equal to one. 49 00:06:02,650 --> 00:06:06,010 That's just a normalisation assumption. 50 00:06:06,010 --> 00:06:15,880 And the sparsity assumption is that the effect of coefficients is only a few non-zero entries and the nonzero entries, 51 00:06:15,880 --> 00:06:22,060 if I count them, I have this sense for a sparsity. 52 00:06:22,060 --> 00:06:33,350 Then there's a noise factor, so the noise is normalised in two and I call Sigma the the noise level and that's the quantity that's sparse. 53 00:06:33,350 --> 00:06:40,450 So in this setup, it's more or less also like engineering. The noise is not necessarily independent of the X. 54 00:06:40,450 --> 00:06:46,450 It's going to be adversarial noise or so it can be also a deterministic. 55 00:06:46,450 --> 00:06:52,990 Quantity, but you may think of this and it will be as a first step if you're statistician's, that's the way to think of it. 56 00:06:52,990 --> 00:07:00,600 You may think of this as just a random noise. And then this segment is here. 57 00:07:00,600 --> 00:07:10,050 OK, and then dispersed the issues to be small in the usual sense that so say like s the fastest 58 00:07:10,050 --> 00:07:14,800 a number of active variables and it should be smaller than the number of observations. 59 00:07:14,800 --> 00:07:27,290 And then there is a long term. In terms of the number of variables to be because you don't know a priori which coefficients are non-zero. 60 00:07:27,290 --> 00:07:33,350 Oh, so here's the notation that infinity norm with an infinity here, the L2 norm, 61 00:07:33,350 --> 00:07:40,010 the L1 norm is just some of the absolute values and the L. Zero is not a norm. 62 00:07:40,010 --> 00:07:46,010 But that's counting the number of entries, which are not zero. 63 00:07:46,010 --> 00:07:53,030 OK. So these are the two programmes I want to look at. 64 00:07:53,030 --> 00:07:57,500 So you observe this design matrix and you have this regression function. 65 00:07:57,500 --> 00:08:06,950 So let's say it looks like this. So just a regression function plus noise, but you only observe the same. 66 00:08:06,950 --> 00:08:12,560 So you don't observe the regression function itself, and that's like in the classification problem, 67 00:08:12,560 --> 00:08:17,490 so the response variable is binary plus or minus one. 68 00:08:17,490 --> 00:08:27,640 OK, and. The interpolating which is developed by Blunt and vaccinia, so I called it upon and for sheening estimate, 69 00:08:27,640 --> 00:08:34,520 there is the one which says, OK, I'm going to interpolate the science, which I observed. 70 00:08:34,520 --> 00:08:42,600 And I used to play there with smallest one norm. And that's like one bit compressed sensing, 71 00:08:42,600 --> 00:08:48,090 except that we're looking at a case where there is noise as so compressed sensing 72 00:08:48,090 --> 00:08:53,250 is where you observe X Beta Star and you want to recover the coefficients. 73 00:08:53,250 --> 00:08:58,920 And one bit compressed sensing is that you even further compress the real value, 74 00:08:58,920 --> 00:09:03,600 random variables or whatever real value variables by a binary variable. 75 00:09:03,600 --> 00:09:13,140 So then it's only one bit. So it's a compression motivation, but you can also think of it as just the classification problem. 76 00:09:13,140 --> 00:09:21,100 Now, in this case, the the if you minimise the elbow nerve because of course, the skill is not identified. 77 00:09:21,100 --> 00:09:29,130 We now have to somehow use. And normalisation, and that's the empirical norm. 78 00:09:29,130 --> 00:09:40,490 Of the. Beta, so that's the vector X, that's a real vector column vector, so that's equal to one. 79 00:09:40,490 --> 00:09:50,330 OK, so this is a factor in dimensional space x beta stars and in dimensional vector, and these are the entries of the 10 dimensional vector. 80 00:09:50,330 --> 00:09:55,820 OK, and it makes margin is the same situation. 81 00:09:55,820 --> 00:10:05,140 You observe what X and the sign of the regression. And I look at the margin, which is why to be. 82 00:10:05,140 --> 00:10:15,070 And I look at the minimal value and I want that to be maximal, overall factors be with L1 normalised and or equal to one. 83 00:10:15,070 --> 00:10:27,520 So that's, again, a interpolating designs because you certainly wouldn't have that design y in the sign of the XP, that will be the same. 84 00:10:27,520 --> 00:10:37,030 Because you want to the minimal value to be maximal, so you will certainly have that's why times be will be positive. 85 00:10:37,030 --> 00:10:43,500 So those are two interpolated. And then we want to obtain bounce. 86 00:10:43,500 --> 00:10:47,110 In to so the true in later, so sorry, 87 00:10:47,110 --> 00:10:52,800 the true fact of efficiencies is to naum equal to one because of the normalisation 88 00:10:52,800 --> 00:10:58,450 and in my estimate I also normalised to have Delta Norm equal to one. 89 00:10:58,450 --> 00:11:04,290 And then I want to find out how close are they. OK, 90 00:11:04,290 --> 00:11:09,180 and I'm going to look really at this compressed sensing type of situation where 91 00:11:09,180 --> 00:11:15,780 the design matrix consists of used in the standard Gaussian random variables, 92 00:11:15,780 --> 00:11:20,670 which is maybe a bit of a disappointing assumption, but at least it's a starting point. 93 00:11:20,670 --> 00:11:28,230 And it's going to be like, I don't know how to generalise it yet. 94 00:11:28,230 --> 00:11:35,680 And this is the notation, so in probability of order at most, just not to get too complicated for us. 95 00:11:35,680 --> 00:11:39,330 OK. 96 00:11:39,330 --> 00:11:48,180 OK, so I'm going to look at Syria and there is some theory about one we compressed sensing this plant in Virginia estimated that for the next March, 97 00:11:48,180 --> 00:11:53,430 I'm not aware of any theory and it is related to the other boost algorithm. 98 00:11:53,430 --> 00:12:07,670 So we get our aim is to get x ray results for add a boost via this march in the Interpol later. 99 00:12:07,670 --> 00:12:12,890 So let's look at some basic facts, which I'm just showing you, because I think it's kind of nice. 100 00:12:12,890 --> 00:12:19,120 So if you look at Vector's with Norm equal to one. 101 00:12:19,120 --> 00:12:24,640 And well, then the expectation of the product of their science, 102 00:12:24,640 --> 00:12:33,940 so x bita is just a linear combination of Gorshin set and X D also a linear combination of 103 00:12:33,940 --> 00:12:39,620 Gaussians and the expectation of having the same sine is the arc sign of the inner product. 104 00:12:39,620 --> 00:12:47,430 It's called to the heart of the identity. So the probability that the signs are equal, 105 00:12:47,430 --> 00:12:57,910 you can think of this as the probability of making a misclassification is to Arko co-signers of the inner product between the two factors. 106 00:12:57,910 --> 00:13:05,770 And that's not hard to see if you think little bit so you can sort of think of it as a two dimensional situation, 107 00:13:05,770 --> 00:13:12,610 although we look at higher dimensions where you have two vectors and they are Naum one, 108 00:13:12,610 --> 00:13:22,180 so they lie on the sphere and the Jodeci distance is just this distance along the sphere. 109 00:13:22,180 --> 00:13:28,780 And that's the probability of making a misclassification error for those two factors. 110 00:13:28,780 --> 00:13:40,240 And then one over by here, OK? So what I'm going to do now is just say, OK, for the classification problem, you just have like for regression problem, 111 00:13:40,240 --> 00:13:53,290 some splitting up into the unavoidable error sigma squared and the error due to the estimation of the unknown coefficients. 112 00:13:53,290 --> 00:13:56,440 So let's do that. So you have this inner product. 113 00:13:56,440 --> 00:14:06,010 Between two vectors, if they have Naum equal to one, it's one minus the distance squared between the two vectors, that that's quite obvious. 114 00:14:06,010 --> 00:14:13,070 And then there's one half year. OK, so the inner product is related to one minus the. 115 00:14:13,070 --> 00:14:20,060 Distance squared in this way, so if you look at the probability of a mass misclassification, 116 00:14:20,060 --> 00:14:26,570 it's just this are cosine of the inner product, which is one minus the distance squared. 117 00:14:26,570 --> 00:14:35,180 And then you just see that if you do a one term data expansion, that the distance is more or less the L2 distance. 118 00:14:35,180 --> 00:14:40,530 If they are close together, which is close, which is clear from this picture. 119 00:14:40,530 --> 00:14:46,080 So did you distance distances along the curve at the straight line is not much different. 120 00:14:46,080 --> 00:14:57,860 OK. Now, suppose you have so again, just for one observation, say this is one observation from the test set, 121 00:14:57,860 --> 00:15:04,170 and so you have the sign of the regression and you have the noise with some variance. 122 00:15:04,170 --> 00:15:12,990 Then the base terrorist probability that the sign is of the noisy regression is not the sign of the regression without noise, 123 00:15:12,990 --> 00:15:19,920 and then you use these whole identity and you see it's to sign cosine of this quantity, 124 00:15:19,920 --> 00:15:28,360 which is by doing just one term data expansion for approximately equal to two sigma, the noise level. 125 00:15:28,360 --> 00:15:35,940 OK. So, OUP. 126 00:15:35,940 --> 00:15:45,640 So the probability of making a misclassification error if you use B as your classifier. 127 00:15:45,640 --> 00:15:47,860 Is just our cosine functions, 128 00:15:47,860 --> 00:15:59,080 which you can then split up in a term involving the unavoidable base error sigma squared and a term due to the estimation, 129 00:15:59,080 --> 00:16:03,760 so the due to making an error in estimating beta star. 130 00:16:03,760 --> 00:16:12,730 So this is exactly like a regression to a regression. Also, you have the excess risk, which is this Delta error. 131 00:16:12,730 --> 00:16:22,640 And then Sigma squared, so that excess risk is here in Sigma squared is an unavoidable error. 132 00:16:22,640 --> 00:16:32,880 OK, now I want to show a result for the vectorised estimated and. 133 00:16:32,880 --> 00:16:36,150 So I need some identities which are very straightforward. 134 00:16:36,150 --> 00:16:45,060 First of all, if you look at the expectation of the inner product between the sign and this linear combination, 135 00:16:45,060 --> 00:16:50,740 it's again, in a product of the truby to start and be. 136 00:16:50,740 --> 00:16:59,560 And then there's some coefficient in front of it, so if I put here and the difference between the Vector B, 137 00:16:59,560 --> 00:17:06,070 so B is an arbitrary factor here and the difference between the vector B and the vector B to start a true vector, 138 00:17:06,070 --> 00:17:11,250 of course, B, to start in the end product, which itself is one. 139 00:17:11,250 --> 00:17:23,520 So you get this expression, which is, again, the one minus to in product, which is again the end to the L to distance squared. 140 00:17:23,520 --> 00:17:29,220 So if you want to estimate the effect of coefficients, be the star, 141 00:17:29,220 --> 00:17:35,940 then this can make a good loss function because this is clearly minimised or a risk function. 142 00:17:35,940 --> 00:17:39,660 This is clearly minimised at the equal to Beat Star. 143 00:17:39,660 --> 00:17:48,310 So what you do to estimate B to start is just to replace this expectation by an average and minimise that. 144 00:17:48,310 --> 00:17:57,860 OK, and then you do a regularisation to make sure that you can deal with this high dimensional situation. 145 00:17:57,860 --> 00:18:01,610 So we do that, so there's a minus sign missing here, so instead of minimisation, 146 00:18:01,610 --> 00:18:14,730 we do a maximisation maximisation of the margin and then minus the regularisation term, the L1 norm of the vector of coefficients. 147 00:18:14,730 --> 00:18:22,890 And you want to still be one, so this is just like the lasso, that is just the lasso. 148 00:18:22,890 --> 00:18:30,550 If you observe the regression function, you can do the lasso, just the least squares and have been analyzation on the elbow norm. 149 00:18:30,550 --> 00:18:37,940 That's the lasso. And this is not much different. OK. 150 00:18:37,940 --> 00:18:50,930 OK, now let's look at this march and so the average margin and its expectation, this is a factor of lengths B, 151 00:18:50,930 --> 00:18:57,410 so this is a very long vector and I subtract its expectation and then they take the maximum. 152 00:18:57,410 --> 00:19:01,910 So I take the maximum of P random variables. 153 00:19:01,910 --> 00:19:09,080 They are independent. So you can well. Sort of show. 154 00:19:09,080 --> 00:19:15,230 That it behaves like squared, look over and just like a maximum of Galicians. 155 00:19:15,230 --> 00:19:21,070 And so that is these are more or less, gosh, Anthony, a few of them and a maximum logarithmic. 156 00:19:21,070 --> 00:19:27,150 Inbee And there's a square root. There's an end because of the normalisation with one over in here. 157 00:19:27,150 --> 00:19:38,950 OK. So if you do this regularised, estimated to be dieties to regularised, estimated. 158 00:19:38,950 --> 00:19:47,100 It's lost or so the Alta convergence rate is here, it's more or less. 159 00:19:47,100 --> 00:19:57,030 Squat and look, be over in times to L1, normal feet to start and then the noise place to roll here it's one plus sigma squared. 160 00:19:57,030 --> 00:20:02,100 And so this is my benchmark for the regularised case. 161 00:20:02,100 --> 00:20:08,480 And just so that you see that it's not very deep. Let me look at the proof. 162 00:20:08,480 --> 00:20:20,090 So the estimate is defined by maximising this quantity. So it will be larger than the same expression at the true victor to start, right? 163 00:20:20,090 --> 00:20:26,760 So I don't just put minus sign in front so that you sign switches. 164 00:20:26,760 --> 00:20:35,610 And I put the the penalties together and there comes somebody visiting me, the cats. 165 00:20:35,610 --> 00:20:43,560 OK, um, so this is just rewriting that you're maximising something. 166 00:20:43,560 --> 00:20:48,510 And then I subtract and add expectations, let's see. 167 00:20:48,510 --> 00:20:53,210 So this is the sort of the theoretical risk. 168 00:20:53,210 --> 00:21:02,860 And we just saw it's equal to this expectation. And then I replaced the expectation by an average. 169 00:21:02,860 --> 00:21:06,750 And then I have to correct for that. 170 00:21:06,750 --> 00:21:18,530 And here I can use the dual norm inequality's, so the inner product between two vectors can be bounded by the Infinity Normington one nor. 171 00:21:18,530 --> 00:21:26,700 Then you get this laptop had this this infinity norm, and this is the one norm Bita had minus Beta Star. 172 00:21:26,700 --> 00:21:35,260 And they're left over with the term which was already there, and then we used this arc inequality. 173 00:21:35,260 --> 00:21:43,190 Which we just saw on the previous page. And then we used the triangle inequality. 174 00:21:43,190 --> 00:21:46,970 Yeah. So there's nothing going on. 175 00:21:46,970 --> 00:21:58,620 OK, yes, Judith, you still have a question. I just I wasn't sure what the hotkeys which you said it afterwards, so he just looked at it again. 176 00:21:58,620 --> 00:22:05,820 It's a good point. It goes quickly, leapt ahead. It's like the maximum of Pete, more or less Gaussian random variable. 177 00:22:05,820 --> 00:22:11,910 So that's a random variable, which you easily have some kind of concentration inequality for it. 178 00:22:11,910 --> 00:22:16,590 So with large probability that will be of order squared. 179 00:22:16,590 --> 00:22:21,090 Look, be over. And just like for the last of your usual thing. 180 00:22:21,090 --> 00:22:33,960 Yeah. So that means here this is the inequality, if you take your tuning parameter, not too small, but also not too large. 181 00:22:33,960 --> 00:22:42,360 It gives you about a high probability. Yeah, OK. 182 00:22:42,360 --> 00:22:50,510 So it looks very much like the lassalle. You can. 183 00:22:50,510 --> 00:22:53,270 We find that if you wish to execute sparsity, 184 00:22:53,270 --> 00:23:02,920 so suppose that are only a few non-zero coefficients in the true vector and the number of them is littleness. 185 00:23:02,920 --> 00:23:10,540 Then the rate of Convergys becomes Lakdar Square times and then something with the noise. 186 00:23:10,540 --> 00:23:17,320 And that is, again, of the same order, so that means you'll have to square it is look be over and in order. 187 00:23:17,320 --> 00:23:27,070 So you get something like the number of active variables divided by the number of observations a long term and one plus sigma squared. 188 00:23:27,070 --> 00:23:31,780 OK, so the proof is, again, very simple, but I don't think I have that much time. 189 00:23:31,780 --> 00:23:43,390 But this is the proof. It's really just being a little bit more careful with the triangle inequality, that's all. 190 00:23:43,390 --> 00:23:52,120 So this is the benchmark, so the MySpace classification error of these regularised estimates consists of two 191 00:23:52,120 --> 00:23:59,260 terms to unavoidable based risk and the term due to estimating estimating beta star, 192 00:23:59,260 --> 00:24:12,670 so to excess risk and the excess risk is squared lock B over N times to Evernham if the true vector is only a one spa's. 193 00:24:12,670 --> 00:24:25,620 And it can be improved to be overturned if the true vector is zero Spar's, which means that it has only s non-zero coefficients. 194 00:24:25,620 --> 00:24:31,430 And you see what is happening here. If there is no noise. 195 00:24:31,430 --> 00:24:37,730 The if there's no noise. It's the area is still not zero. 196 00:24:37,730 --> 00:24:43,070 Which is unlike for the last, so and that's because you only observe the sun, 197 00:24:43,070 --> 00:24:47,810 you don't observe the complete regression function, so there's a price to pay. 198 00:24:47,810 --> 00:24:53,110 You cannot really exactly recover. B Star. 199 00:24:53,110 --> 00:24:57,700 And it's I'm not sure if there are lower bouncier, but. 200 00:24:57,700 --> 00:25:05,320 It's quite clear that you cannot exactly recover it, and so they're still an error, even if you don't have noice. 201 00:25:05,320 --> 00:25:12,040 And if the noise is small. So you have the unavoidable noise here. 202 00:25:12,040 --> 00:25:21,280 And if the noise is small, this is all in order, so if the noise is small one plus the noise squared is of the same order as just one. 203 00:25:21,280 --> 00:25:24,480 So you can sort of cancel this term. 204 00:25:24,480 --> 00:25:31,530 And so the noise doesn't really hurt in a sense, if you look at the order, at least if it's small and that's the case, 205 00:25:31,530 --> 00:25:36,910 we're interested in small noise because otherwise we'll never have small test error. 206 00:25:36,910 --> 00:25:51,340 OK. So that's my benchmark. So now let's compare that with the interpellation, so the plan and for sheening interpolating and Dymocks merchant later. 207 00:25:51,340 --> 00:26:02,340 And we will use them to study them results from bases for shoot, which is going to be noisy bases for shoot. 208 00:26:02,340 --> 00:26:09,150 So what's Noisey basis for a suit in basis, noisey basis, you do observe the regression, 209 00:26:09,150 --> 00:26:15,810 so you do observe this regression, the real value response variable. 210 00:26:15,810 --> 00:26:22,300 So you just have a regression model and then you interpolate. The data. 211 00:26:22,300 --> 00:26:31,140 So see you interpolate the data and you take the interpolating, which is L1 naum minimal. 212 00:26:31,140 --> 00:26:39,420 OK, then there are theoretical results in this paper by cheek and also in more recent work. 213 00:26:39,420 --> 00:26:47,790 And so what we derived is the same as this paper from 2010, but our processes are quite different. 214 00:26:47,790 --> 00:26:57,570 So this this is more like a real. Yeah. Like from a different community, I would say machine learning or computer science. 215 00:26:57,570 --> 00:27:03,910 And our language is more statistical. So that's. 216 00:27:03,910 --> 00:27:07,720 A noisy base, for sure. And then later afterwards, 217 00:27:07,720 --> 00:27:12,500 we will consider the case where you only observe the sign of the regression function 218 00:27:12,500 --> 00:27:17,290 and we look at it later from Blunt and for Sheenan and we look at the Marks, 219 00:27:17,290 --> 00:27:29,530 Marks and Interpolate, OK. So that's to come in all three cases, I will need a good bound for the elbow norm of the estimate, 220 00:27:29,530 --> 00:27:33,790 and that's kind of clear because once you have a good bound for the elbow norm, 221 00:27:33,790 --> 00:27:41,650 you can use results from empirical process theory of empirical processes with coefficients bounded in L1, 222 00:27:41,650 --> 00:27:45,310 and that gives you about four entropies and so on. 223 00:27:45,310 --> 00:27:52,880 So then you're in business without explain it in more detail later also. 224 00:27:52,880 --> 00:28:00,060 OK, so let's look first at bases where you do observe the regression function with Noice. 225 00:28:00,060 --> 00:28:08,440 And you interpolated, you choose to one with smallest one or OK. 226 00:28:08,440 --> 00:28:16,150 So what is a good bound for the estimate there? Maybe I should go back. 227 00:28:16,150 --> 00:28:22,490 And so the point here is if you have noisy and. 228 00:28:22,490 --> 00:28:29,390 Observations, the true beta star is not interpolating quite clearly because of the noise, 229 00:28:29,390 --> 00:28:36,830 so somehow we want to interpolate the noise to make sure that your interpolated data. 230 00:28:36,830 --> 00:28:43,070 And this is more or less what what the result says, if you want a good pound for the L1 norm of the estimated, 231 00:28:43,070 --> 00:28:54,390 you can be counted by the L1 norm of the truth in a term which is the noise level, Times Square and divide by lockpick. 232 00:28:54,390 --> 00:29:05,370 And the reason for this is sort of follows from the chaotic conditions or maybe from the from the dual formulation, 233 00:29:05,370 --> 00:29:08,250 and if you do that, I won't give the details. 234 00:29:08,250 --> 00:29:18,240 It says that the 11 norm of the estimate is bounded by the L1 norm of the truth and then a term involving the L2 norm of the noise, 235 00:29:18,240 --> 00:29:25,940 divided by something and divided by something. You want to divide it by something large so that it will be small. 236 00:29:25,940 --> 00:29:32,630 And if something is the maximum. Of Gaussian random variables. 237 00:29:32,630 --> 00:29:38,650 But. OK, these linear combinations each time for different love, 238 00:29:38,650 --> 00:29:50,460 that you have a different number of Gaussian random variables and then you minimise again, overlap the supposed the minimum of a maximum. 239 00:29:50,460 --> 00:30:00,420 And there comes the assumption that the sample size should be small compared to the number of variables, because we want this to be large. 240 00:30:00,420 --> 00:30:07,140 And if you look at extreme value theory, if you have to be so big, 241 00:30:07,140 --> 00:30:12,450 independent Gaussian random variables, you have this extreme value limiting distribution. 242 00:30:12,450 --> 00:30:19,410 And I think it's a Google limiting distribution. So we know this maximum. 243 00:30:19,410 --> 00:30:27,120 The that's like Squirrelled Lockerby. But it's a little bit more involved because we have, again, an information here, 244 00:30:27,120 --> 00:30:33,480 so this is a lot of technical work and we want it to be non asymptotic. 245 00:30:33,480 --> 00:30:39,190 So we get that this information behaves like indeed like square root Lockerby. 246 00:30:39,190 --> 00:30:48,370 If the sample size is small compared to be, for instance, the sample size is smaller than be divided by Lockerby, something like that. 247 00:30:48,370 --> 00:30:52,600 So really the high dimensionality is your playing is crucial here. 248 00:30:52,600 --> 00:31:06,880 Otherwise we won't have the lock be here. So this term is the square root, and this year and the lock is here, square root and times super. 249 00:31:06,880 --> 00:31:13,320 So if you didn't have about if you didn't have this bound for the one norm. 250 00:31:13,320 --> 00:31:19,500 You can use Sambuca process theory to get bounced for quadratic forms where it is Elgon norm of beer. 251 00:31:19,500 --> 00:31:29,940 So you just take the empirical average minus the theoretical average, which is one of the to go fishing normalised to one. 252 00:31:29,940 --> 00:31:35,280 And you look at these quadratic forms, how different are different from the expectation? 253 00:31:35,280 --> 00:31:45,270 And then the L1 norm plays a role. And we just showed that this L1 norm behaves like square root and divided by Lockerby. 254 00:31:45,270 --> 00:31:53,550 So that's exactly nice cancelling. So that gives you that these pretty radical forms are dist. like like this Sigma's stick. 255 00:31:53,550 --> 00:32:06,540 My head is occurring there. So that gives finally this result that the estimation error of the noisy basis pursuit estimate is bounded by C times. 256 00:32:06,540 --> 00:32:14,310 Some universal Constantines stick my head and of course, if there's no noise, then it means that it's zero. 257 00:32:14,310 --> 00:32:20,760 So that's in basic you have in the nonlawyers you you have exact recovery. 258 00:32:20,760 --> 00:32:25,800 But the noisy Ferreti you have now. 259 00:32:25,800 --> 00:32:35,470 Some. Some error still, but if the noise is small, it's still reasonable. 260 00:32:35,470 --> 00:32:40,830 So now let's look at one bit compressed sensing. 261 00:32:40,830 --> 00:32:49,200 So now we only observe the sign of the regression function, we minimise to Eleanor, as in the basis for shoot. 262 00:32:49,200 --> 00:32:59,940 With an interpellation and with a normalisation. And then the question is, again, what is a good amount for the estimated? 263 00:32:59,940 --> 00:33:06,780 And again, you see the true fekter with star is not interpolating because of the noise. 264 00:33:06,780 --> 00:33:14,100 So, again, we have to interpolate the noise. And if we do that. 265 00:33:14,100 --> 00:33:24,040 So the idea is so if you if you put your to true factor for B plus something which interpolated noise. 266 00:33:24,040 --> 00:33:29,620 Then you have a candidate because then you have an interview later. 267 00:33:29,620 --> 00:33:37,090 And so do the elbow normal of this candidate will be less than the elbow will be larger than this one which minimises it. 268 00:33:37,090 --> 00:33:41,800 So don't we have about four to elbow normal of the minimise her? 269 00:33:41,800 --> 00:33:54,920 If I have canidate here, which I can control, then this one we will have elbow normal smaller than this canidate, ok. 270 00:33:54,920 --> 00:34:06,320 OK, so why not interpolate the noise, so I interpolate here to noise sized noise and I think we had the Yeah. 271 00:34:06,320 --> 00:34:11,900 The L1 minimise or which relates to noise and I just dig here. 272 00:34:11,900 --> 00:34:22,030 So I know from the previous result from bases pursuit, I have this elbow norm for this interplay there under control. 273 00:34:22,030 --> 00:34:30,210 And we're over if I added to be to start a true victor, I have an interview later with the right sign. 274 00:34:30,210 --> 00:34:37,500 So that's a candidate, so I know that one of the estimates responded by one normalities candidate. 275 00:34:37,500 --> 00:34:42,330 Oh, yeah, I have to normalise it, but that doesn't really hurt very much. 276 00:34:42,330 --> 00:34:49,490 So then I'm I'm done. Then I have about four to one. And so that's the idea. 277 00:34:49,490 --> 00:34:54,480 So I have my true Victor, I have my noice. 278 00:34:54,480 --> 00:35:06,460 I interpolated with some other factor. And then X be the star plus noise is the same as this by definition because I interpolated it. 279 00:35:06,460 --> 00:35:15,000 So I have here a vector with the correct which gives the correct sign, which this one interpolate. 280 00:35:15,000 --> 00:35:23,340 The signs. So the one normal beat, the starters, beat the head will be larger than of my. 281 00:35:23,340 --> 00:35:29,120 Planning for Sheenan estimate, which is to minimise. 282 00:35:29,120 --> 00:35:38,270 So then again, you have about four to one norm, and I'm not going to show all the technical details, but then you can do empirical bounce off. 283 00:35:38,270 --> 00:35:50,060 The premium of a process is indexed by functions with coefficients with L1 normal, bounded by some quantity. 284 00:35:50,060 --> 00:35:58,330 And then we get this result, so the planning for Sheenan estimated this interpolating. 285 00:35:58,330 --> 00:36:06,830 As in L to the if you have a sparsity, the rate is like before Uzelac be over in. 286 00:36:06,830 --> 00:36:18,130 And now if you look at the benchmark, it was times one plus sigma and now it's just plus Sigma plus genoise. 287 00:36:18,130 --> 00:36:27,130 And for the if you have only one sparsity, you have again, it's exactly like the benchmark estimate. 288 00:36:27,130 --> 00:36:33,100 But instead of multiplying by one plus sigma, you add the noise sigma. 289 00:36:33,100 --> 00:36:41,990 So it means that if noise is some. Is a large. 290 00:36:41,990 --> 00:36:48,490 You say small but large, still larger than this term, you still feel it so in the benchmark estimate. 291 00:36:48,490 --> 00:36:52,780 If the noise is small, you hardly feel feel the noise. 292 00:36:52,780 --> 00:36:57,330 You only you already have base noise. And that's the main part. 293 00:36:57,330 --> 00:37:04,120 The here it can be. That noise is larger than the area due to estimation. 294 00:37:04,120 --> 00:37:17,280 So. So it's not a less beautiful result, which is, of course, clear because you do interpellation instead of regularisation, so it's not not as good. 295 00:37:17,280 --> 00:37:26,620 OK. So how much time do I have? 296 00:37:26,620 --> 00:37:31,950 What do you say about ten minutes, five minutes? OK. 297 00:37:31,950 --> 00:37:44,980 I'll skip that, so now go to Adam. So we observe as before X and the design and adapt algorithm is as follows. 298 00:37:44,980 --> 00:37:51,860 You first take a vector of weights which are on the simplex. 299 00:37:51,860 --> 00:37:57,710 Well, added boost is for weak learners, they are defined differently, so to mimic that, 300 00:37:57,710 --> 00:38:08,230 we normalise our design so that it all all design entries are between minus one and one. 301 00:38:08,230 --> 00:38:19,300 You start with an ization and then you go in a certain direction, don't ask me too much, but this is just to see it quickly. 302 00:38:19,300 --> 00:38:28,450 You go in a certain direction and then you define a step size, which is here, the one inspired by explanation plus. 303 00:38:28,450 --> 00:38:36,870 And then you update your your vector of coefficients with a learning rate, Eita. 304 00:38:36,870 --> 00:38:50,100 And then you update the weights and you keep on doing this until stopping time, so capital T, OK? 305 00:38:50,100 --> 00:38:59,130 And then there are results which say that if the learning rate goes to zero and the number of iterations goes to infinity, 306 00:38:59,130 --> 00:39:09,320 the outermost algorithm converges to the max margin classifier, which is maximising the minimal margin. 307 00:39:09,320 --> 00:39:13,520 And that's up to skating, the minimum one norm interpolate there, 308 00:39:13,520 --> 00:39:19,340 that is you look at the the minimal use to make sure that the margin is always bigger than one. 309 00:39:19,340 --> 00:39:25,680 And then you look for a solution with L1 or minimal. 310 00:39:25,680 --> 00:39:34,040 So there are some theoretical results for the relation between the boost in this max margin classifier. 311 00:39:34,040 --> 00:39:40,750 When there is no surveillance is of the same order as the number of variables. 312 00:39:40,750 --> 00:39:43,030 So that's not our case. 313 00:39:43,030 --> 00:39:53,880 We have that a number of variables is much larger than the number of observations, and that's really a crucial assumption in our theory. 314 00:39:53,880 --> 00:40:02,110 So let's look at this next March, which is the same as this minimum, one norm estimated. 315 00:40:02,110 --> 00:40:07,420 And what is a good bound for four to one normal of these estimates? 316 00:40:07,420 --> 00:40:14,900 Now, if you look at this carefully, you would say, OK, does it somehow mimic the behaviour of the true vector beta star? 317 00:40:14,900 --> 00:40:24,980 So if you insert to your beat a star the true factor. And look, think, think of the case Western, where there is no noise to simplify, 318 00:40:24,980 --> 00:40:30,480 if you insert here to start, this is just the absolute value of that factor. 319 00:40:30,480 --> 00:40:36,590 So this is just the absolute value of of a Gaussian random variable. 320 00:40:36,590 --> 00:40:44,760 And then you take the minimum, the minimum of any standard Gaussian random variables. 321 00:40:44,760 --> 00:40:52,470 And a minimum of standard Goutman Veritas behaves like one over, and so that's very, very small. 322 00:40:52,470 --> 00:40:59,190 And you want it to be bigger than one Sabetha star is no way a good candidate for solving this problem. 323 00:40:59,190 --> 00:41:08,440 It's completely out of out of scope. Of course, you can be skilled, but then this one, Norm, is going to be very large. 324 00:41:08,440 --> 00:41:17,020 So somehow this is a very strange thing that you doing something where you think, well, be the star itself is not a good solution to this problem. 325 00:41:17,020 --> 00:41:23,850 So why would this be a good estimate or. OK. 326 00:41:23,850 --> 00:41:32,700 Um, yeah, maybe if you have time, you can write down, so again, we want about four to one norm of the estimates. 327 00:41:32,700 --> 00:41:41,020 You can again, like for basis pursuit, do the dual normal for immunisation and then we get a lower about. 328 00:41:41,020 --> 00:41:45,100 So I won't go into details there. 329 00:41:45,100 --> 00:41:54,010 But let's look at the upper bound and just briefly explain the idea that, again, what we use is this baseless pursuit idea. 330 00:41:54,010 --> 00:42:00,720 And so let me just show you the picture. 331 00:42:00,720 --> 00:42:08,160 So the true Victor Batista. Is not bigger than one. 332 00:42:08,160 --> 00:42:13,660 Right. It's going to be orobets very close to zero at certain points. 333 00:42:13,660 --> 00:42:19,870 So it's not a good candidate for this minimisation problem, but what we can do, we can say, well, there is some error. 334 00:42:19,870 --> 00:42:28,490 We just look at is this linear function and we truncated at some positive fancy epsilon. 335 00:42:28,490 --> 00:42:37,860 OK, then, it's nice away from zero, so now the margin is Epsilon's already something, but then we make a mistake. 336 00:42:37,860 --> 00:42:44,550 And this mistake, we just consider this error. So don't worry about noise, because we can do that as before. 337 00:42:44,550 --> 00:42:52,220 So if but if there's no noise, we have this additional error and this additional error, we just interpolate. 338 00:42:52,220 --> 00:42:57,900 XBLA Hetson interpolating interpolating this additional error. 339 00:42:57,900 --> 00:43:06,560 So if I take that one. And I subtracted from the star, I get a nice function, which is. 340 00:43:06,560 --> 00:43:13,780 Staying away from zero, so there's a good margin and it has to right some. 341 00:43:13,780 --> 00:43:20,680 So that's the idea. So, again, interpolating some kind of error will help me to fight about four to one Naumov. 342 00:43:20,680 --> 00:43:22,420 The estimate, 343 00:43:22,420 --> 00:43:33,790 and it turns out to be like this after some some work because you have to renormalise and what occurs is a funny power to power two thirds here. 344 00:43:33,790 --> 00:43:38,620 But this one, we we sort of know this should look familiar, this looks like the benchmark, 345 00:43:38,620 --> 00:43:49,460 except that there are noises at a different place and it's very much also looks like this expression we had for the wrong end for Sheenan estimated. 346 00:43:49,460 --> 00:43:55,650 With nonracist, funny power, two thirds. OK. 347 00:43:55,650 --> 00:44:00,840 Anyway, we then you have to do some work to look at, not at all, 348 00:44:00,840 --> 00:44:09,060 one normed at the normalised factor be to head the L1 normative normalised vector data had to divide it by two L2 norm. 349 00:44:09,060 --> 00:44:16,410 And for that we use, you know, some some some further empirical process theory. 350 00:44:16,410 --> 00:44:28,080 And then we end up with this result, so this is exactly the same as for the plant in Virginia estimated, except that it's the slow rate. 351 00:44:28,080 --> 00:44:35,400 And if you have a sparsity, we were unable to to to explore that or exploit it. 352 00:44:35,400 --> 00:44:43,010 So we don't have to fast track. We only have the slow rate with about one fourth. 353 00:44:43,010 --> 00:44:53,790 And it's this one for us, it's like for the benchmark estimate, but the noise is at a different place, so maybe I should look at the table here. 354 00:44:53,790 --> 00:45:01,010 So here's the benchmark estimate. And unavoidable base error. 355 00:45:01,010 --> 00:45:05,630 And you see that the expressions are similar except for the benchmark estimate. 356 00:45:05,630 --> 00:45:13,230 If the noise is small, you can forget about it as far as the excess risk is concerned. 357 00:45:13,230 --> 00:45:21,640 For the plant in Virginia, an estimate and for the march, an estimated anois will play its role and. 358 00:45:21,640 --> 00:45:26,100 We have blonde and forseen and mixed and same result. 359 00:45:26,100 --> 00:45:34,420 Well, there's a Lockton yeah, I don't know, of course look there, but we can't prove fast rates. 360 00:45:34,420 --> 00:45:38,000 We're unable to boost fast rates for the next March. 361 00:45:38,000 --> 00:45:46,160 The estimated. OK, and then, well, just just to conclude, also, 362 00:45:46,160 --> 00:45:55,730 we can we have months of asymptotic about on how far this makes margin estimated and is from the other boost output. 363 00:45:55,730 --> 00:46:05,100 So this can also be used. To conclude the rates of convergence for the Addabbo algorithm. 364 00:46:05,100 --> 00:46:13,660 OK, then we worked harder, harder, harder, you know, what we started with was the result with your one twelfth. 365 00:46:13,660 --> 00:46:21,730 Now, we got it back to one first, but I think we are now in the situation when we get one third here. 366 00:46:21,730 --> 00:46:28,200 So it's probably this is the final answer to one third here. 367 00:46:28,200 --> 00:46:37,350 And what we can do so what you can do in these kind of problems is just first proof bound for the urban norm and then use all the results. 368 00:46:37,350 --> 00:46:42,420 There are very many results in literature on the on the line segmentation problem. 369 00:46:42,420 --> 00:46:49,800 And you can use their results. And there's one paper where we can conclude this power, one third. 370 00:46:49,800 --> 00:46:55,330 But so far, we do not understand the proof because it's very geometrical. 371 00:46:55,330 --> 00:47:02,200 And the paper is not published, so we really should understand the proof, but I think it's correct. 372 00:47:02,200 --> 00:47:08,730 So that means the rates can still be improved. And we hope to have that on archives. 373 00:47:08,730 --> 00:47:13,560 OK, yeah, um, so conclusion. 374 00:47:13,560 --> 00:47:24,840 So then either boost or mux merchant type estimated to have the same rate as the slow rate for one bit compressed sensing with linear programming. 375 00:47:24,840 --> 00:47:29,610 So the Bonhoeffer Sheenan estimated the confidence is less. 376 00:47:29,610 --> 00:47:38,340 It means the you have a very strong concentration for the first union estimates, but for the max merchant estimate, 377 00:47:38,340 --> 00:47:45,270 we don't have that because if you look at the minimum of random variable, that doesn't concentrate very, very strongly. 378 00:47:45,270 --> 00:47:49,730 And so lower bounds. That's an open question. 379 00:47:49,730 --> 00:47:58,680 Which I'm not able to answer yet, and, yeah, we looked at some lower bounds, the lower bound for the elbow. 380 00:47:58,680 --> 00:48:02,150 Norm, they are the same as the upper bound. So there we are tight. 381 00:48:02,150 --> 00:48:08,960 And indeed, if we do the simulations, we see the power one third, which which I had here appearing. 382 00:48:08,960 --> 00:48:17,240 So it seems to be a. Something you also see, you can see in data. 383 00:48:17,240 --> 00:48:22,480 OK, well, that's that's it. Thank you very much for your attention. 384 00:48:22,480 --> 00:48:33,190 Thank you. Thank you very much, sir. And do you feel free to use your virtual pause to acknowledge excellent talk? 385 00:48:33,190 --> 00:48:39,820 Thank you. Right now, they don't have any questions. 386 00:48:39,820 --> 00:48:44,440 Judith, thanks for the talk. 387 00:48:44,440 --> 00:48:52,090 I have two questions. One is that I was trying to figure out why people graters and was much more interesting. 388 00:48:52,090 --> 00:48:59,530 Is it because you have any sort of assumption on The X Factor with the idea on a virus, on the virus, 389 00:48:59,530 --> 00:49:05,920 that you have some kind of smoothing effect that allows you to control everything for very large people? 390 00:49:05,920 --> 00:49:12,730 Well, it's just it's in the lock. So it's this this phenomenon here. 391 00:49:12,730 --> 00:49:23,020 So far, for the maximum of precaution and random variables, we need to lower about that base like squirrelled lockpick. 392 00:49:23,020 --> 00:49:28,330 And if you don't want to have high dimensional problems, then the lock will disappear. 393 00:49:28,330 --> 00:49:37,240 That's all. But that looks good. Now, it's kind of nice to have your look here and it disappears with the lock here. 394 00:49:37,240 --> 00:49:44,590 And that's kind of pleasing. So but if you don't mind, locks in the high dimensionality can be of course, 395 00:49:44,590 --> 00:49:51,050 you need some high dimensionality because otherwise you can't interpolate clear. 396 00:49:51,050 --> 00:50:03,380 Yeah, well, with what we see in the simulations, it's really it's really kicking in the theory for high dimensions, that's still kind of interesting. 397 00:50:03,380 --> 00:50:16,200 So apparently. Because it plays a role in all the other deprivations as well, and you see it in the simulations, if we don't let BP grow, then. 398 00:50:16,200 --> 00:50:22,000 It just. We get completely wrong curves. 399 00:50:22,000 --> 00:50:30,170 Yeah, so. Thank you. 400 00:50:30,170 --> 00:50:43,940 So did you have a second question? Yeah, I can't remember right now that was the first time we come back. 401 00:50:43,940 --> 00:50:49,270 Yes, I was going to ask so can you kind of check these up abounds empirically? 402 00:50:49,270 --> 00:50:58,100 Like I mean, do they look like they're tight? So you've got this third, which you think is it is is the correct answer now. 403 00:50:58,100 --> 00:51:07,790 Yeah. So indeed. Well, we just finished a simulation because it was a very in many things to to to consider. 404 00:51:07,790 --> 00:51:17,300 So if you do add a boost, there's a some kind of standard defo value for the number of iterations and all that things. 405 00:51:17,300 --> 00:51:20,780 And you if you take the those values, you don't see it. 406 00:51:20,780 --> 00:51:29,300 But if you do enough iterations, as the theory asks for with a small enough learning rate, then you see it perfectly. 407 00:51:29,300 --> 00:51:36,040 It's really incredible. So you see the one third rate, which is in the L2. 408 00:51:36,040 --> 00:51:44,620 Now to conversions, and you see the one third rate in the march and you see it's really perfectly, but of course at a certain point, 409 00:51:44,620 --> 00:51:54,070 you know, the ultra distance between two vectors, which both have naum equal to one, can never be bigger than two. 410 00:51:54,070 --> 00:52:03,290 So there are some bounding off the curve, if at certain points, but I mean, if it's really quite. 411 00:52:03,290 --> 00:52:13,260 Seeable and for the 11 11 marchin, so we know that the upper bound in the lower round are the same up to Constance. 412 00:52:13,260 --> 00:52:18,920 And indeed, so you see that in the simulations, it's. It's really very nice. 413 00:52:18,920 --> 00:52:27,580 Yeah, yeah, yeah. Your question, come back to you, the theory of justice, 414 00:52:27,580 --> 00:52:32,770 and so it's a bit related to what we've been asking and maybe I just didn't understand what you were saying. 415 00:52:32,770 --> 00:52:41,170 But so far, the other booster I'm always for following for your last line of your colour, 416 00:52:41,170 --> 00:52:46,630 of your books at the end of your table, at the end, you didn't get the results, 417 00:52:46,630 --> 00:52:53,850 the rate for the case where you have this false and not sparse seducing me, 418 00:52:53,850 --> 00:52:59,050 that it's a technical problem and you don't get the proof yet, but there is a chance that the rate is better. 419 00:52:59,050 --> 00:53:03,970 Is it a case where actually you can't do better? What would you do you imagine? 420 00:53:03,970 --> 00:53:07,950 I don't know. But in the in the simulations. 421 00:53:07,950 --> 00:53:19,320 If there is noise, we see that this one we see the one third in the simulation, if there's no noise, we see the one half. 422 00:53:19,320 --> 00:53:30,570 So probably we missed some point there. Probably, you know, I do think that indeed, if there is no noise, it should be possible. 423 00:53:30,570 --> 00:53:41,710 It's it's just yeah. It's just that we cannot exactly bound the one norm of the estimated by to one or more of the troops, 424 00:53:41,710 --> 00:53:46,740 but there's a constant in front that's which is of order one. 425 00:53:46,740 --> 00:53:51,090 But still that bothers us. Yeah. 426 00:53:51,090 --> 00:53:57,390 So it's a very good question. And then we keep on thinking, oh, yes, that should be possible with the. 427 00:53:57,390 --> 00:54:09,380 So far, we didn't succeed, so I gave this talk in the over woolfenden once and I was there tonight and he just shook his head. 428 00:54:09,380 --> 00:54:18,930 He said so probably. But that was the one 4th. 429 00:54:18,930 --> 00:54:25,470 So, no, I don't know. Shook his head at this. 430 00:54:25,470 --> 00:54:30,220 You think you thought it was a rubbish bound or you so you said it was a hopeless business. 431 00:54:30,220 --> 00:54:34,800 I don't know. I want to ask you why they didn't make it OK? 432 00:54:34,800 --> 00:54:50,800 Well, yeah, no, it's I keep on working on that one because I don't know about lower bonuses or so I don't know. 433 00:54:50,800 --> 00:54:56,560 Mayor. You do have a question. Patrick. Yeah, thanks. 434 00:54:56,560 --> 00:55:00,010 Thank you, sir. Enjoyed the talk. 435 00:55:00,010 --> 00:55:07,510 Yeah, so I had a question on the connexion with other books as well, just because it wasn't clear from the slide what you're really doing with it. 436 00:55:07,510 --> 00:55:18,140 So I understood that you look at the limits of it one, but at that point you have exactly the maximum solution. 437 00:55:18,140 --> 00:55:22,810 Yeah. So do you do you do something with at least stopping? 438 00:55:22,810 --> 00:55:34,620 Yes. At least we have some kind of theoretical result which says if you have like a log and something at least and the learning rate of descent, 439 00:55:34,620 --> 00:55:37,540 this order, the number of steps that in that order, 440 00:55:37,540 --> 00:55:53,060 then the margin coming out of this algorithm is getting that much larger or smaller than the margin coming out of the next margin algorithm. 441 00:55:53,060 --> 00:56:02,060 So there are asymptotic results, but we sort of refine them so that you can you see the non asymptotic relation, but I didn't put it on the slide. 442 00:56:02,060 --> 00:56:04,610 That's right. 443 00:56:04,610 --> 00:56:17,170 And and so within this picture, you you are able to prove that if you do this type of early stopping, you achieve better bounds as a function of. 444 00:56:17,170 --> 00:56:26,710 So we don't so the theory says you should not do something, you should do enough iterations, so you go on, you do your Arabised algorithm, 445 00:56:26,710 --> 00:56:33,490 then at certain point you start interpolating and you continue with your outermost algorithm, enough iterations. 446 00:56:33,490 --> 00:56:43,650 And then the theory says that you are close to this max march and solution. 447 00:56:43,650 --> 00:56:51,510 We say exactly how we can give some bounce on how close you are then, and we see it also in the simulation, 448 00:56:51,510 --> 00:56:58,710 so if you do the Audibert algorithm with small step size learning rate and enough iterations, 449 00:56:58,710 --> 00:57:04,180 you see the groups are hardly you can hardly distinguish the curves. 450 00:57:04,180 --> 00:57:08,790 I'm sorry, these simulators are just done, so I didn't put them on the slides, 451 00:57:08,790 --> 00:57:16,680 but it's kind of strange if you do the utmost algorithm with large step size, it is giving much better amounts. 452 00:57:16,680 --> 00:57:25,010 So it seems to be. It's even still interpolated with better classification error, 453 00:57:25,010 --> 00:57:34,500 so smaller classification error, this I don't know why this is what the simulations say. 454 00:57:34,500 --> 00:57:42,640 Yeah, I will. So I'm familiar with the literature on regression, and I believe it was Scootin William, 455 00:57:42,640 --> 00:57:51,690 but I they they have some paper connecting early, stopping with notion of M. Max optimality in that setting. 456 00:57:51,690 --> 00:58:01,650 So yeah, I understand now you're focussing on interpellation and you want another figuration for this phenomena to kick in. 457 00:58:01,650 --> 00:58:08,760 Yeah. So you're you're doing the really going way beyond what is allowed for classical statistics. 458 00:58:08,760 --> 00:58:14,010 You don't do early shopping, you just go on with your iterations. 459 00:58:14,010 --> 00:58:24,540 Yeah. Yeah. And that's exactly what helping you should be seeing better and better rates as you do with your benchmark estimate. 460 00:58:24,540 --> 00:58:30,000 That will be money. Well, there's one thing the rates can be better because of that, 461 00:58:30,000 --> 00:58:35,610 because you can do something similar with regularisation, but then you can't computed. 462 00:58:35,610 --> 00:58:48,160 That's the point. So theoretically, you can sort of minimise some sine function, but that's a hopeless, non convex problem, Zozo. 463 00:58:48,160 --> 00:58:52,510 You're thinking of knowing where to stop. No, if you were to. 464 00:58:52,510 --> 00:59:04,430 Oh, no, no, I don't know if you know where to stop, then I would say this gives you a. 465 00:59:04,430 --> 00:59:10,230 It's a good question, so if you look at the benchmark, let me just go back. 466 00:59:10,230 --> 00:59:14,400 I would say, knowing if you do early stopping, that's like regularisation, 467 00:59:14,400 --> 00:59:22,290 so just out of some vague idea, I would say it's it would give you these rates. 468 00:59:22,290 --> 00:59:29,320 Yeah, which is still worse than than the interpolating rate, which kind of strange. 469 00:59:29,320 --> 00:59:38,250 And. In a sense. Yeah, and. 470 00:59:38,250 --> 00:59:43,800 So something to look at. Yeah, yeah. 471 00:59:43,800 --> 00:59:48,590 Thank you. All right, thank you very much, sir. 472 00:59:48,590 --> 00:59:54,710 We put a stop to let's all applaud again. 473 00:59:54,710 --> 01:00:01,058 Excellent work. Thank you. Thanks a lot.