1
00:00:00,480 --> 00:00:10,650
Then to hear the approximately correct who is Professor Zoric, Sarah has previously held positions in Bristol,

2
00:00:10,650 --> 00:00:19,410
Utrecht, Toulouse and Leiden, where she received APHC in 1987,

3
00:00:19,410 --> 00:00:28,860
and she is an expert on empirical processes, having written the book Empirical in Estimation, which I use quite a lot during my master's thesis.

4
00:00:28,860 --> 00:00:36,240
Thank you for that and has worked recently on lots of things relating to high dimensional data.

5
00:00:36,240 --> 00:00:41,280
So we're very excited to hear about what she's going to talk about today.

6
00:00:41,280 --> 00:00:46,320
So for questions, if you want to ask a question, there are two ways you can do it.

7
00:00:46,320 --> 00:00:53,490
So the first is to raise your hand and then I will interrupt Sara at a convenient point.

8
00:00:53,490 --> 00:01:00,240
And you can ask you and then you can ask a question or if you prefer,

9
00:01:00,240 --> 00:01:04,800
you can type your question in the chat and then you won't appear on the recording.

10
00:01:04,800 --> 00:01:12,170
And I can read the question out to Sara for you. OK, excellent.

11
00:01:12,170 --> 00:01:17,180
All right, take it away, sir. Thank you.

12
00:01:17,180 --> 00:01:23,360
Thank you very much for this nice introduction and thank you very much for inviting me.

13
00:01:23,360 --> 00:01:32,680
So let me start screen sharing. This one.

14
00:01:32,680 --> 00:01:38,460
It's already full screen, right? Yeah, good.

15
00:01:38,460 --> 00:01:46,740
So so in my thought, what I will address is the following in many algorithms nowadays,

16
00:01:46,740 --> 00:01:56,410
machine learning algorithms, for instance, they report zero training error, but still a good.

17
00:01:56,410 --> 00:02:08,230
Just error. And so I want to somehow contribute to an understanding of this phenomenon and I will look at classification problems.

18
00:02:08,230 --> 00:02:12,580
And in classification problems, if you have a small test error,

19
00:02:12,580 --> 00:02:21,490
it must mean that the base error is small because that's the unavoidable error that you always make if you believe in a model.

20
00:02:21,490 --> 00:02:26,470
So the test error can only be small if the base error is small.

21
00:02:26,470 --> 00:02:33,820
So I'll look at that context. And then look at the algorithms with zero training error,

22
00:02:33,820 --> 00:02:43,630
actually just three algorithms and zero training error means that they interpolate the data right.

23
00:02:43,630 --> 00:02:50,020
And if you interpolate the data, that's only possible if you have enough flexibility to do that.

24
00:02:50,020 --> 00:02:59,740
So that's what I will look at, the situation, which is high dimensional, so that you have like a regression model with very many variables.

25
00:02:59,740 --> 00:03:09,160
And then the main issue will be the other boost algorithm, which you might know, which has been around for a while from Shapir,

26
00:03:09,160 --> 00:03:21,610
Point and Shapir, and try to explain its success a little bit using this Max Marxian approximation.

27
00:03:21,610 --> 00:03:28,540
OK, so this is the title again, and then this is joint work with Jeffrey Chenault,

28
00:03:28,540 --> 00:03:37,090
Felix Meister and Matthias LeFleur, and these are so Geoffrey and Matthias are both stocks at our institute.

29
00:03:37,090 --> 00:03:44,200
And Felix is a Ph.D. student. And so I, I work together with them and they're a great team.

30
00:03:44,200 --> 00:03:50,970
And so much or maybe everything of what I'm going to tell this is they're hard to work.

31
00:03:50,970 --> 00:04:01,850
OK. So here's a little bit of the overview, so I will look at the algorithms with zero training error,

32
00:04:01,850 --> 00:04:07,100
that means that the interpolated data and I'm going to concentrate on the special

33
00:04:07,100 --> 00:04:12,680
case where they the final INTERPOLATE is the one with the smallest L1 norm,

34
00:04:12,680 --> 00:04:16,070
because you can interpolate data in very many ways.

35
00:04:16,070 --> 00:04:25,130
You can also look at interpolation, looking at the smallest L2 norm, but that's a completely different situation and.

36
00:04:25,130 --> 00:04:32,180
So my view will be more like the sparsity view where you should sparsity.

37
00:04:32,180 --> 00:04:39,410
And if you do that, you want to, of course, see how how it relates to regularisation instead of interpellation.

38
00:04:39,410 --> 00:04:43,190
So if you're high dimensional model, you either interpolate or not,

39
00:04:43,190 --> 00:04:49,640
and then you do regularisation that you introduce some kind of penalty term or regularisation term.

40
00:04:49,640 --> 00:04:56,200
And I'll just look at that. And it's kind of easy for that case to prove.

41
00:04:56,200 --> 00:05:03,010
Some bounce and they will form the benchmark for the interpolating case,

42
00:05:03,010 --> 00:05:10,860
and then I'll show you some rate of convergence, which I can then compare with this benchmark.

43
00:05:10,860 --> 00:05:24,940
So here's the setup. So it's going to be a theoretical work where I have given design matrix X with and rows and columns.

44
00:05:24,940 --> 00:05:29,980
So NSA, the number of observation bees, the number of variables, and B,

45
00:05:29,980 --> 00:05:41,510
the number of variables is assumed to be much larger than and and so that's the high dimensional component.

46
00:05:41,510 --> 00:05:49,390
Then you can think of a regression model with an unknown vector of coefficients, which actually will look at classification.

47
00:05:49,390 --> 00:05:56,680
And then the the the the the scale is not identified.

48
00:05:56,680 --> 00:06:02,650
So there you can assume without loss of generality that the vector of coefficients has L2 naum equal to one.

49
00:06:02,650 --> 00:06:06,010
That's just a normalisation assumption.

50
00:06:06,010 --> 00:06:15,880
And the sparsity assumption is that the effect of coefficients is only a few non-zero entries and the nonzero entries,

51
00:06:15,880 --> 00:06:22,060
if I count them, I have this sense for a sparsity.

52
00:06:22,060 --> 00:06:33,350
Then there's a noise factor, so the noise is normalised in two and I call Sigma the the noise level and that's the quantity that's sparse.

53
00:06:33,350 --> 00:06:40,450
So in this setup, it's more or less also like engineering. The noise is not necessarily independent of the X.

54
00:06:40,450 --> 00:06:46,450
It's going to be adversarial noise or so it can be also a deterministic.

55
00:06:46,450 --> 00:06:52,990
Quantity, but you may think of this and it will be as a first step if you're statistician's, that's the way to think of it.

56
00:06:52,990 --> 00:07:00,600
You may think of this as just a random noise. And then this segment is here.

57
00:07:00,600 --> 00:07:10,050
OK, and then dispersed the issues to be small in the usual sense that so say like s the fastest

58
00:07:10,050 --> 00:07:14,800
a number of active variables and it should be smaller than the number of observations.

59
00:07:14,800 --> 00:07:27,290
And then there is a long term. In terms of the number of variables to be because you don't know a priori which coefficients are non-zero.

60
00:07:27,290 --> 00:07:33,350
Oh, so here's the notation that infinity norm with an infinity here, the L2 norm,

61
00:07:33,350 --> 00:07:40,010
the L1 norm is just some of the absolute values and the L. Zero is not a norm.

62
00:07:40,010 --> 00:07:46,010
But that's counting the number of entries, which are not zero.

63
00:07:46,010 --> 00:07:53,030
OK. So these are the two programmes I want to look at.

64
00:07:53,030 --> 00:07:57,500
So you observe this design matrix and you have this regression function.

65
00:07:57,500 --> 00:08:06,950
So let's say it looks like this. So just a regression function plus noise, but you only observe the same.

66
00:08:06,950 --> 00:08:12,560
So you don't observe the regression function itself, and that's like in the classification problem,

67
00:08:12,560 --> 00:08:17,490
so the response variable is binary plus or minus one.

68
00:08:17,490 --> 00:08:27,640
OK, and. The interpolating which is developed by Blunt and vaccinia, so I called it upon and for sheening estimate,

69
00:08:27,640 --> 00:08:34,520
there is the one which says, OK, I'm going to interpolate the science, which I observed.

70
00:08:34,520 --> 00:08:42,600
And I used to play there with smallest one norm. And that's like one bit compressed sensing,

71
00:08:42,600 --> 00:08:48,090
except that we're looking at a case where there is noise as so compressed sensing

72
00:08:48,090 --> 00:08:53,250
is where you observe X Beta Star and you want to recover the coefficients.

73
00:08:53,250 --> 00:08:58,920
And one bit compressed sensing is that you even further compress the real value,

74
00:08:58,920 --> 00:09:03,600
random variables or whatever real value variables by a binary variable.

75
00:09:03,600 --> 00:09:13,140
So then it's only one bit. So it's a compression motivation, but you can also think of it as just the classification problem.

76
00:09:13,140 --> 00:09:21,100
Now, in this case, the the if you minimise the elbow nerve because of course, the skill is not identified.

77
00:09:21,100 --> 00:09:29,130
We now have to somehow use. And normalisation, and that's the empirical norm.

78
00:09:29,130 --> 00:09:40,490
Of the. Beta, so that's the vector X, that's a real vector column vector, so that's equal to one.

79
00:09:40,490 --> 00:09:50,330
OK, so this is a factor in dimensional space x beta stars and in dimensional vector, and these are the entries of the 10 dimensional vector.

80
00:09:50,330 --> 00:09:55,820
OK, and it makes margin is the same situation.

81
00:09:55,820 --> 00:10:05,140
You observe what X and the sign of the regression. And I look at the margin, which is why to be.

82
00:10:05,140 --> 00:10:15,070
And I look at the minimal value and I want that to be maximal, overall factors be with L1 normalised and or equal to one.

83
00:10:15,070 --> 00:10:27,520
So that's, again, a interpolating designs because you certainly wouldn't have that design y in the sign of the XP, that will be the same.

84
00:10:27,520 --> 00:10:37,030
Because you want to the minimal value to be maximal, so you will certainly have that's why times be will be positive.

85
00:10:37,030 --> 00:10:43,500
So those are two interpolated. And then we want to obtain bounce.

86
00:10:43,500 --> 00:10:47,110
In to so the true in later, so sorry,

87
00:10:47,110 --> 00:10:52,800
the true fact of efficiencies is to naum equal to one because of the normalisation

88
00:10:52,800 --> 00:10:58,450
and in my estimate I also normalised to have Delta Norm equal to one.

89
00:10:58,450 --> 00:11:04,290
And then I want to find out how close are they. OK,

90
00:11:04,290 --> 00:11:09,180
and I'm going to look really at this compressed sensing type of situation where

91
00:11:09,180 --> 00:11:15,780
the design matrix consists of used in the standard Gaussian random variables,

92
00:11:15,780 --> 00:11:20,670
which is maybe a bit of a disappointing assumption, but at least it's a starting point.

93
00:11:20,670 --> 00:11:28,230
And it's going to be like, I don't know how to generalise it yet.

94
00:11:28,230 --> 00:11:35,680
And this is the notation, so in probability of order at most, just not to get too complicated for us.

95
00:11:35,680 --> 00:11:39,330
OK.

96
00:11:39,330 --> 00:11:48,180
OK, so I'm going to look at Syria and there is some theory about one we compressed sensing this plant in Virginia estimated that for the next March,

97
00:11:48,180 --> 00:11:53,430
I'm not aware of any theory and it is related to the other boost algorithm.

98
00:11:53,430 --> 00:12:07,670
So we get our aim is to get x ray results for add a boost via this march in the Interpol later.

99
00:12:07,670 --> 00:12:12,890
So let's look at some basic facts, which I'm just showing you, because I think it's kind of nice.

100
00:12:12,890 --> 00:12:19,120
So if you look at Vector's with Norm equal to one.

101
00:12:19,120 --> 00:12:24,640
And well, then the expectation of the product of their science,

102
00:12:24,640 --> 00:12:33,940
so x bita is just a linear combination of Gorshin set and X D also a linear combination of

103
00:12:33,940 --> 00:12:39,620
Gaussians and the expectation of having the same sine is the arc sign of the inner product.

104
00:12:39,620 --> 00:12:47,430
It's called to the heart of the identity. So the probability that the signs are equal,

105
00:12:47,430 --> 00:12:57,910
you can think of this as the probability of making a misclassification is to Arko co-signers of the inner product between the two factors.

106
00:12:57,910 --> 00:13:05,770
And that's not hard to see if you think little bit so you can sort of think of it as a two dimensional situation,

107
00:13:05,770 --> 00:13:12,610
although we look at higher dimensions where you have two vectors and they are Naum one,

108
00:13:12,610 --> 00:13:22,180
so they lie on the sphere and the Jodeci distance is just this distance along the sphere.

109
00:13:22,180 --> 00:13:28,780
And that's the probability of making a misclassification error for those two factors.

110
00:13:28,780 --> 00:13:40,240
And then one over by here, OK? So what I'm going to do now is just say, OK, for the classification problem, you just have like for regression problem,

111
00:13:40,240 --> 00:13:53,290
some splitting up into the unavoidable error sigma squared and the error due to the estimation of the unknown coefficients.

112
00:13:53,290 --> 00:13:56,440
So let's do that. So you have this inner product.

113
00:13:56,440 --> 00:14:06,010
Between two vectors, if they have Naum equal to one, it's one minus the distance squared between the two vectors, that that's quite obvious.

114
00:14:06,010 --> 00:14:13,070
And then there's one half year. OK, so the inner product is related to one minus the.

115
00:14:13,070 --> 00:14:20,060
Distance squared in this way, so if you look at the probability of a mass misclassification,

116
00:14:20,060 --> 00:14:26,570
it's just this are cosine of the inner product, which is one minus the distance squared.

117
00:14:26,570 --> 00:14:35,180
And then you just see that if you do a one term data expansion, that the distance is more or less the L2 distance.

118
00:14:35,180 --> 00:14:40,530
If they are close together, which is close, which is clear from this picture.

119
00:14:40,530 --> 00:14:46,080
So did you distance distances along the curve at the straight line is not much different.

120
00:14:46,080 --> 00:14:57,860
OK. Now, suppose you have so again, just for one observation, say this is one observation from the test set,

121
00:14:57,860 --> 00:15:04,170
and so you have the sign of the regression and you have the noise with some variance.

122
00:15:04,170 --> 00:15:12,990
Then the base terrorist probability that the sign is of the noisy regression is not the sign of the regression without noise,

123
00:15:12,990 --> 00:15:19,920
and then you use these whole identity and you see it's to sign cosine of this quantity,

124
00:15:19,920 --> 00:15:28,360
which is by doing just one term data expansion for approximately equal to two sigma, the noise level.

125
00:15:28,360 --> 00:15:35,940
OK. So, OUP.

126
00:15:35,940 --> 00:15:45,640
So the probability of making a misclassification error if you use B as your classifier.

127
00:15:45,640 --> 00:15:47,860
Is just our cosine functions,

128
00:15:47,860 --> 00:15:59,080
which you can then split up in a term involving the unavoidable base error sigma squared and a term due to the estimation,

129
00:15:59,080 --> 00:16:03,760
so the due to making an error in estimating beta star.

130
00:16:03,760 --> 00:16:12,730
So this is exactly like a regression to a regression. Also, you have the excess risk, which is this Delta error.

131
00:16:12,730 --> 00:16:22,640
And then Sigma squared, so that excess risk is here in Sigma squared is an unavoidable error.

132
00:16:22,640 --> 00:16:32,880
OK, now I want to show a result for the vectorised estimated and.

133
00:16:32,880 --> 00:16:36,150
So I need some identities which are very straightforward.

134
00:16:36,150 --> 00:16:45,060
First of all, if you look at the expectation of the inner product between the sign and this linear combination,

135
00:16:45,060 --> 00:16:50,740
it's again, in a product of the truby to start and be.

136
00:16:50,740 --> 00:16:59,560
And then there's some coefficient in front of it, so if I put here and the difference between the Vector B,

137
00:16:59,560 --> 00:17:06,070
so B is an arbitrary factor here and the difference between the vector B and the vector B to start a true vector,

138
00:17:06,070 --> 00:17:11,250
of course, B, to start in the end product, which itself is one.

139
00:17:11,250 --> 00:17:23,520
So you get this expression, which is, again, the one minus to in product, which is again the end to the L to distance squared.

140
00:17:23,520 --> 00:17:29,220
So if you want to estimate the effect of coefficients, be the star,

141
00:17:29,220 --> 00:17:35,940
then this can make a good loss function because this is clearly minimised or a risk function.

142
00:17:35,940 --> 00:17:39,660
This is clearly minimised at the equal to Beat Star.

143
00:17:39,660 --> 00:17:48,310
So what you do to estimate B to start is just to replace this expectation by an average and minimise that.

144
00:17:48,310 --> 00:17:57,860
OK, and then you do a regularisation to make sure that you can deal with this high dimensional situation.

145
00:17:57,860 --> 00:18:01,610
So we do that, so there's a minus sign missing here, so instead of minimisation,

146
00:18:01,610 --> 00:18:14,730
we do a maximisation maximisation of the margin and then minus the regularisation term, the L1 norm of the vector of coefficients.

147
00:18:14,730 --> 00:18:22,890
And you want to still be one, so this is just like the lasso, that is just the lasso.

148
00:18:22,890 --> 00:18:30,550
If you observe the regression function, you can do the lasso, just the least squares and have been analyzation on the elbow norm.

149
00:18:30,550 --> 00:18:37,940
That's the lasso. And this is not much different. OK.

150
00:18:37,940 --> 00:18:50,930
OK, now let's look at this march and so the average margin and its expectation, this is a factor of lengths B,

151
00:18:50,930 --> 00:18:57,410
so this is a very long vector and I subtract its expectation and then they take the maximum.

152
00:18:57,410 --> 00:19:01,910
So I take the maximum of P random variables.

153
00:19:01,910 --> 00:19:09,080
They are independent. So you can well. Sort of show.

154
00:19:09,080 --> 00:19:15,230
That it behaves like squared, look over and just like a maximum of Galicians.

155
00:19:15,230 --> 00:19:21,070
And so that is these are more or less, gosh, Anthony, a few of them and a maximum logarithmic.

156
00:19:21,070 --> 00:19:27,150
Inbee And there's a square root. There's an end because of the normalisation with one over in here.

157
00:19:27,150 --> 00:19:38,950
OK. So if you do this regularised, estimated to be dieties to regularised, estimated.

158
00:19:38,950 --> 00:19:47,100
It's lost or so the Alta convergence rate is here, it's more or less.

159
00:19:47,100 --> 00:19:57,030
Squat and look, be over in times to L1, normal feet to start and then the noise place to roll here it's one plus sigma squared.

160
00:19:57,030 --> 00:20:02,100
And so this is my benchmark for the regularised case.

161
00:20:02,100 --> 00:20:08,480
And just so that you see that it's not very deep. Let me look at the proof.

162
00:20:08,480 --> 00:20:20,090
So the estimate is defined by maximising this quantity. So it will be larger than the same expression at the true victor to start, right?

163
00:20:20,090 --> 00:20:26,760
So I don't just put minus sign in front so that you sign switches.

164
00:20:26,760 --> 00:20:35,610
And I put the the penalties together and there comes somebody visiting me, the cats.

165
00:20:35,610 --> 00:20:43,560
OK, um, so this is just rewriting that you're maximising something.

166
00:20:43,560 --> 00:20:48,510
And then I subtract and add expectations, let's see.

167
00:20:48,510 --> 00:20:53,210
So this is the sort of the theoretical risk.

168
00:20:53,210 --> 00:21:02,860
And we just saw it's equal to this expectation. And then I replaced the expectation by an average.

169
00:21:02,860 --> 00:21:06,750
And then I have to correct for that.

170
00:21:06,750 --> 00:21:18,530
And here I can use the dual norm inequality's, so the inner product between two vectors can be bounded by the Infinity Normington one nor.

171
00:21:18,530 --> 00:21:26,700
Then you get this laptop had this this infinity norm, and this is the one norm Bita had minus Beta Star.

172
00:21:26,700 --> 00:21:35,260
And they're left over with the term which was already there, and then we used this arc inequality.

173
00:21:35,260 --> 00:21:43,190
Which we just saw on the previous page. And then we used the triangle inequality.

174
00:21:43,190 --> 00:21:46,970
Yeah. So there's nothing going on.

175
00:21:46,970 --> 00:21:58,620
OK, yes, Judith, you still have a question. I just I wasn't sure what the hotkeys which you said it afterwards, so he just looked at it again.

176
00:21:58,620 --> 00:22:05,820
It's a good point. It goes quickly, leapt ahead. It's like the maximum of Pete, more or less Gaussian random variable.

177
00:22:05,820 --> 00:22:11,910
So that's a random variable, which you easily have some kind of concentration inequality for it.

178
00:22:11,910 --> 00:22:16,590
So with large probability that will be of order squared.

179
00:22:16,590 --> 00:22:21,090
Look, be over. And just like for the last of your usual thing.

180
00:22:21,090 --> 00:22:33,960
Yeah. So that means here this is the inequality, if you take your tuning parameter, not too small, but also not too large.

181
00:22:33,960 --> 00:22:42,360
It gives you about a high probability. Yeah, OK.

182
00:22:42,360 --> 00:22:50,510
So it looks very much like the lassalle. You can.

183
00:22:50,510 --> 00:22:53,270
We find that if you wish to execute sparsity,

184
00:22:53,270 --> 00:23:02,920
so suppose that are only a few non-zero coefficients in the true vector and the number of them is littleness.

185
00:23:02,920 --> 00:23:10,540
Then the rate of Convergys becomes Lakdar Square times and then something with the noise.

186
00:23:10,540 --> 00:23:17,320
And that is, again, of the same order, so that means you'll have to square it is look be over and in order.

187
00:23:17,320 --> 00:23:27,070
So you get something like the number of active variables divided by the number of observations a long term and one plus sigma squared.

188
00:23:27,070 --> 00:23:31,780
OK, so the proof is, again, very simple, but I don't think I have that much time.

189
00:23:31,780 --> 00:23:43,390
But this is the proof. It's really just being a little bit more careful with the triangle inequality, that's all.

190
00:23:43,390 --> 00:23:52,120
So this is the benchmark, so the MySpace classification error of these regularised estimates consists of two

191
00:23:52,120 --> 00:23:59,260
terms to unavoidable based risk and the term due to estimating estimating beta star,

192
00:23:59,260 --> 00:24:12,670
so to excess risk and the excess risk is squared lock B over N times to Evernham if the true vector is only a one spa's.

193
00:24:12,670 --> 00:24:25,620
And it can be improved to be overturned if the true vector is zero Spar's, which means that it has only s non-zero coefficients.

194
00:24:25,620 --> 00:24:31,430
And you see what is happening here. If there is no noise.

195
00:24:31,430 --> 00:24:37,730
The if there's no noise. It's the area is still not zero.

196
00:24:37,730 --> 00:24:43,070
Which is unlike for the last, so and that's because you only observe the sun,

197
00:24:43,070 --> 00:24:47,810
you don't observe the complete regression function, so there's a price to pay.

198
00:24:47,810 --> 00:24:53,110
You cannot really exactly recover. B Star.

199
00:24:53,110 --> 00:24:57,700
And it's I'm not sure if there are lower bouncier, but.

200
00:24:57,700 --> 00:25:05,320
It's quite clear that you cannot exactly recover it, and so they're still an error, even if you don't have noice.

201
00:25:05,320 --> 00:25:12,040
And if the noise is small. So you have the unavoidable noise here.

202
00:25:12,040 --> 00:25:21,280
And if the noise is small, this is all in order, so if the noise is small one plus the noise squared is of the same order as just one.

203
00:25:21,280 --> 00:25:24,480
So you can sort of cancel this term.

204
00:25:24,480 --> 00:25:31,530
And so the noise doesn't really hurt in a sense, if you look at the order, at least if it's small and that's the case,

205
00:25:31,530 --> 00:25:36,910
we're interested in small noise because otherwise we'll never have small test error.

206
00:25:36,910 --> 00:25:51,340
OK. So that's my benchmark. So now let's compare that with the interpellation, so the plan and for sheening interpolating and Dymocks merchant later.

207
00:25:51,340 --> 00:26:02,340
And we will use them to study them results from bases for shoot, which is going to be noisy bases for shoot.

208
00:26:02,340 --> 00:26:09,150
So what's Noisey basis for a suit in basis, noisey basis, you do observe the regression,

209
00:26:09,150 --> 00:26:15,810
so you do observe this regression, the real value response variable.

210
00:26:15,810 --> 00:26:22,300
So you just have a regression model and then you interpolate. The data.

211
00:26:22,300 --> 00:26:31,140
So see you interpolate the data and you take the interpolating, which is L1 naum minimal.

212
00:26:31,140 --> 00:26:39,420
OK, then there are theoretical results in this paper by cheek and also in more recent work.

213
00:26:39,420 --> 00:26:47,790
And so what we derived is the same as this paper from 2010, but our processes are quite different.

214
00:26:47,790 --> 00:26:57,570
So this this is more like a real. Yeah. Like from a different community, I would say machine learning or computer science.

215
00:26:57,570 --> 00:27:03,910
And our language is more statistical. So that's.

216
00:27:03,910 --> 00:27:07,720
A noisy base, for sure. And then later afterwards,

217
00:27:07,720 --> 00:27:12,500
we will consider the case where you only observe the sign of the regression function

218
00:27:12,500 --> 00:27:17,290
and we look at it later from Blunt and for Sheenan and we look at the Marks,

219
00:27:17,290 --> 00:27:29,530
Marks and Interpolate, OK. So that's to come in all three cases, I will need a good bound for the elbow norm of the estimate,

220
00:27:29,530 --> 00:27:33,790
and that's kind of clear because once you have a good bound for the elbow norm,

221
00:27:33,790 --> 00:27:41,650
you can use results from empirical process theory of empirical processes with coefficients bounded in L1,

222
00:27:41,650 --> 00:27:45,310
and that gives you about four entropies and so on.

223
00:27:45,310 --> 00:27:52,880
So then you're in business without explain it in more detail later also.

224
00:27:52,880 --> 00:28:00,060
OK, so let's look first at bases where you do observe the regression function with Noice.

225
00:28:00,060 --> 00:28:08,440
And you interpolated, you choose to one with smallest one or OK.

226
00:28:08,440 --> 00:28:16,150
So what is a good bound for the estimate there? Maybe I should go back.

227
00:28:16,150 --> 00:28:22,490
And so the point here is if you have noisy and.

228
00:28:22,490 --> 00:28:29,390
Observations, the true beta star is not interpolating quite clearly because of the noise,

229
00:28:29,390 --> 00:28:36,830
so somehow we want to interpolate the noise to make sure that your interpolated data.

230
00:28:36,830 --> 00:28:43,070
And this is more or less what what the result says, if you want a good pound for the L1 norm of the estimated,

231
00:28:43,070 --> 00:28:54,390
you can be counted by the L1 norm of the truth in a term which is the noise level, Times Square and divide by lockpick.

232
00:28:54,390 --> 00:29:05,370
And the reason for this is sort of follows from the chaotic conditions or maybe from the from the dual formulation,

233
00:29:05,370 --> 00:29:08,250
and if you do that, I won't give the details.

234
00:29:08,250 --> 00:29:18,240
It says that the 11 norm of the estimate is bounded by the L1 norm of the truth and then a term involving the L2 norm of the noise,

235
00:29:18,240 --> 00:29:25,940
divided by something and divided by something. You want to divide it by something large so that it will be small.

236
00:29:25,940 --> 00:29:32,630
And if something is the maximum. Of Gaussian random variables.

237
00:29:32,630 --> 00:29:38,650
But. OK, these linear combinations each time for different love,

238
00:29:38,650 --> 00:29:50,460
that you have a different number of Gaussian random variables and then you minimise again, overlap the supposed the minimum of a maximum.

239
00:29:50,460 --> 00:30:00,420
And there comes the assumption that the sample size should be small compared to the number of variables, because we want this to be large.

240
00:30:00,420 --> 00:30:07,140
And if you look at extreme value theory, if you have to be so big,

241
00:30:07,140 --> 00:30:12,450
independent Gaussian random variables, you have this extreme value limiting distribution.

242
00:30:12,450 --> 00:30:19,410
And I think it's a Google limiting distribution. So we know this maximum.

243
00:30:19,410 --> 00:30:27,120
The that's like Squirrelled Lockerby. But it's a little bit more involved because we have, again, an information here,

244
00:30:27,120 --> 00:30:33,480
so this is a lot of technical work and we want it to be non asymptotic.

245
00:30:33,480 --> 00:30:39,190
So we get that this information behaves like indeed like square root Lockerby.

246
00:30:39,190 --> 00:30:48,370
If the sample size is small compared to be, for instance, the sample size is smaller than be divided by Lockerby, something like that.

247
00:30:48,370 --> 00:30:52,600
So really the high dimensionality is your playing is crucial here.

248
00:30:52,600 --> 00:31:06,880
Otherwise we won't have the lock be here. So this term is the square root, and this year and the lock is here, square root and times super.

249
00:31:06,880 --> 00:31:13,320
So if you didn't have about if you didn't have this bound for the one norm.

250
00:31:13,320 --> 00:31:19,500
You can use Sambuca process theory to get bounced for quadratic forms where it is Elgon norm of beer.

251
00:31:19,500 --> 00:31:29,940
So you just take the empirical average minus the theoretical average, which is one of the to go fishing normalised to one.

252
00:31:29,940 --> 00:31:35,280
And you look at these quadratic forms, how different are different from the expectation?

253
00:31:35,280 --> 00:31:45,270
And then the L1 norm plays a role. And we just showed that this L1 norm behaves like square root and divided by Lockerby.

254
00:31:45,270 --> 00:31:53,550
So that's exactly nice cancelling. So that gives you that these pretty radical forms are dist. like like this Sigma's stick.

255
00:31:53,550 --> 00:32:06,540
My head is occurring there. So that gives finally this result that the estimation error of the noisy basis pursuit estimate is bounded by C times.

256
00:32:06,540 --> 00:32:14,310
Some universal Constantines stick my head and of course, if there's no noise, then it means that it's zero.

257
00:32:14,310 --> 00:32:20,760
So that's in basic you have in the nonlawyers you you have exact recovery.

258
00:32:20,760 --> 00:32:25,800
But the noisy Ferreti you have now.

259
00:32:25,800 --> 00:32:35,470
Some. Some error still, but if the noise is small, it's still reasonable.

260
00:32:35,470 --> 00:32:40,830
So now let's look at one bit compressed sensing.

261
00:32:40,830 --> 00:32:49,200
So now we only observe the sign of the regression function, we minimise to Eleanor, as in the basis for shoot.

262
00:32:49,200 --> 00:32:59,940
With an interpellation and with a normalisation. And then the question is, again, what is a good amount for the estimated?

263
00:32:59,940 --> 00:33:06,780
And again, you see the true fekter with star is not interpolating because of the noise.

264
00:33:06,780 --> 00:33:14,100
So, again, we have to interpolate the noise. And if we do that.

265
00:33:14,100 --> 00:33:24,040
So the idea is so if you if you put your to true factor for B plus something which interpolated noise.

266
00:33:24,040 --> 00:33:29,620
Then you have a candidate because then you have an interview later.

267
00:33:29,620 --> 00:33:37,090
And so do the elbow normal of this candidate will be less than the elbow will be larger than this one which minimises it.

268
00:33:37,090 --> 00:33:41,800
So don't we have about four to elbow normal of the minimise her?

269
00:33:41,800 --> 00:33:54,920
If I have canidate here, which I can control, then this one we will have elbow normal smaller than this canidate, ok.

270
00:33:54,920 --> 00:34:06,320
OK, so why not interpolate the noise, so I interpolate here to noise sized noise and I think we had the Yeah.

271
00:34:06,320 --> 00:34:11,900
The L1 minimise or which relates to noise and I just dig here.

272
00:34:11,900 --> 00:34:22,030
So I know from the previous result from bases pursuit, I have this elbow norm for this interplay there under control.

273
00:34:22,030 --> 00:34:30,210
And we're over if I added to be to start a true victor, I have an interview later with the right sign.

274
00:34:30,210 --> 00:34:37,500
So that's a candidate, so I know that one of the estimates responded by one normalities candidate.

275
00:34:37,500 --> 00:34:42,330
Oh, yeah, I have to normalise it, but that doesn't really hurt very much.

276
00:34:42,330 --> 00:34:49,490
So then I'm I'm done. Then I have about four to one. And so that's the idea.

277
00:34:49,490 --> 00:34:54,480
So I have my true Victor, I have my noice.

278
00:34:54,480 --> 00:35:06,460
I interpolated with some other factor. And then X be the star plus noise is the same as this by definition because I interpolated it.

279
00:35:06,460 --> 00:35:15,000
So I have here a vector with the correct which gives the correct sign, which this one interpolate.

280
00:35:15,000 --> 00:35:23,340
The signs. So the one normal beat, the starters, beat the head will be larger than of my.

281
00:35:23,340 --> 00:35:29,120
Planning for Sheenan estimate, which is to minimise.

282
00:35:29,120 --> 00:35:38,270
So then again, you have about four to one norm, and I'm not going to show all the technical details, but then you can do empirical bounce off.

283
00:35:38,270 --> 00:35:50,060
The premium of a process is indexed by functions with coefficients with L1 normal, bounded by some quantity.

284
00:35:50,060 --> 00:35:58,330
And then we get this result, so the planning for Sheenan estimated this interpolating.

285
00:35:58,330 --> 00:36:06,830
As in L to the if you have a sparsity, the rate is like before Uzelac be over in.

286
00:36:06,830 --> 00:36:18,130
And now if you look at the benchmark, it was times one plus sigma and now it's just plus Sigma plus genoise.

287
00:36:18,130 --> 00:36:27,130
And for the if you have only one sparsity, you have again, it's exactly like the benchmark estimate.

288
00:36:27,130 --> 00:36:33,100
But instead of multiplying by one plus sigma, you add the noise sigma.

289
00:36:33,100 --> 00:36:41,990
So it means that if noise is some. Is a large.

290
00:36:41,990 --> 00:36:48,490
You say small but large, still larger than this term, you still feel it so in the benchmark estimate.

291
00:36:48,490 --> 00:36:52,780
If the noise is small, you hardly feel feel the noise.

292
00:36:52,780 --> 00:36:57,330
You only you already have base noise. And that's the main part.

293
00:36:57,330 --> 00:37:04,120
The here it can be. That noise is larger than the area due to estimation.

294
00:37:04,120 --> 00:37:17,280
So. So it's not a less beautiful result, which is, of course, clear because you do interpellation instead of regularisation, so it's not not as good.

295
00:37:17,280 --> 00:37:26,620
OK. So how much time do I have?

296
00:37:26,620 --> 00:37:31,950
What do you say about ten minutes, five minutes? OK.

297
00:37:31,950 --> 00:37:44,980
I'll skip that, so now go to Adam. So we observe as before X and the design and adapt algorithm is as follows.

298
00:37:44,980 --> 00:37:51,860
You first take a vector of weights which are on the simplex.

299
00:37:51,860 --> 00:37:57,710
Well, added boost is for weak learners, they are defined differently, so to mimic that,

300
00:37:57,710 --> 00:38:08,230
we normalise our design so that it all all design entries are between minus one and one.

301
00:38:08,230 --> 00:38:19,300
You start with an ization and then you go in a certain direction, don't ask me too much, but this is just to see it quickly.

302
00:38:19,300 --> 00:38:28,450
You go in a certain direction and then you define a step size, which is here, the one inspired by explanation plus.

303
00:38:28,450 --> 00:38:36,870
And then you update your your vector of coefficients with a learning rate, Eita.

304
00:38:36,870 --> 00:38:50,100
And then you update the weights and you keep on doing this until stopping time, so capital T, OK?

305
00:38:50,100 --> 00:38:59,130
And then there are results which say that if the learning rate goes to zero and the number of iterations goes to infinity,

306
00:38:59,130 --> 00:39:09,320
the outermost algorithm converges to the max margin classifier, which is maximising the minimal margin.

307
00:39:09,320 --> 00:39:13,520
And that's up to skating, the minimum one norm interpolate there,

308
00:39:13,520 --> 00:39:19,340
that is you look at the the minimal use to make sure that the margin is always bigger than one.

309
00:39:19,340 --> 00:39:25,680
And then you look for a solution with L1 or minimal.

310
00:39:25,680 --> 00:39:34,040
So there are some theoretical results for the relation between the boost in this max margin classifier.

311
00:39:34,040 --> 00:39:40,750
When there is no surveillance is of the same order as the number of variables.

312
00:39:40,750 --> 00:39:43,030
So that's not our case.

313
00:39:43,030 --> 00:39:53,880
We have that a number of variables is much larger than the number of observations, and that's really a crucial assumption in our theory.

314
00:39:53,880 --> 00:40:02,110
So let's look at this next March, which is the same as this minimum, one norm estimated.

315
00:40:02,110 --> 00:40:07,420
And what is a good bound for four to one normal of these estimates?

316
00:40:07,420 --> 00:40:14,900
Now, if you look at this carefully, you would say, OK, does it somehow mimic the behaviour of the true vector beta star?

317
00:40:14,900 --> 00:40:24,980
So if you insert to your beat a star the true factor. And look, think, think of the case Western, where there is no noise to simplify,

318
00:40:24,980 --> 00:40:30,480
if you insert here to start, this is just the absolute value of that factor.

319
00:40:30,480 --> 00:40:36,590
So this is just the absolute value of of a Gaussian random variable.

320
00:40:36,590 --> 00:40:44,760
And then you take the minimum, the minimum of any standard Gaussian random variables.

321
00:40:44,760 --> 00:40:52,470
And a minimum of standard Goutman Veritas behaves like one over, and so that's very, very small.

322
00:40:52,470 --> 00:40:59,190
And you want it to be bigger than one Sabetha star is no way a good candidate for solving this problem.

323
00:40:59,190 --> 00:41:08,440
It's completely out of out of scope. Of course, you can be skilled, but then this one, Norm, is going to be very large.

324
00:41:08,440 --> 00:41:17,020
So somehow this is a very strange thing that you doing something where you think, well, be the star itself is not a good solution to this problem.

325
00:41:17,020 --> 00:41:23,850
So why would this be a good estimate or. OK.

326
00:41:23,850 --> 00:41:32,700
Um, yeah, maybe if you have time, you can write down, so again, we want about four to one norm of the estimates.

327
00:41:32,700 --> 00:41:41,020
You can again, like for basis pursuit, do the dual normal for immunisation and then we get a lower about.

328
00:41:41,020 --> 00:41:45,100
So I won't go into details there.

329
00:41:45,100 --> 00:41:54,010
But let's look at the upper bound and just briefly explain the idea that, again, what we use is this baseless pursuit idea.

330
00:41:54,010 --> 00:42:00,720
And so let me just show you the picture.

331
00:42:00,720 --> 00:42:08,160
So the true Victor Batista. Is not bigger than one.

332
00:42:08,160 --> 00:42:13,660
Right. It's going to be orobets very close to zero at certain points.

333
00:42:13,660 --> 00:42:19,870
So it's not a good candidate for this minimisation problem, but what we can do, we can say, well, there is some error.

334
00:42:19,870 --> 00:42:28,490
We just look at is this linear function and we truncated at some positive fancy epsilon.

335
00:42:28,490 --> 00:42:37,860
OK, then, it's nice away from zero, so now the margin is Epsilon's already something, but then we make a mistake.

336
00:42:37,860 --> 00:42:44,550
And this mistake, we just consider this error. So don't worry about noise, because we can do that as before.

337
00:42:44,550 --> 00:42:52,220
So if but if there's no noise, we have this additional error and this additional error, we just interpolate.

338
00:42:52,220 --> 00:42:57,900
XBLA Hetson interpolating interpolating this additional error.

339
00:42:57,900 --> 00:43:06,560
So if I take that one. And I subtracted from the star, I get a nice function, which is.

340
00:43:06,560 --> 00:43:13,780
Staying away from zero, so there's a good margin and it has to right some.

341
00:43:13,780 --> 00:43:20,680
So that's the idea. So, again, interpolating some kind of error will help me to fight about four to one Naumov.

342
00:43:20,680 --> 00:43:22,420
The estimate,

343
00:43:22,420 --> 00:43:33,790
and it turns out to be like this after some some work because you have to renormalise and what occurs is a funny power to power two thirds here.

344
00:43:33,790 --> 00:43:38,620
But this one, we we sort of know this should look familiar, this looks like the benchmark,

345
00:43:38,620 --> 00:43:49,460
except that there are noises at a different place and it's very much also looks like this expression we had for the wrong end for Sheenan estimated.

346
00:43:49,460 --> 00:43:55,650
With nonracist, funny power, two thirds. OK.

347
00:43:55,650 --> 00:44:00,840
Anyway, we then you have to do some work to look at, not at all,

348
00:44:00,840 --> 00:44:09,060
one normed at the normalised factor be to head the L1 normative normalised vector data had to divide it by two L2 norm.

349
00:44:09,060 --> 00:44:16,410
And for that we use, you know, some some some further empirical process theory.

350
00:44:16,410 --> 00:44:28,080
And then we end up with this result, so this is exactly the same as for the plant in Virginia estimated, except that it's the slow rate.

351
00:44:28,080 --> 00:44:35,400
And if you have a sparsity, we were unable to to to explore that or exploit it.

352
00:44:35,400 --> 00:44:43,010
So we don't have to fast track. We only have the slow rate with about one fourth.

353
00:44:43,010 --> 00:44:53,790
And it's this one for us, it's like for the benchmark estimate, but the noise is at a different place, so maybe I should look at the table here.

354
00:44:53,790 --> 00:45:01,010
So here's the benchmark estimate. And unavoidable base error.

355
00:45:01,010 --> 00:45:05,630
And you see that the expressions are similar except for the benchmark estimate.

356
00:45:05,630 --> 00:45:13,230
If the noise is small, you can forget about it as far as the excess risk is concerned.

357
00:45:13,230 --> 00:45:21,640
For the plant in Virginia, an estimate and for the march, an estimated anois will play its role and.

358
00:45:21,640 --> 00:45:26,100
We have blonde and forseen and mixed and same result.

359
00:45:26,100 --> 00:45:34,420
Well, there's a Lockton yeah, I don't know, of course look there, but we can't prove fast rates.

360
00:45:34,420 --> 00:45:38,000
We're unable to boost fast rates for the next March.

361
00:45:38,000 --> 00:45:46,160
The estimated. OK, and then, well, just just to conclude, also,

362
00:45:46,160 --> 00:45:55,730
we can we have months of asymptotic about on how far this makes margin estimated and is from the other boost output.

363
00:45:55,730 --> 00:46:05,100
So this can also be used. To conclude the rates of convergence for the Addabbo algorithm.

364
00:46:05,100 --> 00:46:13,660
OK, then we worked harder, harder, harder, you know, what we started with was the result with your one twelfth.

365
00:46:13,660 --> 00:46:21,730
Now, we got it back to one first, but I think we are now in the situation when we get one third here.

366
00:46:21,730 --> 00:46:28,200
So it's probably this is the final answer to one third here.

367
00:46:28,200 --> 00:46:37,350
And what we can do so what you can do in these kind of problems is just first proof bound for the urban norm and then use all the results.

368
00:46:37,350 --> 00:46:42,420
There are very many results in literature on the on the line segmentation problem.

369
00:46:42,420 --> 00:46:49,800
And you can use their results. And there's one paper where we can conclude this power, one third.

370
00:46:49,800 --> 00:46:55,330
But so far, we do not understand the proof because it's very geometrical.

371
00:46:55,330 --> 00:47:02,200
And the paper is not published, so we really should understand the proof, but I think it's correct.

372
00:47:02,200 --> 00:47:08,730
So that means the rates can still be improved. And we hope to have that on archives.

373
00:47:08,730 --> 00:47:13,560
OK, yeah, um, so conclusion.

374
00:47:13,560 --> 00:47:24,840
So then either boost or mux merchant type estimated to have the same rate as the slow rate for one bit compressed sensing with linear programming.

375
00:47:24,840 --> 00:47:29,610
So the Bonhoeffer Sheenan estimated the confidence is less.

376
00:47:29,610 --> 00:47:38,340
It means the you have a very strong concentration for the first union estimates, but for the max merchant estimate,

377
00:47:38,340 --> 00:47:45,270
we don't have that because if you look at the minimum of random variable, that doesn't concentrate very, very strongly.

378
00:47:45,270 --> 00:47:49,730
And so lower bounds. That's an open question.

379
00:47:49,730 --> 00:47:58,680
Which I'm not able to answer yet, and, yeah, we looked at some lower bounds, the lower bound for the elbow.

380
00:47:58,680 --> 00:48:02,150
Norm, they are the same as the upper bound. So there we are tight.

381
00:48:02,150 --> 00:48:08,960
And indeed, if we do the simulations, we see the power one third, which which I had here appearing.

382
00:48:08,960 --> 00:48:17,240
So it seems to be a. Something you also see, you can see in data.

383
00:48:17,240 --> 00:48:22,480
OK, well, that's that's it. Thank you very much for your attention.

384
00:48:22,480 --> 00:48:33,190
Thank you. Thank you very much, sir. And do you feel free to use your virtual pause to acknowledge excellent talk?

385
00:48:33,190 --> 00:48:39,820
Thank you. Right now, they don't have any questions.

386
00:48:39,820 --> 00:48:44,440
Judith, thanks for the talk.

387
00:48:44,440 --> 00:48:52,090
I have two questions. One is that I was trying to figure out why people graters and was much more interesting.

388
00:48:52,090 --> 00:48:59,530
Is it because you have any sort of assumption on The X Factor with the idea on a virus, on the virus,

389
00:48:59,530 --> 00:49:05,920
that you have some kind of smoothing effect that allows you to control everything for very large people?

390
00:49:05,920 --> 00:49:12,730
Well, it's just it's in the lock. So it's this this phenomenon here.

391
00:49:12,730 --> 00:49:23,020
So far, for the maximum of precaution and random variables, we need to lower about that base like squirrelled lockpick.

392
00:49:23,020 --> 00:49:28,330
And if you don't want to have high dimensional problems, then the lock will disappear.

393
00:49:28,330 --> 00:49:37,240
That's all. But that looks good. Now, it's kind of nice to have your look here and it disappears with the lock here.

394
00:49:37,240 --> 00:49:44,590
And that's kind of pleasing. So but if you don't mind, locks in the high dimensionality can be of course,

395
00:49:44,590 --> 00:49:51,050
you need some high dimensionality because otherwise you can't interpolate clear.

396
00:49:51,050 --> 00:50:03,380
Yeah, well, with what we see in the simulations, it's really it's really kicking in the theory for high dimensions, that's still kind of interesting.

397
00:50:03,380 --> 00:50:16,200
So apparently. Because it plays a role in all the other deprivations as well, and you see it in the simulations, if we don't let BP grow, then.

398
00:50:16,200 --> 00:50:22,000
It just. We get completely wrong curves.

399
00:50:22,000 --> 00:50:30,170
Yeah, so. Thank you.

400
00:50:30,170 --> 00:50:43,940
So did you have a second question? Yeah, I can't remember right now that was the first time we come back.

401
00:50:43,940 --> 00:50:49,270
Yes, I was going to ask so can you kind of check these up abounds empirically?

402
00:50:49,270 --> 00:50:58,100
Like I mean, do they look like they're tight? So you've got this third, which you think is it is is the correct answer now.

403
00:50:58,100 --> 00:51:07,790
Yeah. So indeed. Well, we just finished a simulation because it was a very in many things to to to consider.

404
00:51:07,790 --> 00:51:17,300
So if you do add a boost, there's a some kind of standard defo value for the number of iterations and all that things.

405
00:51:17,300 --> 00:51:20,780
And you if you take the those values, you don't see it.

406
00:51:20,780 --> 00:51:29,300
But if you do enough iterations, as the theory asks for with a small enough learning rate, then you see it perfectly.

407
00:51:29,300 --> 00:51:36,040
It's really incredible. So you see the one third rate, which is in the L2.

408
00:51:36,040 --> 00:51:44,620
Now to conversions, and you see the one third rate in the march and you see it's really perfectly, but of course at a certain point,

409
00:51:44,620 --> 00:51:54,070
you know, the ultra distance between two vectors, which both have naum equal to one, can never be bigger than two.

410
00:51:54,070 --> 00:52:03,290
So there are some bounding off the curve, if at certain points, but I mean, if it's really quite.

411
00:52:03,290 --> 00:52:13,260
Seeable and for the 11 11 marchin, so we know that the upper bound in the lower round are the same up to Constance.

412
00:52:13,260 --> 00:52:18,920
And indeed, so you see that in the simulations, it's. It's really very nice.

413
00:52:18,920 --> 00:52:27,580
Yeah, yeah, yeah. Your question, come back to you, the theory of justice,

414
00:52:27,580 --> 00:52:32,770
and so it's a bit related to what we've been asking and maybe I just didn't understand what you were saying.

415
00:52:32,770 --> 00:52:41,170
But so far, the other booster I'm always for following for your last line of your colour,

416
00:52:41,170 --> 00:52:46,630
of your books at the end of your table, at the end, you didn't get the results,

417
00:52:46,630 --> 00:52:53,850
the rate for the case where you have this false and not sparse seducing me,

418
00:52:53,850 --> 00:52:59,050
that it's a technical problem and you don't get the proof yet, but there is a chance that the rate is better.

419
00:52:59,050 --> 00:53:03,970
Is it a case where actually you can't do better? What would you do you imagine?

420
00:53:03,970 --> 00:53:07,950
I don't know. But in the in the simulations.

421
00:53:07,950 --> 00:53:19,320
If there is noise, we see that this one we see the one third in the simulation, if there's no noise, we see the one half.

422
00:53:19,320 --> 00:53:30,570
So probably we missed some point there. Probably, you know, I do think that indeed, if there is no noise, it should be possible.

423
00:53:30,570 --> 00:53:41,710
It's it's just yeah. It's just that we cannot exactly bound the one norm of the estimated by to one or more of the troops,

424
00:53:41,710 --> 00:53:46,740
but there's a constant in front that's which is of order one.

425
00:53:46,740 --> 00:53:51,090
But still that bothers us. Yeah.

426
00:53:51,090 --> 00:53:57,390
So it's a very good question. And then we keep on thinking, oh, yes, that should be possible with the.

427
00:53:57,390 --> 00:54:09,380
So far, we didn't succeed, so I gave this talk in the over woolfenden once and I was there tonight and he just shook his head.

428
00:54:09,380 --> 00:54:18,930
He said so probably. But that was the one 4th.

429
00:54:18,930 --> 00:54:25,470
So, no, I don't know. Shook his head at this.

430
00:54:25,470 --> 00:54:30,220
You think you thought it was a rubbish bound or you so you said it was a hopeless business.

431
00:54:30,220 --> 00:54:34,800
I don't know. I want to ask you why they didn't make it OK?

432
00:54:34,800 --> 00:54:50,800
Well, yeah, no, it's I keep on working on that one because I don't know about lower bonuses or so I don't know.

433
00:54:50,800 --> 00:54:56,560
Mayor. You do have a question. Patrick. Yeah, thanks.

434
00:54:56,560 --> 00:55:00,010
Thank you, sir. Enjoyed the talk.

435
00:55:00,010 --> 00:55:07,510
Yeah, so I had a question on the connexion with other books as well, just because it wasn't clear from the slide what you're really doing with it.

436
00:55:07,510 --> 00:55:18,140
So I understood that you look at the limits of it one, but at that point you have exactly the maximum solution.

437
00:55:18,140 --> 00:55:22,810
Yeah. So do you do you do something with at least stopping?

438
00:55:22,810 --> 00:55:34,620
Yes. At least we have some kind of theoretical result which says if you have like a log and something at least and the learning rate of descent,

439
00:55:34,620 --> 00:55:37,540
this order, the number of steps that in that order,

440
00:55:37,540 --> 00:55:53,060
then the margin coming out of this algorithm is getting that much larger or smaller than the margin coming out of the next margin algorithm.

441
00:55:53,060 --> 00:56:02,060
So there are asymptotic results, but we sort of refine them so that you can you see the non asymptotic relation, but I didn't put it on the slide.

442
00:56:02,060 --> 00:56:04,610
That's right.

443
00:56:04,610 --> 00:56:17,170
And and so within this picture, you you are able to prove that if you do this type of early stopping, you achieve better bounds as a function of.

444
00:56:17,170 --> 00:56:26,710
So we don't so the theory says you should not do something, you should do enough iterations, so you go on, you do your Arabised algorithm,

445
00:56:26,710 --> 00:56:33,490
then at certain point you start interpolating and you continue with your outermost algorithm, enough iterations.

446
00:56:33,490 --> 00:56:43,650
And then the theory says that you are close to this max march and solution.

447
00:56:43,650 --> 00:56:51,510
We say exactly how we can give some bounce on how close you are then, and we see it also in the simulation,

448
00:56:51,510 --> 00:56:58,710
so if you do the Audibert algorithm with small step size learning rate and enough iterations,

449
00:56:58,710 --> 00:57:04,180
you see the groups are hardly you can hardly distinguish the curves.

450
00:57:04,180 --> 00:57:08,790
I'm sorry, these simulators are just done, so I didn't put them on the slides,

451
00:57:08,790 --> 00:57:16,680
but it's kind of strange if you do the utmost algorithm with large step size, it is giving much better amounts.

452
00:57:16,680 --> 00:57:25,010
So it seems to be. It's even still interpolated with better classification error,

453
00:57:25,010 --> 00:57:34,500
so smaller classification error, this I don't know why this is what the simulations say.

454
00:57:34,500 --> 00:57:42,640
Yeah, I will. So I'm familiar with the literature on regression, and I believe it was Scootin William,

455
00:57:42,640 --> 00:57:51,690
but I they they have some paper connecting early, stopping with notion of M. Max optimality in that setting.

456
00:57:51,690 --> 00:58:01,650
So yeah, I understand now you're focussing on interpellation and you want another figuration for this phenomena to kick in.

457
00:58:01,650 --> 00:58:08,760
Yeah. So you're you're doing the really going way beyond what is allowed for classical statistics.

458
00:58:08,760 --> 00:58:14,010
You don't do early shopping, you just go on with your iterations.

459
00:58:14,010 --> 00:58:24,540
Yeah. Yeah. And that's exactly what helping you should be seeing better and better rates as you do with your benchmark estimate.

460
00:58:24,540 --> 00:58:30,000
That will be money. Well, there's one thing the rates can be better because of that,

461
00:58:30,000 --> 00:58:35,610
because you can do something similar with regularisation, but then you can't computed.

462
00:58:35,610 --> 00:58:48,160
That's the point. So theoretically, you can sort of minimise some sine function, but that's a hopeless, non convex problem, Zozo.

463
00:58:48,160 --> 00:58:52,510
You're thinking of knowing where to stop. No, if you were to.

464
00:58:52,510 --> 00:59:04,430
Oh, no, no, I don't know if you know where to stop, then I would say this gives you a.

465
00:59:04,430 --> 00:59:10,230
It's a good question, so if you look at the benchmark, let me just go back.

466
00:59:10,230 --> 00:59:14,400
I would say, knowing if you do early stopping, that's like regularisation,

467
00:59:14,400 --> 00:59:22,290
so just out of some vague idea, I would say it's it would give you these rates.

468
00:59:22,290 --> 00:59:29,320
Yeah, which is still worse than than the interpolating rate, which kind of strange.

469
00:59:29,320 --> 00:59:38,250
And. In a sense. Yeah, and.

470
00:59:38,250 --> 00:59:43,800
So something to look at. Yeah, yeah.

471
00:59:43,800 --> 00:59:48,590
Thank you. All right, thank you very much, sir.

472
00:59:48,590 --> 00:59:54,710
We put a stop to let's all applaud again.

473
00:59:54,710 --> 01:00:01,058
Excellent work. Thank you. Thanks a lot.