1
00:00:03,360 --> 00:00:08,460
So hello, everyone, so welcome to the first Yasumoto seminar.

2
00:00:08,460 --> 00:00:18,090
So this week, the speaker is to talk to So Radice, assistant professor at the University of Toronto,

3
00:00:18,090 --> 00:00:21,900
the Departments of Computer Science and Statistical Sciences.

4
00:00:21,900 --> 00:00:32,130
He's a faculty member of the Machine Learning Group and the Bacteria Institute, and Seifi chair in Artificial Intelligence before joining Toronto.

5
00:00:32,130 --> 00:00:38,790
He was also at the Microsoft Research New England, where I did his Ph.D.

6
00:00:38,790 --> 00:00:50,160
They teach statistics at Stanford University, a device by Mosin Bayati and Andrea Montana.

7
00:00:50,160 --> 00:00:56,700
So today Mirada will be talking about some new results of our exciting result about convergence

8
00:00:56,700 --> 00:01:05,690
of online stochastic reading designed under infinite noise virus and non-combustible.

9
00:01:05,690 --> 00:01:13,190
Thank you, Jim, for that very kind introduction and thank you all for coming to the talk show today.

10
00:01:13,190 --> 00:01:20,090
I'll talk about convergence of online stochastic gradient descent under some modern set

11
00:01:20,090 --> 00:01:26,750
ups where you may have infinite noise variance or a non complexity in the problem.

12
00:01:26,750 --> 00:01:35,540
So, as you mentioned, I did my Ph.D. in statistics department and back then pretty much every talk started with the following line.

13
00:01:35,540 --> 00:01:40,070
And I'll stick to the tradition and also do the same.

14
00:01:40,070 --> 00:01:50,570
So we considered for this slide the following linear model where why is the response to the is the covariates data?

15
00:01:50,570 --> 00:01:57,050
Zero is the true parameter that we are after. And Epsilon, here is the noise.

16
00:01:57,050 --> 00:02:01,070
So the noise is assumed to have conditional meaning.

17
00:02:01,070 --> 00:02:12,950
And when you're trying to find beta zero, what you do is you write down the population problem and and write this as a minimisation problem.

18
00:02:12,950 --> 00:02:19,910
So minimum of this population risk is a population problem because it's under

19
00:02:19,910 --> 00:02:26,120
expectation of response and the covariance is going to give you the true beta not.

20
00:02:26,120 --> 00:02:37,940
And since this is a minimisation problem, you can write down the stochastic gradient descent updates given as this here Gamma T is your step system.

21
00:02:37,940 --> 00:02:48,480
So it turns out stochastic gradient descent with a proper step size choice will give you the two parameter B to zero.

22
00:02:48,480 --> 00:02:52,280
So of course I did not.

23
00:02:52,280 --> 00:02:56,210
So I left a couple of blank space here.

24
00:02:56,210 --> 00:03:03,190
The reason for that is what happens if you assumed the noise variance is infinity, right?

25
00:03:03,190 --> 00:03:11,450
So the noise may have maybe tails. It's coming from the model and the second moment of the noise is infinity.

26
00:03:11,450 --> 00:03:16,580
Then this objective function is not defined. So this becomes infinity.

27
00:03:16,580 --> 00:03:26,540
But nothing is stop being a practitioner to run SGD in this scenario again, it turns out SAGD still finds the true parameter.

28
00:03:26,540 --> 00:03:35,180
Not so. Some of you may see what's going on here and some of you may be surprised, just like I was.

29
00:03:35,180 --> 00:03:43,930
And I will tell you why this is happening in after I give you some convergence results on SGV.

30
00:03:43,930 --> 00:03:55,090
OK, so throughout this talk, I am interested in the following unconstrained optimisation problem where we are trying to minimise the function F of X,

31
00:03:55,090 --> 00:04:01,390
and its argument is in this dimension and we would like to use stochastic gradient

32
00:04:01,390 --> 00:04:10,240
descent algorithm given as the following recursion and here GMAT is denoting the step.

33
00:04:10,240 --> 00:04:17,050
This is gradient of the function you're trying to minimise and this is state dependent.

34
00:04:17,050 --> 00:04:24,870
So noisy, state dependent, then it is an important component of this algorithm.

35
00:04:24,870 --> 00:04:33,120
So two things that I would like to study in this talk is that in modern settings,

36
00:04:33,120 --> 00:04:46,920
just as I describe in the infinite variations in underlying complexity, how does this algorithm converge and how does the averaging behave?

37
00:04:46,920 --> 00:04:52,710
So I'm not specifying where they converge or how they converge or the type of convergence.

38
00:04:52,710 --> 00:05:03,050
And I'm just saying these are the two things that I would like to study and we will see different notions of convergence and very soon.

39
00:05:03,050 --> 00:05:09,430
OK, and I'll talk more about polygrapher average, why it is interesting.

40
00:05:09,430 --> 00:05:21,720
So this study looks a bit different than the study that we run and run into in machine learning algorithms, but it is actually the same thing.

41
00:05:21,720 --> 00:05:31,410
So consider the stochastic optimisation problem where you have a function f AWEX that can be written as the population risk,

42
00:05:31,410 --> 00:05:37,560
but the population risk and machine learning is the expected loss over a random variable Z.

43
00:05:37,560 --> 00:05:48,970
This is typically your data points YMC together and this expectation is in all the data points so you don't have access to it.

44
00:05:48,970 --> 00:05:59,610
So you resort to a stochastic optimisation algorithm like stochastic gradient descent, the speed that we run in general.

45
00:05:59,610 --> 00:06:06,120
It takes a bunch of data and many batched say it with a bad side.

46
00:06:06,120 --> 00:06:16,530
B is the estimates the gradient of this function F or using the M.H. So B samples,

47
00:06:16,530 --> 00:06:22,620
you just estimate the gradient of F of X, which is a population quantity.

48
00:06:22,620 --> 00:06:28,380
You estimate it using this over this batch of data points.

49
00:06:28,380 --> 00:06:35,490
And what you can do here is to convert this back to our formulation.

50
00:06:35,490 --> 00:06:43,180
You can simply say that the noise is given by the difference between this gradient estimated with the true gradient itself.

51
00:06:43,180 --> 00:06:53,850
So this is the noisy gradient, in other words. And the difference between this noisy gradient with the true gradient is the noise in my studio.

52
00:06:53,850 --> 00:06:54,720
Good.

53
00:06:54,720 --> 00:07:04,200
I would like to emphasise that this is online SAGD, meaning that this is a stream of data coming in and each batch is independent from each other.

54
00:07:04,200 --> 00:07:11,460
So this is not like you have a batch of data points and you subsample a B data points from that.

55
00:07:11,460 --> 00:07:21,190
That's not what we are doing. There is a stream of data coming in and and that's why the title is on minus.

56
00:07:21,190 --> 00:07:29,050
So I am interested in the settings where the noise has infinite variance, noise is heavy tailed.

57
00:07:29,050 --> 00:07:32,980
So the first item is motivated by recent observations.

58
00:07:32,980 --> 00:07:42,240
That study can exhibit heavy tails and in many modern machine learning models.

59
00:07:42,240 --> 00:07:47,100
And the second thing I'm interested in is the objective function is not.

60
00:07:47,100 --> 00:07:56,550
This is also motivated by the fact that many of the objective functions in machine learning is non-com.

61
00:07:56,550 --> 00:08:04,530
And lastly, a practitioner can never take infinite iterations of SAGD.

62
00:08:04,530 --> 00:08:07,380
So in some scenarios,

63
00:08:07,380 --> 00:08:16,540
I would like to characterise these conversions behaviour within an asymptotic analysis so I can quantify how much iterations I need.

64
00:08:16,540 --> 00:08:23,700
Exactly. So these are the three items that I am interested in.

65
00:08:23,700 --> 00:08:31,200
So remember when I said I'm interested in the interest, I'm also interested in the public reppert averaging.

66
00:08:31,200 --> 00:08:38,110
So what is all crap, the averaging? It's a very simple adjustment to the output.

67
00:08:38,110 --> 00:08:42,120
So you run a CD for titrations and you average them out.

68
00:08:42,120 --> 00:08:50,170
That is called Polyakov Upward Averaging. And Polier Interrupted introduced this and different papers around the same time.

69
00:08:50,170 --> 00:09:01,350
It turns out this algorithm has. Good generalisation properties observed numerically, also, you can under classical set ups,

70
00:09:01,350 --> 00:09:08,190
you can prove that this is optimal, achieves Kramer all over, bound under a finite Nigeriens, et cetera.

71
00:09:08,190 --> 00:09:13,620
So it's it's it's it enjoys some nice properties.

72
00:09:13,620 --> 00:09:20,280
And one thing that I am interested in is that it admits a central limit theorem in the classical

73
00:09:20,280 --> 00:09:28,860
setting where the noise variance is finite and and the function you're minimising is strongly complex.

74
00:09:28,860 --> 00:09:36,780
And this is the basis for many inference methods under your algorithm and pretty

75
00:09:36,780 --> 00:09:48,300
much all the inference methods that rely that take the algorithmic part of the.

76
00:09:48,300 --> 00:09:55,410
That take into consideration the algorithmic part of the methodology relies on the CFT,

77
00:09:55,410 --> 00:10:05,430
but what happens to this Keltie if the noise variance is infinity or F is non-com X, and what is the rate of these this CRT?

78
00:10:05,430 --> 00:10:11,160
These are the two things I would like to answer.

79
00:10:11,160 --> 00:10:22,560
OK, so there are many papers on stochastic gradient descent because the method of choice these days to train a model for that reason,

80
00:10:22,560 --> 00:10:34,110
but many of them actually can be considered as kind of follow up to these classical results by in the sky and cross the line, etc.

81
00:10:34,110 --> 00:10:41,310
And I will only compare this talk to the classical results. And I also consider this.

82
00:10:41,310 --> 00:10:47,400
These are the most important papers that kind of led the field.

83
00:10:47,400 --> 00:10:58,890
So the first result is by now where the first paper to consider infinite variance and infinite variance in study.

84
00:10:58,890 --> 00:11:06,000
So what this paper did was in one dimension, the function is assumed to be strongly complex.

85
00:11:06,000 --> 00:11:11,250
So it looks like this and the noise variance has a sorry.

86
00:11:11,250 --> 00:11:19,230
The noise variance is fairly, but the noise has the of moment finite and alpha is less than two and the step

87
00:11:19,230 --> 00:11:26,610
size is T to the minus row decaying step size where row is set to be one.

88
00:11:26,610 --> 00:11:32,820
Now the convergence in LP, the Convergence Rate of Speed LP is given by this expression.

89
00:11:32,820 --> 00:11:44,850
And it turns out that with this rate you cannot show that the public average the convergence of the polio Grubert average.

90
00:11:44,850 --> 00:11:50,490
So the reason for this I'll get to it is that the Stepaside decay is too fast.

91
00:11:50,490 --> 00:11:51,540
Growing equals one.

92
00:11:51,540 --> 00:12:01,910
Is it too fast to that decay to that will not give you at least as far as the proof is concerned, polio be convergence of the polio for the average.

93
00:12:01,910 --> 00:12:15,030
So the the seminal work by Paul Kanjorski was considering the strongly complex set up, and they assume that the noise variance is finite,

94
00:12:15,030 --> 00:12:22,260
Alpha is, too, and they considered a range of steps, such as the case where it is ranging from one 1/2 to one notice.

95
00:12:22,260 --> 00:12:30,150
That equals one is not included here. The reason for that is Roy equals one is too fast and anything below is fine.

96
00:12:30,150 --> 00:12:38,400
And the rate that they consider, I mean, this they relied on previous results to is this much.

97
00:12:38,400 --> 00:12:47,370
And the most important part is that they showed us some tricks, which led to many interesting results.

98
00:12:47,370 --> 00:12:52,950
They show that it's achieved a lower bound to show that.

99
00:12:52,950 --> 00:12:57,100
So they showed, I think, the first asymptotically for the polygrapher.

100
00:12:57,100 --> 00:13:08,350
Babaji and many other papers actually use this idea to construct inference methods based on the fact that they proved.

101
00:13:08,350 --> 00:13:17,890
And in this talk, I will first consider a class of functions that are strongly complex,

102
00:13:17,890 --> 00:13:25,450
but it is a stronger assumption than strong convexity, so strongly convex functions have hachim that live inside of the core.

103
00:13:25,450 --> 00:13:33,520
So let's say this is the speaker. I will consider a set of Q positive, definite matrices.

104
00:13:33,520 --> 00:13:42,250
That is a subset of the sequence. So this is the orange region that you're seeing and I'll explain what this red region is.

105
00:13:42,250 --> 00:13:46,570
Later, I will assume that the noise variances infinity.

106
00:13:46,570 --> 00:13:54,310
So we are in the setting and the step size it satisfies the exponent of the step size is in this range.

107
00:13:54,310 --> 00:14:03,190
So notice that when Alpha is two, you have flying experience and this reduces to the same interval that polio can you diski consider this is

108
00:14:03,190 --> 00:14:10,510
the rate that I obtain and we show that the public averaging actually does not admit a Gaussian Keltie.

109
00:14:10,510 --> 00:14:14,150
It admits the alpha stable limit and it is heavy tail.

110
00:14:14,150 --> 00:14:15,850
It doesn't have a finite variance.

111
00:14:15,850 --> 00:14:25,990
So many of the inference methods that you relied on that you constructed, relying on this asymptotically will not hold in the air.

112
00:14:25,990 --> 00:14:31,000
And in this case. And I will.

113
00:14:31,000 --> 00:14:39,730
As a follow up to this, I'll talk about how we can actually make these asymptotically results not asymptotic,

114
00:14:39,730 --> 00:14:47,230
and prove that this Keltie can be achieved with a raid of this square over tea,

115
00:14:47,230 --> 00:14:52,470
but under the assumption that the third movement of the noise is finite.

116
00:14:52,470 --> 00:14:58,360
So this is a finite this is a nine asymptotically and based on a science method.

117
00:14:58,360 --> 00:15:01,250
And this is the rate is the square oberti.

118
00:15:01,250 --> 00:15:12,440
So this is going to tell us that in practise, these KLZ, especially in high dimension, require too much work.

119
00:15:12,440 --> 00:15:19,370
And I will relax the strong convexity assumptions, which is in the first four items,

120
00:15:19,370 --> 00:15:25,040
I will assume that the strong convexity is contained only in details of the function,

121
00:15:25,040 --> 00:15:26,930
as in the last figure here,

122
00:15:26,930 --> 00:15:37,070
where around the origin you can't you are allowed to have any behaviour and with an assumption that the fourth moment of the noise is finite.

123
00:15:37,070 --> 00:15:41,570
I will show that the asymptotically the Gaussian Keltie still holds.

124
00:15:41,570 --> 00:15:46,400
So these are the three results that I would like to talk to talk about.

125
00:15:46,400 --> 00:15:51,860
And before I give you the results, I'll try to give you some intuition.

126
00:15:51,860 --> 00:15:58,580
And intuition is quite simple and it'll lead to the assumptions that we have.

127
00:15:58,580 --> 00:16:05,750
So assume that you're going to run on this CD and at iteration you are here.

128
00:16:05,750 --> 00:16:09,800
This is the point. You are and an extra is where you want to get at.

129
00:16:09,800 --> 00:16:17,200
So once you have strong complexity, you run gradient descent, you will converge to this point.

130
00:16:17,200 --> 00:16:22,080
It's small enough, so let's consider the one step of the full gradient descent.

131
00:16:22,080 --> 00:16:29,130
This is the full gradient descent. So it takes you from your text and makes it closer texta.

132
00:16:29,130 --> 00:16:38,330
Right. So the distance from X to X killed it, which is the output of gradient descent, non stochastic gradient descent.

133
00:16:38,330 --> 00:16:43,350
Is going to be decreased following this rule.

134
00:16:43,350 --> 00:16:46,440
Right, so there is a constant ah, which is less than one,

135
00:16:46,440 --> 00:16:54,030
so it will multiply the distance from 60 to start and make it distance from X to the T plus one to Exter.

136
00:16:54,030 --> 00:16:59,390
So this is the gradient descent algorithm. And.

137
00:16:59,390 --> 00:17:04,240
In the stochastic gradient descent, you have a noise.

138
00:17:04,240 --> 00:17:12,580
So what it means is that you have the gradient descent plus step size times, some noise.

139
00:17:12,580 --> 00:17:18,400
So this is the stochastic city part and this is where your study is getting you right.

140
00:17:18,400 --> 00:17:30,930
And the. And it turns out that this distance is going to be controlled by the Stepaside because this noise has is multiplied by the steps,

141
00:17:30,930 --> 00:17:37,290
the smaller the step sizes, this noise will be smaller and smaller.

142
00:17:37,290 --> 00:17:44,310
OK. Now I can construct this triangle and right down this triangle, inequality,

143
00:17:44,310 --> 00:17:50,910
basically the distance from the next iteration of SAGD to the optimum is the summit.

144
00:17:50,910 --> 00:17:59,160
It's smaller than the sum of these two edges. And this is the basically triangle inequality.

145
00:17:59,160 --> 00:18:03,390
And what this is telling me is that so this is getting and obtaining this from

146
00:18:03,390 --> 00:18:11,760
the contraction I'm obtaining this due to noise is multiplied by the step size.

147
00:18:11,760 --> 00:18:26,130
OK, so it's a very simple argument, and this leads to the assumption that we we have and it's going to be hopefully intuitive.

148
00:18:26,130 --> 00:18:36,480
So the gradient descent contraction is, as I said, is given by this relationship, this inequality in the case where you have finite noise variance.

149
00:18:36,480 --> 00:18:45,090
The standard analysis continues with setting this distance from 60 to Exter as the expected value.

150
00:18:45,090 --> 00:18:54,270
I mean, this L2 to distance basically the expected value of the squared or so.

151
00:18:54,270 --> 00:19:04,890
And it turns out that the gradient descent iteration satisfies this inequality following the standard analysis.

152
00:19:04,890 --> 00:19:14,190
So all I need to show is that I need to show this guy, which is the R here is less than one.

153
00:19:14,190 --> 00:19:24,870
So remember, I'm going to let the step size go to zero because I need to decrease the step size in this part of the talk.

154
00:19:24,870 --> 00:19:33,760
So this means that for a four, Gama'a, that is small enough, this quantity has to be smaller than one.

155
00:19:33,760 --> 00:19:42,990
In other words, I'm just going to say that for small enough gammer, there is a L such that this inequality is satisfied.

156
00:19:42,990 --> 00:19:52,530
Now, this is exactly equivalent to saying that F is strongly comeback's or Heshan of F is uniformly positive, definite.

157
00:19:52,530 --> 00:20:05,730
These two statements are equivalent. OK, so this was the standard speed analysis and it turns out strong convexity is giving us nice results.

158
00:20:05,730 --> 00:20:11,850
But in the infinite noise variance setting, you see, I cannot use the second moment here.

159
00:20:11,850 --> 00:20:16,790
So what I need to use is the moment where P is smaller than to.

160
00:20:16,790 --> 00:20:26,090
And for that reason, I'm going to choose my distance and this is not a metric, but so this distance as the expected value of the.

161
00:20:26,090 --> 00:20:33,710
Peace, power of the peace, no, of this distance, if this is my new distance metric,

162
00:20:33,710 --> 00:20:43,910
then the recursion I wrote for the second in the in the finite notes variance case, I can write it equivalently as this and.

163
00:20:43,910 --> 00:20:56,580
With the same strategy, I need to make this first term here, which is going to be giving me to dictate smaller than one so that I have a contraction.

164
00:20:56,580 --> 00:21:08,840
OK, and. The equivalent, the expression is this, So I need to have this power of the peace no less than or equal to one minus Gamma.

165
00:21:08,840 --> 00:21:18,260
And it turns out, just as this was equivalent to the uniform positive definiteness, this will also imply some sort of positive definiteness.

166
00:21:18,260 --> 00:21:26,120
But it's going to be stronger than the standard notion of positive definiteness that we all are familiar with.

167
00:21:26,120 --> 00:21:31,380
So this is the idea. I need a contraction in some distance.

168
00:21:31,380 --> 00:21:38,130
Second, so if the distance is the Euclidean distance squared, that gives me strong complexity,

169
00:21:38,130 --> 00:21:43,310
if it is the norm, it will give me it will require something stronger.

170
00:21:43,310 --> 00:21:54,200
That is the idea, so let's recap some of the results where we have in the standard.

171
00:21:54,200 --> 00:22:05,560
Standard classical set up so. As I mentioned, this inequality, basically this matrix, having this operator, Naum Square,

172
00:22:05,560 --> 00:22:14,440
being less than equal to one minus gamma four for all small gamma is equivalent to F being strong economics,

173
00:22:14,440 --> 00:22:17,880
which is saying that the Heshan is lower bonded.

174
00:22:17,880 --> 00:22:29,070
Right, and so this means that this implies that the set of mattresses, the federal mattresses that are over this set,

175
00:22:29,070 --> 00:22:39,390
basically the question of the function you're trying to minimise is a subset of the equation which is represented as this coin and figure on the left.

176
00:22:39,390 --> 00:22:48,000
So it's symmetric Matrix two is positive, definite if the normal V is one, we have this inequality,

177
00:22:48,000 --> 00:22:54,520
which is a strict inequality is this quadratic form is greater than zero all the time for all V on this unit,

178
00:22:54,520 --> 00:23:02,120
Saffir and two is positive, semi definite if this inequality is not strict.

179
00:23:02,120 --> 00:23:07,970
Right, and instead of the mattresses defines a count. So this is that cone.

180
00:23:07,970 --> 00:23:13,670
Now let's go to the pinata. So this was to Norm here, right?

181
00:23:13,670 --> 00:23:19,190
Let's go to the Pinault and see what happens to get the contraction in the arm.

182
00:23:19,190 --> 00:23:23,870
I need this inequality for Gammer in this range.

183
00:23:23,870 --> 00:23:28,730
Now, before showing you the equivalence, I need a definition.

184
00:23:28,730 --> 00:23:31,730
I need sign power or a vector.

185
00:23:31,730 --> 00:23:44,630
So given a vector like this and some positive or Q greater than equal to zero, the same power of a vector is defined by this quantity.

186
00:23:44,630 --> 00:23:50,990
So I have the sign of each entry and taking the absolute value and raising to the public.

187
00:23:50,990 --> 00:24:01,350
So you can say, OK, let's let me give you the definition and then I'll show you the figure and how they look like.

188
00:24:01,350 --> 00:24:15,560
So a symmetric matrix cue is positive, definite, if for all we on the units to fear but fear and Piñera satisfy this strict inequality.

189
00:24:15,560 --> 00:24:31,340
And it is positive, semi definite, if this strict inequality becomes non strict and said of all the mattresses are also defined.

190
00:24:31,340 --> 00:24:39,680
And here is how the picture looked like when P is equal to this definition here.

191
00:24:39,680 --> 00:24:43,530
Exactly. Reduces to the cone.

192
00:24:43,530 --> 00:24:45,920
So vampyres equal to two.

193
00:24:45,920 --> 00:24:57,170
You have the cone that is the blue region here, and this definition reduces the standard definition and then P is equal to one.

194
00:24:57,170 --> 00:25:08,420
Then this definition reduces to the set of all diagonal dominant matrices that have positive entries.

195
00:25:08,420 --> 00:25:13,610
And that is also a cone. That is this red region here.

196
00:25:13,610 --> 00:25:24,440
And any party that is in between one half, you still get a cone, but that is in between these two these two cases, then equal to one and two.

197
00:25:24,440 --> 00:25:32,830
So there is this inclusion relationship between these cones and.

198
00:25:32,830 --> 00:25:42,490
And it is important that this war and peace and peace in between one and two discon lives in these two two columns

199
00:25:42,490 --> 00:25:49,360
of background dominant in East Coast and the cone of diagonally dominant mattresses is the most extreme case.

200
00:25:49,360 --> 00:25:55,320
So if you make this assumption, then any of our assumptions are already satisfied.

201
00:25:55,320 --> 00:26:04,320
I'll get I'll talk about this a bit further. Let me give you the two assumptions that I need for the convergence of SAGD and the heavy tailed Riggi.

202
00:26:04,320 --> 00:26:12,960
So I will assume that the set of matrices, basically these Heshan matrices, are bounded and uniformly positive, definite.

203
00:26:12,960 --> 00:26:23,010
So when you see in this region, so when you see a few positive definiteness assumption, so is referring to this.

204
00:26:23,010 --> 00:26:31,050
Uniformly on the set of all sessions and here equal to Q so this is my first

205
00:26:31,050 --> 00:26:37,350
assumption on the function and I will assume that the noise admits the following,

206
00:26:37,350 --> 00:26:48,280
the composition. So noise itself is state dependent and there is a state dependent component and a state independent component,

207
00:26:48,280 --> 00:26:55,130
by state I mean the 60 and the state independent components are eyed with expectation

208
00:26:55,130 --> 00:27:02,020
zero and has the finite of US moment for some of our less than or equal to two.

209
00:27:02,020 --> 00:27:08,620
Two is also fine. And this state dependent component is a martingale different sequence.

210
00:27:08,620 --> 00:27:13,000
And this martingale different sequence satisfies this inequality.

211
00:27:13,000 --> 00:27:20,050
So this means that the the variance of the state dependent component can be large.

212
00:27:20,050 --> 00:27:31,950
But for a fixed X or for a given X T, it is always finite, whereas the heavy tail part comes from the side component.

213
00:27:31,950 --> 00:27:38,220
So this is the noise assumption that I make in the heavy tails, heavy tail case,

214
00:27:38,220 --> 00:27:44,490
so we have this very recent result saying that if you have a step size satisfying

215
00:27:44,490 --> 00:27:51,120
this decaying like this to my nostril and this row is in between zero and one,

216
00:27:51,120 --> 00:27:56,910
the air of the city in Kunar to normal to the power kyuzo here,

217
00:27:56,910 --> 00:28:08,230
to the subscripts to that's why we limited to naum to the power you is given by this expression and that is its decay rate.

218
00:28:08,230 --> 00:28:14,540
So this is the convergence rate that we achieve in the case of heavy tailed mouse.

219
00:28:14,540 --> 00:28:24,880
For the less illiterate of GD to minimise their unique minimisation of the problem.

220
00:28:24,880 --> 00:28:34,040
OK, so what is this rate about? Remember, we have a stat size decaying with teeth minus Joe.

221
00:28:34,040 --> 00:28:44,390
And we have a noise that has finite of us moment for hopefully some of that is smaller than two and we have a HESHBON that is too positive,

222
00:28:44,390 --> 00:28:53,560
definite, and this is a stronger assumption that than a plain positive definiteness, which is equivalent to strong convexity.

223
00:28:53,560 --> 00:29:02,980
Right. So this rate. Is basically giving me when Q is going to Alpha,

224
00:29:02,980 --> 00:29:10,060
so you see the assumption on the function and the assumption on the noise are connected to has to be always less than Alpha.

225
00:29:10,060 --> 00:29:19,660
But being curious, going to Alpha in this regime, you achieved this great.

226
00:29:19,660 --> 00:29:30,200
OK, and when role is going to one which is the fastest dictate that you can have on the step size, that gets you to the fastest convergence rate.

227
00:29:30,200 --> 00:29:33,940
This is the rate of Gasolina. Sixty nine.

228
00:29:33,940 --> 00:29:38,680
But this is too fast for the convergence of polygraphers average.

229
00:29:38,680 --> 00:29:48,510
So this will not this case will not work if our objective is to show the convergence of politics upward averaging.

230
00:29:48,510 --> 00:29:56,280
So when offer is going to two, meaning that your noise is now having finite variance.

231
00:29:56,280 --> 00:30:04,950
Then the rate that you achieve is this much, and in this case now you can actually move to the all the way to the two.

232
00:30:04,950 --> 00:30:15,780
So you have the strongly complex functions. And in this case, you recover the normal rate T to the minus trover to for finite variance.

233
00:30:15,780 --> 00:30:21,540
But this is the finite variance regime and this is the infinite variance regime.

234
00:30:21,540 --> 00:30:29,620
And this rate is not good for us because I want to show the convergence of the political upward averaging.

235
00:30:29,620 --> 00:30:39,050
That's too fast for that. So remember, the gradient noise was given by this edness, this decomposition.

236
00:30:39,050 --> 00:30:50,300
So what I will additionally issue to the previous case is that I'll assume that this average of the IED component admits that of a stable,

237
00:30:50,300 --> 00:30:53,340
symmetric of a stable limit.

238
00:30:53,340 --> 00:31:05,580
And so standard generalised Keltie results actually has certain conditions to be satisfied in one result is due to climate in a Danco.

239
00:31:05,580 --> 00:31:16,140
And if the random variables with power to tails symmetric random variables with power little tails, then you have the following satisfied.

240
00:31:16,140 --> 00:31:27,600
So a stable distribution, this alpha stable limit assumption allows me to characterise the limiting behaviour of the Polya corrupted averaging.

241
00:31:27,600 --> 00:31:32,130
So I'll talk about alpha stable limits in the next slide.

242
00:31:32,130 --> 00:31:43,470
But the result is as follows. If the Roka and Alpha satisfied satisfy this inequality and this additional smoothness assumption.

243
00:31:43,470 --> 00:31:47,910
So you can think of this assumption as the Heshan smoothness. Then,

244
00:31:47,910 --> 00:31:54,300
for a Stepaside sequence satisfying the standard indicate the normalised average

245
00:31:54,300 --> 00:32:01,650
pull Robert averaging Convergys weekly to an alpha stable distribution.

246
00:32:01,650 --> 00:32:11,320
So here, the first observation is that the staff size the index of the staff size the role.

247
00:32:11,320 --> 00:32:19,460
It has to be in this regime and in politics and you just keep this within this regime, when Alpha goes to two,

248
00:32:19,460 --> 00:32:29,610
when you have finite experience these to match, but when Alpha is smaller than two, we have a smaller interval here to work with.

249
00:32:29,610 --> 00:32:39,030
So offer stable distributions. So here is a figure that I took from Wikipedia and I forgot to acknowledge in the slide.

250
00:32:39,030 --> 00:32:45,900
And so this is the PDF of our first table distribution and this is the parameter alpha.

251
00:32:45,900 --> 00:32:51,780
So the variance of this limit is infinity. So there is no variance when Alpha is smaller than two.

252
00:32:51,780 --> 00:32:57,930
So you won't have a Gaussian Keltie when Alpha is smaller than two.

253
00:32:57,930 --> 00:33:11,430
And when Alpha is to this recovers the standard asymptotically results of and she did Saeki is because of a stable one of equals two is Gaussian.

254
00:33:11,430 --> 00:33:17,940
And. Real, as with pretty much all results on Polier Crawford,

255
00:33:17,940 --> 00:33:25,830
averaging this proof of this theorem relies on the construction first made by Pollack and you to see the seminal work.

256
00:33:25,830 --> 00:33:32,310
And it's just that we we analyse this algorithm in the heavy tail setting.

257
00:33:32,310 --> 00:33:43,110
That is the novelty. So at one point that I would like to make is that when Alpha is less than two, there is no Girl Cincotti.

258
00:33:43,110 --> 00:33:50,840
So the inference methods that are constructed for the uncertainty.

259
00:33:50,840 --> 00:33:53,900
Quantifying the uncertainty of SGV, for example,

260
00:33:53,900 --> 00:34:00,290
is not going to be valid because they are all based on this Gaussian Keltie assumption of the polio Chrebet averaging.

261
00:34:00,290 --> 00:34:12,880
So a lot of those methods has to be rethought in the in the case where you have heavy tail noise in the studio gaud.

262
00:34:12,880 --> 00:34:20,170
So, one, before I moved to Non-Com Back City and the final result,

263
00:34:20,170 --> 00:34:30,610
understand convexity is a result that you obtain twenty nineteen as a product of the science method workshop in San Jose.

264
00:34:30,610 --> 00:34:42,580
So here we actually characterised how fast you're converging to Gaussian in the case where you have a finite variance.

265
00:34:42,580 --> 00:34:52,450
Actually, we assume that this third movement of the noise exists. Then you can characterise how fast you're converging to Gaussian with this theorem.

266
00:34:52,450 --> 00:35:00,340
So what this theorem is saying is for any test function that is three times differentiable, you can show that this expectation,

267
00:35:00,340 --> 00:35:08,170
the difference between the test function value to the polygraph or the averaging and Gaussian is upper bounded by this quantity.

268
00:35:08,170 --> 00:35:15,790
So the proof relies on and not asymptotic martingale Keltie using Stine's method.

269
00:35:15,790 --> 00:35:21,490
But one thing that I would like to emphasise is that this rate here is not a good rate.

270
00:35:21,490 --> 00:35:31,360
It is because these squared divided by squared of the number of iterations and this is the dimension that you have in the high dimensions.

271
00:35:31,360 --> 00:35:35,740
To make this no small, you need to take way too many iterations.

272
00:35:35,740 --> 00:35:45,080
It's not a good rate, but it is still consistent with the experiments. A lot of these methodology like inference methodology that rely on this,

273
00:35:45,080 --> 00:35:56,080
Keltie actually realised, requires a small dimension, plus a lot of iterations of the.

274
00:35:56,080 --> 00:36:05,650
And so this is consistent with what we observed in practise, it's also as a side note,

275
00:36:05,650 --> 00:36:14,430
burning helps a lot in these kind of stuff, but I don't have any theoretical quantification of that.

276
00:36:14,430 --> 00:36:21,000
So back to the least squares example that I started to talk with, so we considered this linear model, right?

277
00:36:21,000 --> 00:36:25,850
And Epsilon, the epsilon, the noise had infinite variance.

278
00:36:25,850 --> 00:36:31,850
Now, without defining the objective function, you can directly define the gradient of it.

279
00:36:31,850 --> 00:36:37,940
So what does this mean? This is a bit cheating, but I can actually choose a function of X,

280
00:36:37,940 --> 00:36:44,840
and instead of writing this as a minimisation problem, I can write this as a finding the zero of a function.

281
00:36:44,840 --> 00:36:50,360
Right. And if I define this, the zero of the function is going to be below zero.

282
00:36:50,360 --> 00:36:57,770
Right. And that's basically where the minimum will be 18.

283
00:36:57,770 --> 00:37:03,880
So George has a question in the right result was the absolute value inside the expectation.

284
00:37:03,880 --> 00:37:13,830
Site. Hmm.

285
00:37:13,830 --> 00:37:21,130
Good point, that is inside the expectation. I just verified it.

286
00:37:21,130 --> 00:37:29,850
By looking at the paper, so it should be inside the expectation, but I can I can double check.

287
00:37:29,850 --> 00:37:41,260
OK, so. And the least squares set up the beat of zero is the unique minimum of this multi dimensional function,

288
00:37:41,260 --> 00:37:49,940
and Exter is going to be the start that the zero of this function is going to be to be to not be quite straightforward to see this.

289
00:37:49,940 --> 00:37:57,040
So the study updates, you can write it as a stochastic approximation algorithm and is given by this quantity.

290
00:37:57,040 --> 00:38:02,740
So the noise part is going to be just as we did for the in the second slider.

291
00:38:02,740 --> 00:38:05,890
So the noise I can write it as it has.

292
00:38:05,890 --> 00:38:15,040
It admits that the composition where the IED component is given by this quantity and the state dependent component is given by this quantity.

293
00:38:15,040 --> 00:38:23,280
So here the state is X, right? You see the IED component does not have the state in it.

294
00:38:23,280 --> 00:38:28,870
So one can verify if you assume that the first moment of the noise is finite,

295
00:38:28,870 --> 00:38:33,690
then the first moment of the IED component is finite and the state dependent

296
00:38:33,690 --> 00:38:41,880
component trivially satisfies this upper bound if we assume the covariates are nice.

297
00:38:41,880 --> 00:38:49,320
Right. So this basically is telling us that the assumption to the noise assumption will be easily satisfied.

298
00:38:49,320 --> 00:38:56,000
And for the function, the direction of the function, we have this expected value of the transfer,

299
00:38:56,000 --> 00:39:00,510
covariance of the covariance, covariance of the features.

300
00:39:00,510 --> 00:39:06,390
So you can assume this matrix is diagonally dominant and all the assumptions are satisfied.

301
00:39:06,390 --> 00:39:13,530
Like the Stupples, the definiteness is satisfied, but the cumulative definiteness is a milder assumption than diagonal.

302
00:39:13,530 --> 00:39:17,220
So you can actually assume this from the covariance matrix.

303
00:39:17,220 --> 00:39:26,530
Then you actually have the convergence results can be invoked for this case.

304
00:39:26,530 --> 00:39:36,550
OK, so now I will talk about non convexity and and what happens in this case and in the case of non convexity and non smoothness,

305
00:39:36,550 --> 00:39:42,580
basically assuming that your objective function is non convex and lossiemouth at the same time.

306
00:39:42,580 --> 00:39:46,720
The problem is a bit harder to handle.

307
00:39:46,720 --> 00:40:00,100
So we will consider in this case and EGD with a constant step size so that we can actually rely on the convergence Markov chain convergence results.

308
00:40:00,100 --> 00:40:12,970
And this methodology is actually the constant analysis of constant stepaside size is first derived by these people in 2019,

309
00:40:12,970 --> 00:40:17,950
delivered in 2019, and but the analysis was under strong complexity.

310
00:40:17,950 --> 00:40:27,190
And so it turns out that these are not needed assumptions and you can relax those pronouncements and come complex objectives.

311
00:40:27,190 --> 00:40:32,230
But the key point here is that I have a Gama'a that is websites that is constant.

312
00:40:32,230 --> 00:40:42,650
So this has to be Gammer. It's a typo. So the assumption, the first assumption, instead of the smoothness assumption.

313
00:40:42,650 --> 00:40:48,200
I make is that the gradient of F the objective function is growing at most linearly.

314
00:40:48,200 --> 00:40:50,990
So this is the first assumption.

315
00:40:50,990 --> 00:40:59,000
The second assumption I make is that the instead of strong complexity, the function satisfies a discipline assumption.

316
00:40:59,000 --> 00:41:04,610
So this opportunity is given by this inequality and this is satisfied for all X.

317
00:41:04,610 --> 00:41:10,460
So what does this assumption tell me and how does it reflect strong convexity is here.

318
00:41:10,460 --> 00:41:21,410
It basically says the strong convexity is a global curvature assumption, whereas this opportunity is restricting the strong convexity to the tails.

319
00:41:21,410 --> 00:41:29,880
So anything above you have the quadratic growth of F and around the origin you can have none complexity.

320
00:41:29,880 --> 00:41:38,010
So that is that assumption and it's quite common in not convex optimisation analysis.

321
00:41:38,010 --> 00:41:45,360
So in this case, the noise sequence is assumed to have finite fourth moment and additionally,

322
00:41:45,360 --> 00:41:57,900
I assume that its density has full support so so that I have a nice Markov chain defined by the L word.

323
00:41:57,900 --> 00:42:02,340
And the key point is that noise is still state dependent,

324
00:42:02,340 --> 00:42:12,620
so I can write the standard email SCD in this form, but it has a finite variance given X, given the state.

325
00:42:12,620 --> 00:42:20,330
OK, so now I used Oxford comma in the title so you can tell me if I use it correctly or not,

326
00:42:20,330 --> 00:42:28,760
but yeah, so in this case, the noise is state dependent and that's fine at variance, but not.

327
00:42:28,760 --> 00:42:35,280
So given these standard Markov chain results can tell us the area of the city of SAGD,

328
00:42:35,280 --> 00:42:44,150
so for a small enough staff size the DEA to admit a unique stationary distribution and this distribution,

329
00:42:44,150 --> 00:42:49,370
although we cannot characterise it, we can show its existence and uniqueness.

330
00:42:49,370 --> 00:42:59,720
And it depends on the step size. And also for a test function, it satisfies this convergence to the stationary distribution,

331
00:42:59,720 --> 00:43:03,600
not to any of the minima, it's just to the stationary distribution.

332
00:43:03,600 --> 00:43:11,630
The convergence is geometric, but it's not directly comparable to the previous convergence results that we talked about.

333
00:43:11,630 --> 00:43:18,860
And so what happens to the public or averaging non-com excuse now heavy tailed kids?

334
00:43:18,860 --> 00:43:26,980
It doesn't satisfy Ghastliness TLT, but it turns out that in the non-com excuse.

335
00:43:26,980 --> 00:43:38,680
There is a Gulf society, so now I will define the average of the city's interests are evaluated at a test function by and I

336
00:43:38,680 --> 00:43:45,370
will define the average display and the expectation under the invariant measure is defined this way.

337
00:43:45,370 --> 00:43:51,070
So this means that the Poljac report averaging where they evaluated a test function,

338
00:43:51,070 --> 00:43:58,870
is going to admit a central limit theorem given by the expression here.

339
00:43:58,870 --> 00:44:03,850
So this is cool. The Keltie is still valid under non complexity.

340
00:44:03,850 --> 00:44:13,180
And but we we have to talk about the bias of this algorithm in order to in order to come up with something useful.

341
00:44:13,180 --> 00:44:22,030
So what do I mean by bias? How does this compare to any of the critical points or the global M.O., the minima of this non complex function?

342
00:44:22,030 --> 00:44:28,720
So if without making any local regularity, so we if you don't assume anything further than just discipline,

343
00:44:28,720 --> 00:44:37,060
every assumption, this is not possible and all we can hope for is that the SGD algorithm will satisfy this inequality.

344
00:44:37,060 --> 00:44:47,870
And this is a constant that is an old one. So by letting step size small, you're not getting any improvement here.

345
00:44:47,870 --> 00:44:54,050
So SAGD converges to a ball that contains all the critical points in this convergence is exponentially fast,

346
00:44:54,050 --> 00:45:02,070
but bias is not controllable with the step size. Bias to the extreme.

347
00:45:02,070 --> 00:45:13,050
OK, so for that reason, I need to enforce a local regularity on top of the the growth assumption that was anticipated.

348
00:45:13,050 --> 00:45:15,090
So this is the previous disciplinary,

349
00:45:15,090 --> 00:45:25,290
I assume this outside of a board and now I assume inside of a bowl there is a local regularity defined by this convex function.

350
00:45:25,290 --> 00:45:30,580
That is incredible. And zero zero zero.

351
00:45:30,580 --> 00:45:37,930
Right. So this is the additional local irregularity and then under this additional localised, this Aparecida assumption,

352
00:45:37,930 --> 00:45:47,020
I can show that the was admitted a stationary distribution and the bias of this distribution,

353
00:45:47,020 --> 00:45:52,150
stationary distribution of the study is now controllable by the step website.

354
00:45:52,150 --> 00:45:59,440
So if I let Gamma to zero, the right hand side here also converges to zero.

355
00:45:59,440 --> 00:46:10,480
And for a test function that is Lipshitz, you can just take this and then show that the expectation under the invariant measure,

356
00:46:10,480 --> 00:46:20,530
minus the test function evaluated at the minimum, is going to be upper bound by this quantity, which goes to zero when Gamma goes to zero.

357
00:46:20,530 --> 00:46:31,750
And so here is a non Carmex function that is called Blake's Esterman Emily, that I found in, I think, visual reconstruction book.

358
00:46:31,750 --> 00:46:39,970
So this assumption, these this function actually turns out satisfies the assumptions of these theorems.

359
00:46:39,970 --> 00:46:46,090
And you can actually verify that it has a cult.

360
00:46:46,090 --> 00:46:49,750
And also this is by experiments is also correct.

361
00:46:49,750 --> 00:46:59,200
So our results also match with the numerical studies and in the numerical studies we use this test function, which is the absolute value.

362
00:46:59,200 --> 00:47:07,930
And in the paper, we also verified it rigorously that these assumptions of our theorems are satisfied.

363
00:47:07,930 --> 00:47:16,190
So. I'm out of time, so I'll conclude and with the summary of what I told you today.

364
00:47:16,190 --> 00:47:21,470
So basically this is the convergence analysis of attitude and the politics upward averaging.

365
00:47:21,470 --> 00:47:25,460
Right. So that's what we talked about, the infinite noise variance case.

366
00:47:25,460 --> 00:47:32,200
We showed the LP Convergence of SAGD and we identified a regime that size decay.

367
00:47:32,200 --> 00:47:42,440
So I that the polio growth rate averaging admits an awful stable limit and negotiate non Gaussian limit and understand complexity.

368
00:47:42,440 --> 00:47:50,390
We talked about it on asymptotically rate due to Y signs LAMBA and under non complexity.

369
00:47:50,390 --> 00:47:59,720
We show that Polier support averaging admits the CLT and its bias can be characterised if we have additional low cost structure.

370
00:47:59,720 --> 00:48:06,850
So that's it from me today. And if you have questions I would be happy to answer.

371
00:48:06,850 --> 00:48:18,190
OK, so thank you very much for very nice talk. So is there any question from the audience you can just amuse yourself and ask?

372
00:48:18,190 --> 00:48:27,060
Yeah, I've got a question. Well, I really enjoyed your talk, and you probably guess what my question is, is about.

373
00:48:27,060 --> 00:48:34,600
So you say you cannot ever run at the stuff that the gradient descent infinitely long.

374
00:48:34,600 --> 00:48:39,820
Mm hmm. So do you have these results with Andrius for the Central Limit Theorem?

375
00:48:39,820 --> 00:48:46,690
But can you use results also for a stable distribution using Stintz method in order to get an exclusive count?

376
00:48:46,690 --> 00:48:55,970
Yes. So that's a great question. And I think you had a recent paper on our first table limits and sties method.

377
00:48:55,970 --> 00:49:01,690
So it's definitely doable. It's just that it's a matter of doing it.

378
00:49:01,690 --> 00:49:06,940
So, yes, that's a good point. OK, good, Senator. Yes, a very interesting direction.

379
00:49:06,940 --> 00:49:16,630
Like basically, how does this discredited basket of trade change when you don't have a girlfriend sensibility, but you have all of those people?

380
00:49:16,630 --> 00:49:24,510
So that's a very interesting direction to explore. OK, yeah.

381
00:49:24,510 --> 00:49:34,230
Can I ask a question? Actually, it's sort of a double question, so, um, what's the motivation behind the Policy Act, Rupert averaging?

382
00:49:34,230 --> 00:49:37,180
Yeah. Another great question.

383
00:49:37,180 --> 00:49:49,690
So the poll Rupert averaging is so in the classical setting, it achieved the Krimmer all over Bonte, whereas the Lessiter iDesk does not.

384
00:49:49,690 --> 00:49:53,740
So now to the and other interesting research directions.

385
00:49:53,740 --> 00:49:58,570
How to what is that like the bond on lower bond on the variance.

386
00:49:58,570 --> 00:50:06,340
What does that translate to in the end, the heavy tail setting when there is no beating us?

387
00:50:06,340 --> 00:50:10,730
So what kind of optimality can one prove for political operative, right?

388
00:50:10,730 --> 00:50:16,090
OK, so is there does anyone ever consider, like, different types of averaging?

389
00:50:16,090 --> 00:50:18,850
Yes, I think that is also common.

390
00:50:18,850 --> 00:50:32,410
And I don't think the analysis is very hard given like a specific weighting between the iterates, but we haven't done that in any of our results.

391
00:50:32,410 --> 00:50:38,300
OK, OK, great, thanks. And to go back to my previous question about the absolute value.

392
00:50:38,300 --> 00:50:45,370
And so the reason I'm asking is because if you put the absolute value inside the expectation,

393
00:50:45,370 --> 00:50:51,590
then it sort of requires that you build some kind of coupling between the limiting Gaussian and your process.

394
00:50:51,590 --> 00:50:55,870
So it usually requires heavier machinery and potentially the red can be worse.

395
00:50:55,870 --> 00:51:02,060
So if you take the absolute value outside of the expectation, maybe the rate can improve.

396
00:51:02,060 --> 00:51:13,650
Hmm, that's interesting point. So. And usually the absolute value is outside, I think I think you're you're right,

397
00:51:13,650 --> 00:51:20,610
I think the absolute value should be outside because I'm checking the martingale Keltie that we proved in the paper.

398
00:51:20,610 --> 00:51:24,600
And it appears that it's outside of the expectation. OK, OK.

399
00:51:24,600 --> 00:51:28,210
Because otherwise you need to do like some kind of, like, strong central element, you know.

400
00:51:28,210 --> 00:51:34,410
You're right. Yeah. Yeah. No, this is relying on like basically.

401
00:51:34,410 --> 00:51:40,440
Like similar construction by rainout, drilling. So it should be outside, I guess.

402
00:51:40,440 --> 00:51:50,710
Yeah, yeah, yeah. We always have the outside. Yeah. I had a quick question.

403
00:51:50,710 --> 00:51:51,910
Yeah, go ahead.

404
00:51:51,910 --> 00:52:03,730
So the assumption on the experiences of the Cotard's, so you said that in the case of P equals one, it's aligned to the diagonal dominant.

405
00:52:03,730 --> 00:52:10,620
So is there like an intuition on what kind of how that changes the covariance structure in the case of linear regression?

406
00:52:10,620 --> 00:52:18,450
So is going to be more aligned to that identity as the candidates have to be more independent or something like this.

407
00:52:18,450 --> 00:52:24,860
So you have the covariates, expected value of.

408
00:52:24,860 --> 00:52:35,680
These transports, right? So these are your covariates and this is a kind of a signal and that is like if this sigma satisfies.

409
00:52:35,680 --> 00:52:47,020
Diagonal dominance, then this assumption is satisfied because then you are in this red regime, and that is going to satisfy any people.

410
00:52:47,020 --> 00:52:58,040
The assumption. So this only means that like this covariance, I mean, for example, if you have white covariance, then this is identity, right?

411
00:52:58,040 --> 00:53:05,230
So it's going to be satisfied, for example. Does that answer your question or did I miss it part of your question?

412
00:53:05,230 --> 00:53:06,610
So in some sense,

413
00:53:06,610 --> 00:53:15,670
is it saying is the assumption saying that the court does have to be more independent as P increases are equal to it will always be satisfied.

414
00:53:15,670 --> 00:53:23,890
But then I think as it moved towards Dinotopia and just say that the variance has to be closer to an identity in some sense.

415
00:53:23,890 --> 00:53:27,420
Yeah, so I think so. This is what I want.

416
00:53:27,420 --> 00:53:32,900
Right. So Viji G is not a.

417
00:53:32,900 --> 00:53:40,280
So I think that can be interpreted this way.

418
00:53:40,280 --> 00:53:51,530
I just wondered whether that sort of in between in the case of people wanting to so, so close to little.

419
00:53:51,530 --> 00:54:00,830
Certainly like this is a weaker assumption than the diagonal dominance, and this is also just to clarify, this is not really diagonal dominance.

420
00:54:00,830 --> 00:54:05,990
This is diagonal dominance with non-negative diagonal entries.

421
00:54:05,990 --> 00:54:13,220
And in this case, so this is the assumption that you have and I think you can interpret this as like the

422
00:54:13,220 --> 00:54:20,870
correlation between the features are weaker compared to the compared to the strong,

423
00:54:20,870 --> 00:54:40,490
stronger case. Another question from the audience.

424
00:54:40,490 --> 00:54:46,670
Yeah, I see the question, I guess for the latter part of the talk, the constant subsides down capacity.

425
00:54:46,670 --> 00:54:59,420
So if it's a mini batch instead of last year, can we still prove could with some stronger assumption to prove the central premise here?

426
00:54:59,420 --> 00:55:07,790
You can't prove you can't use minibike SCD. But, um, so let me go back to this slide.

427
00:55:07,790 --> 00:55:17,060
So this is still valid. Like you can have gamma T plus one equals gamma constant step size, but you can have many batched.

428
00:55:17,060 --> 00:55:23,120
But this is stream of data, so it's not like you have a batch of data that you resample.

429
00:55:23,120 --> 00:55:29,760
And so you should be doing these mini batches should be coming in as like new data points.

430
00:55:29,760 --> 00:55:36,530
That's what I meant. So if that is the case, you can do it. But everything here in this talk is on line.

431
00:55:36,530 --> 00:55:54,320
OK. For the for the U.S., there is a stationary distribution for them, but do you have any type of convergence results?

432
00:55:54,320 --> 00:55:59,930
Sort of. You mentioned it. So convergence to the invariant measure.

433
00:55:59,930 --> 00:56:10,880
Yeah. So convergence to so the standard are good to see the type of stuff is like easy to prove using Markov chain machinery like this one.

434
00:56:10,880 --> 00:56:16,040
Right. So this is the convergence that I briefly talked about.

435
00:56:16,040 --> 00:56:21,920
So this is quite standard. But the important point is like this.

436
00:56:21,920 --> 00:56:31,550
This is easy. But what is hard is this guy like how do you actually characterise anything about this invariant measure is the biggest challenge.

437
00:56:31,550 --> 00:56:43,070
That's not. Yeah, I see what you mean. So but I guess my question is so in practise, when you run algorithms like this, do you ever have convergence?

438
00:56:43,070 --> 00:56:50,180
You mean variance distribution or do people just run it for much shorter time scale so that

439
00:56:50,180 --> 00:56:58,460
you may only ever see you're still close to initialisation or the part of the environment,

440
00:56:58,460 --> 00:57:03,130
as you explore, is heavily influenced by your initialisation? Mm hmm.

441
00:57:03,130 --> 00:57:06,830
Mm hmm. So that's a good point.

442
00:57:06,830 --> 00:57:13,160
And so what you're saying is, OK, this is not what this is not answering your question,

443
00:57:13,160 --> 00:57:19,190
because this is true for any initialisation you're getting the same result.

444
00:57:19,190 --> 00:57:24,890
So what you're saying is if the stationary distribution depends on the or like.

445
00:57:24,890 --> 00:57:31,130
Yeah, it could depend on the initialisation as well. Well, that's not quite what I'm saying is.

446
00:57:31,130 --> 00:57:38,210
So obviously this are there the rate is sort of I'm not sure if you've quantified it.

447
00:57:38,210 --> 00:57:44,200
You could you can you can verify this are.

448
00:57:44,200 --> 00:57:54,760
I mean, I think we stop that saying our is less than one and greater than zero, but this is kind of like I think you can you can compute it, but.

449
00:57:54,760 --> 00:57:57,700
Yeah, but what I'm saying. OK, so let me rephrase.

450
00:57:57,700 --> 00:58:05,470
So I'm not sure how you proving this, but if you're doing it using Dreft and, you know, functions and things like that,

451
00:58:05,470 --> 00:58:11,830
it's very likely that this hour will be very, very bad in terms of the dimensioned dependence.

452
00:58:11,830 --> 00:58:19,570
Yes. So if you do that in practise, if you plug that in and you say, OK, so I've run my algorithm for,

453
00:58:19,570 --> 00:58:24,640
let's say capital case, that's how close our mind to the environment distribution.

454
00:58:24,640 --> 00:58:29,170
What I'm saying is so obviously this gives you a mixing time.

455
00:58:29,170 --> 00:58:34,150
So does the mixing time compared to practical runtime.

456
00:58:34,150 --> 00:58:43,660
Some of these algorithms will raise it at a different and a completely different time scale so that you'll never, ever see the environ distribution.

457
00:58:43,660 --> 00:58:51,490
So that's a good point. So this proposition is used in this theorem, which is asymptotic.

458
00:58:51,490 --> 00:58:57,970
So we actually never required are to be something nice as long as it's less than one.

459
00:58:57,970 --> 00:58:59,140
We were good.

460
00:58:59,140 --> 00:59:09,970
So we didn't explore anything that like what happens in the finite sample regime because our objective was this Keltie and which is asymptotic.

461
00:59:09,970 --> 00:59:13,960
But that's a that's an interesting thing to think about, actually.

462
00:59:13,960 --> 00:59:23,890
In your experience, what do you think? So I can show you I mean, I knew I wasn't going to have time, so that's why I.

463
00:59:23,890 --> 00:59:36,570
covid. Yeah, OK, so I actually had experiments that I took it out of the talk, but so.

464
00:59:36,570 --> 00:59:47,480
It's. So what people observe is like when is larger than 20, you have to run for a very long time to see any of these results, actually.

465
00:59:47,480 --> 00:59:55,470
This asymptotic ski resort that is in line with this not asymptotic version of the CFT.

466
00:59:55,470 --> 01:00:02,740
So it's like if the raid is discovered, I'm scared of the squirrel, we're scared of tea, then you expect to run forever.

467
01:00:02,740 --> 01:00:08,430
It is like large, right? So and what we observed is similar.

468
01:00:08,430 --> 01:00:12,330
The experiments you can verify staff and these like 20 fifty.

469
01:00:12,330 --> 01:00:17,180
But anything beyond that is hard. Right.

470
01:00:17,180 --> 01:00:25,400
OK, thanks. OK, I guess we have a run run out of time, so let's add some in there for today.

471
01:00:25,400 --> 01:00:29,660
So thank you, everyone, for coming in. Let's listen to the speaker again.

472
01:00:29,660 --> 01:00:40,476
Have a nice weekend. Thank you for spending.