1 00:00:03,360 --> 00:00:08,460 So hello, everyone, so welcome to the first Yasumoto seminar. 2 00:00:08,460 --> 00:00:18,090 So this week, the speaker is to talk to So Radice, assistant professor at the University of Toronto, 3 00:00:18,090 --> 00:00:21,900 the Departments of Computer Science and Statistical Sciences. 4 00:00:21,900 --> 00:00:32,130 He's a faculty member of the Machine Learning Group and the Bacteria Institute, and Seifi chair in Artificial Intelligence before joining Toronto. 5 00:00:32,130 --> 00:00:38,790 He was also at the Microsoft Research New England, where I did his Ph.D. 6 00:00:38,790 --> 00:00:50,160 They teach statistics at Stanford University, a device by Mosin Bayati and Andrea Montana. 7 00:00:50,160 --> 00:00:56,700 So today Mirada will be talking about some new results of our exciting result about convergence 8 00:00:56,700 --> 00:01:05,690 of online stochastic reading designed under infinite noise virus and non-combustible. 9 00:01:05,690 --> 00:01:13,190 Thank you, Jim, for that very kind introduction and thank you all for coming to the talk show today. 10 00:01:13,190 --> 00:01:20,090 I'll talk about convergence of online stochastic gradient descent under some modern set 11 00:01:20,090 --> 00:01:26,750 ups where you may have infinite noise variance or a non complexity in the problem. 12 00:01:26,750 --> 00:01:35,540 So, as you mentioned, I did my Ph.D. in statistics department and back then pretty much every talk started with the following line. 13 00:01:35,540 --> 00:01:40,070 And I'll stick to the tradition and also do the same. 14 00:01:40,070 --> 00:01:50,570 So we considered for this slide the following linear model where why is the response to the is the covariates data? 15 00:01:50,570 --> 00:01:57,050 Zero is the true parameter that we are after. And Epsilon, here is the noise. 16 00:01:57,050 --> 00:02:01,070 So the noise is assumed to have conditional meaning. 17 00:02:01,070 --> 00:02:12,950 And when you're trying to find beta zero, what you do is you write down the population problem and and write this as a minimisation problem. 18 00:02:12,950 --> 00:02:19,910 So minimum of this population risk is a population problem because it's under 19 00:02:19,910 --> 00:02:26,120 expectation of response and the covariance is going to give you the true beta not. 20 00:02:26,120 --> 00:02:37,940 And since this is a minimisation problem, you can write down the stochastic gradient descent updates given as this here Gamma T is your step system. 21 00:02:37,940 --> 00:02:48,480 So it turns out stochastic gradient descent with a proper step size choice will give you the two parameter B to zero. 22 00:02:48,480 --> 00:02:52,280 So of course I did not. 23 00:02:52,280 --> 00:02:56,210 So I left a couple of blank space here. 24 00:02:56,210 --> 00:03:03,190 The reason for that is what happens if you assumed the noise variance is infinity, right? 25 00:03:03,190 --> 00:03:11,450 So the noise may have maybe tails. It's coming from the model and the second moment of the noise is infinity. 26 00:03:11,450 --> 00:03:16,580 Then this objective function is not defined. So this becomes infinity. 27 00:03:16,580 --> 00:03:26,540 But nothing is stop being a practitioner to run SGD in this scenario again, it turns out SAGD still finds the true parameter. 28 00:03:26,540 --> 00:03:35,180 Not so. Some of you may see what's going on here and some of you may be surprised, just like I was. 29 00:03:35,180 --> 00:03:43,930 And I will tell you why this is happening in after I give you some convergence results on SGV. 30 00:03:43,930 --> 00:03:55,090 OK, so throughout this talk, I am interested in the following unconstrained optimisation problem where we are trying to minimise the function F of X, 31 00:03:55,090 --> 00:04:01,390 and its argument is in this dimension and we would like to use stochastic gradient 32 00:04:01,390 --> 00:04:10,240 descent algorithm given as the following recursion and here GMAT is denoting the step. 33 00:04:10,240 --> 00:04:17,050 This is gradient of the function you're trying to minimise and this is state dependent. 34 00:04:17,050 --> 00:04:24,870 So noisy, state dependent, then it is an important component of this algorithm. 35 00:04:24,870 --> 00:04:33,120 So two things that I would like to study in this talk is that in modern settings, 36 00:04:33,120 --> 00:04:46,920 just as I describe in the infinite variations in underlying complexity, how does this algorithm converge and how does the averaging behave? 37 00:04:46,920 --> 00:04:52,710 So I'm not specifying where they converge or how they converge or the type of convergence. 38 00:04:52,710 --> 00:05:03,050 And I'm just saying these are the two things that I would like to study and we will see different notions of convergence and very soon. 39 00:05:03,050 --> 00:05:09,430 OK, and I'll talk more about polygrapher average, why it is interesting. 40 00:05:09,430 --> 00:05:21,720 So this study looks a bit different than the study that we run and run into in machine learning algorithms, but it is actually the same thing. 41 00:05:21,720 --> 00:05:31,410 So consider the stochastic optimisation problem where you have a function f AWEX that can be written as the population risk, 42 00:05:31,410 --> 00:05:37,560 but the population risk and machine learning is the expected loss over a random variable Z. 43 00:05:37,560 --> 00:05:48,970 This is typically your data points YMC together and this expectation is in all the data points so you don't have access to it. 44 00:05:48,970 --> 00:05:59,610 So you resort to a stochastic optimisation algorithm like stochastic gradient descent, the speed that we run in general. 45 00:05:59,610 --> 00:06:06,120 It takes a bunch of data and many batched say it with a bad side. 46 00:06:06,120 --> 00:06:16,530 B is the estimates the gradient of this function F or using the M.H. So B samples, 47 00:06:16,530 --> 00:06:22,620 you just estimate the gradient of F of X, which is a population quantity. 48 00:06:22,620 --> 00:06:28,380 You estimate it using this over this batch of data points. 49 00:06:28,380 --> 00:06:35,490 And what you can do here is to convert this back to our formulation. 50 00:06:35,490 --> 00:06:43,180 You can simply say that the noise is given by the difference between this gradient estimated with the true gradient itself. 51 00:06:43,180 --> 00:06:53,850 So this is the noisy gradient, in other words. And the difference between this noisy gradient with the true gradient is the noise in my studio. 52 00:06:53,850 --> 00:06:54,720 Good. 53 00:06:54,720 --> 00:07:04,200 I would like to emphasise that this is online SAGD, meaning that this is a stream of data coming in and each batch is independent from each other. 54 00:07:04,200 --> 00:07:11,460 So this is not like you have a batch of data points and you subsample a B data points from that. 55 00:07:11,460 --> 00:07:21,190 That's not what we are doing. There is a stream of data coming in and and that's why the title is on minus. 56 00:07:21,190 --> 00:07:29,050 So I am interested in the settings where the noise has infinite variance, noise is heavy tailed. 57 00:07:29,050 --> 00:07:32,980 So the first item is motivated by recent observations. 58 00:07:32,980 --> 00:07:42,240 That study can exhibit heavy tails and in many modern machine learning models. 59 00:07:42,240 --> 00:07:47,100 And the second thing I'm interested in is the objective function is not. 60 00:07:47,100 --> 00:07:56,550 This is also motivated by the fact that many of the objective functions in machine learning is non-com. 61 00:07:56,550 --> 00:08:04,530 And lastly, a practitioner can never take infinite iterations of SAGD. 62 00:08:04,530 --> 00:08:07,380 So in some scenarios, 63 00:08:07,380 --> 00:08:16,540 I would like to characterise these conversions behaviour within an asymptotic analysis so I can quantify how much iterations I need. 64 00:08:16,540 --> 00:08:23,700 Exactly. So these are the three items that I am interested in. 65 00:08:23,700 --> 00:08:31,200 So remember when I said I'm interested in the interest, I'm also interested in the public reppert averaging. 66 00:08:31,200 --> 00:08:38,110 So what is all crap, the averaging? It's a very simple adjustment to the output. 67 00:08:38,110 --> 00:08:42,120 So you run a CD for titrations and you average them out. 68 00:08:42,120 --> 00:08:50,170 That is called Polyakov Upward Averaging. And Polier Interrupted introduced this and different papers around the same time. 69 00:08:50,170 --> 00:09:01,350 It turns out this algorithm has. Good generalisation properties observed numerically, also, you can under classical set ups, 70 00:09:01,350 --> 00:09:08,190 you can prove that this is optimal, achieves Kramer all over, bound under a finite Nigeriens, et cetera. 71 00:09:08,190 --> 00:09:13,620 So it's it's it's it enjoys some nice properties. 72 00:09:13,620 --> 00:09:20,280 And one thing that I am interested in is that it admits a central limit theorem in the classical 73 00:09:20,280 --> 00:09:28,860 setting where the noise variance is finite and and the function you're minimising is strongly complex. 74 00:09:28,860 --> 00:09:36,780 And this is the basis for many inference methods under your algorithm and pretty 75 00:09:36,780 --> 00:09:48,300 much all the inference methods that rely that take the algorithmic part of the. 76 00:09:48,300 --> 00:09:55,410 That take into consideration the algorithmic part of the methodology relies on the CFT, 77 00:09:55,410 --> 00:10:05,430 but what happens to this Keltie if the noise variance is infinity or F is non-com X, and what is the rate of these this CRT? 78 00:10:05,430 --> 00:10:11,160 These are the two things I would like to answer. 79 00:10:11,160 --> 00:10:22,560 OK, so there are many papers on stochastic gradient descent because the method of choice these days to train a model for that reason, 80 00:10:22,560 --> 00:10:34,110 but many of them actually can be considered as kind of follow up to these classical results by in the sky and cross the line, etc. 81 00:10:34,110 --> 00:10:41,310 And I will only compare this talk to the classical results. And I also consider this. 82 00:10:41,310 --> 00:10:47,400 These are the most important papers that kind of led the field. 83 00:10:47,400 --> 00:10:58,890 So the first result is by now where the first paper to consider infinite variance and infinite variance in study. 84 00:10:58,890 --> 00:11:06,000 So what this paper did was in one dimension, the function is assumed to be strongly complex. 85 00:11:06,000 --> 00:11:11,250 So it looks like this and the noise variance has a sorry. 86 00:11:11,250 --> 00:11:19,230 The noise variance is fairly, but the noise has the of moment finite and alpha is less than two and the step 87 00:11:19,230 --> 00:11:26,610 size is T to the minus row decaying step size where row is set to be one. 88 00:11:26,610 --> 00:11:32,820 Now the convergence in LP, the Convergence Rate of Speed LP is given by this expression. 89 00:11:32,820 --> 00:11:44,850 And it turns out that with this rate you cannot show that the public average the convergence of the polio Grubert average. 90 00:11:44,850 --> 00:11:50,490 So the reason for this I'll get to it is that the Stepaside decay is too fast. 91 00:11:50,490 --> 00:11:51,540 Growing equals one. 92 00:11:51,540 --> 00:12:01,910 Is it too fast to that decay to that will not give you at least as far as the proof is concerned, polio be convergence of the polio for the average. 93 00:12:01,910 --> 00:12:15,030 So the the seminal work by Paul Kanjorski was considering the strongly complex set up, and they assume that the noise variance is finite, 94 00:12:15,030 --> 00:12:22,260 Alpha is, too, and they considered a range of steps, such as the case where it is ranging from one 1/2 to one notice. 95 00:12:22,260 --> 00:12:30,150 That equals one is not included here. The reason for that is Roy equals one is too fast and anything below is fine. 96 00:12:30,150 --> 00:12:38,400 And the rate that they consider, I mean, this they relied on previous results to is this much. 97 00:12:38,400 --> 00:12:47,370 And the most important part is that they showed us some tricks, which led to many interesting results. 98 00:12:47,370 --> 00:12:52,950 They show that it's achieved a lower bound to show that. 99 00:12:52,950 --> 00:12:57,100 So they showed, I think, the first asymptotically for the polygrapher. 100 00:12:57,100 --> 00:13:08,350 Babaji and many other papers actually use this idea to construct inference methods based on the fact that they proved. 101 00:13:08,350 --> 00:13:17,890 And in this talk, I will first consider a class of functions that are strongly complex, 102 00:13:17,890 --> 00:13:25,450 but it is a stronger assumption than strong convexity, so strongly convex functions have hachim that live inside of the core. 103 00:13:25,450 --> 00:13:33,520 So let's say this is the speaker. I will consider a set of Q positive, definite matrices. 104 00:13:33,520 --> 00:13:42,250 That is a subset of the sequence. So this is the orange region that you're seeing and I'll explain what this red region is. 105 00:13:42,250 --> 00:13:46,570 Later, I will assume that the noise variances infinity. 106 00:13:46,570 --> 00:13:54,310 So we are in the setting and the step size it satisfies the exponent of the step size is in this range. 107 00:13:54,310 --> 00:14:03,190 So notice that when Alpha is two, you have flying experience and this reduces to the same interval that polio can you diski consider this is 108 00:14:03,190 --> 00:14:10,510 the rate that I obtain and we show that the public averaging actually does not admit a Gaussian Keltie. 109 00:14:10,510 --> 00:14:14,150 It admits the alpha stable limit and it is heavy tail. 110 00:14:14,150 --> 00:14:15,850 It doesn't have a finite variance. 111 00:14:15,850 --> 00:14:25,990 So many of the inference methods that you relied on that you constructed, relying on this asymptotically will not hold in the air. 112 00:14:25,990 --> 00:14:31,000 And in this case. And I will. 113 00:14:31,000 --> 00:14:39,730 As a follow up to this, I'll talk about how we can actually make these asymptotically results not asymptotic, 114 00:14:39,730 --> 00:14:47,230 and prove that this Keltie can be achieved with a raid of this square over tea, 115 00:14:47,230 --> 00:14:52,470 but under the assumption that the third movement of the noise is finite. 116 00:14:52,470 --> 00:14:58,360 So this is a finite this is a nine asymptotically and based on a science method. 117 00:14:58,360 --> 00:15:01,250 And this is the rate is the square oberti. 118 00:15:01,250 --> 00:15:12,440 So this is going to tell us that in practise, these KLZ, especially in high dimension, require too much work. 119 00:15:12,440 --> 00:15:19,370 And I will relax the strong convexity assumptions, which is in the first four items, 120 00:15:19,370 --> 00:15:25,040 I will assume that the strong convexity is contained only in details of the function, 121 00:15:25,040 --> 00:15:26,930 as in the last figure here, 122 00:15:26,930 --> 00:15:37,070 where around the origin you can't you are allowed to have any behaviour and with an assumption that the fourth moment of the noise is finite. 123 00:15:37,070 --> 00:15:41,570 I will show that the asymptotically the Gaussian Keltie still holds. 124 00:15:41,570 --> 00:15:46,400 So these are the three results that I would like to talk to talk about. 125 00:15:46,400 --> 00:15:51,860 And before I give you the results, I'll try to give you some intuition. 126 00:15:51,860 --> 00:15:58,580 And intuition is quite simple and it'll lead to the assumptions that we have. 127 00:15:58,580 --> 00:16:05,750 So assume that you're going to run on this CD and at iteration you are here. 128 00:16:05,750 --> 00:16:09,800 This is the point. You are and an extra is where you want to get at. 129 00:16:09,800 --> 00:16:17,200 So once you have strong complexity, you run gradient descent, you will converge to this point. 130 00:16:17,200 --> 00:16:22,080 It's small enough, so let's consider the one step of the full gradient descent. 131 00:16:22,080 --> 00:16:29,130 This is the full gradient descent. So it takes you from your text and makes it closer texta. 132 00:16:29,130 --> 00:16:38,330 Right. So the distance from X to X killed it, which is the output of gradient descent, non stochastic gradient descent. 133 00:16:38,330 --> 00:16:43,350 Is going to be decreased following this rule. 134 00:16:43,350 --> 00:16:46,440 Right, so there is a constant ah, which is less than one, 135 00:16:46,440 --> 00:16:54,030 so it will multiply the distance from 60 to start and make it distance from X to the T plus one to Exter. 136 00:16:54,030 --> 00:16:59,390 So this is the gradient descent algorithm. And. 137 00:16:59,390 --> 00:17:04,240 In the stochastic gradient descent, you have a noise. 138 00:17:04,240 --> 00:17:12,580 So what it means is that you have the gradient descent plus step size times, some noise. 139 00:17:12,580 --> 00:17:18,400 So this is the stochastic city part and this is where your study is getting you right. 140 00:17:18,400 --> 00:17:30,930 And the. And it turns out that this distance is going to be controlled by the Stepaside because this noise has is multiplied by the steps, 141 00:17:30,930 --> 00:17:37,290 the smaller the step sizes, this noise will be smaller and smaller. 142 00:17:37,290 --> 00:17:44,310 OK. Now I can construct this triangle and right down this triangle, inequality, 143 00:17:44,310 --> 00:17:50,910 basically the distance from the next iteration of SAGD to the optimum is the summit. 144 00:17:50,910 --> 00:17:59,160 It's smaller than the sum of these two edges. And this is the basically triangle inequality. 145 00:17:59,160 --> 00:18:03,390 And what this is telling me is that so this is getting and obtaining this from 146 00:18:03,390 --> 00:18:11,760 the contraction I'm obtaining this due to noise is multiplied by the step size. 147 00:18:11,760 --> 00:18:26,130 OK, so it's a very simple argument, and this leads to the assumption that we we have and it's going to be hopefully intuitive. 148 00:18:26,130 --> 00:18:36,480 So the gradient descent contraction is, as I said, is given by this relationship, this inequality in the case where you have finite noise variance. 149 00:18:36,480 --> 00:18:45,090 The standard analysis continues with setting this distance from 60 to Exter as the expected value. 150 00:18:45,090 --> 00:18:54,270 I mean, this L2 to distance basically the expected value of the squared or so. 151 00:18:54,270 --> 00:19:04,890 And it turns out that the gradient descent iteration satisfies this inequality following the standard analysis. 152 00:19:04,890 --> 00:19:14,190 So all I need to show is that I need to show this guy, which is the R here is less than one. 153 00:19:14,190 --> 00:19:24,870 So remember, I'm going to let the step size go to zero because I need to decrease the step size in this part of the talk. 154 00:19:24,870 --> 00:19:33,760 So this means that for a four, Gama'a, that is small enough, this quantity has to be smaller than one. 155 00:19:33,760 --> 00:19:42,990 In other words, I'm just going to say that for small enough gammer, there is a L such that this inequality is satisfied. 156 00:19:42,990 --> 00:19:52,530 Now, this is exactly equivalent to saying that F is strongly comeback's or Heshan of F is uniformly positive, definite. 157 00:19:52,530 --> 00:20:05,730 These two statements are equivalent. OK, so this was the standard speed analysis and it turns out strong convexity is giving us nice results. 158 00:20:05,730 --> 00:20:11,850 But in the infinite noise variance setting, you see, I cannot use the second moment here. 159 00:20:11,850 --> 00:20:16,790 So what I need to use is the moment where P is smaller than to. 160 00:20:16,790 --> 00:20:26,090 And for that reason, I'm going to choose my distance and this is not a metric, but so this distance as the expected value of the. 161 00:20:26,090 --> 00:20:33,710 Peace, power of the peace, no, of this distance, if this is my new distance metric, 162 00:20:33,710 --> 00:20:43,910 then the recursion I wrote for the second in the in the finite notes variance case, I can write it equivalently as this and. 163 00:20:43,910 --> 00:20:56,580 With the same strategy, I need to make this first term here, which is going to be giving me to dictate smaller than one so that I have a contraction. 164 00:20:56,580 --> 00:21:08,840 OK, and. The equivalent, the expression is this, So I need to have this power of the peace no less than or equal to one minus Gamma. 165 00:21:08,840 --> 00:21:18,260 And it turns out, just as this was equivalent to the uniform positive definiteness, this will also imply some sort of positive definiteness. 166 00:21:18,260 --> 00:21:26,120 But it's going to be stronger than the standard notion of positive definiteness that we all are familiar with. 167 00:21:26,120 --> 00:21:31,380 So this is the idea. I need a contraction in some distance. 168 00:21:31,380 --> 00:21:38,130 Second, so if the distance is the Euclidean distance squared, that gives me strong complexity, 169 00:21:38,130 --> 00:21:43,310 if it is the norm, it will give me it will require something stronger. 170 00:21:43,310 --> 00:21:54,200 That is the idea, so let's recap some of the results where we have in the standard. 171 00:21:54,200 --> 00:22:05,560 Standard classical set up so. As I mentioned, this inequality, basically this matrix, having this operator, Naum Square, 172 00:22:05,560 --> 00:22:14,440 being less than equal to one minus gamma four for all small gamma is equivalent to F being strong economics, 173 00:22:14,440 --> 00:22:17,880 which is saying that the Heshan is lower bonded. 174 00:22:17,880 --> 00:22:29,070 Right, and so this means that this implies that the set of mattresses, the federal mattresses that are over this set, 175 00:22:29,070 --> 00:22:39,390 basically the question of the function you're trying to minimise is a subset of the equation which is represented as this coin and figure on the left. 176 00:22:39,390 --> 00:22:48,000 So it's symmetric Matrix two is positive, definite if the normal V is one, we have this inequality, 177 00:22:48,000 --> 00:22:54,520 which is a strict inequality is this quadratic form is greater than zero all the time for all V on this unit, 178 00:22:54,520 --> 00:23:02,120 Saffir and two is positive, semi definite if this inequality is not strict. 179 00:23:02,120 --> 00:23:07,970 Right, and instead of the mattresses defines a count. So this is that cone. 180 00:23:07,970 --> 00:23:13,670 Now let's go to the pinata. So this was to Norm here, right? 181 00:23:13,670 --> 00:23:19,190 Let's go to the Pinault and see what happens to get the contraction in the arm. 182 00:23:19,190 --> 00:23:23,870 I need this inequality for Gammer in this range. 183 00:23:23,870 --> 00:23:28,730 Now, before showing you the equivalence, I need a definition. 184 00:23:28,730 --> 00:23:31,730 I need sign power or a vector. 185 00:23:31,730 --> 00:23:44,630 So given a vector like this and some positive or Q greater than equal to zero, the same power of a vector is defined by this quantity. 186 00:23:44,630 --> 00:23:50,990 So I have the sign of each entry and taking the absolute value and raising to the public. 187 00:23:50,990 --> 00:24:01,350 So you can say, OK, let's let me give you the definition and then I'll show you the figure and how they look like. 188 00:24:01,350 --> 00:24:15,560 So a symmetric matrix cue is positive, definite, if for all we on the units to fear but fear and PiƱera satisfy this strict inequality. 189 00:24:15,560 --> 00:24:31,340 And it is positive, semi definite, if this strict inequality becomes non strict and said of all the mattresses are also defined. 190 00:24:31,340 --> 00:24:39,680 And here is how the picture looked like when P is equal to this definition here. 191 00:24:39,680 --> 00:24:43,530 Exactly. Reduces to the cone. 192 00:24:43,530 --> 00:24:45,920 So vampyres equal to two. 193 00:24:45,920 --> 00:24:57,170 You have the cone that is the blue region here, and this definition reduces the standard definition and then P is equal to one. 194 00:24:57,170 --> 00:25:08,420 Then this definition reduces to the set of all diagonal dominant matrices that have positive entries. 195 00:25:08,420 --> 00:25:13,610 And that is also a cone. That is this red region here. 196 00:25:13,610 --> 00:25:24,440 And any party that is in between one half, you still get a cone, but that is in between these two these two cases, then equal to one and two. 197 00:25:24,440 --> 00:25:32,830 So there is this inclusion relationship between these cones and. 198 00:25:32,830 --> 00:25:42,490 And it is important that this war and peace and peace in between one and two discon lives in these two two columns 199 00:25:42,490 --> 00:25:49,360 of background dominant in East Coast and the cone of diagonally dominant mattresses is the most extreme case. 200 00:25:49,360 --> 00:25:55,320 So if you make this assumption, then any of our assumptions are already satisfied. 201 00:25:55,320 --> 00:26:04,320 I'll get I'll talk about this a bit further. Let me give you the two assumptions that I need for the convergence of SAGD and the heavy tailed Riggi. 202 00:26:04,320 --> 00:26:12,960 So I will assume that the set of matrices, basically these Heshan matrices, are bounded and uniformly positive, definite. 203 00:26:12,960 --> 00:26:23,010 So when you see in this region, so when you see a few positive definiteness assumption, so is referring to this. 204 00:26:23,010 --> 00:26:31,050 Uniformly on the set of all sessions and here equal to Q so this is my first 205 00:26:31,050 --> 00:26:37,350 assumption on the function and I will assume that the noise admits the following, 206 00:26:37,350 --> 00:26:48,280 the composition. So noise itself is state dependent and there is a state dependent component and a state independent component, 207 00:26:48,280 --> 00:26:55,130 by state I mean the 60 and the state independent components are eyed with expectation 208 00:26:55,130 --> 00:27:02,020 zero and has the finite of US moment for some of our less than or equal to two. 209 00:27:02,020 --> 00:27:08,620 Two is also fine. And this state dependent component is a martingale different sequence. 210 00:27:08,620 --> 00:27:13,000 And this martingale different sequence satisfies this inequality. 211 00:27:13,000 --> 00:27:20,050 So this means that the the variance of the state dependent component can be large. 212 00:27:20,050 --> 00:27:31,950 But for a fixed X or for a given X T, it is always finite, whereas the heavy tail part comes from the side component. 213 00:27:31,950 --> 00:27:38,220 So this is the noise assumption that I make in the heavy tails, heavy tail case, 214 00:27:38,220 --> 00:27:44,490 so we have this very recent result saying that if you have a step size satisfying 215 00:27:44,490 --> 00:27:51,120 this decaying like this to my nostril and this row is in between zero and one, 216 00:27:51,120 --> 00:27:56,910 the air of the city in Kunar to normal to the power kyuzo here, 217 00:27:56,910 --> 00:28:08,230 to the subscripts to that's why we limited to naum to the power you is given by this expression and that is its decay rate. 218 00:28:08,230 --> 00:28:14,540 So this is the convergence rate that we achieve in the case of heavy tailed mouse. 219 00:28:14,540 --> 00:28:24,880 For the less illiterate of GD to minimise their unique minimisation of the problem. 220 00:28:24,880 --> 00:28:34,040 OK, so what is this rate about? Remember, we have a stat size decaying with teeth minus Joe. 221 00:28:34,040 --> 00:28:44,390 And we have a noise that has finite of us moment for hopefully some of that is smaller than two and we have a HESHBON that is too positive, 222 00:28:44,390 --> 00:28:53,560 definite, and this is a stronger assumption that than a plain positive definiteness, which is equivalent to strong convexity. 223 00:28:53,560 --> 00:29:02,980 Right. So this rate. Is basically giving me when Q is going to Alpha, 224 00:29:02,980 --> 00:29:10,060 so you see the assumption on the function and the assumption on the noise are connected to has to be always less than Alpha. 225 00:29:10,060 --> 00:29:19,660 But being curious, going to Alpha in this regime, you achieved this great. 226 00:29:19,660 --> 00:29:30,200 OK, and when role is going to one which is the fastest dictate that you can have on the step size, that gets you to the fastest convergence rate. 227 00:29:30,200 --> 00:29:33,940 This is the rate of Gasolina. Sixty nine. 228 00:29:33,940 --> 00:29:38,680 But this is too fast for the convergence of polygraphers average. 229 00:29:38,680 --> 00:29:48,510 So this will not this case will not work if our objective is to show the convergence of politics upward averaging. 230 00:29:48,510 --> 00:29:56,280 So when offer is going to two, meaning that your noise is now having finite variance. 231 00:29:56,280 --> 00:30:04,950 Then the rate that you achieve is this much, and in this case now you can actually move to the all the way to the two. 232 00:30:04,950 --> 00:30:15,780 So you have the strongly complex functions. And in this case, you recover the normal rate T to the minus trover to for finite variance. 233 00:30:15,780 --> 00:30:21,540 But this is the finite variance regime and this is the infinite variance regime. 234 00:30:21,540 --> 00:30:29,620 And this rate is not good for us because I want to show the convergence of the political upward averaging. 235 00:30:29,620 --> 00:30:39,050 That's too fast for that. So remember, the gradient noise was given by this edness, this decomposition. 236 00:30:39,050 --> 00:30:50,300 So what I will additionally issue to the previous case is that I'll assume that this average of the IED component admits that of a stable, 237 00:30:50,300 --> 00:30:53,340 symmetric of a stable limit. 238 00:30:53,340 --> 00:31:05,580 And so standard generalised Keltie results actually has certain conditions to be satisfied in one result is due to climate in a Danco. 239 00:31:05,580 --> 00:31:16,140 And if the random variables with power to tails symmetric random variables with power little tails, then you have the following satisfied. 240 00:31:16,140 --> 00:31:27,600 So a stable distribution, this alpha stable limit assumption allows me to characterise the limiting behaviour of the Polya corrupted averaging. 241 00:31:27,600 --> 00:31:32,130 So I'll talk about alpha stable limits in the next slide. 242 00:31:32,130 --> 00:31:43,470 But the result is as follows. If the Roka and Alpha satisfied satisfy this inequality and this additional smoothness assumption. 243 00:31:43,470 --> 00:31:47,910 So you can think of this assumption as the Heshan smoothness. Then, 244 00:31:47,910 --> 00:31:54,300 for a Stepaside sequence satisfying the standard indicate the normalised average 245 00:31:54,300 --> 00:32:01,650 pull Robert averaging Convergys weekly to an alpha stable distribution. 246 00:32:01,650 --> 00:32:11,320 So here, the first observation is that the staff size the index of the staff size the role. 247 00:32:11,320 --> 00:32:19,460 It has to be in this regime and in politics and you just keep this within this regime, when Alpha goes to two, 248 00:32:19,460 --> 00:32:29,610 when you have finite experience these to match, but when Alpha is smaller than two, we have a smaller interval here to work with. 249 00:32:29,610 --> 00:32:39,030 So offer stable distributions. So here is a figure that I took from Wikipedia and I forgot to acknowledge in the slide. 250 00:32:39,030 --> 00:32:45,900 And so this is the PDF of our first table distribution and this is the parameter alpha. 251 00:32:45,900 --> 00:32:51,780 So the variance of this limit is infinity. So there is no variance when Alpha is smaller than two. 252 00:32:51,780 --> 00:32:57,930 So you won't have a Gaussian Keltie when Alpha is smaller than two. 253 00:32:57,930 --> 00:33:11,430 And when Alpha is to this recovers the standard asymptotically results of and she did Saeki is because of a stable one of equals two is Gaussian. 254 00:33:11,430 --> 00:33:17,940 And. Real, as with pretty much all results on Polier Crawford, 255 00:33:17,940 --> 00:33:25,830 averaging this proof of this theorem relies on the construction first made by Pollack and you to see the seminal work. 256 00:33:25,830 --> 00:33:32,310 And it's just that we we analyse this algorithm in the heavy tail setting. 257 00:33:32,310 --> 00:33:43,110 That is the novelty. So at one point that I would like to make is that when Alpha is less than two, there is no Girl Cincotti. 258 00:33:43,110 --> 00:33:50,840 So the inference methods that are constructed for the uncertainty. 259 00:33:50,840 --> 00:33:53,900 Quantifying the uncertainty of SGV, for example, 260 00:33:53,900 --> 00:34:00,290 is not going to be valid because they are all based on this Gaussian Keltie assumption of the polio Chrebet averaging. 261 00:34:00,290 --> 00:34:12,880 So a lot of those methods has to be rethought in the in the case where you have heavy tail noise in the studio gaud. 262 00:34:12,880 --> 00:34:20,170 So, one, before I moved to Non-Com Back City and the final result, 263 00:34:20,170 --> 00:34:30,610 understand convexity is a result that you obtain twenty nineteen as a product of the science method workshop in San Jose. 264 00:34:30,610 --> 00:34:42,580 So here we actually characterised how fast you're converging to Gaussian in the case where you have a finite variance. 265 00:34:42,580 --> 00:34:52,450 Actually, we assume that this third movement of the noise exists. Then you can characterise how fast you're converging to Gaussian with this theorem. 266 00:34:52,450 --> 00:35:00,340 So what this theorem is saying is for any test function that is three times differentiable, you can show that this expectation, 267 00:35:00,340 --> 00:35:08,170 the difference between the test function value to the polygraph or the averaging and Gaussian is upper bounded by this quantity. 268 00:35:08,170 --> 00:35:15,790 So the proof relies on and not asymptotic martingale Keltie using Stine's method. 269 00:35:15,790 --> 00:35:21,490 But one thing that I would like to emphasise is that this rate here is not a good rate. 270 00:35:21,490 --> 00:35:31,360 It is because these squared divided by squared of the number of iterations and this is the dimension that you have in the high dimensions. 271 00:35:31,360 --> 00:35:35,740 To make this no small, you need to take way too many iterations. 272 00:35:35,740 --> 00:35:45,080 It's not a good rate, but it is still consistent with the experiments. A lot of these methodology like inference methodology that rely on this, 273 00:35:45,080 --> 00:35:56,080 Keltie actually realised, requires a small dimension, plus a lot of iterations of the. 274 00:35:56,080 --> 00:36:05,650 And so this is consistent with what we observed in practise, it's also as a side note, 275 00:36:05,650 --> 00:36:14,430 burning helps a lot in these kind of stuff, but I don't have any theoretical quantification of that. 276 00:36:14,430 --> 00:36:21,000 So back to the least squares example that I started to talk with, so we considered this linear model, right? 277 00:36:21,000 --> 00:36:25,850 And Epsilon, the epsilon, the noise had infinite variance. 278 00:36:25,850 --> 00:36:31,850 Now, without defining the objective function, you can directly define the gradient of it. 279 00:36:31,850 --> 00:36:37,940 So what does this mean? This is a bit cheating, but I can actually choose a function of X, 280 00:36:37,940 --> 00:36:44,840 and instead of writing this as a minimisation problem, I can write this as a finding the zero of a function. 281 00:36:44,840 --> 00:36:50,360 Right. And if I define this, the zero of the function is going to be below zero. 282 00:36:50,360 --> 00:36:57,770 Right. And that's basically where the minimum will be 18. 283 00:36:57,770 --> 00:37:03,880 So George has a question in the right result was the absolute value inside the expectation. 284 00:37:03,880 --> 00:37:13,830 Site. Hmm. 285 00:37:13,830 --> 00:37:21,130 Good point, that is inside the expectation. I just verified it. 286 00:37:21,130 --> 00:37:29,850 By looking at the paper, so it should be inside the expectation, but I can I can double check. 287 00:37:29,850 --> 00:37:41,260 OK, so. And the least squares set up the beat of zero is the unique minimum of this multi dimensional function, 288 00:37:41,260 --> 00:37:49,940 and Exter is going to be the start that the zero of this function is going to be to be to not be quite straightforward to see this. 289 00:37:49,940 --> 00:37:57,040 So the study updates, you can write it as a stochastic approximation algorithm and is given by this quantity. 290 00:37:57,040 --> 00:38:02,740 So the noise part is going to be just as we did for the in the second slider. 291 00:38:02,740 --> 00:38:05,890 So the noise I can write it as it has. 292 00:38:05,890 --> 00:38:15,040 It admits that the composition where the IED component is given by this quantity and the state dependent component is given by this quantity. 293 00:38:15,040 --> 00:38:23,280 So here the state is X, right? You see the IED component does not have the state in it. 294 00:38:23,280 --> 00:38:28,870 So one can verify if you assume that the first moment of the noise is finite, 295 00:38:28,870 --> 00:38:33,690 then the first moment of the IED component is finite and the state dependent 296 00:38:33,690 --> 00:38:41,880 component trivially satisfies this upper bound if we assume the covariates are nice. 297 00:38:41,880 --> 00:38:49,320 Right. So this basically is telling us that the assumption to the noise assumption will be easily satisfied. 298 00:38:49,320 --> 00:38:56,000 And for the function, the direction of the function, we have this expected value of the transfer, 299 00:38:56,000 --> 00:39:00,510 covariance of the covariance, covariance of the features. 300 00:39:00,510 --> 00:39:06,390 So you can assume this matrix is diagonally dominant and all the assumptions are satisfied. 301 00:39:06,390 --> 00:39:13,530 Like the Stupples, the definiteness is satisfied, but the cumulative definiteness is a milder assumption than diagonal. 302 00:39:13,530 --> 00:39:17,220 So you can actually assume this from the covariance matrix. 303 00:39:17,220 --> 00:39:26,530 Then you actually have the convergence results can be invoked for this case. 304 00:39:26,530 --> 00:39:36,550 OK, so now I will talk about non convexity and and what happens in this case and in the case of non convexity and non smoothness, 305 00:39:36,550 --> 00:39:42,580 basically assuming that your objective function is non convex and lossiemouth at the same time. 306 00:39:42,580 --> 00:39:46,720 The problem is a bit harder to handle. 307 00:39:46,720 --> 00:40:00,100 So we will consider in this case and EGD with a constant step size so that we can actually rely on the convergence Markov chain convergence results. 308 00:40:00,100 --> 00:40:12,970 And this methodology is actually the constant analysis of constant stepaside size is first derived by these people in 2019, 309 00:40:12,970 --> 00:40:17,950 delivered in 2019, and but the analysis was under strong complexity. 310 00:40:17,950 --> 00:40:27,190 And so it turns out that these are not needed assumptions and you can relax those pronouncements and come complex objectives. 311 00:40:27,190 --> 00:40:32,230 But the key point here is that I have a Gama'a that is websites that is constant. 312 00:40:32,230 --> 00:40:42,650 So this has to be Gammer. It's a typo. So the assumption, the first assumption, instead of the smoothness assumption. 313 00:40:42,650 --> 00:40:48,200 I make is that the gradient of F the objective function is growing at most linearly. 314 00:40:48,200 --> 00:40:50,990 So this is the first assumption. 315 00:40:50,990 --> 00:40:59,000 The second assumption I make is that the instead of strong complexity, the function satisfies a discipline assumption. 316 00:40:59,000 --> 00:41:04,610 So this opportunity is given by this inequality and this is satisfied for all X. 317 00:41:04,610 --> 00:41:10,460 So what does this assumption tell me and how does it reflect strong convexity is here. 318 00:41:10,460 --> 00:41:21,410 It basically says the strong convexity is a global curvature assumption, whereas this opportunity is restricting the strong convexity to the tails. 319 00:41:21,410 --> 00:41:29,880 So anything above you have the quadratic growth of F and around the origin you can have none complexity. 320 00:41:29,880 --> 00:41:38,010 So that is that assumption and it's quite common in not convex optimisation analysis. 321 00:41:38,010 --> 00:41:45,360 So in this case, the noise sequence is assumed to have finite fourth moment and additionally, 322 00:41:45,360 --> 00:41:57,900 I assume that its density has full support so so that I have a nice Markov chain defined by the L word. 323 00:41:57,900 --> 00:42:02,340 And the key point is that noise is still state dependent, 324 00:42:02,340 --> 00:42:12,620 so I can write the standard email SCD in this form, but it has a finite variance given X, given the state. 325 00:42:12,620 --> 00:42:20,330 OK, so now I used Oxford comma in the title so you can tell me if I use it correctly or not, 326 00:42:20,330 --> 00:42:28,760 but yeah, so in this case, the noise is state dependent and that's fine at variance, but not. 327 00:42:28,760 --> 00:42:35,280 So given these standard Markov chain results can tell us the area of the city of SAGD, 328 00:42:35,280 --> 00:42:44,150 so for a small enough staff size the DEA to admit a unique stationary distribution and this distribution, 329 00:42:44,150 --> 00:42:49,370 although we cannot characterise it, we can show its existence and uniqueness. 330 00:42:49,370 --> 00:42:59,720 And it depends on the step size. And also for a test function, it satisfies this convergence to the stationary distribution, 331 00:42:59,720 --> 00:43:03,600 not to any of the minima, it's just to the stationary distribution. 332 00:43:03,600 --> 00:43:11,630 The convergence is geometric, but it's not directly comparable to the previous convergence results that we talked about. 333 00:43:11,630 --> 00:43:18,860 And so what happens to the public or averaging non-com excuse now heavy tailed kids? 334 00:43:18,860 --> 00:43:26,980 It doesn't satisfy Ghastliness TLT, but it turns out that in the non-com excuse. 335 00:43:26,980 --> 00:43:38,680 There is a Gulf society, so now I will define the average of the city's interests are evaluated at a test function by and I 336 00:43:38,680 --> 00:43:45,370 will define the average display and the expectation under the invariant measure is defined this way. 337 00:43:45,370 --> 00:43:51,070 So this means that the Poljac report averaging where they evaluated a test function, 338 00:43:51,070 --> 00:43:58,870 is going to admit a central limit theorem given by the expression here. 339 00:43:58,870 --> 00:44:03,850 So this is cool. The Keltie is still valid under non complexity. 340 00:44:03,850 --> 00:44:13,180 And but we we have to talk about the bias of this algorithm in order to in order to come up with something useful. 341 00:44:13,180 --> 00:44:22,030 So what do I mean by bias? How does this compare to any of the critical points or the global M.O., the minima of this non complex function? 342 00:44:22,030 --> 00:44:28,720 So if without making any local regularity, so we if you don't assume anything further than just discipline, 343 00:44:28,720 --> 00:44:37,060 every assumption, this is not possible and all we can hope for is that the SGD algorithm will satisfy this inequality. 344 00:44:37,060 --> 00:44:47,870 And this is a constant that is an old one. So by letting step size small, you're not getting any improvement here. 345 00:44:47,870 --> 00:44:54,050 So SAGD converges to a ball that contains all the critical points in this convergence is exponentially fast, 346 00:44:54,050 --> 00:45:02,070 but bias is not controllable with the step size. Bias to the extreme. 347 00:45:02,070 --> 00:45:13,050 OK, so for that reason, I need to enforce a local regularity on top of the the growth assumption that was anticipated. 348 00:45:13,050 --> 00:45:15,090 So this is the previous disciplinary, 349 00:45:15,090 --> 00:45:25,290 I assume this outside of a board and now I assume inside of a bowl there is a local regularity defined by this convex function. 350 00:45:25,290 --> 00:45:30,580 That is incredible. And zero zero zero. 351 00:45:30,580 --> 00:45:37,930 Right. So this is the additional local irregularity and then under this additional localised, this Aparecida assumption, 352 00:45:37,930 --> 00:45:47,020 I can show that the was admitted a stationary distribution and the bias of this distribution, 353 00:45:47,020 --> 00:45:52,150 stationary distribution of the study is now controllable by the step website. 354 00:45:52,150 --> 00:45:59,440 So if I let Gamma to zero, the right hand side here also converges to zero. 355 00:45:59,440 --> 00:46:10,480 And for a test function that is Lipshitz, you can just take this and then show that the expectation under the invariant measure, 356 00:46:10,480 --> 00:46:20,530 minus the test function evaluated at the minimum, is going to be upper bound by this quantity, which goes to zero when Gamma goes to zero. 357 00:46:20,530 --> 00:46:31,750 And so here is a non Carmex function that is called Blake's Esterman Emily, that I found in, I think, visual reconstruction book. 358 00:46:31,750 --> 00:46:39,970 So this assumption, these this function actually turns out satisfies the assumptions of these theorems. 359 00:46:39,970 --> 00:46:46,090 And you can actually verify that it has a cult. 360 00:46:46,090 --> 00:46:49,750 And also this is by experiments is also correct. 361 00:46:49,750 --> 00:46:59,200 So our results also match with the numerical studies and in the numerical studies we use this test function, which is the absolute value. 362 00:46:59,200 --> 00:47:07,930 And in the paper, we also verified it rigorously that these assumptions of our theorems are satisfied. 363 00:47:07,930 --> 00:47:16,190 So. I'm out of time, so I'll conclude and with the summary of what I told you today. 364 00:47:16,190 --> 00:47:21,470 So basically this is the convergence analysis of attitude and the politics upward averaging. 365 00:47:21,470 --> 00:47:25,460 Right. So that's what we talked about, the infinite noise variance case. 366 00:47:25,460 --> 00:47:32,200 We showed the LP Convergence of SAGD and we identified a regime that size decay. 367 00:47:32,200 --> 00:47:42,440 So I that the polio growth rate averaging admits an awful stable limit and negotiate non Gaussian limit and understand complexity. 368 00:47:42,440 --> 00:47:50,390 We talked about it on asymptotically rate due to Y signs LAMBA and under non complexity. 369 00:47:50,390 --> 00:47:59,720 We show that Polier support averaging admits the CLT and its bias can be characterised if we have additional low cost structure. 370 00:47:59,720 --> 00:48:06,850 So that's it from me today. And if you have questions I would be happy to answer. 371 00:48:06,850 --> 00:48:18,190 OK, so thank you very much for very nice talk. So is there any question from the audience you can just amuse yourself and ask? 372 00:48:18,190 --> 00:48:27,060 Yeah, I've got a question. Well, I really enjoyed your talk, and you probably guess what my question is, is about. 373 00:48:27,060 --> 00:48:34,600 So you say you cannot ever run at the stuff that the gradient descent infinitely long. 374 00:48:34,600 --> 00:48:39,820 Mm hmm. So do you have these results with Andrius for the Central Limit Theorem? 375 00:48:39,820 --> 00:48:46,690 But can you use results also for a stable distribution using Stintz method in order to get an exclusive count? 376 00:48:46,690 --> 00:48:55,970 Yes. So that's a great question. And I think you had a recent paper on our first table limits and sties method. 377 00:48:55,970 --> 00:49:01,690 So it's definitely doable. It's just that it's a matter of doing it. 378 00:49:01,690 --> 00:49:06,940 So, yes, that's a good point. OK, good, Senator. Yes, a very interesting direction. 379 00:49:06,940 --> 00:49:16,630 Like basically, how does this discredited basket of trade change when you don't have a girlfriend sensibility, but you have all of those people? 380 00:49:16,630 --> 00:49:24,510 So that's a very interesting direction to explore. OK, yeah. 381 00:49:24,510 --> 00:49:34,230 Can I ask a question? Actually, it's sort of a double question, so, um, what's the motivation behind the Policy Act, Rupert averaging? 382 00:49:34,230 --> 00:49:37,180 Yeah. Another great question. 383 00:49:37,180 --> 00:49:49,690 So the poll Rupert averaging is so in the classical setting, it achieved the Krimmer all over Bonte, whereas the Lessiter iDesk does not. 384 00:49:49,690 --> 00:49:53,740 So now to the and other interesting research directions. 385 00:49:53,740 --> 00:49:58,570 How to what is that like the bond on lower bond on the variance. 386 00:49:58,570 --> 00:50:06,340 What does that translate to in the end, the heavy tail setting when there is no beating us? 387 00:50:06,340 --> 00:50:10,730 So what kind of optimality can one prove for political operative, right? 388 00:50:10,730 --> 00:50:16,090 OK, so is there does anyone ever consider, like, different types of averaging? 389 00:50:16,090 --> 00:50:18,850 Yes, I think that is also common. 390 00:50:18,850 --> 00:50:32,410 And I don't think the analysis is very hard given like a specific weighting between the iterates, but we haven't done that in any of our results. 391 00:50:32,410 --> 00:50:38,300 OK, OK, great, thanks. And to go back to my previous question about the absolute value. 392 00:50:38,300 --> 00:50:45,370 And so the reason I'm asking is because if you put the absolute value inside the expectation, 393 00:50:45,370 --> 00:50:51,590 then it sort of requires that you build some kind of coupling between the limiting Gaussian and your process. 394 00:50:51,590 --> 00:50:55,870 So it usually requires heavier machinery and potentially the red can be worse. 395 00:50:55,870 --> 00:51:02,060 So if you take the absolute value outside of the expectation, maybe the rate can improve. 396 00:51:02,060 --> 00:51:13,650 Hmm, that's interesting point. So. And usually the absolute value is outside, I think I think you're you're right, 397 00:51:13,650 --> 00:51:20,610 I think the absolute value should be outside because I'm checking the martingale Keltie that we proved in the paper. 398 00:51:20,610 --> 00:51:24,600 And it appears that it's outside of the expectation. OK, OK. 399 00:51:24,600 --> 00:51:28,210 Because otherwise you need to do like some kind of, like, strong central element, you know. 400 00:51:28,210 --> 00:51:34,410 You're right. Yeah. Yeah. No, this is relying on like basically. 401 00:51:34,410 --> 00:51:40,440 Like similar construction by rainout, drilling. So it should be outside, I guess. 402 00:51:40,440 --> 00:51:50,710 Yeah, yeah, yeah. We always have the outside. Yeah. I had a quick question. 403 00:51:50,710 --> 00:51:51,910 Yeah, go ahead. 404 00:51:51,910 --> 00:52:03,730 So the assumption on the experiences of the Cotard's, so you said that in the case of P equals one, it's aligned to the diagonal dominant. 405 00:52:03,730 --> 00:52:10,620 So is there like an intuition on what kind of how that changes the covariance structure in the case of linear regression? 406 00:52:10,620 --> 00:52:18,450 So is going to be more aligned to that identity as the candidates have to be more independent or something like this. 407 00:52:18,450 --> 00:52:24,860 So you have the covariates, expected value of. 408 00:52:24,860 --> 00:52:35,680 These transports, right? So these are your covariates and this is a kind of a signal and that is like if this sigma satisfies. 409 00:52:35,680 --> 00:52:47,020 Diagonal dominance, then this assumption is satisfied because then you are in this red regime, and that is going to satisfy any people. 410 00:52:47,020 --> 00:52:58,040 The assumption. So this only means that like this covariance, I mean, for example, if you have white covariance, then this is identity, right? 411 00:52:58,040 --> 00:53:05,230 So it's going to be satisfied, for example. Does that answer your question or did I miss it part of your question? 412 00:53:05,230 --> 00:53:06,610 So in some sense, 413 00:53:06,610 --> 00:53:15,670 is it saying is the assumption saying that the court does have to be more independent as P increases are equal to it will always be satisfied. 414 00:53:15,670 --> 00:53:23,890 But then I think as it moved towards Dinotopia and just say that the variance has to be closer to an identity in some sense. 415 00:53:23,890 --> 00:53:27,420 Yeah, so I think so. This is what I want. 416 00:53:27,420 --> 00:53:32,900 Right. So Viji G is not a. 417 00:53:32,900 --> 00:53:40,280 So I think that can be interpreted this way. 418 00:53:40,280 --> 00:53:51,530 I just wondered whether that sort of in between in the case of people wanting to so, so close to little. 419 00:53:51,530 --> 00:54:00,830 Certainly like this is a weaker assumption than the diagonal dominance, and this is also just to clarify, this is not really diagonal dominance. 420 00:54:00,830 --> 00:54:05,990 This is diagonal dominance with non-negative diagonal entries. 421 00:54:05,990 --> 00:54:13,220 And in this case, so this is the assumption that you have and I think you can interpret this as like the 422 00:54:13,220 --> 00:54:20,870 correlation between the features are weaker compared to the compared to the strong, 423 00:54:20,870 --> 00:54:40,490 stronger case. Another question from the audience. 424 00:54:40,490 --> 00:54:46,670 Yeah, I see the question, I guess for the latter part of the talk, the constant subsides down capacity. 425 00:54:46,670 --> 00:54:59,420 So if it's a mini batch instead of last year, can we still prove could with some stronger assumption to prove the central premise here? 426 00:54:59,420 --> 00:55:07,790 You can't prove you can't use minibike SCD. But, um, so let me go back to this slide. 427 00:55:07,790 --> 00:55:17,060 So this is still valid. Like you can have gamma T plus one equals gamma constant step size, but you can have many batched. 428 00:55:17,060 --> 00:55:23,120 But this is stream of data, so it's not like you have a batch of data that you resample. 429 00:55:23,120 --> 00:55:29,760 And so you should be doing these mini batches should be coming in as like new data points. 430 00:55:29,760 --> 00:55:36,530 That's what I meant. So if that is the case, you can do it. But everything here in this talk is on line. 431 00:55:36,530 --> 00:55:54,320 OK. For the for the U.S., there is a stationary distribution for them, but do you have any type of convergence results? 432 00:55:54,320 --> 00:55:59,930 Sort of. You mentioned it. So convergence to the invariant measure. 433 00:55:59,930 --> 00:56:10,880 Yeah. So convergence to so the standard are good to see the type of stuff is like easy to prove using Markov chain machinery like this one. 434 00:56:10,880 --> 00:56:16,040 Right. So this is the convergence that I briefly talked about. 435 00:56:16,040 --> 00:56:21,920 So this is quite standard. But the important point is like this. 436 00:56:21,920 --> 00:56:31,550 This is easy. But what is hard is this guy like how do you actually characterise anything about this invariant measure is the biggest challenge. 437 00:56:31,550 --> 00:56:43,070 That's not. Yeah, I see what you mean. So but I guess my question is so in practise, when you run algorithms like this, do you ever have convergence? 438 00:56:43,070 --> 00:56:50,180 You mean variance distribution or do people just run it for much shorter time scale so that 439 00:56:50,180 --> 00:56:58,460 you may only ever see you're still close to initialisation or the part of the environment, 440 00:56:58,460 --> 00:57:03,130 as you explore, is heavily influenced by your initialisation? Mm hmm. 441 00:57:03,130 --> 00:57:06,830 Mm hmm. So that's a good point. 442 00:57:06,830 --> 00:57:13,160 And so what you're saying is, OK, this is not what this is not answering your question, 443 00:57:13,160 --> 00:57:19,190 because this is true for any initialisation you're getting the same result. 444 00:57:19,190 --> 00:57:24,890 So what you're saying is if the stationary distribution depends on the or like. 445 00:57:24,890 --> 00:57:31,130 Yeah, it could depend on the initialisation as well. Well, that's not quite what I'm saying is. 446 00:57:31,130 --> 00:57:38,210 So obviously this are there the rate is sort of I'm not sure if you've quantified it. 447 00:57:38,210 --> 00:57:44,200 You could you can you can verify this are. 448 00:57:44,200 --> 00:57:54,760 I mean, I think we stop that saying our is less than one and greater than zero, but this is kind of like I think you can you can compute it, but. 449 00:57:54,760 --> 00:57:57,700 Yeah, but what I'm saying. OK, so let me rephrase. 450 00:57:57,700 --> 00:58:05,470 So I'm not sure how you proving this, but if you're doing it using Dreft and, you know, functions and things like that, 451 00:58:05,470 --> 00:58:11,830 it's very likely that this hour will be very, very bad in terms of the dimensioned dependence. 452 00:58:11,830 --> 00:58:19,570 Yes. So if you do that in practise, if you plug that in and you say, OK, so I've run my algorithm for, 453 00:58:19,570 --> 00:58:24,640 let's say capital case, that's how close our mind to the environment distribution. 454 00:58:24,640 --> 00:58:29,170 What I'm saying is so obviously this gives you a mixing time. 455 00:58:29,170 --> 00:58:34,150 So does the mixing time compared to practical runtime. 456 00:58:34,150 --> 00:58:43,660 Some of these algorithms will raise it at a different and a completely different time scale so that you'll never, ever see the environ distribution. 457 00:58:43,660 --> 00:58:51,490 So that's a good point. So this proposition is used in this theorem, which is asymptotic. 458 00:58:51,490 --> 00:58:57,970 So we actually never required are to be something nice as long as it's less than one. 459 00:58:57,970 --> 00:58:59,140 We were good. 460 00:58:59,140 --> 00:59:09,970 So we didn't explore anything that like what happens in the finite sample regime because our objective was this Keltie and which is asymptotic. 461 00:59:09,970 --> 00:59:13,960 But that's a that's an interesting thing to think about, actually. 462 00:59:13,960 --> 00:59:23,890 In your experience, what do you think? So I can show you I mean, I knew I wasn't going to have time, so that's why I. 463 00:59:23,890 --> 00:59:36,570 covid. Yeah, OK, so I actually had experiments that I took it out of the talk, but so. 464 00:59:36,570 --> 00:59:47,480 It's. So what people observe is like when is larger than 20, you have to run for a very long time to see any of these results, actually. 465 00:59:47,480 --> 00:59:55,470 This asymptotic ski resort that is in line with this not asymptotic version of the CFT. 466 00:59:55,470 --> 01:00:02,740 So it's like if the raid is discovered, I'm scared of the squirrel, we're scared of tea, then you expect to run forever. 467 01:00:02,740 --> 01:00:08,430 It is like large, right? So and what we observed is similar. 468 01:00:08,430 --> 01:00:12,330 The experiments you can verify staff and these like 20 fifty. 469 01:00:12,330 --> 01:00:17,180 But anything beyond that is hard. Right. 470 01:00:17,180 --> 01:00:25,400 OK, thanks. OK, I guess we have a run run out of time, so let's add some in there for today. 471 01:00:25,400 --> 01:00:29,660 So thank you, everyone, for coming in. Let's listen to the speaker again. 472 01:00:29,660 --> 01:00:40,476 Have a nice weekend. Thank you for spending.