1 00:00:00,490 --> 00:00:07,270 So today we're going to club are old enough to write it. 2 00:00:07,270 --> 00:00:14,420 She is a research scientist at a. 3 00:00:14,420 --> 00:00:20,990 Before that, she did appear to be in the University of Cambridge on the business supervision. 4 00:00:20,990 --> 00:00:27,680 So being that the money before that she is studied mathematics at the University of Worry. 5 00:00:27,680 --> 00:00:33,430 And also she had read part three in mathematics at the University of Cambridge. 6 00:00:33,430 --> 00:00:38,800 And because of that, she was symptomatic of a a study in applied mathematics. 7 00:00:38,800 --> 00:00:45,280 And also, Galima also felt like socialising on it. 8 00:00:45,280 --> 00:01:00,030 Like, I have been a bozo in the Institute for Advanced Study and also A is affiliated with a payment systems for this theory of computer. 9 00:01:00,030 --> 00:01:09,370 On Catalina has been doing very, very interesting work in there day they the use of the information theory that exists 10 00:01:09,370 --> 00:01:16,560 in order to improve our understanding of generalisation properties of usual algorithm, 11 00:01:16,560 --> 00:01:22,390 Dravidian deep learning. And that that is what North Carolinians like to talk about. 12 00:01:22,390 --> 00:01:31,080 Now, the stock. His name distribution dependent derealization balance for Nisim, a directive. 13 00:01:31,080 --> 00:01:39,410 Many algorithms. And now I would do with that know to tell us. 14 00:01:39,410 --> 00:01:43,410 Thanks for the introduction and thanks for the invitation to present my work. 15 00:01:43,410 --> 00:01:55,200 So I'm going to present my work that was done in collaboration with my introspecting and my colleagues at the University of Toronto. 16 00:01:55,200 --> 00:02:03,750 So part of my research is centred around the following question, can we derive meaningful transition bonds for non X learning? 17 00:02:03,750 --> 00:02:12,630 And today I'll be focussing CALM's the CAS agreed on launching the dynamics or as Jodi short sort of as she'll do, even minimising pure risk. 18 00:02:12,630 --> 00:02:21,840 Similar. We do lipstick. I see a great gradient descent iteratively, but on each duration we also add this additional Gaussian noise term, 19 00:02:21,840 --> 00:02:31,980 which makes the analysis quite a bit easier. Begins to recommend until Karski if yours bagi first models and plot Chris Bonds towards yielding. 20 00:02:31,980 --> 00:02:41,730 So assuming that learning create data goes to zero and the so-called inverse temperature parameter beta goes unvirtuous as Jodi samples from. 21 00:02:41,730 --> 00:02:47,880 It gives distribution which is proportional to exponentiation and reskilled empirical risk. 22 00:02:47,880 --> 00:02:54,670 So looking at this picture, if we start at initialise our training of this X here and run, for example, success, 23 00:02:54,670 --> 00:03:02,040 a gradient descent or a whole batch gradient descent, it will most likely converge to this local minima close by. 24 00:03:02,040 --> 00:03:06,810 If we run as shields instead, because it's sampling due to this additive guessing noise stream, 25 00:03:06,810 --> 00:03:15,950 it will actually be exploring this other man quite a bit. Fancy a jog, a moat for you, for your Sparke established generalisation. 26 00:03:15,950 --> 00:03:18,240 But as far as your D using estimates, 27 00:03:18,240 --> 00:03:26,180 all the mutual formation between the weights of convergence and the dataset based on the information leaked at each generation. 28 00:03:26,180 --> 00:03:32,190 Now, unfortunately, those yields back those bounds. But of course, we only used estimates of Meachem formation. 29 00:03:32,190 --> 00:03:43,700 So the question does the meta information between the weights, a convergence of data sets explains generalisation still remains. 30 00:03:43,700 --> 00:03:51,470 In our work, we show how gradient noise influences mutual information leading to nonbank stance. 31 00:03:51,470 --> 00:04:01,770 So let me start by kind of defining dub location, I'll be using private talk and introducing a set of. 32 00:04:01,770 --> 00:04:08,560 Denote by DBI, unknown data distribution on the inputs X and our binary labels. 33 00:04:08,560 --> 00:04:16,920 And we'll consider upper middle class predictors of W better parameters by some W in a higher dimensional space. 34 00:04:16,920 --> 00:04:26,490 The goal is then to find the goal of flurrying and create ways to find some weights such that the probability off here is minimised. 35 00:04:26,490 --> 00:04:31,630 And we're calling this probability of error risk. 36 00:04:31,630 --> 00:04:37,500 Life, as you know, the training sex sampled by idea from some from this unknown theatre distribution deal. 37 00:04:37,500 --> 00:04:43,660 Then the empirical risk is defined as the average here on our training sets. 38 00:04:43,660 --> 00:04:47,440 We'll denote this until that age. 39 00:04:47,440 --> 00:04:51,820 The algorithm that takes and training data asks. And maybe some other randomness. 40 00:04:51,820 --> 00:04:55,600 For example, many back shorter or maybe about random such attitude. 41 00:04:55,600 --> 00:05:02,650 Each minute, batch, gradient and double, you will know about the learn parameters straightforwardly. 42 00:05:02,650 --> 00:05:07,900 It's possible to decompose the risk into the following trends just by adding and subtracting people, of course. 43 00:05:07,900 --> 00:05:12,000 So risk equals empirical risk plus risk, minus empirical or pleasure. 44 00:05:12,000 --> 00:05:20,050 The difference between the two and there's a difference between risk and empirical risk is called generalisation error. 45 00:05:20,050 --> 00:05:22,460 So consider upper middle class of your networks, for example, 46 00:05:22,460 --> 00:05:31,300 all your neural networks of some particular with and depth and a particular activation function such as such as Rayland. 47 00:05:31,300 --> 00:05:37,210 Many of the most successful learning algorithms look superficially like approximations to empirical risk minimisation. 48 00:05:37,210 --> 00:05:44,330 So usually we take our empirical risk and try to minimise that. But respect to the parameters. 49 00:05:44,330 --> 00:05:50,060 Approximately. Then go back to the risk of decomposition I introduce in the Paris light. 50 00:05:50,060 --> 00:05:55,970 The trisk is empirical risk. Bless the generalisation here. So the difference between risk and empirical risk. 51 00:05:55,970 --> 00:06:02,840 We can say that the first term, the empirical risk term is small by design because we are explicitly minimising it. 52 00:06:02,840 --> 00:06:08,630 But what controls that generalisation here? So what controls the second term? 53 00:06:08,630 --> 00:06:17,120 The classical approach to controlling that generalisation error term exploits uniform convergence, the key idea is quite simple. 54 00:06:17,120 --> 00:06:22,310 So again, risk seemed to composition equals empirical, restless generalisation error. 55 00:06:22,310 --> 00:06:26,480 And we can bounce empirical risk, of course, just compute out on our training set. 56 00:06:26,480 --> 00:06:36,860 We don't need to estimate it in any way or bounded. But the second term digitalisation here becan founded in the worst case based on the worst case 57 00:06:36,860 --> 00:06:43,040 analyzation era of any model in a particular class and any perhaps any data distribution. 58 00:06:43,040 --> 00:06:51,480 This approach works really well for a small model condescends, but not for explaining modern machine learning. 59 00:06:51,480 --> 00:06:54,940 So here's a classical text which penalisation started work on the x axis. 60 00:06:54,940 --> 00:07:00,240 We have the complexity of our hypothesis class and on the Y axis, we have error. 61 00:07:00,240 --> 00:07:02,880 The empirical risk, which is this kurban blue. 62 00:07:02,880 --> 00:07:11,460 This, of course, decreases as we increase Mollel complexity because we can fit the training better training data, better and better. 63 00:07:11,460 --> 00:07:14,370 The risk, the green curve initially decreases. 64 00:07:14,370 --> 00:07:20,910 And in this class couple, each picture starts increasing because we get too much complexity in our class. 65 00:07:20,910 --> 00:07:25,710 The difference between the two as the generalisation here. 66 00:07:25,710 --> 00:07:30,960 And of course, it starts increasing clearer as well. Of course, this is not what happens in modern machine learning and deep networks. 67 00:07:30,960 --> 00:07:36,510 For example, as we increase our complexity as measured by, let's say, 68 00:07:36,510 --> 00:07:45,090 the number of layers of no parameters is the training care, which is as blacker now decreases eventually goes to zero. 69 00:07:45,090 --> 00:07:50,520 And the test there, it doesn't go up. It keeps actually go down even after we fit all the training data. 70 00:07:50,520 --> 00:07:58,670 So this plot, I think, is to come from Natia Bertel paper from 2015. 71 00:07:58,670 --> 00:08:04,120 So what WB, the wait's lambi stochastic gradient descent. 72 00:08:04,120 --> 00:08:07,120 Then a uniform convergence band could do the falling for. 73 00:08:07,120 --> 00:08:12,500 We might want to bomb the generalisation here in expectation of of high probability in terms of some absolute. 74 00:08:12,500 --> 00:08:16,750 But we depend maybe on the hypothesis class. We're using a number of samples. 75 00:08:16,750 --> 00:08:24,530 We have the probability of a [INAUDIBLE]. The band holds zero distribution, training sets, algorithm we use and so on. 76 00:08:24,530 --> 00:08:29,560 The four year terms that Epsilon the band depends on, the stronger the bound. 77 00:08:29,560 --> 00:08:37,300 So perhaps it can only depend on the hypothesis class number of trained data we have on the probability of failure. 78 00:08:37,300 --> 00:08:42,380 And an example of such a band is the C band. So you see Bandha roughly, 79 00:08:42,380 --> 00:08:45,740 things are falling for him and we see dimensioned like a hypothesis class for no 80 00:08:45,740 --> 00:08:52,600 networks and is going to be what we're buying it by the number of parameters. So from your networks, let's see. 81 00:08:52,600 --> 00:08:57,560 Let's take a tiny news network, Trena. Numbness with a single head and layer of 600 units. 82 00:08:57,560 --> 00:09:06,020 It would already have nearly half a million parameters, which is a lot more than the number of training examples. 83 00:09:06,020 --> 00:09:11,110 And we can see the spine depends on the ratio between the C dimension and the number of training data. 84 00:09:11,110 --> 00:09:17,680 Of course, leading to a totally back has found meaning that the band is much greater than one. 85 00:09:17,680 --> 00:09:22,370 Now, of course, one could say, you know, we're here in this regime right now for the number of data we have. 86 00:09:22,370 --> 00:09:30,040 The bond is way above one. We can get more data, eventually will arrive, but a number is found. 87 00:09:30,040 --> 00:09:37,540 As as bandwidth decreases, the number of tweeting points. But that's not what happens in practise as we get more training data. 88 00:09:37,540 --> 00:09:44,410 We usually increase the model size, complexity of the models, so we increase the size of the neural network. 89 00:09:44,410 --> 00:09:47,440 For example. 90 00:09:47,440 --> 00:09:54,110 Of course, the find that, you know, the number of parameters does not really the right notional complexity has been appreciate that long ago. 91 00:09:54,110 --> 00:10:03,090 So here is the paper by Peter Bartlett from over 20 years ago saying that the size of the weights is more important, that the size of the network. 92 00:10:03,090 --> 00:10:11,850 Here is some more recent work by Nationwide Call Focussing on Laughner instead of all too lamon the weights. 93 00:10:11,850 --> 00:10:16,760 So instead of getting this found, these papers kind of suggest that maybe we'll get a generalisation found. 94 00:10:16,760 --> 00:10:25,620 It's also depends on the normal weights which will help us limit the hypothesis class we're looking at and perhaps get a tighter bound. 95 00:10:25,620 --> 00:10:30,330 So looking again at a simple network randomness, we can evaluate this bounce. 96 00:10:30,330 --> 00:10:38,520 So we'll see that the bond depends on the pachyderm divided by the margin, the bathroom as the training progresses on the x axis. 97 00:10:38,520 --> 00:10:44,790 It increases. The margin also increases, which is great because we care about the ratio, the two. 98 00:10:44,790 --> 00:10:50,490 But know that the Bapna here squatted on the Y axis on the lung scale and margin is not. 99 00:10:50,490 --> 00:11:01,540 And so when you compare the bound, it's as soon as the network starts doing something trivial, the the bound will be totally my kids. 100 00:11:01,540 --> 00:11:06,360 No one can argue that. Perhaps if we add explicit regularisation, maybe it's useful at some regularised. 101 00:11:06,360 --> 00:11:11,000 So as you start regularising the term, it's definitely grow slower. 102 00:11:11,000 --> 00:11:19,440 It and doesn't grow as much anymore. But unfortunately, the margin also decreases can be still leading to a non Beckers tobacco sensor. 103 00:11:19,440 --> 00:11:25,230 As we increase their regularisation, we do that and decrease as the margin decreases. 104 00:11:25,230 --> 00:11:30,090 And again, when you look at the ratio, perhaps we can get a Bompard below one now. 105 00:11:30,090 --> 00:11:34,620 But our predictor and here is the training can test loss margin. 106 00:11:34,620 --> 00:11:41,700 But great. This is something that we're getting 30 percent there. So it's not actually useful predictor anymore. 107 00:11:41,700 --> 00:11:47,250 So summarising the progress towards explaining generalisation, obstacles secretely descent, 108 00:11:47,250 --> 00:11:53,120 I think we can conclude that no existing bounds explain deploying into practise analytically. 109 00:11:53,120 --> 00:11:58,640 Modern bounds are not as some topic, but data dependent terms are poorly understood. 110 00:11:58,640 --> 00:12:03,140 So if the bond includes any data dependent terms, even like the normal, 111 00:12:03,140 --> 00:12:11,260 that we'd sexually actually after training did experiments, we don't really understand how they grow highly skilled. 112 00:12:11,260 --> 00:12:19,390 Numerically, no bounden suggest reading sent on a real neural network is done by and syntactically the bonds don't 113 00:12:19,390 --> 00:12:26,320 even scale correctly like shown in the garage and culture paper from last year or two years ago. 114 00:12:26,320 --> 00:12:33,640 It is hard to evaluate banks empirically. Empirical correlation studies are usually unconvincing come close inspection. 115 00:12:33,640 --> 00:12:40,930 So there are papers that perhaps show that the bounce or do not explain channelisation and include some empirical evidence. 116 00:12:40,930 --> 00:12:47,440 Those are usually pretty convincing. But the papers that introduced a new bounce and perform some kind of magical correlation studies. 117 00:12:47,440 --> 00:12:52,750 If you actually look closer, those empirical correlation studies are not too convincing. 118 00:12:52,750 --> 00:12:55,900 In practise, suggestions made by Shankly tallest paper. 119 00:12:55,900 --> 00:13:02,320 Fantastic generalisation measures have absolved authors of any serious consideration of empirical evidence. 120 00:13:02,320 --> 00:13:11,500 So there they propose a single metric which would allow for comparing differential safety measures. 121 00:13:11,500 --> 00:13:16,600 But I don't think that a single metric can actually capture the failures and successes of its 122 00:13:16,600 --> 00:13:22,150 veterans and our own Europe's 2020 paper in search for best measure of such analyzation, 123 00:13:22,150 --> 00:13:28,210 which was in collaboration, the researchers said. Mila ServiceNow entry north of Toronto. 124 00:13:28,210 --> 00:13:33,970 We argue for a distribution or business analysis. Now, the analysis is not straightforward. 125 00:13:33,970 --> 00:13:38,170 It doesn't produce a single quantity to look at, and it's hard and subtle. 126 00:13:38,170 --> 00:13:44,680 But I think that when one understands analyzation measures, we need a hard analysis. 127 00:13:44,680 --> 00:13:53,100 We can conclude things with a single number. But one thing we do conclude in the paper is that no binding, sexually robust. 128 00:13:53,100 --> 00:14:04,360 Or not? No channelisation measures metrosexual, robust, the role of the data and key tools are more important than realised. 129 00:14:04,360 --> 00:14:11,950 Explanations must be data dependent. And we don't really have good tools for measuring. 130 00:14:11,950 --> 00:14:16,780 Measuring how hard the data is, explanations must not argue. 131 00:14:16,780 --> 00:14:20,200 My uniform convergence of a class containing the learnt predictor. 132 00:14:20,200 --> 00:14:26,180 And this is really based on our recent work with Jeff Negrita and Ben Roy that appeared 133 00:14:26,180 --> 00:14:32,990 to stumble last year and the correction Coulter paper from a couple of years. 134 00:14:32,990 --> 00:14:40,890 So what are the barriers to explaining channelisation? Well, there are many. And there are a few that will be dealing with today. 135 00:14:40,890 --> 00:14:47,280 So one bear statistical, the bulk of empirical generalisation performance may be due to properties. 136 00:14:47,280 --> 00:14:53,310 I'll be unknown data distribution. And we only have samples. Understand the true data distribution. 137 00:14:53,310 --> 00:14:57,870 And number bearer's computational tired upper bounds on various divergences. 138 00:14:57,870 --> 00:15:05,670 Like. From which information are often intractable. Also the best spent on marginal probabilities and sampling. 139 00:15:05,670 --> 00:15:14,550 Those are all some tractable. So my bills on comics learning the sarcastic greeting and launch of that mixture. 140 00:15:14,550 --> 00:15:20,410 My soul go back now to the question I presented in the first line. Are there any questions at this point? 141 00:15:20,410 --> 00:15:25,060 I also forgot to mention that if you have any questions, please feel free to interrupt me. 142 00:15:25,060 --> 00:15:35,620 Yeah. Did their Enquist question so far in the tub. Please feel free to ask them to cheque them in a Guinness. 143 00:15:35,620 --> 00:15:44,570 OK, so I'll I'll continue for now. And you just interrupt me if you have any questions. 144 00:15:44,570 --> 00:15:52,180 So here is the update. As Jill, just as a reminder, it's a standard update based on being green until the empirical press. 145 00:15:52,180 --> 00:16:01,970 Plus a gassin Nordström. Now, this additional Nordström actually makes us feel d much easier tantalised compared to as Judy, 146 00:16:01,970 --> 00:16:06,380 the beta is referred to as the inverse temperature parameter. 147 00:16:06,380 --> 00:16:09,110 It trades off exploration versus optimisation. 148 00:16:09,110 --> 00:16:15,110 So as we increase the beta, we'll be adding less noise and we'll be getting closer to standard stochastic gradient descent. 149 00:16:15,110 --> 00:16:21,160 And as we decrease beta, we'll be adding more noise and doom for exploration. 150 00:16:21,160 --> 00:16:26,460 There are two use official debates, but can tank. One is the sound from view. 151 00:16:26,460 --> 00:16:33,380 So as I mentioned before. We can see as you'll do a sample producing samples from a gibs distribution. 152 00:16:33,380 --> 00:16:36,720 But unfortunately, this only holds under unrealistic assumptions. 153 00:16:36,720 --> 00:16:43,180 The second view is that optimises empirical risk and result in guedes don't carry too much information about the data. 154 00:16:43,180 --> 00:16:54,680 Now, does this latter view explain channelisation? So let s b training data just as before w go learn ways by Sagasta Green and luncheon's enactments. 155 00:16:54,680 --> 00:16:58,280 We'll do know the by e.g. e they expect a generalisation here. 156 00:16:58,280 --> 00:17:03,930 So big spectate expected difference between the risk and empirical risk. 157 00:17:03,930 --> 00:17:10,800 Assume that the losses bonded between zero one, though this assumption can be relaxed, but it's just easier to keep this in mind for the time. 158 00:17:10,800 --> 00:17:18,690 John Riggins kicked your arse. Thanks. Should that expect a generalisation here is bandied about in terms of screwed all the 159 00:17:18,690 --> 00:17:24,090 mutual information between the weights and the DNA divided by that amount of data? 160 00:17:24,090 --> 00:17:29,020 For those that are more used to thinking in terms of scale divergences rather than mutual information, 161 00:17:29,020 --> 00:17:32,850 we can also write down right down mutual commission tricks of the scale. 162 00:17:32,850 --> 00:17:38,070 So let's be be the marginal distribution, the weights and cubie, the distribution of the weights. 163 00:17:38,070 --> 00:17:41,610 Given the training set, then the mutual information is to be expected. 164 00:17:41,610 --> 00:17:49,700 Keall divergence between this. Q and. So does this theorem binding the expected generalisation here in terms of new information? 165 00:17:49,700 --> 00:17:54,950 Explain generalisation of astrology. There are a couple of barriers. 166 00:17:54,950 --> 00:18:00,050 Again, one is statistical neutral information between the weights on the data. 167 00:18:00,050 --> 00:18:05,000 Depends on the unknown. The other distribution and another beer is computational. 168 00:18:05,000 --> 00:18:09,440 So even at the data distribution are known being more often intractable, 169 00:18:09,440 --> 00:18:17,960 meaning that the much information between the data and the weights solve some intractable. 170 00:18:17,960 --> 00:18:30,130 So how can we get around these barriers, the competition barrier? Let us know what random minute batches of our training set us. 171 00:18:30,130 --> 00:18:35,850 The Gaspari, a quantum dynamics, adds us at each step and says Gaussian noise term, 172 00:18:35,850 --> 00:18:42,120 which seems actually nice for analysis and fancy jargon, low take advantage of this. 173 00:18:42,120 --> 00:18:50,160 Some of the observed that the chain rule on Mitchelmore formation implies that the neutral formation between the weights on the data can be bounded. 174 00:18:50,160 --> 00:18:58,950 In terms of the sum with mutual formations that measure the information leaked at each step. 175 00:18:58,950 --> 00:19:04,650 So here's a conditional on mutual information conditioning from the weights of the previous step 176 00:19:04,650 --> 00:19:11,100 and computing the information between the men about use for the update and the current weights. 177 00:19:11,100 --> 00:19:17,910 So information leaked about the data by the final weight of convergence is bounded 178 00:19:17,910 --> 00:19:27,400 above by the sum of information leaked about each MIT a batch at each training set. 179 00:19:27,400 --> 00:19:36,520 Now, Chris, you kind of simplified the problem. We sort of can get by and kind of deal with the convergence, your thinking about them stuff twice. 180 00:19:36,520 --> 00:19:41,890 But the next hurdle is to compute this stepwise mutual formation. 181 00:19:41,890 --> 00:19:47,610 Which is also unknown and attractable, so again, we have a statistical incompletion barriers. 182 00:19:47,610 --> 00:19:51,520 Yes. Sorry. Look at this. Are questioning the cheque. Yeah. 183 00:19:51,520 --> 00:19:58,090 That's no. Make your way. Says when talking about B.S., they mention bounce. 184 00:19:58,090 --> 00:20:03,970 You mentioned that we can just use more data because that would also imply a larger model size. 185 00:20:03,970 --> 00:20:16,030 Could you elaborate on that? Can we theoretically show that the ABC, they mention bound doesn't call when the data size is larger than the model size? 186 00:20:16,030 --> 00:20:18,530 Now, so does hold. It does hold. 187 00:20:18,530 --> 00:20:26,690 And so, for example, if we fixing your network and keep getting more and more data, eventually we'll get a non bakhos bound. 188 00:20:26,690 --> 00:20:31,880 But actually, if you actually look at the numerical quantities we have in order to get it back, 189 00:20:31,880 --> 00:20:36,590 because fans will need a very, very large amount of data. 190 00:20:36,590 --> 00:20:46,350 OK. Which is just not not reasonable, not practical. And I think what he was trying to say there is that when we get more data, 191 00:20:46,350 --> 00:20:51,150 instead of looking at the same model and looking at how this model for friends and getting bats for it, 192 00:20:51,150 --> 00:20:55,230 we actually look at a larger networking so that we get more data. 193 00:20:55,230 --> 00:21:02,940 Again, the scale of the model size. Right. So we never really you know, as we get more data, we don't keep looking at the same small network anymore. 194 00:21:02,940 --> 00:21:06,630 And that's what I what I really meant. But obviously, bans are still valid. 195 00:21:06,630 --> 00:21:09,330 And eventually they'll give you nonbanks bounds. 196 00:21:09,330 --> 00:21:17,700 But just unreasonable to expect that, you know, unreasonably, even for the small networks, we have to demand to have this much data. 197 00:21:17,700 --> 00:21:40,430 Really? OK, so getting around this other competition like statistical barrier for a one stop conditional mutual formation. 198 00:21:40,430 --> 00:21:43,990 OK. So I see our minibike from. 199 00:21:43,990 --> 00:21:51,330 S at each step, we add Gaussian noise and disgusting noise makes our analysis actually much easier. 200 00:21:51,330 --> 00:22:03,390 So w to put this one conditioned on where we are at Tonti W team, it's actually Gaussian because of this additive Gaussian noise term. 201 00:22:03,390 --> 00:22:10,000 So it has to mean. Sentence at. 202 00:22:10,000 --> 00:22:17,060 This update, so w t minus that, the gradient of empirical risk and the convenience is, of course, determined by the. 203 00:22:17,060 --> 00:22:23,880 But this scaled down seemed like a garrison term for adding here, such it's always fixed. 204 00:22:23,880 --> 00:22:33,130 We actually always know the clearance to know the mean of, you know, we would have to know the mini batch. 205 00:22:33,130 --> 00:22:39,790 Now, this single step. Mutual permission intuitively, as expected. 206 00:22:39,790 --> 00:22:45,190 Log loss of best predictor for W.T. plus one based only on W.T. 207 00:22:45,190 --> 00:22:50,860 So, of course, if we choose any other predictor for W two plus one, we'll get an upper bound on the mutual information. 208 00:22:50,860 --> 00:22:55,450 So, for example, we can predict the W two plus one is sample from a gallon. 209 00:22:55,450 --> 00:23:01,390 Sambit, it's at our current weight W team with the right covariance matrix, which is known. 210 00:23:01,390 --> 00:23:09,670 Then we combine the mutual information just in terms of that expected squared or the gradient. 211 00:23:09,670 --> 00:23:19,570 Alternatively, we could predict that B sample W.T. plus one from a Gaussian that's centred at current weights, 212 00:23:19,570 --> 00:23:25,450 minus the update based on the green and other risk, of course, risk. 213 00:23:25,450 --> 00:23:30,070 Depends on the data distribution. So again, this is not tractable. But just as an example, 214 00:23:30,070 --> 00:23:36,580 we would get that much information minus founded in terms of the expected norm 215 00:23:36,580 --> 00:23:44,080 squared norm of the difference between the empirical gradient and the risk gradient. 216 00:23:44,080 --> 00:23:52,610 OK. So just because of this adds the gas and noise, B, we can maybe it's kind of nice. 217 00:23:52,610 --> 00:23:56,330 Gas is on the distribution of a W.T. plus one. 218 00:23:56,330 --> 00:24:01,720 A sample from. And basic job and low take advantage except for you. 219 00:24:01,720 --> 00:24:06,660 So they assumed that the Supreme they would be Bando Supreme. 220 00:24:06,660 --> 00:24:15,360 Of the gradient they Bierko risk in terms of the lipshutz consent, all the critical risk, 221 00:24:15,360 --> 00:24:19,680 then plugging it back, inherence, optimising the variances beget the balance on this. 222 00:24:19,680 --> 00:24:25,170 Once the beach formation in terms of this looks just concepts. 223 00:24:25,170 --> 00:24:29,730 Now combining this with the chain rule on the mutual information, they get the following ground. 224 00:24:29,730 --> 00:24:39,030 So expect a generalisation here as bonded in terms of this mutual information by Richard Schoenberg's You're in a good skin. 225 00:24:39,030 --> 00:24:47,830 But now they bound best to make it more tractable in terms of that, in terms of the religious content. 226 00:24:47,830 --> 00:24:54,840 And there is something overtraining training times from that comes from this channel. So in summary, fancy jargon, 227 00:24:54,840 --> 00:25:05,010 low be bound in mutual information between the weeds of convergence of data in terms of information leaked at each training step. 228 00:25:05,010 --> 00:25:10,990 This information leaked that each training step is unknown. It's actually actually still distribution dependent. 229 00:25:10,990 --> 00:25:16,530 But the bonded file, which is continuity, so they lose distribution dependence. 230 00:25:16,530 --> 00:25:21,070 So this approach, he also distribution independent bound and for DePinho Networks. 231 00:25:21,070 --> 00:25:24,960 Look, this concept of empirical risk is actually massive. 232 00:25:24,960 --> 00:25:31,760 And as a rule of thumb, bonds dependent on this Lipshitz concept are usually backrest in the regime's self-interest. 233 00:25:31,760 --> 00:25:37,050 So can we do better? So better and low the. 234 00:25:37,050 --> 00:25:40,660 You don't really know what the data distribution breath and we're not dealing with it. 235 00:25:40,660 --> 00:25:44,000 You know, in some cases that by B. This is just one step up. 236 00:25:44,000 --> 00:25:48,790 You can be great and can be really spread, spread around depending on the minute batch you get. 237 00:25:48,790 --> 00:25:55,360 And other times it can be more concentrated. The economic data distribution. 238 00:25:55,360 --> 00:25:58,360 We don't know in which case, in which scenario we are. 239 00:25:58,360 --> 00:26:08,030 So in other words, we proposed to use some of the data to to estimate which case we're dealing with. 240 00:26:08,030 --> 00:26:14,810 Let me be denote a random subset of your training set. 241 00:26:14,810 --> 00:26:22,990 So the training set s is now split into S.G. And as G Bar select a trendy. 242 00:26:22,990 --> 00:26:33,820 No one can show that expect generalisation here is then bonded by this conditional mutual formation between the weights and S.G. bar conditions. 243 00:26:33,820 --> 00:26:39,820 On the other side, a sheep. But, of course, now we're dividing here by the smaller number, by this other set. 244 00:26:39,820 --> 00:26:43,160 And sheep are the size of the. 245 00:26:43,160 --> 00:26:52,640 Letting you be being distribution are the weight condition on the DNA and B, B, the distribution orebody, the weights conditioned. 246 00:26:52,640 --> 00:27:00,640 Only only on the set as gene. So it's similar as before, but it's that are key being the marginal and w their condition and a subset of data. 247 00:27:00,640 --> 00:27:06,610 So I'll refer to took us the booster impious the prior sometimes. 248 00:27:06,610 --> 00:27:13,320 They had this conditional mutual information can be expressed again, as expected, Khail divergence, 249 00:27:13,320 --> 00:27:20,570 but not between Q and very dependent B, but between Q the defence on S and B, the defence on the subset of the data. 250 00:27:20,570 --> 00:27:29,480 S.G. So you can think about it as B being a data dependent prior. 251 00:27:29,480 --> 00:27:37,880 Then we showed that expected generalisation here is bandied about by this expected expectation on screw it off. 252 00:27:37,880 --> 00:27:43,850 This Khail divergence between the exterior and a data dependent. 253 00:27:43,850 --> 00:27:51,560 And this, of course, holds for all Col's be more interested, get a looser band who can make a moderate choice for being. 254 00:27:51,560 --> 00:28:01,010 Select w w does not a trades training and on define these conditional distributions. 255 00:28:01,010 --> 00:28:07,370 One step distribution. So kind of similar as the four Q T given T minus one. 256 00:28:07,370 --> 00:28:14,300 The distribution on W t given all the previous drinks actually. 257 00:28:14,300 --> 00:28:30,770 And similarly peaty given T minus long B a distribution on again W.T. given our previous location and the subset of the data from other data. 258 00:28:30,770 --> 00:28:36,970 Then one step, just this one step distributions are both Gaussian and satisfied and falling. 259 00:28:36,970 --> 00:28:45,400 So they're both Gaussian. The same variances because of variance only depends on the additive gas in terms, but different means. 260 00:28:45,400 --> 00:28:51,010 So for the Q, the mean depends on the previous location minus the gradient. 261 00:28:51,010 --> 00:28:54,540 The empirical risk. Competed on all of the training set. 262 00:28:54,540 --> 00:28:57,990 S and the Nina B is similar. 263 00:28:57,990 --> 00:29:08,860 But the empirical risk agree and often empirical risk is computed only on the subset of the data S.J that the prior has access to. 264 00:29:08,860 --> 00:29:12,430 Now, from here, I think it's quite clear to see where we're going. 265 00:29:12,430 --> 00:29:19,630 But there's still that chain rule for Khail yells That tail between Q and B at 266 00:29:19,630 --> 00:29:26,950 convergence is upper body in terms of the of that these expected step Y scales. 267 00:29:26,950 --> 00:29:35,140 So now we're only completing the kail distribution between this conditional Q on conditional fee after one stop and Q and pure Gaussians. 268 00:29:35,140 --> 00:29:41,320 So causing the scale works out to be can be computed easily. 269 00:29:41,320 --> 00:29:47,710 So the ones that kill divergence is then equal to this term. 270 00:29:47,710 --> 00:29:56,830 So Bita in temperature eight learning grades and the squared norm of Sieda, which we call incoherence, 271 00:29:56,830 --> 00:30:05,050 which measures how different our gradient on all the data is from the gradient on the subset of the data. 272 00:30:05,050 --> 00:30:10,630 And here, note that your average order averaging over which subset of the of your. 273 00:30:10,630 --> 00:30:17,160 Computing. Any questions here? 274 00:30:17,160 --> 00:30:27,900 I don't see any question to you. OK? So to summarise, kind of all of these bounds here is frankly, a tall bound. 275 00:30:27,900 --> 00:30:32,960 Here's the number of bomblet appeared soon after their slimani tall. And this is our bounce. 276 00:30:32,960 --> 00:30:36,990 And the key differences between these highlighted trims and reds. 277 00:30:36,990 --> 00:30:45,240 So banks see it all bounce depends on this lipshitz continuity. Trim Mowatt all bound depends on the llama. 278 00:30:45,240 --> 00:30:53,580 The gradients, which can still be very large during training and our bond depends on this incoherence term. 279 00:30:53,580 --> 00:31:02,490 The normal incoherence trim, which is the difference between gradient and all the bachelors degree up on the middle benches. 280 00:31:02,490 --> 00:31:11,880 In practise, incoherence, Sturm turns out to be quite a bit smaller. So here we plot incoherence term versus that normal, 281 00:31:11,880 --> 00:31:20,730 the grip of that full gradient for different number of held out points, meaning held out from the prior. 282 00:31:20,730 --> 00:31:25,760 So really, how how many points to the prior did not see. 283 00:31:25,760 --> 00:31:30,180 So it's the size of S.G. Bar and the language before. 284 00:31:30,180 --> 00:31:39,410 And here one should compare, for example, this little line with the orange line, which has the same number of holdout point. 285 00:31:39,410 --> 00:31:42,720 All right. Let's see what size. Red and green. 286 00:31:42,720 --> 00:31:50,370 So red and green. So our term decoherence trim is green and the trauma and get motile bounties in red. 287 00:31:50,370 --> 00:31:56,490 And we can see that it's orders of magnitude larger. And again, for a different choice of no other points. 288 00:31:56,490 --> 00:32:03,480 We can compare the blue and the orange. And again, we can see that the blue, which is our experience term, is much, much smaller. 289 00:32:03,480 --> 00:32:11,310 Looking at the bonds itself, what the system could do, tell us what the data on the prediction for this experiment. 290 00:32:11,310 --> 00:32:16,230 Great question. I do not you cannot on this one. 291 00:32:16,230 --> 00:32:21,810 It's either siefer or numbness on this. I'm pretty sure that this was one I missed and this one was steeper. 292 00:32:21,810 --> 00:32:27,750 So I'm I'll have to look this one up from the paper. And just quickly grabbed a screen shot. 293 00:32:27,750 --> 00:32:32,920 This one was, I think, gymnast's. Actually, they're both CFR. 294 00:32:32,920 --> 00:32:40,000 You know what? I should have taken an oath. My apologies. But it's one of the two. 295 00:32:40,000 --> 00:32:47,360 So here is the actual bounce. And here we see the motel bound. 296 00:32:47,360 --> 00:32:54,130 There they're bound. But with a different choice of the in risk temperature parameter. 297 00:32:54,130 --> 00:33:01,090 So here is then. The inverse temperature is large, meaning that we have less exploration in our cemetery, 298 00:33:01,090 --> 00:33:06,370 small meaning bantams, meaning that we have more exploration in here for the same choices. 299 00:33:06,370 --> 00:33:15,740 OK. Sorry. This is not Denver Centre. This is a certain groups. So this was Sunbus temperature I such a slides. 300 00:33:15,740 --> 00:33:23,780 And here out of bounds. But the same same choice was in respect to parameters. 301 00:33:23,780 --> 00:33:32,090 And we can see that again there quite a bit smaller. And here we are bearing that, learning great instead. 302 00:33:32,090 --> 00:33:36,920 Instead of being risk temperature and again, these two are most tall bounds. 303 00:33:36,920 --> 00:33:46,670 And these two are our bones. Here, a label that some here is and this fashion Emison CFR. 304 00:33:46,670 --> 00:33:53,360 And here we have them. The incoherence term, the norm. 305 00:33:53,360 --> 00:33:59,340 On what? The incoherence STURMAN Red and the size of the. 306 00:33:59,340 --> 00:34:06,290 Gradient against her, no gradient in blue, and we can see again that, you know, the x axis log scale. 307 00:34:06,290 --> 00:34:14,860 And that for all three datasets. Our incoherence term is orders of magnitude smaller. 308 00:34:14,860 --> 00:34:25,370 While he says keeping disability fight, I look, I like what you're describing a bonus on the hearing station or radio station they've got. 309 00:34:25,370 --> 00:34:29,330 Yeah. So you here are actually here? We are only in the slot. 310 00:34:29,330 --> 00:34:34,820 We are undergoing computing, being coherent sturm and that long term, which appear in the box. 311 00:34:34,820 --> 00:34:39,590 And here was the bottom on the expected generalisation error. 312 00:34:39,590 --> 00:34:44,620 So the difference between the risk and empirical risk. 313 00:34:44,620 --> 00:34:52,120 But this particular plot is just the kind of that we're tracking, the actual trend that appears some the it. 314 00:34:52,120 --> 00:34:59,580 Yeah, I think it was effective. We did it to the east side. OK. 315 00:34:59,580 --> 00:35:10,400 OK. Now. The question that remains is, can we actually learn from past trends because now we only looked one step back, 316 00:35:10,400 --> 00:35:18,170 so we reduce the problem of thinking about the distribution of the weight, such convergence to the conditional off the weights at time. 317 00:35:18,170 --> 00:35:24,470 T given the previous set of weight, weight supply chain will be improved upon, 318 00:35:24,470 --> 00:35:31,880 the worst case bound on the one step scale mutual information by data dependent estimates. 319 00:35:31,880 --> 00:35:36,600 That led to the great incoherence term at Teach TimeStep team. 320 00:35:36,600 --> 00:35:41,300 It does not take advantage of the past. 321 00:35:41,300 --> 00:35:50,420 It treats W not W two minus one that may reveal information about the unknown to the prior data points as G Bar. 322 00:35:50,420 --> 00:35:58,590 So as the prior does not say C has Dubah, but it can kind of get information about as dubah by looking at this past, 323 00:35:58,590 --> 00:36:07,960 it turns out that we are allowed to condition on. So we should be able to leverage information from passive tricks to make better predictions for W.T. 324 00:36:07,960 --> 00:36:13,490 We implement this idea using an improved version of the information for Advanced introduced by Stanka. 325 00:36:13,490 --> 00:36:22,450 And as it came to New York just last year. So consider a so-called super sample. 326 00:36:22,450 --> 00:36:25,960 So this particular super sample is to buy em. 327 00:36:25,960 --> 00:36:31,870 And it was the size of our training set. But now we're imagine just sampling double bass. 328 00:36:31,870 --> 00:36:38,380 And now we choose our training set apps from a super sample by in each column. 329 00:36:38,380 --> 00:36:44,550 By choosing randomly the first century or the second entry. 330 00:36:44,550 --> 00:36:51,780 Let UI be equal to one if s contains the price data point out of the IV column. 331 00:36:51,780 --> 00:36:58,140 And to you, I told you to have the training set contains the second entry from the IV column. 332 00:36:58,140 --> 00:37:08,040 So, for example, if we have you richness of the spectral once in Tucson, it and gives us a training centre sampled uniformly. 333 00:37:08,040 --> 00:37:14,380 Traynham. It gives us a training set. They choose us. 334 00:37:14,380 --> 00:37:26,400 The first entry from the first column, choosing the second entry from the second column chooses again the first entry from the first column and so on. 335 00:37:26,400 --> 00:37:32,440 So sanctions can continue to find the conditional mutual formation of an algorithm. 336 00:37:32,440 --> 00:37:45,260 Respected deal, their distribution. But as CMI of the algorithm, eight equals as conditional rich information between the weights and the index, 337 00:37:45,260 --> 00:37:49,790 which you use to choose a training set, given the super sample. And it might be easier, actually, 338 00:37:49,790 --> 00:37:58,250 to think of this Mitchell information and in the fall race or it's because I'm Mitchell information between the weights and the training set. 339 00:37:58,250 --> 00:38:10,550 Given the super sample. Now, Sam, I had some really nice properties, so they showed that CMI is. 340 00:38:10,550 --> 00:38:19,970 Bounded, always bounded key was the number of rows and the super simple and is the size of a training centre. 341 00:38:19,970 --> 00:38:25,380 Now, in contrast, standard information can actually be infinite. 342 00:38:25,380 --> 00:38:33,730 What we show is that CMI essentially always no greater than the mutual formation. 343 00:38:33,730 --> 00:38:39,190 So going back to kind of this, the original, the original here, 344 00:38:39,190 --> 00:38:46,250 the original bound to unexpected generalisation error back in terms of the Meachem permission. 345 00:38:46,250 --> 00:38:54,370 Stun guns. Can you prove a new upper bound in terms of air conditioning, which permission? 346 00:38:54,370 --> 00:39:00,420 So this is just CMI. Notice that you have some slightly different concerns here. 347 00:39:00,420 --> 00:39:09,420 And in our work, we showed that for G shows an independent B A train and from our training set. 348 00:39:09,420 --> 00:39:14,070 So let's say we choose one index and then expect a generalisation. 349 00:39:14,070 --> 00:39:21,060 Error is bounded in terms of this individual sample bomblet. 350 00:39:21,060 --> 00:39:26,460 I'll explain that the next slide. And it's tighter than the original CMI bound. 351 00:39:26,460 --> 00:39:32,480 So this was the original CMI bound from the line of bus that appeared in cycling. 352 00:39:32,480 --> 00:39:43,130 So no paper. And we put a slightly tighter bound in terms of this individual sample than. 353 00:39:43,130 --> 00:39:46,670 So let let me just kind of give you a little bit of contrition, one. 354 00:39:46,670 --> 00:39:50,660 This is a virtual sample about Gus. And what would this term even means? 355 00:39:50,660 --> 00:39:55,610 Which we call Disintegrator Mitchell information. So this is actually a random variable. 356 00:39:55,610 --> 00:40:03,380 So let's yujie indicates which are terms and the super sample in the JF column was used to train W. 357 00:40:03,380 --> 00:40:11,940 So just as before then there's this integrated meta information as mutual information between the weights and the indicator NUJ. 358 00:40:11,940 --> 00:40:21,110 Know this super sample and the index. So in effect our bound close to expectations outside the square root. 359 00:40:21,110 --> 00:40:26,310 So here you see that original box. And that appeared in Steichen thinking. 360 00:40:26,310 --> 00:40:36,130 So continue paper, which can be expressed in terms of this expectation of disintegrated mutual information. 361 00:40:36,130 --> 00:40:43,270 And we're just here. You can think about this term as the average over which index, Jabe, you're using. 362 00:40:43,270 --> 00:40:48,620 And our bond is kind of roughly, you know, pulling these expectations outside escrowed. 363 00:40:48,620 --> 00:40:54,160 So you see expectation of being here, which means about titre buying something. 364 00:40:54,160 --> 00:41:01,410 So for a launch of the dynamics individual sample band leads to. To sample and coherence, I will not go for the details. 365 00:41:01,410 --> 00:41:16,680 For someone too complicated for the talk. But if you kind of, you know, similarly, similarly is about as much as in the previous case, 366 00:41:16,680 --> 00:41:23,700 you can actually show that this two sample bounce leads to the following two sample incoherence. 367 00:41:23,700 --> 00:41:27,960 So instead of measuring before we had their incoherence, 368 00:41:27,960 --> 00:41:35,920 measuring the difference between bad gradient of empirical risk on all the training set versus on the subset of the training set, 369 00:41:35,920 --> 00:41:44,910 the to sample coherence measures that difference of the great epic empirical risk on one sample from the 370 00:41:44,910 --> 00:41:53,130 super sample versus another sample from a super sample in this Jabe column and its averaging origins. 371 00:41:53,130 --> 00:41:56,490 So conditional on all but the Jeev entry in the training set. 372 00:41:56,490 --> 00:42:11,880 S How much information do the weights of the training reveal about which of the two some samples ze 1g or z2 g belong in the training set? 373 00:42:11,880 --> 00:42:20,310 So neatly we could assign equal probability to each sample appearing in training set. 374 00:42:20,310 --> 00:42:27,280 And if we had no further information, because, of course, all we can do and we'll end up incurring the penalty, 375 00:42:27,280 --> 00:42:32,440 the defence on this squared norm of two sample incoherence. 376 00:42:32,440 --> 00:42:42,160 Don't forget that we have access to all the previous iterates w want to w t minus one after observing them, 377 00:42:42,160 --> 00:42:48,880 predicting identity or the chief entry can be viewed as a binary classification problem. 378 00:42:48,880 --> 00:42:54,460 And formerly the lower the risk of this prediction, the tighter the bound reducing the penalty. 379 00:42:54,460 --> 00:43:05,730 So you know, the weights w the pass, the trends that we want to w two months longer being updated based on training quando full training set. 380 00:43:05,730 --> 00:43:16,190 S. So each time we can go a little bit of information about Richard, the data points appear. 381 00:43:16,190 --> 00:43:21,230 And obviously, it's a much better approach, so we can see that in the beginning of training. 382 00:43:21,230 --> 00:43:23,180 So this is EDNESS and TFR. 383 00:43:23,180 --> 00:43:30,500 X-axis is the number of training stamps and y axis of the error, and we're plotting to bound to begin to guard the training. 384 00:43:30,500 --> 00:43:35,250 You can see how bad our old bound, which is the green and our new band, which is the blue. 385 00:43:35,250 --> 00:43:43,040 They kind of do approximately the same because the new band doesn't hasn't yet been a 386 00:43:43,040 --> 00:43:49,850 coherent term is equally large because you haven't learnt too much from early trends. 387 00:43:49,850 --> 00:43:56,880 But as the training goes, you're learning more and more about richel them. 388 00:43:56,880 --> 00:44:02,630 Two samples from the super sample actually belong in the training set and. 389 00:44:02,630 --> 00:44:12,660 The penalty coming from the north, something coherent stream, it's actually going to zero and the bounce convergence. 390 00:44:12,660 --> 00:44:21,940 And this is yet another bond that appeared in 2020. Some other offers. 391 00:44:21,940 --> 00:44:26,370 So to recap and summarise all what I talked about today, 392 00:44:26,370 --> 00:44:32,200 so discuss the barriers to explaining generalisation with deep learning and highlighted the needs of proper 393 00:44:32,200 --> 00:44:40,370 empirical revelation and data dependence and thinking about what tools we need to kind of measure data. 394 00:44:40,370 --> 00:44:45,660 And I think it would be key to improving getting tighter bounds. 395 00:44:45,660 --> 00:44:51,670 I introduced Mitchell Information Bounce and expect a generalisation. You are due to share again scheme. 396 00:44:51,670 --> 00:44:57,310 I described application to understanding's the Cancerian launch of the dynamics, 397 00:44:57,310 --> 00:45:08,150 also mentioned the word by Pennsy at all that using my information between Bedelia s and the weights W learnt by as Jodi and they. 398 00:45:08,150 --> 00:45:15,110 Bond mutual information between these by the information leaflet, each a duration instead. 399 00:45:15,110 --> 00:45:21,180 So they break down the problem of thinking about the rates of convergence to kind of one step ahead. 400 00:45:21,180 --> 00:45:27,660 I explain that distribution to independent bounds on the commission, which was done in PENSEE at all. 401 00:45:27,660 --> 00:45:37,850 Guilt factors bonds can patrician's. I introduced distribution dependence by Akeel Bounce and data dependent Prioress and presented 402 00:45:37,850 --> 00:45:43,190 empirical findings showing that reading incoherence is much smaller than gradient norms. 403 00:45:43,190 --> 00:45:46,610 For example, were still getting clues. 404 00:45:46,610 --> 00:45:55,370 But numbat, whose bounce for the first few epochs and the bond unfortunately kept increasing as a training time. 405 00:45:55,370 --> 00:45:56,840 One point that I did not hear. 406 00:45:56,840 --> 00:46:10,140 I also talked about how we could use conditional mutual information, seeing my work, my style, and continue to in order to learn a little more. 407 00:46:10,140 --> 00:46:22,130 Bob, you held out points from the prior hour, from the previous entrance in order for the bout to actually converge instead of keeping creasing. 408 00:46:22,130 --> 00:46:26,480 More work is needed to understand the limits of Mitchell information based explanations of his show. 409 00:46:26,480 --> 00:46:35,570 Decent. All these are just estimates. All of these are bonds. And I think there is a lot more work to be done in order to tighten this up. 410 00:46:35,570 --> 00:46:50,890 That's a thank you very much. Instead of them letting on, the festival will be clear that these are questioning the jet by month as well. 411 00:46:50,890 --> 00:46:54,990 Who says that incoherence then looks like it of order? 412 00:46:54,990 --> 00:47:00,340 Have you conducted experiments to see how the new boats behave with increasing D, 413 00:47:00,340 --> 00:47:09,520 for example, increasing the number of layers there with et cetera as well? 414 00:47:09,520 --> 00:47:12,430 So we did not do experiments in our most recent paper. 415 00:47:12,430 --> 00:47:19,390 I'm thinking if we did in our 2019 paper, I think we did experiments on different architectures. 416 00:47:19,390 --> 00:47:25,950 So I mean, had it been coherence term, of course, only implicitly depends on attrite. 417 00:47:25,950 --> 00:47:29,770 I guess it depends on the norm or the gradients or you know, 418 00:47:29,770 --> 00:47:33,900 that the difference between Banaszak, the gradients and I can't really answer this question. 419 00:47:33,900 --> 00:47:36,160 But I think it's just one of those things. 420 00:47:36,160 --> 00:47:41,230 As I said initially, we don't really have a good way to think about these data dependent quantities on hobby scale. 421 00:47:41,230 --> 00:47:46,160 So I think it's an interesting question. But I hadn't thought enough about it. 422 00:47:46,160 --> 00:47:51,700 But I think we may have included some experiments in the 2019 newspaper on the information period. 423 00:47:51,700 --> 00:47:57,880 channelisation bound for iterative algorithms. So it might be worth checking there. 424 00:47:57,880 --> 00:48:06,350 Great question, though. I have a couple of questions that I seen anything else let go with it? 425 00:48:06,350 --> 00:48:13,450 Yeah. So, Fadiga, very elementary lesson about appearance. We'll take a very relevant one to keep at it all at the same time. 426 00:48:13,450 --> 00:48:22,560 The guy doesn't seem at a very similar to do so it would that we have seen in these videos. 427 00:48:22,560 --> 00:48:29,250 So I wondered if you followed all the way of the gods, meaning there's no one to do that, abusing your boundaries. 428 00:48:29,250 --> 00:48:32,770 They go, let's do any more for me. 429 00:48:32,770 --> 00:48:40,170 The of gossip. That is my question. And the other one is, what do you do that much about these guys? 430 00:48:40,170 --> 00:48:47,880 I read this paper in of these robust measures of derealization. 431 00:48:47,880 --> 00:48:55,300 I would like to know, like. So then would you like to use these data dependent's bones, if you like? 432 00:48:55,300 --> 00:49:03,800 I have a pretty solid product narrative that includes robustness, which in this distribution of the BBC link together. 433 00:49:03,800 --> 00:49:10,140 Okay, so I'm more adults. Forget the first question. So Hughart the first. 434 00:49:10,140 --> 00:49:16,700 So you were asking for more intuition of my incoherence. Incoherence terms, right? I don't. 435 00:49:16,700 --> 00:49:20,880 I mean, you can really think about it as in some sense kind of variance off the gradient. 436 00:49:20,880 --> 00:49:21,660 Right. 437 00:49:21,660 --> 00:49:29,690 So if you have a distribution that every time you get, like, can do sample, you'll your great info will be pointing in a very different direction. 438 00:49:29,690 --> 00:49:36,750 Right. This incoherent storm will be large because your data will not like small sample. 439 00:49:36,750 --> 00:49:43,610 So the data will not be agreeing on where the minimum mass versus if you have kind of be, you know. 440 00:49:43,610 --> 00:49:46,920 And some trims. That's like the nice rest of the data. Right. 441 00:49:46,920 --> 00:49:51,870 If you have a nice, nice data distribution, then the incoherence term will be like a start. 442 00:49:51,870 --> 00:49:57,570 Every gradient measured on a fairly small batch roll of roughly agree where the minimum mass. 443 00:49:57,570 --> 00:50:06,210 I mean, you're in a storm will be small than normal. Right. So it's really thinking about this kind of spreads of the next gradient stuff. 444 00:50:06,210 --> 00:50:10,700 So that's that's my intuition. 445 00:50:10,700 --> 00:50:18,960 Like think in terms of this kind of spread or the great answer variance of the gradients and not so much about that. 446 00:50:18,960 --> 00:50:26,490 You know, the norms of the gradients or things like that that appeared in previous work, which I think is a big step forwards. 447 00:50:26,490 --> 00:50:37,350 So does this answer kind of Russia? And the second question was about and such a robust measures. 448 00:50:37,350 --> 00:50:41,400 So her Promesse measures internalisation. What was the question again? 449 00:50:41,400 --> 00:50:48,690 The giggler that people are starting to create a narrative of lack of 450 00:50:48,690 --> 00:50:53,280 understanding that the translation doesn't like understanding about distribution. 451 00:50:53,280 --> 00:50:59,250 So I wanted to know what is maybe a programme that you look at using these data to balance. 452 00:50:59,250 --> 00:51:03,660 In addition to these on the narrative of your. 453 00:51:03,660 --> 00:51:08,410 Yeah. So, you know that framework that you propose there, the distribution of Bresson's framework. 454 00:51:08,410 --> 00:51:20,760 It just really we argue about the bonds have to be evaluated and different settings with different what you call a different environment there. 455 00:51:20,760 --> 00:51:22,620 For example, you know, 456 00:51:22,620 --> 00:51:31,200 I think it's unlikely that we'll get tied bounce that our modern distribution dependence or this mission dependent or hold an all possible scenarios. 457 00:51:31,200 --> 00:51:35,580 So as to be the hyper parameters as to be the area the size of the architectures. 458 00:51:35,580 --> 00:51:43,920 That's very the data sets. Right. And what we argue there is that in order to understand when the bond fails and it succeeds, 459 00:51:43,920 --> 00:51:49,380 you need to kind of define this set of environments where you think the band should hold. 460 00:51:49,380 --> 00:51:54,450 For example, I think it will hold far, you know, small learning grades. 461 00:51:54,450 --> 00:52:04,710 And when my architect exercises are a particular whatever, and then you want to really look at its worst performance under the set of environments. 462 00:52:04,710 --> 00:52:07,860 So that's really what the favourite spot. 463 00:52:07,860 --> 00:52:14,220 And since none of the bands have really or none of the measures coming from the bands are bad, great right now. 464 00:52:14,220 --> 00:52:19,500 One needs to really look further where they fail and be actually by looking 465 00:52:19,500 --> 00:52:23,270 kind of further and digging deeper for the framework or distribution business. 466 00:52:23,270 --> 00:52:31,050 They can identify some of the failures that were pointed out in the corruption culture paper, for example. 467 00:52:31,050 --> 00:52:39,560 We noticed that when you have now I'm forgetting exactly actually about the exact failure, but I think maybe twice. 468 00:52:39,560 --> 00:52:48,960 And you have like a small number of training points as increased training points them, the band was basically changing about in the wrong way now. 469 00:52:48,960 --> 00:52:58,410 So there are things like that, and especially for Albie's bands that involve data dependent quantities. 470 00:52:58,410 --> 00:53:04,900 I think just like. A proper investigation and analysis over different environments is extremely poor. 471 00:53:04,900 --> 00:53:10,360 Looking at that, the worst case scenario over those environments. 472 00:53:10,360 --> 00:53:16,840 Because if you clean bent, you know, your bond will work well in kind of all these settings. 473 00:53:16,840 --> 00:53:20,320 But then you only look at the average performance over the aesthetics. 474 00:53:20,320 --> 00:53:30,340 I don't think it's it's a fair comparison because a ferry should do equally well in each of the settings rather than on average. 475 00:53:30,340 --> 00:53:34,480 I'm not sure how clear it was, what the actual paper and the. 476 00:53:34,480 --> 00:53:40,840 But, you know. Thanks. 477 00:53:40,840 --> 00:53:46,610 Oh, yeah. I'm not seeing more questions in the job. 478 00:53:46,610 --> 00:53:53,210 So thanks again to all of you know, for this very, very interesting topic. 479 00:53:53,210 --> 00:54:03,140 Thank you very much for inviting me. And if you have any other questions or want to discuss any other related break, please feel free to reach out. 480 00:54:03,140 --> 00:54:10,024 Thank you. OK. Thanks a lot. Bye.