1 00:00:03,420 --> 00:00:09,630 So, hi, everyone is welcome to this week's Oh, I see, I smell. 2 00:00:09,630 --> 00:00:15,420 I'm in awe. So today our speaker is James Martin from the mine. 3 00:00:15,420 --> 00:00:22,960 So James is research scientist. I do my working on a deep learning fundamentals, training algorithms and theory. 4 00:00:22,960 --> 00:00:31,900 So before the tragedy that his Ph.D. from University of Toronto under the supervision of a job, Hinton then retired a demo. 5 00:00:31,900 --> 00:00:36,490 So it seems. Go ahead. Thank you. 6 00:00:36,490 --> 00:00:42,760 Yes, I'll today, I'll be talking about a recent project called The Rapid Training of Deep Neural Networks, 7 00:00:42,760 --> 00:00:46,090 that normalisation rallies or Skip Connexions. 8 00:00:46,090 --> 00:00:54,220 And you see my great collaborators down here, most of them are actually all of them are out of health a bit, some at DeepMind and some brain. 9 00:00:54,220 --> 00:00:59,770 So deep neural networks have become ubiquitous in modern machine learning applications. 10 00:00:59,770 --> 00:01:06,940 You see them in reinforcement learning agents, translation system, things and language systems, 11 00:01:06,940 --> 00:01:11,860 vision systems, speech recognition systems in search of recommendation. 12 00:01:11,860 --> 00:01:16,210 And it's pretty much pretty much all over the place now. Now, 13 00:01:16,210 --> 00:01:23,230 while practitioners of neural networks have come up with many heuristic innovations that make them trained 14 00:01:23,230 --> 00:01:30,130 on at higher debts and which are very useful in practise theory hasn't had much to say about this. 15 00:01:30,130 --> 00:01:35,110 It's been quite slow to catch up and rarely do you actually see theoretical insights 16 00:01:35,110 --> 00:01:42,540 making an impact on the practical application neural nets in these contexts. 17 00:01:42,540 --> 00:01:51,960 So currently, defence seems to require some combination of the following elements to train fast normalisation layers, 18 00:01:51,960 --> 00:01:56,310 such as batch normalisation or layer normalisation, 19 00:01:56,310 --> 00:02:05,430 Skip Connexions, also known as Shortcut Connexions and specific choices for activation functions such as value and sell you. 20 00:02:05,430 --> 00:02:13,950 Now this comes with various problems. First of all, the mechanism of action of all these elements is not particularly well understood. 21 00:02:13,950 --> 00:02:20,490 There has been progress in this direction, but I don't think we're anywhere close to a complete picture yet. 22 00:02:20,490 --> 00:02:27,360 It's unclear also how to use these elements in new architectures, partly because we don't understand fully how they work. 23 00:02:27,360 --> 00:02:32,400 So, you know, if you're if you ask the average person, the average practitioner, you know, 24 00:02:32,400 --> 00:02:38,070 why don't you just put on normalisation layers between the blocks of a resonant baby? 25 00:02:38,070 --> 00:02:43,650 What harm could it do? It actually does a lot of harm. But it's very not obvious why. 26 00:02:43,650 --> 00:02:48,660 And it's actually this very particular recipe that is used in resonance that's actually surprisingly 27 00:02:48,660 --> 00:02:53,580 effective for reasons that are sort of nothing particularly to do with the individual elements, 28 00:02:53,580 --> 00:02:58,350 but actually just the combination. They're very good in that very particular way that they combined in. 29 00:02:58,350 --> 00:03:07,830 There was an architecture that Shoreham in particular has caused problems in certain domains where the information 30 00:03:07,830 --> 00:03:14,150 sharing that you have over the many batch leads to degenerate training that you see in certain kinds of things. 31 00:03:14,150 --> 00:03:20,090 So certain kinds of generative models are self-supervised models and. 32 00:03:20,090 --> 00:03:25,130 Also, skit Connexions will change the inductive bias of the model. 33 00:03:25,130 --> 00:03:31,840 This may or may not be desirable depending on your application, but it's kind of annoying that you have to include them for your nephew. 34 00:03:31,840 --> 00:03:39,980 You don't want to trade at all. I would say more speculatively, these techniques might be acting as a crutch, 35 00:03:39,980 --> 00:03:47,190 and our reliance on them could be holding us back from pushing the practise in theory and deferring to the next level. 36 00:03:47,190 --> 00:03:51,030 I can't. In particular, if we don't understand where they work, 37 00:03:51,030 --> 00:03:58,820 there's really no way that we can sort of push a state of the art beyond just random exploration. 38 00:03:58,820 --> 00:04:05,300 So in this work, we develop a method called Deep Kernel Shaping Diecast, 39 00:04:05,300 --> 00:04:12,410 which is a general automated framework for transforming neural nets so that they have better properties and initialisation. 40 00:04:12,410 --> 00:04:15,210 And this will make them easier to train. 41 00:04:15,210 --> 00:04:24,960 So the headline result is that decades enables rapid training of neural networks that are traditionally considered hardware possible to train. 42 00:04:24,960 --> 00:04:29,220 And this includes very deep vanilla convolutional networks. 43 00:04:29,220 --> 00:04:39,510 So the vanilla here, I mean, that's without better or worse connexions networks with an unpopular activation functions such as teenager software. 44 00:04:39,510 --> 00:04:44,790 And also this this work sort of reveals why those choices are popular all. 45 00:04:44,790 --> 00:04:52,650 And you know, and we'd like to speculate that this approach will be very useful in developing new models because it sort of removes 46 00:04:52,650 --> 00:05:03,990 some of the requirement of architectural features in order to in order to enable fast training in the in the work, 47 00:05:03,990 --> 00:05:11,310 we also provide a comprehensive explanation for why things like real use spectrum layers and skin connexion speed of training on the show, 48 00:05:11,310 --> 00:05:20,350 how to cast makes them at least partially unnecessary. And you can see the paper for that. 49 00:05:20,350 --> 00:05:28,750 So in general, diecast supports fully connected, conditional cooling, weighted some layer norm, 50 00:05:28,750 --> 00:05:35,530 an element wise nonlinear layers, although the only moisture linear layers have to be preceded always by a linear layer. 51 00:05:35,530 --> 00:05:40,930 That's a that's a requirement or accomplished layer, 52 00:05:40,930 --> 00:05:49,930 and it supports calcium fannin of standard as if in initialisation and also orthogonal visualisations. 53 00:05:49,930 --> 00:05:50,620 Strictly speaking, 54 00:05:50,620 --> 00:05:59,770 these have to be of the delta type wherein which means that layer you basically just zero everything put the centre of part of the filter, 55 00:05:59,770 --> 00:06:05,470 although in practise that actually this approach seems to work OK without the delta in it. 56 00:06:05,470 --> 00:06:13,990 But it is a formal requirement, at least for the theory. And then we also assume that the bias is initialised to zero in those types of weight 57 00:06:13,990 --> 00:06:20,640 sharing are actually supported by an approach such as what you see and what steps of artemz. 58 00:06:20,640 --> 00:06:23,960 And this is a kind of owing to some recent work done, like rigging, 59 00:06:23,960 --> 00:06:30,190 showing that a lot of the mathematical tools we use are actually applicable to networks that have weight sharing. 60 00:06:30,190 --> 00:06:39,610 Previously, that was actually not well understood. And actually, the current draught doesn't even reflect that so that it doesn't work for Arden's. 61 00:06:39,610 --> 00:06:46,100 And also the approach supports arbitrary topologies, such as the multiple branches heads. 62 00:06:46,100 --> 00:06:51,460 You know, it also supports networks that have state connexions. 63 00:06:51,460 --> 00:06:57,130 And so for this talk, we're going to simplify things a bit just for the sake of clarity. 64 00:06:57,130 --> 00:07:05,530 We're going to assume only fully connected layers and element why so many layers and a very simple fit for no apologies. 65 00:07:05,530 --> 00:07:15,400 So basically your standard of connectivity and we're going to have we're going to assume that our network inputs are normalised to have a norm, 66 00:07:15,400 --> 00:07:22,890 which is equal to the square root of the dimension, which is a pretty standard thing to assume. 67 00:07:22,890 --> 00:07:35,220 So the mathematical basis is not that it comes from sort of the theory of criminal functions for deep networks or criminal approximations. 68 00:07:35,220 --> 00:07:39,960 This is these are approximations that apply when that work is randomly initialised. 69 00:07:39,960 --> 00:07:48,240 So in particular, if you let f be a neural network function that gets its vector output, 70 00:07:48,240 --> 00:07:55,740 give it an input x and here we can just take the output to be, you know, before the logit layer. 71 00:07:55,740 --> 00:08:03,180 So women, when the layer is still pretty wide and doesn't depend on the output or the target output dimension. 72 00:08:03,180 --> 00:08:10,680 So at initialisation, it turns out that you can approximate uh. 73 00:08:10,680 --> 00:08:18,270 It could also can be my most crucial. I'm not sure if I understand if it comes through. 74 00:08:18,270 --> 00:08:27,600 Do you see that? Yeah. 75 00:08:27,600 --> 00:08:31,350 OK, that's good. Um, yeah, 76 00:08:31,350 --> 00:08:42,990 so you can approximate both this squared norm divided by the dimension or an inner product between the 77 00:08:42,990 --> 00:08:49,560 output of the network for two different inputs X and its price normalised by their respective norms. 78 00:08:49,560 --> 00:08:57,380 You can you can approximate this using only knowledge of the network structure and the following three scalar parties, 79 00:08:57,380 --> 00:09:02,790 which are just, uh, squared for the two different inputs and x prime. 80 00:09:02,790 --> 00:09:09,720 Uh, and also there are two products to modify their product to their norms. 81 00:09:09,720 --> 00:09:20,490 And by the way, the quantity is just a cosine similarity. If you're familiar with that, as is the quantity and so, uh, so yes, 82 00:09:20,490 --> 00:09:33,840 you can do this computation and we'll call these squared norms cube values and these cosine similarity quantities C values. 83 00:09:33,840 --> 00:09:44,120 And we'll say that they're computed by functions called Q Mass and seems to take the input value. 84 00:09:44,120 --> 00:09:50,510 And so the key value and or at least a good. 85 00:09:50,510 --> 00:09:57,650 Approximation of it and similar statement for syntax and see values. 86 00:09:57,650 --> 00:10:01,220 There is a hint of hints of humility here, which I've sort of swept under the rug, 87 00:10:01,220 --> 00:10:05,820 but it turns out that with the data processing we use, you can do that. 88 00:10:05,820 --> 00:10:13,250 And I should say this approximation gets better as if the width of the layers grows. 89 00:10:13,250 --> 00:10:21,800 So, yeah, so this is here. These are yours. This is your sort of your standard. A deep kernel approximation sort of boils down to to its essence, 90 00:10:21,800 --> 00:10:27,740 having sort of stated what these things are in terms of what they call what they approximate. 91 00:10:27,740 --> 00:10:33,500 Still haven't described, you know, actually how you define them and how you compute them. 92 00:10:33,500 --> 00:10:43,280 So to compute a cue map for a semantic layer, you just you just say that to be the identity function. 93 00:10:43,280 --> 00:10:52,220 So that's trivial for a nonlinear later gene activation function inside a map. 94 00:10:52,220 --> 00:11:01,040 It's just given this formula. We just assume a one dimensional gives you an expectation and you have this here. 95 00:11:01,040 --> 00:11:05,810 We see maps a bit more complicated to image this in expectation. 96 00:11:05,810 --> 00:11:10,760 It's also important to actually write these formulas for this talk. 97 00:11:10,760 --> 00:11:20,630 Really, they'll really take away that you need is that we can actually commit these if not in closed form, at least numerical integration. 98 00:11:20,630 --> 00:11:26,150 And it's pretty efficient because these are only one two dimensional Gaussian integrals, 99 00:11:26,150 --> 00:11:34,150 as you can actually compute them reasonably fast up to a very high precision for arbitrary files. 100 00:11:34,150 --> 00:11:40,230 And also, we out that you can compute their derivatives as well, which will be important. 101 00:11:40,230 --> 00:11:52,410 So having to find that didn't seem pretty visible, whereas in our network, we can actually define them for our entire networks. 102 00:11:52,410 --> 00:12:00,960 And we do that by a simple composition. In particular, the key map for a for a composition of two networks, 103 00:12:00,960 --> 00:12:10,560 F and H just ends up being a composition for the individual F and H, and similarly for Siemens. 104 00:12:10,560 --> 00:12:19,080 By the way, if there's any, if any of this is unclear, please let me know now because the rest of the talk sort of relies very heavily on this. 105 00:12:19,080 --> 00:12:23,730 And so if this if this is not clear, you're going to have a hard time finding the rest to talk to you. 106 00:12:23,730 --> 00:12:28,450 Please let me know if you have any questions. 107 00:12:28,450 --> 00:12:39,790 So in general, she noted that she and Synapse are only valid descriptions of past initialisation, that's very important to underline general these. 108 00:12:39,790 --> 00:12:44,440 You can't actually predict the output norm given the input norm. 109 00:12:44,440 --> 00:12:52,580 It just doesn't work. All right. 110 00:12:52,580 --> 00:13:02,570 So having defined came out to see maps, we can start to examine them for words. 111 00:13:02,570 --> 00:13:14,600 So just to recall, see map determines essentially the angle because it's the cosine similarity and or in the distance because even derive the distance 112 00:13:14,600 --> 00:13:24,140 from the angle if you know the norms between to open the doctors from the network as a function of the angle to the input factors. 113 00:13:24,140 --> 00:13:35,150 And so in the big networks, we see that see maps become degenerate so that information about the input angles is essentially obscured. 114 00:13:35,150 --> 00:13:39,650 In other words, it's hard to recover. So you can see that it's up here. 115 00:13:39,650 --> 00:13:50,720 This is a these are C maps for deep really networks at different depths and for a stellar looking network. 116 00:13:50,720 --> 00:13:59,030 It's a pretty reasonable function. You could have any time inverting this to recover the value of a portion of the C value. 117 00:13:59,030 --> 00:14:07,040 But as it is for deeper clutter, clutter still technically in a vertical, 118 00:14:07,040 --> 00:14:12,240 it will be very hard to convert it in practise under any kind of approximation. 119 00:14:12,240 --> 00:14:20,330 Boys And of course, because these maps are only approximate descriptions of the network's behaviour, that will be of a concern. 120 00:14:20,330 --> 00:14:27,410 So essentially, what's going on here is that I give you anything but see value of the WS. 121 00:14:27,410 --> 00:14:35,300 Basically, you could not depend on the input value will be so weak it'll just be swamped over the noise. 122 00:14:35,300 --> 00:14:38,750 So you've really just lost that information about with the ACLU. 123 00:14:38,750 --> 00:14:44,880 In other words, you've lost information about the input distances in the output space. 124 00:14:44,880 --> 00:14:49,550 Now it's obvious that that's going to be a problem, but it turns out that it is. 125 00:14:49,550 --> 00:14:57,560 It turns out that this this situation sort of dooms gradient descent learning. 126 00:14:57,560 --> 00:15:07,920 So in general, a degenerate sea map is one that squashes the entire range of imposing value around some output value. 127 00:15:07,920 --> 00:15:15,500 S. zero. And there are two basic cases, all you see is zero, 128 00:15:15,500 --> 00:15:22,610 both of which turn out to be bad for treating either the value that you squash to is significantly 129 00:15:22,610 --> 00:15:28,400 less than one one in the maximum possible value because these are cosine similarities. 130 00:15:28,400 --> 00:15:32,870 So if it's less than one, then what that means is the slope. 131 00:15:32,870 --> 00:15:34,670 Actors basically look random words. 132 00:15:34,670 --> 00:15:45,590 They all they're all sort of from each other approximately on it, regardless of how close or far the corresponding input vectors were. 133 00:15:45,590 --> 00:15:49,910 And so that makes it look kind of like a random hash of the input. 134 00:15:49,910 --> 00:15:53,780 And while you might be able to learn from this generalisation, 135 00:15:53,780 --> 00:16:01,190 it's going to be impossible because they're just they don't reflect anything about their inputs, essentially or anything useful. 136 00:16:01,190 --> 00:16:06,770 And also, this condition will imply that early layers will have huge gradients compared to later layers, 137 00:16:06,770 --> 00:16:10,000 and that will actually make optimisation tricky. 138 00:16:10,000 --> 00:16:19,180 On the other case is that you've got your Europe put so severely that you're squashing towards as close to one or equal to one. 139 00:16:19,180 --> 00:16:23,420 What that means is that all the open sectors are basically going to look identical because of course, 140 00:16:23,420 --> 00:16:25,990 the value one corresponds to a close in similarity, 141 00:16:25,990 --> 00:16:33,220 one that means the vectors of the same assume that they have the same norm, which in this case, they do. 142 00:16:33,220 --> 00:16:44,740 And so, uh. And so the implication of this is that gradients earlier layers will vanish and the lost surface will become illegal conditions, 143 00:16:44,740 --> 00:16:52,300 making optimisation basically impossible impossible. 144 00:16:52,300 --> 00:16:59,590 Um, this can be formalised using very techniques such as UK theory, and this is done in the paper. 145 00:16:59,590 --> 00:17:05,830 There's other people that have also looked at this phenomenon and sort of tried to argue why 146 00:17:05,830 --> 00:17:14,000 it's it's it's bad for training and analysis of practises that seem to hold up as well. 147 00:17:14,000 --> 00:17:22,620 All right. So the previous solution to this problem, 148 00:17:22,620 --> 00:17:32,820 and this is a paper that sort of first observed this phenomenon is in the in a method called the edge of chaos. 149 00:17:32,820 --> 00:17:45,400 And so in that approach, the solution is to require that the derivative of the sea map for each individual nonlinear. 150 00:17:45,400 --> 00:17:49,300 It is equal to one when evaluated at one, 151 00:17:49,300 --> 00:18:02,080 and it brings out the condition will slow asymptotic convergence of key values to their sort of assumption of value one as death increases, 152 00:18:02,080 --> 00:18:09,160 and in particularly that convergence will go from being exponential to being so exponential. 153 00:18:09,160 --> 00:18:18,300 And there's a lot of it in animal systems, analysis of the composition of many of these functions. 154 00:18:18,300 --> 00:18:23,750 Which cares a lot about a lot about the slope in the limit. 155 00:18:23,750 --> 00:18:31,830 Unfortunately, though, given the deep enough network, see values will still be pretty close to fully converged. 156 00:18:31,830 --> 00:18:39,230 So in other words, the networks see map is still going to be highly degenerate and as an example, 157 00:18:39,230 --> 00:18:45,840 the deep railway networks that we studied before and previous slide actually already satisfy this condition. 158 00:18:45,840 --> 00:18:54,260 Rail news out of the box If this policy is a biased initialisation zero satisfy this condition. 159 00:18:54,260 --> 00:19:01,250 And yet we know that a deep enough rail network becomes untreatable and then you have to do that. 160 00:19:01,250 --> 00:19:05,570 You can't see maps unless it is. 161 00:19:05,570 --> 00:19:13,240 So much of that which we're up to this point, we're not assuming, right? 162 00:19:13,240 --> 00:19:22,470 So the I'd say the main contribution of this work is sort of a new way of controlling cement properties and. 163 00:19:22,470 --> 00:19:29,570 Instead of looking at sea map for individual layers, we're going to look at the sea map for the whole network. 164 00:19:29,570 --> 00:19:38,250 And we're going to analyse it from that perspective. So, eh, so there are there's a way to formalise this. 165 00:19:38,250 --> 00:19:47,240 But the intuition can be seen to be a situation that see maps or convex on the zero to one. 166 00:19:47,240 --> 00:19:51,510 And this is just a fact that you can prove. And intuitively, 167 00:19:51,510 --> 00:19:58,140 what that means is we can control the mission of the network's overall sea map 168 00:19:58,140 --> 00:20:05,490 from the entity function by just controlling its slope at the maximum value one, 169 00:20:05,490 --> 00:20:10,380 assuming that we know its value in zero and we set it to zero, although you could visit, 170 00:20:10,380 --> 00:20:18,060 it would also work if you sort of set this to some, particularly it's some particular value that's like significantly less than one. 171 00:20:18,060 --> 00:20:30,330 So carefully see, you can see this in this sort of picture where it's like, if I fix the graph to this point and I vary the slope over here, 172 00:20:30,330 --> 00:20:37,140 that sort of a one to one correspondence on that how how much the curve deviates from the identity and in particular, 173 00:20:37,140 --> 00:20:46,660 how much it flattens out becomes to generate around an output value in this case of zero. 174 00:20:46,660 --> 00:20:50,080 Right, so, so, so, yeah, so controlling the the network, 175 00:20:50,080 --> 00:20:59,260 the networks seem derivative at one gives us a way to prevent degeneration and we can pick a value as long as it's the slope isn't too extreme, 176 00:20:59,260 --> 00:21:02,410 it won't be degenerate. 177 00:21:02,410 --> 00:21:15,400 Now, you can't formalise this, and so this is a pretty lengthy paper, basically just says her best condition of value, zero is equal to zero. 178 00:21:15,400 --> 00:21:21,700 And the deviation of the identity number is a function of its derivative at one. 179 00:21:21,700 --> 00:21:28,420 And also, if the derivative of the cement can also hit a deviation from the derivative of the identity, 180 00:21:28,420 --> 00:21:41,160 function can be found in a similar way in this whole or the entire input domain, not a zero one. 181 00:21:41,160 --> 00:21:47,490 Right. OK, so we have a reasonable solution for, it seems. 182 00:21:47,490 --> 00:21:54,520 Unfortunately, though, there are still other ways that the network can be failed, failed to be trainable. 183 00:21:54,520 --> 00:21:59,000 One of them is that networks that are nearly linear. 184 00:21:59,000 --> 00:22:09,220 And so. First, if you can observe that linear networks have they see maps, but their model class is very limited, in particular, 185 00:22:09,220 --> 00:22:11,200 linear networks actually have identity segments, 186 00:22:11,200 --> 00:22:17,800 so that's sort of like the perfect night each other and see map information is preserved as well as you can. 187 00:22:17,800 --> 00:22:24,130 But you know, a linear network is not going to find interesting solutions because it's intrinsically limited. 188 00:22:24,130 --> 00:22:31,660 You could say, Well, let's just ban linear networks from our consideration and just stick to non-linear networks. 189 00:22:31,660 --> 00:22:36,880 Problem, though, is that you can make a network of nearly linear in a certain sense, 190 00:22:36,880 --> 00:22:41,800 so that it'll have a nice simple but be almost as hard to optimise as a as a linear network, 191 00:22:41,800 --> 00:22:47,260 which in fact is impossible to optimise at least up to the performance that you want. 192 00:22:47,260 --> 00:22:54,250 And so one example of this is you can take a railroad network and for each billion activation, 193 00:22:54,250 --> 00:23:02,090 just add a small or a large, very large constant to its input and also subtract the same constant from its output. 194 00:23:02,090 --> 00:23:05,290 So essentially just transforming the rallies. 195 00:23:05,290 --> 00:23:12,820 Now they're going to basically behave like the identity function because for all reasonable inputs they use, 196 00:23:12,820 --> 00:23:24,680 they'll be less much less than this constant. So if essentially just gotten rid of the the left part of the value function, the negative part of. 197 00:23:24,680 --> 00:23:33,090 But. You know, you can actually prove it's very quick, quite easy that you could with a certain traits, 198 00:23:33,090 --> 00:23:39,510 weight and biases and essentially undo the situation that we just did in recover a stent could really work. 199 00:23:39,510 --> 00:23:46,320 So the model class hasn't changed here, but obviously those young protesters a struggle in this situation. 200 00:23:46,320 --> 00:23:57,720 In fact, it'll probably never even evaluate the network for four inputs that are sort of in the nonlinear region of the real use. 201 00:23:57,720 --> 00:23:58,520 So, 202 00:23:58,520 --> 00:24:07,260 so it's basically just going to be like optimising the linear networks and you're not going to get any any of the benefits of using neural networks. 203 00:24:07,260 --> 00:24:15,480 So to prevent this problem, which can manifest in any type of network, not just reality networks, 204 00:24:15,480 --> 00:24:24,330 we were quite clear that the derivative of this imagine for the moment in the earlier nonlinear layer, 205 00:24:24,330 --> 00:24:34,770 if all you want is maximised OK to condition the network, but overall see, map and interpret it. 206 00:24:34,770 --> 00:24:39,000 So there's now there's this tension, want to make the derivatives large individual layers, 207 00:24:39,000 --> 00:24:48,260 but we want to we want to build the derivative for the overall networks to be smaller than so constant. 208 00:24:48,260 --> 00:24:56,780 And they'll be sort of a way to compute this appearance of this trade-off. 209 00:24:56,780 --> 00:25:08,030 Yet another failure mode is that our approximations that we're basing all of this analysis on my not mine at all. 210 00:25:08,030 --> 00:25:15,140 So, you know, and in that situation, nothing that we're even talking about makes any sense. 211 00:25:15,140 --> 00:25:22,940 Unfortunately, error in these of approximations can actually get very high in deep neural networks unless you make the with extremely large. 212 00:25:22,940 --> 00:25:27,470 In the worst case, the dependence could be exponential. 213 00:25:27,470 --> 00:25:32,850 Requiring an exponentially wide network to it is a function of the depth. 214 00:25:32,850 --> 00:25:40,080 So that's not tenable, because, you know, networks can get quite deep these days. 215 00:25:40,080 --> 00:25:42,480 Now you can see this issue, perhaps. 216 00:25:42,480 --> 00:25:54,840 Obviously, when you think about values, so let's say they're mapping you maps and how they can be sort of vulnerable to errors. 217 00:25:54,840 --> 00:25:58,800 And then you look at the map up to first order. 218 00:25:58,800 --> 00:26:07,590 The evidence output is proportional to its derivative times, the error it's in the book. 219 00:26:07,590 --> 00:26:13,570 So he maps will amplify any any errors that they've that they've said that are their input. 220 00:26:13,570 --> 00:26:21,640 And if this derivative can get very big in a deep network, it turns out. 221 00:26:21,640 --> 00:26:29,970 In general. Offence, if you think it is just, you know, this problem is just a cute values, unfortunately, that doesn't really work. 222 00:26:29,970 --> 00:26:36,570 Values are you did wrong. See, map competitions also become essentially meaningless as well. 223 00:26:36,570 --> 00:26:46,560 So we need to handle that problem. So the solution that we use indicates is to require that the derivative of the 224 00:26:46,560 --> 00:26:53,540 cue map is less or equal to one of four values of cue that we expect to see. 225 00:26:53,540 --> 00:26:58,700 Turns out, we can't actually enforce this condition that we or someone reasonably close to it. 226 00:26:58,700 --> 00:27:08,830 And so this will control the compounding of errors in deep networks of these kind of approximations. 227 00:27:08,830 --> 00:27:20,890 OK, so now we've sort of identified these various failure cases and ways to manipulate the queue and see map in order to prevent them. 228 00:27:20,890 --> 00:27:28,300 So that leaves those sort of conditions which define decades. And these and we'll discuss now. 229 00:27:28,300 --> 00:27:35,950 So for every sub network asked by some network, I just mean some component of the network, 230 00:27:35,950 --> 00:27:41,620 including the whole networks that sort of defines a well-defined input and output. 231 00:27:41,620 --> 00:27:50,940 So you could think about, let's say, if it was a multi layers three four or five four a seven network starting with layer three or five. 232 00:27:50,940 --> 00:27:59,790 And, you know, more general arbitrary structures and networks have more interesting examples of Soviet work to them. 233 00:27:59,790 --> 00:28:07,630 So the so in essence, the network itself is this is something we're typically. 234 00:28:07,630 --> 00:28:15,970 Right. So first condition is something that be discussed before it's more of a convention that we go with, 235 00:28:15,970 --> 00:28:27,340 which is that input values of one Q two values, they they have one man to hold the key values of one. 236 00:28:27,340 --> 00:28:38,620 And what this says is that the networks layers preserve the norms of their inputs at least once you account for the dimension of those factors. 237 00:28:38,620 --> 00:28:44,500 So in other words, it preserves the Q values, which is where norms divided by the dimensions. 238 00:28:44,500 --> 00:28:51,940 And it does this at least if the if the yeah, the values want for other values, you can't necessarily say anything. 239 00:28:51,940 --> 00:28:57,970 Although it should be noted after reviewing it works. You get that this is the identity function for free, essentially, because really, 240 00:28:57,970 --> 00:29:04,180 no, it's kind of a preservation of their skills from their input to their own. 241 00:29:04,180 --> 00:29:11,760 But. Right, so and we go with the value of one, just because that's a common convention. 242 00:29:11,760 --> 00:29:15,620 You know, we could amount to to two or five to five weeks, 243 00:29:15,620 --> 00:29:22,730 but we just have to standardise to some vector length on or to sort of do everything else that we want to do. 244 00:29:22,730 --> 00:29:25,340 Also, by doing this, 245 00:29:25,340 --> 00:29:38,030 we prevent the problem where you've got an exploding or vanishing vector lengths for your your your activation vectors in networks, which you know, 246 00:29:38,030 --> 00:29:44,600 if you go right to the end of the radio network could lead to a very small or very big input to your lost function, 247 00:29:44,600 --> 00:29:52,610 which could be to an America problems or optimisation problems, depending on the type of loss of. 248 00:29:52,610 --> 00:29:57,510 In a value of one is kind of what most standard loss functions expect. 249 00:29:57,510 --> 00:30:05,530 So. All right, so that's that's condition one condition, a second condition is what we just discussed previously, 250 00:30:05,530 --> 00:30:14,590 which is they require that they come out of it, which is as well behaved and are in this control colonel approximation error. 251 00:30:14,590 --> 00:30:22,870 Now previously basically, we wanted this to be less than or equal to one for all potential capabilities of Q. 252 00:30:22,870 --> 00:30:29,770 Now in general, we only expect to see one type of Q value in our network, which would be equal to one. 253 00:30:29,770 --> 00:30:38,080 Of course, due to random errors that will sort of break down. But as long as we're close, this map is sort of continuous and smooth it. 254 00:30:38,080 --> 00:30:43,390 You know, enforcing this condition at one will be sort of good enough for another close to one as well. 255 00:30:43,390 --> 00:30:49,870 And we said that equal to one, not we don't try to minimise it because it's equal to one. 256 00:30:49,870 --> 00:30:56,590 And this turns out to work best in practise. You could also have said it less or equal to one or try to minimise it. 257 00:30:56,590 --> 00:30:59,320 That would also control the current approximation error, 258 00:30:59,320 --> 00:31:07,180 but we find that the Eagles one just seems to work best in practise for reasons that are not totally well understood. 259 00:31:07,180 --> 00:31:17,440 This is where the maximum value if you set it larger than when you run into problems. OK, and then we've got conditions C and D. 260 00:31:17,440 --> 00:31:22,440 These are the conditions that we hope for that prevent seamount degeneration. 261 00:31:22,440 --> 00:31:32,740 Um, so yeah, so it's a study of equal to zero and zero and also restricting its its derivative at one to be less or equal to some constant. 262 00:31:32,740 --> 00:31:39,010 And oftentimes, this is sort of like 1.5 or just some moderate value. 263 00:31:39,010 --> 00:31:44,500 And finally, we've got this condition here, which prevents the nearly linear networks problem. 264 00:31:44,500 --> 00:31:53,510 This this is that the the derivative came out through nonlinear layers is maximised subject to these other conditions. 265 00:31:53,510 --> 00:31:59,630 So put. So for A, B and C, it turns out that you can. 266 00:31:59,630 --> 00:32:09,050 It's efficient to have these conditions hold for cutesy maps of non-linear letters and get a free for all sub networks, 267 00:32:09,050 --> 00:32:15,170 provided that, you know, formalise quote unquote any way that sums the network. 268 00:32:15,170 --> 00:32:21,090 And I'm not going to describe what that means, but it's a straightforward operation. 269 00:32:21,090 --> 00:32:30,240 And then, Danny, the combination of those turns out to be equivalent to enforcing the condition for each nonlinear layer. 270 00:32:30,240 --> 00:32:39,460 So we're setting the derivative of the cement at one to be able to this constant to the power one over T. 271 00:32:39,460 --> 00:32:45,270 Where does the depth of the network for or more arbitrary topologies? 272 00:32:45,270 --> 00:32:53,170 Is there is a more complicated formula here. This is the one for M.P.s. 273 00:32:53,170 --> 00:33:02,330 But it's important to know that this formula can be easily competed in the more general case, so that's not a problem. 274 00:33:02,330 --> 00:33:10,640 Right. So I talked a lot about conditions that we want to enforce first on the cutesy maps for the network and some networks, 275 00:33:10,640 --> 00:33:14,600 and then for translating that into conditions on hotels. 276 00:33:14,600 --> 00:33:19,700 I think it is. There's someone asking a question. Oh, sure. 277 00:33:19,700 --> 00:33:25,280 Yeah, yeah. Hey, that was me on the pace on the slide. 278 00:33:25,280 --> 00:33:33,470 Before you have these different conditions, I think Slide B. Sorry, Point B was the thing that controls a kind of approximation error. 279 00:33:33,470 --> 00:33:46,130 Not did you say this? Not only it's not only that for the validity of the analysis, it also has an impact on performance is that we said. 280 00:33:46,130 --> 00:33:54,590 Yeah. So we need the curve approximation here to be low in order for this analysis to make any sense. 281 00:33:54,590 --> 00:34:02,180 But but you know, you could achieve low colonel brown application error by requiring this to be less than or equal to one, 282 00:34:02,180 --> 00:34:08,420 and you could try a minimum minimise and in fact minimise it. It will minimise the approximation error. 283 00:34:08,420 --> 00:34:09,850 So why didn't we minimise it? Why? 284 00:34:09,850 --> 00:34:18,500 Why do we actually set it equal to one which is actually the maximum permissible value before you get run runaway approximation error? 285 00:34:18,500 --> 00:34:25,700 And the reason why is because it works best in practise in terms of the overall effectiveness of these networks at the end of the day. 286 00:34:25,700 --> 00:34:32,300 We don't have a good explanation for why that's true. This is this is sort of one of the remaining mysteries. 287 00:34:32,300 --> 00:34:40,220 OK, so so A, B, D and E, or let's say necessarily for the sake of argument to have a performant network. 288 00:34:40,220 --> 00:34:51,440 But so a CD and a b, there might be networks to satisfy the other conditions that a good initialisation to train and so on. 289 00:34:51,440 --> 00:34:59,030 But the do not look anything like kernels. Um, maybe. 290 00:34:59,030 --> 00:35:06,460 Of course, once once the kernel approximation kernel approximations break down, like none of these conditions really make any sense like they're, 291 00:35:06,460 --> 00:35:12,290 you know, they're describing things that are not descriptions anymore of the network's behaviour. 292 00:35:12,290 --> 00:35:13,550 Right. Okay. OK. 293 00:35:13,550 --> 00:35:23,330 It's certainly possible that a number of that is sort of well outside of the kernel regime could be, um, you know, could could train well, 294 00:35:23,330 --> 00:35:29,760 but then it's it's much harder to talk about it like we just don't have the theoretical tools to really analyse it at that point. 295 00:35:29,760 --> 00:35:34,570 OK. Thank you. Right. 296 00:35:34,570 --> 00:35:40,000 So, yeah, so having having reduced these conditions and I can't see maps for some networks 297 00:35:40,000 --> 00:35:44,560 down to conditions in the cutesy maps for individual non-linear layers, 298 00:35:44,560 --> 00:35:45,190 I still, you know, 299 00:35:45,190 --> 00:35:54,100 we still haven't actually about how how do we achieve these conditions on the cansee maps of non-linear letters like what are our levers of control? 300 00:35:54,100 --> 00:35:59,350 And so for that, we're going to transform the activation functions in a fairly benign way. 301 00:35:59,350 --> 00:36:07,650 I would argue, in particular, we're going to introduce non-tradable to constant. 302 00:36:07,650 --> 00:36:12,720 It's for both the input and the output to each activation function, so in particular, 303 00:36:12,720 --> 00:36:24,810 going from side effects to this where all of these gamer health beta and delta are just fixed non-tradable scalar constants. 304 00:36:24,810 --> 00:36:35,840 Um, because you can always carefully choose your weights and biases in your network to simulate the kind of transformation. 305 00:36:35,840 --> 00:36:38,750 This will not actually change our middle class, 306 00:36:38,750 --> 00:36:50,150 at least not assuming a perfect optimiser space of functions computed by the network is is the same now in practise. 307 00:36:50,150 --> 00:37:01,640 Of course, doing these kinds of transformations could change the inductive bias of the model under a limited optimiser like gradient design. 308 00:37:01,640 --> 00:37:07,700 So a couple of examples of transcriptome deactivation functions are plotted here. 309 00:37:07,700 --> 00:37:11,090 This is for a vanilla 100 layer on LP. 310 00:37:11,090 --> 00:37:20,990 So in the case of software, as we go from this sort of familiar to real use, soft, plush, a softer, more gentle curve. 311 00:37:20,990 --> 00:37:25,010 And we see something even more dramatic for 10h. 312 00:37:25,010 --> 00:37:27,350 And in general, that's what this method is going to do. 313 00:37:27,350 --> 00:37:36,860 It's going to take a maturation function, which is quite nonlinear and kind of tone it down to be closer to a linear function. 314 00:37:36,860 --> 00:37:46,190 I should say this is over an input range with a typical range of input factors sort of beyond the typical range of inputs you'd expect to see. 315 00:37:46,190 --> 00:37:50,250 Debates will typically approximately follow a Gaussian distribution. 316 00:37:50,250 --> 00:37:53,080 And so would with a variance of one. 317 00:37:53,080 --> 00:37:58,910 And so once you get up to negative 10, essentially it's negligible probability that you'd ever see an input of that size. 318 00:37:58,910 --> 00:38:04,760 So, you know, the steeper the activation function occurs the central region. 319 00:38:04,760 --> 00:38:10,490 If you go further from this graph, you'll see that this is still this is still a teenager here. 320 00:38:10,490 --> 00:38:19,850 If I were split it up much, much further, you'd see eventually some token button up, but but rather work actually matter in practise. 321 00:38:19,850 --> 00:38:25,530 But it basically looks like this kind of software, as it does here as well. 322 00:38:25,530 --> 00:38:32,220 Um. All right. So now, having describe the approach, how about the experiments? 323 00:38:32,220 --> 00:38:39,810 So our basic setup is that we are training the resident one to one v two style architecture on image, 324 00:38:39,810 --> 00:38:45,060 or they do this with and without that storm and with a skip connexions. 325 00:38:45,060 --> 00:38:57,180 Batch sizes five 12 learning rate schedules were optimised dynamically using a method called Fire PBT developed at DeepMind recently, 326 00:38:57,180 --> 00:39:01,780 and this was done to with the particular objective of maximising optimisation speed. 327 00:39:01,780 --> 00:39:08,880 So in other words, the choice of that kind of learning rate schedule is not a confound with regards to optimisation. 328 00:39:08,880 --> 00:39:13,110 We know why did we examine optimisation speed and generalisation performance? 329 00:39:13,110 --> 00:39:22,590 Well, the goal of this work is was mostly just to make up the gap between resonance and and networks that don't have 330 00:39:22,590 --> 00:39:29,930 all of those architectural flourishes in the main gap that you see there is actually optimisation speed. 331 00:39:29,930 --> 00:39:34,550 In fact, the networks without connexions in better if they're made deep enough. 332 00:39:34,550 --> 00:39:41,060 Basically, they don't even train at all to the time. They'll just sit and zero performance. 333 00:39:41,060 --> 00:39:46,550 We do some footwork that looks more at generalisation performance. 334 00:39:46,550 --> 00:39:51,700 That was actually recently accepted that I here, I'm not going to talk about that here. 335 00:39:51,700 --> 00:40:00,830 So, yeah, and the other optimisation parameters were tuned lately, not as much as learning rate, but even even at Alphabet, 336 00:40:00,830 --> 00:40:09,230 we only have limited capabilities, although what you would cry if you saw how much computation I used. 337 00:40:09,230 --> 00:40:20,180 And yeah, there's lots of experiments in the paper, tons of observations and different different things that we studied in relation to this approach. 338 00:40:20,180 --> 00:40:32,770 In fact, that makes up sort of the bulk of the paper. It's just experiments. So the main result is for vanilla episode of this network, 339 00:40:32,770 --> 00:40:39,430 let's get Typekit without connexions and short using chaos and comparing those to resonates. 340 00:40:39,430 --> 00:40:50,690 And so a standard resume that is this ludicrous here to even see if the network is in stock plus or change are slowly. 341 00:40:50,690 --> 00:40:56,240 And meanwhile, really universe where you where we've stripped out either the ship Connexions the shore, 342 00:40:56,240 --> 00:41:01,790 we're both perfect, but first we do need all of those elements. 343 00:41:01,790 --> 00:41:10,130 You can't just rip them now, insults or laser, I should say, and that's a very important point to make. 344 00:41:10,130 --> 00:41:18,500 If we go to see, the key fact is it's optimised for neural nets. 345 00:41:18,500 --> 00:41:24,470 It's a non diagonal approach. It's pretty powerful, but somewhat expensive. 346 00:41:24,470 --> 00:41:37,730 The if we go to networks of optimisation, the situation isn't as nice standard risk that is optimised about the same rate as it was with K-Fed. 347 00:41:37,730 --> 00:41:51,560 Maybe a little slower, but still it works with us are now truly behind, although they're still doing much faster than the bricks that don't stick. 348 00:41:51,560 --> 00:42:14,340 But the gap there is now a gap with that with residents. So trying to drill down into this a little bit more if we look at it. 349 00:42:14,340 --> 00:42:24,480 Using Connexions with chaos, if you're using caretaker's network, once you introduce chaos and fat, Skip Connexions don't seem to matter at all. 350 00:42:24,480 --> 00:42:31,080 Here we're using Skype Connexions that have been chosen to have a residual weight equal to this constant. 351 00:42:31,080 --> 00:42:36,090 And the rate on the shortcut is is such that the sum of the squares is equal to one. 352 00:42:36,090 --> 00:42:44,160 That's a requirement of this method. And it turns out that actually, you know, well, yeah. 353 00:42:44,160 --> 00:42:51,360 So all of these approaches are met the performance of a sort of standard resonance that we've got to. 354 00:42:51,360 --> 00:42:56,830 That's where things get a bit interesting. And now we see that, in fact, 355 00:42:56,830 --> 00:43:06,790 we can obtain the same performance as stand resident using the case just by reintroducing Ship Connexions into these networks, 356 00:43:06,790 --> 00:43:10,390 at least, at least at least in. It's a thoughtless activation. 357 00:43:10,390 --> 00:43:14,410 It's intended for some reason behaves weird in this experiment. 358 00:43:14,410 --> 00:43:20,020 All those I should note that almost all recognition functions do well. 359 00:43:20,020 --> 00:43:31,240 So it does seem that, uh, you know, with with these conditions, plus de cases is good enough to match resonant performance. 360 00:43:31,240 --> 00:43:38,400 But your other option if you don't want to use get Connexions is to use CVAC. 361 00:43:38,400 --> 00:43:43,980 Now we can apply to gas to networks with a whole bunch of different activation functions. 362 00:43:43,980 --> 00:43:46,860 And actually, we see that many of these education functions, 363 00:43:46,860 --> 00:43:54,780 including a lot of ones that work in typically very badly or don't even train at all in such deep networks work just fine. 364 00:43:54,780 --> 00:44:01,080 In fact, they all work pretty similarly, except for real, you know, real news actually trailing behind here. 365 00:44:01,080 --> 00:44:06,030 Somewhat ironically, actually, because it's not really compatible DKA on top of that, 366 00:44:06,030 --> 00:44:14,040 but the kinds of transformations that we do in the activation functions are a limited power in the case of value activation functions. 367 00:44:14,040 --> 00:44:18,000 So, so it's in some ways this isn't really the performance of class at all, 368 00:44:18,000 --> 00:44:25,080 because it's not the method isn't really working properly in the case of real news. 369 00:44:25,080 --> 00:44:30,660 We can also look at the effect of using different optimisers because we need this strong dependence on OPTIMISER, 370 00:44:30,660 --> 00:44:34,050 at least when we're talking about networks. So don't skip Connexions. 371 00:44:34,050 --> 00:44:42,390 And we see that in fact, key and shampoo or a modified version of shampoo or both doing very well and match the performance of resonance, 372 00:44:42,390 --> 00:44:54,380 whereas Study and Adam perform roughly similarly and do not allow us to replicate the optimisation performance of residents. 373 00:44:54,380 --> 00:45:03,080 We can also look at some previous work, for example, it's the edge of chaos method, which is sort of the main is preparing for chaos. 374 00:45:03,080 --> 00:45:12,500 And yet we see that. This is clearly performing better in this in the case of these teenage networks. 375 00:45:12,500 --> 00:45:20,700 This is what key facts, although the the gap persists and gets even. 376 00:45:20,700 --> 00:45:30,270 Bigger with my computer would load the grass there. And in fact, yeah, in fact, the U.S. is going to be doing that much compared to Saddam. 377 00:45:30,270 --> 00:45:36,730 And in this context, the whole thing is this, by the way, are below presents. 378 00:45:36,730 --> 00:45:42,660 Standard resonance with us should be. We can also look at looks linear, 379 00:45:42,660 --> 00:45:49,470 which is the method that tries to initialise the network to be exactly linear and initialisation time due to certain weight, 380 00:45:49,470 --> 00:45:59,660 symmetries and the use of a radio activation functions. And this is nothing, if it turns out that it doesn't seem to work well with. 381 00:45:59,660 --> 00:46:07,340 I think because in fact too aggressively breaks the symmetries that you have, and as a result, 382 00:46:07,340 --> 00:46:13,820 the network enters this very nonlinear behaviour too quickly and sort of things go off the rails. 383 00:46:13,820 --> 00:46:21,060 So we have to use looks linearly with Adam, and in that case, to perform that there's a clear performance gap. 384 00:46:21,060 --> 00:46:29,240 And this is partly just because we're using CVAC, which works with CAS. 385 00:46:29,240 --> 00:46:38,600 The situation with SUV looks linear, it much more close to the case, I would say. 386 00:46:38,600 --> 00:46:43,730 Right. So, yeah, so this is coming to the end of the talk. 387 00:46:43,730 --> 00:46:52,130 Current limitations of this approach. We do not have support for multiplicative units like you see in Transformers. 388 00:46:52,130 --> 00:47:00,710 But I think an extension is very possible and quite interesting, actually. 389 00:47:00,710 --> 00:47:05,090 Vanilla networks, that is next. Let's get predictions about you're using. 390 00:47:05,090 --> 00:47:07,820 Your do seem to generalise words, at least in these experiments. 391 00:47:07,820 --> 00:47:12,290 I didn't talk about generalisation performance, but that is that is an observation we make. 392 00:47:12,290 --> 00:47:17,720 Although that's largely been addressed in follow up work, that was this clear paper. 393 00:47:17,720 --> 00:47:24,740 And in particular, if you just change the way you do the optimisation and also you make some small changes to decrease, 394 00:47:24,740 --> 00:47:31,820 you can actually close the gap to standard resonance almost completely. 395 00:47:31,820 --> 00:47:38,840 Measurement, the speed of resonance using the networks, we had to use crack. 396 00:47:38,840 --> 00:47:46,880 Otherwise, we had to reintroduce the Connexions and the general attitude on these and all 397 00:47:46,880 --> 00:47:51,890 networks will require at least two explorations and interesting and perhaps more, 398 00:47:51,890 --> 00:48:01,190 depending on your. Trying to understand that, I think, is a very interesting question for future work as well. 399 00:48:01,190 --> 00:48:11,390 Maybe, maybe with the right tweak to this approach, we could actually have it perform just as well as residents would use in a study. 400 00:48:11,390 --> 00:48:18,440 Right. So I think the outlook is pretty good for decades and it could be a useful tool for unlocking new model classes. 401 00:48:18,440 --> 00:48:28,280 I think that's the primary location here, allowing you to build the design your models without having to sort of rely on some conflicts 402 00:48:28,280 --> 00:48:35,030 of tricks to sort of make it optimise faster for reasons that you hopefully in your stand. 403 00:48:35,030 --> 00:48:40,520 And it also should help enable existing models that have optimisation issues to train better. 404 00:48:40,520 --> 00:48:44,310 And we've sort of started to look at that actually at the might. 405 00:48:44,310 --> 00:48:54,360 And also, you know, if you if you do, you have models where tricks like Bachelor Skip Connexions are causing problems or can't be used, 406 00:48:54,360 --> 00:49:01,190 this method could be very useful in those contexts. Right. So there is a there's a paper on archives. 407 00:49:01,190 --> 00:49:06,540 It's long, but I'd say it's actually not very dense and it's very self-contained. 408 00:49:06,540 --> 00:49:15,870 So hopefully if you if you're interested that you'll find it an easy read and also a lot of the length is just in terms of the experiments. 409 00:49:15,870 --> 00:49:23,490 And there's an official augmentation which is going to be on GitHub quite soon. 410 00:49:23,490 --> 00:49:33,470 And here are some of the work that inspired this project, and I'm happy to take any questions. 411 00:49:33,470 --> 00:49:44,900 His successor, James Foley, a wonderful talk. So there's one question in the title window, so is quite long, so I was just a reader. 412 00:49:44,900 --> 00:49:56,690 You showed, remarkably that chaos plus any activity function seems to perform similarly irrespective of the closing of our function. 413 00:49:56,690 --> 00:50:10,400 But is is it surprising at all since you have shown in reading them matter as opposed to a tightening or a softer pass plus decay space, 414 00:50:10,400 --> 00:50:17,500 they look the same in terms of resulting activation function? 415 00:50:17,500 --> 00:50:28,460 Yeah. You've idiocy now surprising? Well, when this kind of a phenomenon to tell us is rather one much smoother activity, 416 00:50:28,460 --> 00:50:35,510 active vision function would actually work best than anything else. 417 00:50:35,510 --> 00:50:39,620 Yeah. Well, OK. So there's a few things there, I would say. 418 00:50:39,620 --> 00:50:44,810 It still is somewhat surprising because it's not obvious that when you train these networks, 419 00:50:44,810 --> 00:50:53,260 that they're going to stay in these regions of description that we develop these, you know, Q maps to see maps, right? 420 00:50:53,260 --> 00:51:02,240 The inputs to the activation functions could get much bigger or much smaller than what is predicted by this theory during the course of training. 421 00:51:02,240 --> 00:51:10,670 Now, that won't happen in the entire team, but not everybody believes that that works David Typekit machine when you're training them. 422 00:51:10,670 --> 00:51:17,630 So I think it's it's still it's still not obvious that this this would work despite the fact that, yes, as you pointed out, 423 00:51:17,630 --> 00:51:20,840 the activation functions do look similar to each other once they're transformed, 424 00:51:20,840 --> 00:51:26,860 at least in the region where the cardinal theory says that, you know, behaviour should matter. 425 00:51:26,860 --> 00:51:34,700 Um, the uh, that another thing is that like it's not good enough just to make activation functions smooth. 426 00:51:34,700 --> 00:51:40,220 I mean, this approach is making them so very, very particular and delicate way, 427 00:51:40,220 --> 00:51:47,260 and it's very easy to trick yourself into thinking that you can just eyeball these. 428 00:51:47,260 --> 00:51:52,630 Plots for the activation functions and know if they're going to do well. Trust me, that's not true. 429 00:51:52,630 --> 00:51:59,170 For example, if, say, a soft place looks very similar to a rail, you when you just look at the graph. 430 00:51:59,170 --> 00:52:08,240 But in terms of the criminal properties that are wildly different. Is there any other question from the audience? 431 00:52:08,240 --> 00:52:15,590 You can just email yourself. Yeah, there's one question. 432 00:52:15,590 --> 00:52:25,130 Yeah, it's very interesting. I was wondering, so it's more just to understand better the work when you are enforcing the conditions 433 00:52:25,130 --> 00:52:29,480 that you the four conditions despite five conditions that you you tend to find, 434 00:52:29,480 --> 00:52:33,860 like with the the sea map of being zero at zero. 435 00:52:33,860 --> 00:52:40,700 Do you work primarily on the mission functions or do you also work on the initialisation weights? 436 00:52:40,700 --> 00:52:48,300 So I was thinking I'd like to see my fellow remember it was defined by the variances of the weights in the original paper. 437 00:52:48,300 --> 00:52:57,110 Yes, it's also, I don't know. So we assume that the the variances are fixed for the weights and we do everything. 438 00:52:57,110 --> 00:53:02,330 All of our manipulations happen on the activation functions. 439 00:53:02,330 --> 00:53:10,610 You can. So it turns out that due to that statement that I made about about these transformations 440 00:53:10,610 --> 00:53:16,820 sort of being something that you can replicate by manipulation of the weights and biases, 441 00:53:16,820 --> 00:53:24,500 you could actually transform this approach into one that only sets the weights and the biases of the activation functions, 442 00:53:24,500 --> 00:53:29,570 although I should point out that it will require the weights and places to be non-independent, 443 00:53:29,570 --> 00:53:36,750 which sort of departs from the old literature, which always assumes that they're they're independent. 444 00:53:36,750 --> 00:53:39,960 The you so you can do that. 445 00:53:39,960 --> 00:53:47,130 It's cleaner, I would say, to think about in terms of activation transforms it just leave that distributions pretty vanilla. 446 00:53:47,130 --> 00:53:57,660 It also works better in practise. That's another finding. So because Kathak is invariant to that kind of reprioritisation, 447 00:53:57,660 --> 00:54:04,410 if you push the transformation out of the activation function and into the way it symbolises flexibility into that. 448 00:54:04,410 --> 00:54:06,780 So in fact, the performance will be quite similar. 449 00:54:06,780 --> 00:54:16,350 But for be pushing the transformation out of the activation function and into the way Tobias is actually makes it much worse of this method. 450 00:54:16,350 --> 00:54:20,980 And that's again, it's not invariant to that kind of thing. 451 00:54:20,980 --> 00:54:26,820 And why is that the privatisation that does the manipulation inside the activation function 452 00:54:26,820 --> 00:54:31,330 different from the one that does it outside in terms of study optimisation performance? 453 00:54:31,330 --> 00:54:38,790 I don't know. I mean, you could probably you could make an argument in terms of like the condition number of the A.K or something. 454 00:54:38,790 --> 00:54:41,910 But but it's something that we've studied in depth, and it might be, you know, 455 00:54:41,910 --> 00:54:49,740 that might that difference might be the key to sort of making this approach work even better than it does with it with speed. 456 00:54:49,740 --> 00:54:58,050 Because if you could, if you could make the resulting opposition landscape even better conditioned without Connexions, 457 00:54:58,050 --> 00:55:05,500 that would perhaps enable us to to to get rid of fat from this equation. 458 00:55:05,500 --> 00:55:14,990 OK, thank you very much. I think we are out of time, so let's ask the speaker again. 459 00:55:14,990 --> 00:55:16,640 Thank you.