1
00:00:03,420 --> 00:00:09,630
So, hi, everyone is welcome to this week's Oh, I see, I smell.

2
00:00:09,630 --> 00:00:15,420
I'm in awe. So today our speaker is James Martin from the mine.

3
00:00:15,420 --> 00:00:22,960
So James is research scientist. I do my working on a deep learning fundamentals, training algorithms and theory.

4
00:00:22,960 --> 00:00:31,900
So before the tragedy that his Ph.D. from University of Toronto under the supervision of a job, Hinton then retired a demo.

5
00:00:31,900 --> 00:00:36,490
So it seems. Go ahead. Thank you.

6
00:00:36,490 --> 00:00:42,760
Yes, I'll today, I'll be talking about a recent project called The Rapid Training of Deep Neural Networks,

7
00:00:42,760 --> 00:00:46,090
that normalisation rallies or Skip Connexions.

8
00:00:46,090 --> 00:00:54,220
And you see my great collaborators down here, most of them are actually all of them are out of health a bit, some at DeepMind and some brain.

9
00:00:54,220 --> 00:00:59,770
So deep neural networks have become ubiquitous in modern machine learning applications.

10
00:00:59,770 --> 00:01:06,940
You see them in reinforcement learning agents, translation system, things and language systems,

11
00:01:06,940 --> 00:01:11,860
vision systems, speech recognition systems in search of recommendation.

12
00:01:11,860 --> 00:01:16,210
And it's pretty much pretty much all over the place now. Now,

13
00:01:16,210 --> 00:01:23,230
while practitioners of neural networks have come up with many heuristic innovations that make them trained

14
00:01:23,230 --> 00:01:30,130
on at higher debts and which are very useful in practise theory hasn't had much to say about this.

15
00:01:30,130 --> 00:01:35,110
It's been quite slow to catch up and rarely do you actually see theoretical insights

16
00:01:35,110 --> 00:01:42,540
making an impact on the practical application neural nets in these contexts.

17
00:01:42,540 --> 00:01:51,960
So currently, defence seems to require some combination of the following elements to train fast normalisation layers,

18
00:01:51,960 --> 00:01:56,310
such as batch normalisation or layer normalisation,

19
00:01:56,310 --> 00:02:05,430
Skip Connexions, also known as Shortcut Connexions and specific choices for activation functions such as value and sell you.

20
00:02:05,430 --> 00:02:13,950
Now this comes with various problems. First of all, the mechanism of action of all these elements is not particularly well understood.

21
00:02:13,950 --> 00:02:20,490
There has been progress in this direction, but I don't think we're anywhere close to a complete picture yet.

22
00:02:20,490 --> 00:02:27,360
It's unclear also how to use these elements in new architectures, partly because we don't understand fully how they work.

23
00:02:27,360 --> 00:02:32,400
So, you know, if you're if you ask the average person, the average practitioner, you know,

24
00:02:32,400 --> 00:02:38,070
why don't you just put on normalisation layers between the blocks of a resonant baby?

25
00:02:38,070 --> 00:02:43,650
What harm could it do? It actually does a lot of harm. But it's very not obvious why.

26
00:02:43,650 --> 00:02:48,660
And it's actually this very particular recipe that is used in resonance that's actually surprisingly

27
00:02:48,660 --> 00:02:53,580
effective for reasons that are sort of nothing particularly to do with the individual elements,

28
00:02:53,580 --> 00:02:58,350
but actually just the combination. They're very good in that very particular way that they combined in.

29
00:02:58,350 --> 00:03:07,830
There was an architecture that Shoreham in particular has caused problems in certain domains where the information

30
00:03:07,830 --> 00:03:14,150
sharing that you have over the many batch leads to degenerate training that you see in certain kinds of things.

31
00:03:14,150 --> 00:03:20,090
So certain kinds of generative models are self-supervised models and.

32
00:03:20,090 --> 00:03:25,130
Also, skit Connexions will change the inductive bias of the model.

33
00:03:25,130 --> 00:03:31,840
This may or may not be desirable depending on your application, but it's kind of annoying that you have to include them for your nephew.

34
00:03:31,840 --> 00:03:39,980
You don't want to trade at all. I would say more speculatively, these techniques might be acting as a crutch,

35
00:03:39,980 --> 00:03:47,190
and our reliance on them could be holding us back from pushing the practise in theory and deferring to the next level.

36
00:03:47,190 --> 00:03:51,030
I can't. In particular, if we don't understand where they work,

37
00:03:51,030 --> 00:03:58,820
there's really no way that we can sort of push a state of the art beyond just random exploration.

38
00:03:58,820 --> 00:04:05,300
So in this work, we develop a method called Deep Kernel Shaping Diecast,

39
00:04:05,300 --> 00:04:12,410
which is a general automated framework for transforming neural nets so that they have better properties and initialisation.

40
00:04:12,410 --> 00:04:15,210
And this will make them easier to train.

41
00:04:15,210 --> 00:04:24,960
So the headline result is that decades enables rapid training of neural networks that are traditionally considered hardware possible to train.

42
00:04:24,960 --> 00:04:29,220
And this includes very deep vanilla convolutional networks.

43
00:04:29,220 --> 00:04:39,510
So the vanilla here, I mean, that's without better or worse connexions networks with an unpopular activation functions such as teenager software.

44
00:04:39,510 --> 00:04:44,790
And also this this work sort of reveals why those choices are popular all.

45
00:04:44,790 --> 00:04:52,650
And you know, and we'd like to speculate that this approach will be very useful in developing new models because it sort of removes

46
00:04:52,650 --> 00:05:03,990
some of the requirement of architectural features in order to in order to enable fast training in the in the work,

47
00:05:03,990 --> 00:05:11,310
we also provide a comprehensive explanation for why things like real use spectrum layers and skin connexion speed of training on the show,

48
00:05:11,310 --> 00:05:20,350
how to cast makes them at least partially unnecessary. And you can see the paper for that.

49
00:05:20,350 --> 00:05:28,750
So in general, diecast supports fully connected, conditional cooling, weighted some layer norm,

50
00:05:28,750 --> 00:05:35,530
an element wise nonlinear layers, although the only moisture linear layers have to be preceded always by a linear layer.

51
00:05:35,530 --> 00:05:40,930
That's a that's a requirement or accomplished layer,

52
00:05:40,930 --> 00:05:49,930
and it supports calcium fannin of standard as if in initialisation and also orthogonal visualisations.

53
00:05:49,930 --> 00:05:50,620
Strictly speaking,

54
00:05:50,620 --> 00:05:59,770
these have to be of the delta type wherein which means that layer you basically just zero everything put the centre of part of the filter,

55
00:05:59,770 --> 00:06:05,470
although in practise that actually this approach seems to work OK without the delta in it.

56
00:06:05,470 --> 00:06:13,990
But it is a formal requirement, at least for the theory. And then we also assume that the bias is initialised to zero in those types of weight

57
00:06:13,990 --> 00:06:20,640
sharing are actually supported by an approach such as what you see and what steps of artemz.

58
00:06:20,640 --> 00:06:23,960
And this is a kind of owing to some recent work done, like rigging,

59
00:06:23,960 --> 00:06:30,190
showing that a lot of the mathematical tools we use are actually applicable to networks that have weight sharing.

60
00:06:30,190 --> 00:06:39,610
Previously, that was actually not well understood. And actually, the current draught doesn't even reflect that so that it doesn't work for Arden's.

61
00:06:39,610 --> 00:06:46,100
And also the approach supports arbitrary topologies, such as the multiple branches heads.

62
00:06:46,100 --> 00:06:51,460
You know, it also supports networks that have state connexions.

63
00:06:51,460 --> 00:06:57,130
And so for this talk, we're going to simplify things a bit just for the sake of clarity.

64
00:06:57,130 --> 00:07:05,530
We're going to assume only fully connected layers and element why so many layers and a very simple fit for no apologies.

65
00:07:05,530 --> 00:07:15,400
So basically your standard of connectivity and we're going to have we're going to assume that our network inputs are normalised to have a norm,

66
00:07:15,400 --> 00:07:22,890
which is equal to the square root of the dimension, which is a pretty standard thing to assume.

67
00:07:22,890 --> 00:07:35,220
So the mathematical basis is not that it comes from sort of the theory of criminal functions for deep networks or criminal approximations.

68
00:07:35,220 --> 00:07:39,960
This is these are approximations that apply when that work is randomly initialised.

69
00:07:39,960 --> 00:07:48,240
So in particular, if you let f be a neural network function that gets its vector output,

70
00:07:48,240 --> 00:07:55,740
give it an input x and here we can just take the output to be, you know, before the logit layer.

71
00:07:55,740 --> 00:08:03,180
So women, when the layer is still pretty wide and doesn't depend on the output or the target output dimension.

72
00:08:03,180 --> 00:08:10,680
So at initialisation, it turns out that you can approximate uh.

73
00:08:10,680 --> 00:08:18,270
It could also can be my most crucial. I'm not sure if I understand if it comes through.

74
00:08:18,270 --> 00:08:27,600
Do you see that? Yeah.

75
00:08:27,600 --> 00:08:31,350
OK, that's good. Um, yeah,

76
00:08:31,350 --> 00:08:42,990
so you can approximate both this squared norm divided by the dimension or an inner product between the

77
00:08:42,990 --> 00:08:49,560
output of the network for two different inputs X and its price normalised by their respective norms.

78
00:08:49,560 --> 00:08:57,380
You can you can approximate this using only knowledge of the network structure and the following three scalar parties,

79
00:08:57,380 --> 00:09:02,790
which are just, uh, squared for the two different inputs and x prime.

80
00:09:02,790 --> 00:09:09,720
Uh, and also there are two products to modify their product to their norms.

81
00:09:09,720 --> 00:09:20,490
And by the way, the quantity is just a cosine similarity. If you're familiar with that, as is the quantity and so, uh, so yes,

82
00:09:20,490 --> 00:09:33,840
you can do this computation and we'll call these squared norms cube values and these cosine similarity quantities C values.

83
00:09:33,840 --> 00:09:44,120
And we'll say that they're computed by functions called Q Mass and seems to take the input value.

84
00:09:44,120 --> 00:09:50,510
And so the key value and or at least a good.

85
00:09:50,510 --> 00:09:57,650
Approximation of it and similar statement for syntax and see values.

86
00:09:57,650 --> 00:10:01,220
There is a hint of hints of humility here, which I've sort of swept under the rug,

87
00:10:01,220 --> 00:10:05,820
but it turns out that with the data processing we use, you can do that.

88
00:10:05,820 --> 00:10:13,250
And I should say this approximation gets better as if the width of the layers grows.

89
00:10:13,250 --> 00:10:21,800
So, yeah, so this is here. These are yours. This is your sort of your standard. A deep kernel approximation sort of boils down to to its essence,

90
00:10:21,800 --> 00:10:27,740
having sort of stated what these things are in terms of what they call what they approximate.

91
00:10:27,740 --> 00:10:33,500
Still haven't described, you know, actually how you define them and how you compute them.

92
00:10:33,500 --> 00:10:43,280
So to compute a cue map for a semantic layer, you just you just say that to be the identity function.

93
00:10:43,280 --> 00:10:52,220
So that's trivial for a nonlinear later gene activation function inside a map.

94
00:10:52,220 --> 00:11:01,040
It's just given this formula. We just assume a one dimensional gives you an expectation and you have this here.

95
00:11:01,040 --> 00:11:05,810
We see maps a bit more complicated to image this in expectation.

96
00:11:05,810 --> 00:11:10,760
It's also important to actually write these formulas for this talk.

97
00:11:10,760 --> 00:11:20,630
Really, they'll really take away that you need is that we can actually commit these if not in closed form, at least numerical integration.

98
00:11:20,630 --> 00:11:26,150
And it's pretty efficient because these are only one two dimensional Gaussian integrals,

99
00:11:26,150 --> 00:11:34,150
as you can actually compute them reasonably fast up to a very high precision for arbitrary files.

100
00:11:34,150 --> 00:11:40,230
And also, we out that you can compute their derivatives as well, which will be important.

101
00:11:40,230 --> 00:11:52,410
So having to find that didn't seem pretty visible, whereas in our network, we can actually define them for our entire networks.

102
00:11:52,410 --> 00:12:00,960
And we do that by a simple composition. In particular, the key map for a for a composition of two networks,

103
00:12:00,960 --> 00:12:10,560
F and H just ends up being a composition for the individual F and H, and similarly for Siemens.

104
00:12:10,560 --> 00:12:19,080
By the way, if there's any, if any of this is unclear, please let me know now because the rest of the talk sort of relies very heavily on this.

105
00:12:19,080 --> 00:12:23,730
And so if this if this is not clear, you're going to have a hard time finding the rest to talk to you.

106
00:12:23,730 --> 00:12:28,450
Please let me know if you have any questions.

107
00:12:28,450 --> 00:12:39,790
So in general, she noted that she and Synapse are only valid descriptions of past initialisation, that's very important to underline general these.

108
00:12:39,790 --> 00:12:44,440
You can't actually predict the output norm given the input norm.

109
00:12:44,440 --> 00:12:52,580
It just doesn't work. All right.

110
00:12:52,580 --> 00:13:02,570
So having defined came out to see maps, we can start to examine them for words.

111
00:13:02,570 --> 00:13:14,600
So just to recall, see map determines essentially the angle because it's the cosine similarity and or in the distance because even derive the distance

112
00:13:14,600 --> 00:13:24,140
from the angle if you know the norms between to open the doctors from the network as a function of the angle to the input factors.

113
00:13:24,140 --> 00:13:35,150
And so in the big networks, we see that see maps become degenerate so that information about the input angles is essentially obscured.

114
00:13:35,150 --> 00:13:39,650
In other words, it's hard to recover. So you can see that it's up here.

115
00:13:39,650 --> 00:13:50,720
This is a these are C maps for deep really networks at different depths and for a stellar looking network.

116
00:13:50,720 --> 00:13:59,030
It's a pretty reasonable function. You could have any time inverting this to recover the value of a portion of the C value.

117
00:13:59,030 --> 00:14:07,040
But as it is for deeper clutter, clutter still technically in a vertical,

118
00:14:07,040 --> 00:14:12,240
it will be very hard to convert it in practise under any kind of approximation.

119
00:14:12,240 --> 00:14:20,330
Boys And of course, because these maps are only approximate descriptions of the network's behaviour, that will be of a concern.

120
00:14:20,330 --> 00:14:27,410
So essentially, what's going on here is that I give you anything but see value of the WS.

121
00:14:27,410 --> 00:14:35,300
Basically, you could not depend on the input value will be so weak it'll just be swamped over the noise.

122
00:14:35,300 --> 00:14:38,750
So you've really just lost that information about with the ACLU.

123
00:14:38,750 --> 00:14:44,880
In other words, you've lost information about the input distances in the output space.

124
00:14:44,880 --> 00:14:49,550
Now it's obvious that that's going to be a problem, but it turns out that it is.

125
00:14:49,550 --> 00:14:57,560
It turns out that this this situation sort of dooms gradient descent learning.

126
00:14:57,560 --> 00:15:07,920
So in general, a degenerate sea map is one that squashes the entire range of imposing value around some output value.

127
00:15:07,920 --> 00:15:15,500
S. zero. And there are two basic cases, all you see is zero,

128
00:15:15,500 --> 00:15:22,610
both of which turn out to be bad for treating either the value that you squash to is significantly

129
00:15:22,610 --> 00:15:28,400
less than one one in the maximum possible value because these are cosine similarities.

130
00:15:28,400 --> 00:15:32,870
So if it's less than one, then what that means is the slope.

131
00:15:32,870 --> 00:15:34,670
Actors basically look random words.

132
00:15:34,670 --> 00:15:45,590
They all they're all sort of from each other approximately on it, regardless of how close or far the corresponding input vectors were.

133
00:15:45,590 --> 00:15:49,910
And so that makes it look kind of like a random hash of the input.

134
00:15:49,910 --> 00:15:53,780
And while you might be able to learn from this generalisation,

135
00:15:53,780 --> 00:16:01,190
it's going to be impossible because they're just they don't reflect anything about their inputs, essentially or anything useful.

136
00:16:01,190 --> 00:16:06,770
And also, this condition will imply that early layers will have huge gradients compared to later layers,

137
00:16:06,770 --> 00:16:10,000
and that will actually make optimisation tricky.

138
00:16:10,000 --> 00:16:19,180
On the other case is that you've got your Europe put so severely that you're squashing towards as close to one or equal to one.

139
00:16:19,180 --> 00:16:23,420
What that means is that all the open sectors are basically going to look identical because of course,

140
00:16:23,420 --> 00:16:25,990
the value one corresponds to a close in similarity,

141
00:16:25,990 --> 00:16:33,220
one that means the vectors of the same assume that they have the same norm, which in this case, they do.

142
00:16:33,220 --> 00:16:44,740
And so, uh. And so the implication of this is that gradients earlier layers will vanish and the lost surface will become illegal conditions,

143
00:16:44,740 --> 00:16:52,300
making optimisation basically impossible impossible.

144
00:16:52,300 --> 00:16:59,590
Um, this can be formalised using very techniques such as UK theory, and this is done in the paper.

145
00:16:59,590 --> 00:17:05,830
There's other people that have also looked at this phenomenon and sort of tried to argue why

146
00:17:05,830 --> 00:17:14,000
it's it's it's bad for training and analysis of practises that seem to hold up as well.

147
00:17:14,000 --> 00:17:22,620
All right. So the previous solution to this problem,

148
00:17:22,620 --> 00:17:32,820
and this is a paper that sort of first observed this phenomenon is in the in a method called the edge of chaos.

149
00:17:32,820 --> 00:17:45,400
And so in that approach, the solution is to require that the derivative of the sea map for each individual nonlinear.

150
00:17:45,400 --> 00:17:49,300
It is equal to one when evaluated at one,

151
00:17:49,300 --> 00:18:02,080
and it brings out the condition will slow asymptotic convergence of key values to their sort of assumption of value one as death increases,

152
00:18:02,080 --> 00:18:09,160
and in particularly that convergence will go from being exponential to being so exponential.

153
00:18:09,160 --> 00:18:18,300
And there's a lot of it in animal systems, analysis of the composition of many of these functions.

154
00:18:18,300 --> 00:18:23,750
Which cares a lot about a lot about the slope in the limit.

155
00:18:23,750 --> 00:18:31,830
Unfortunately, though, given the deep enough network, see values will still be pretty close to fully converged.

156
00:18:31,830 --> 00:18:39,230
So in other words, the networks see map is still going to be highly degenerate and as an example,

157
00:18:39,230 --> 00:18:45,840
the deep railway networks that we studied before and previous slide actually already satisfy this condition.

158
00:18:45,840 --> 00:18:54,260
Rail news out of the box If this policy is a biased initialisation zero satisfy this condition.

159
00:18:54,260 --> 00:19:01,250
And yet we know that a deep enough rail network becomes untreatable and then you have to do that.

160
00:19:01,250 --> 00:19:05,570
You can't see maps unless it is.

161
00:19:05,570 --> 00:19:13,240
So much of that which we're up to this point, we're not assuming, right?

162
00:19:13,240 --> 00:19:22,470
So the I'd say the main contribution of this work is sort of a new way of controlling cement properties and.

163
00:19:22,470 --> 00:19:29,570
Instead of looking at sea map for individual layers, we're going to look at the sea map for the whole network.

164
00:19:29,570 --> 00:19:38,250
And we're going to analyse it from that perspective. So, eh, so there are there's a way to formalise this.

165
00:19:38,250 --> 00:19:47,240
But the intuition can be seen to be a situation that see maps or convex on the zero to one.

166
00:19:47,240 --> 00:19:51,510
And this is just a fact that you can prove. And intuitively,

167
00:19:51,510 --> 00:19:58,140
what that means is we can control the mission of the network's overall sea map

168
00:19:58,140 --> 00:20:05,490
from the entity function by just controlling its slope at the maximum value one,

169
00:20:05,490 --> 00:20:10,380
assuming that we know its value in zero and we set it to zero, although you could visit,

170
00:20:10,380 --> 00:20:18,060
it would also work if you sort of set this to some, particularly it's some particular value that's like significantly less than one.

171
00:20:18,060 --> 00:20:30,330
So carefully see, you can see this in this sort of picture where it's like, if I fix the graph to this point and I vary the slope over here,

172
00:20:30,330 --> 00:20:37,140
that sort of a one to one correspondence on that how how much the curve deviates from the identity and in particular,

173
00:20:37,140 --> 00:20:46,660
how much it flattens out becomes to generate around an output value in this case of zero.

174
00:20:46,660 --> 00:20:50,080
Right, so, so, so, yeah, so controlling the the network,

175
00:20:50,080 --> 00:20:59,260
the networks seem derivative at one gives us a way to prevent degeneration and we can pick a value as long as it's the slope isn't too extreme,

176
00:20:59,260 --> 00:21:02,410
it won't be degenerate.

177
00:21:02,410 --> 00:21:15,400
Now, you can't formalise this, and so this is a pretty lengthy paper, basically just says her best condition of value, zero is equal to zero.

178
00:21:15,400 --> 00:21:21,700
And the deviation of the identity number is a function of its derivative at one.

179
00:21:21,700 --> 00:21:28,420
And also, if the derivative of the cement can also hit a deviation from the derivative of the identity,

180
00:21:28,420 --> 00:21:41,160
function can be found in a similar way in this whole or the entire input domain, not a zero one.

181
00:21:41,160 --> 00:21:47,490
Right. OK, so we have a reasonable solution for, it seems.

182
00:21:47,490 --> 00:21:54,520
Unfortunately, though, there are still other ways that the network can be failed, failed to be trainable.

183
00:21:54,520 --> 00:21:59,000
One of them is that networks that are nearly linear.

184
00:21:59,000 --> 00:22:09,220
And so. First, if you can observe that linear networks have they see maps, but their model class is very limited, in particular,

185
00:22:09,220 --> 00:22:11,200
linear networks actually have identity segments,

186
00:22:11,200 --> 00:22:17,800
so that's sort of like the perfect night each other and see map information is preserved as well as you can.

187
00:22:17,800 --> 00:22:24,130
But you know, a linear network is not going to find interesting solutions because it's intrinsically limited.

188
00:22:24,130 --> 00:22:31,660
You could say, Well, let's just ban linear networks from our consideration and just stick to non-linear networks.

189
00:22:31,660 --> 00:22:36,880
Problem, though, is that you can make a network of nearly linear in a certain sense,

190
00:22:36,880 --> 00:22:41,800
so that it'll have a nice simple but be almost as hard to optimise as a as a linear network,

191
00:22:41,800 --> 00:22:47,260
which in fact is impossible to optimise at least up to the performance that you want.

192
00:22:47,260 --> 00:22:54,250
And so one example of this is you can take a railroad network and for each billion activation,

193
00:22:54,250 --> 00:23:02,090
just add a small or a large, very large constant to its input and also subtract the same constant from its output.

194
00:23:02,090 --> 00:23:05,290
So essentially just transforming the rallies.

195
00:23:05,290 --> 00:23:12,820
Now they're going to basically behave like the identity function because for all reasonable inputs they use,

196
00:23:12,820 --> 00:23:24,680
they'll be less much less than this constant. So if essentially just gotten rid of the the left part of the value function, the negative part of.

197
00:23:24,680 --> 00:23:33,090
But. You know, you can actually prove it's very quick, quite easy that you could with a certain traits,

198
00:23:33,090 --> 00:23:39,510
weight and biases and essentially undo the situation that we just did in recover a stent could really work.

199
00:23:39,510 --> 00:23:46,320
So the model class hasn't changed here, but obviously those young protesters a struggle in this situation.

200
00:23:46,320 --> 00:23:57,720
In fact, it'll probably never even evaluate the network for four inputs that are sort of in the nonlinear region of the real use.

201
00:23:57,720 --> 00:23:58,520
So,

202
00:23:58,520 --> 00:24:07,260
so it's basically just going to be like optimising the linear networks and you're not going to get any any of the benefits of using neural networks.

203
00:24:07,260 --> 00:24:15,480
So to prevent this problem, which can manifest in any type of network, not just reality networks,

204
00:24:15,480 --> 00:24:24,330
we were quite clear that the derivative of this imagine for the moment in the earlier nonlinear layer,

205
00:24:24,330 --> 00:24:34,770
if all you want is maximised OK to condition the network, but overall see, map and interpret it.

206
00:24:34,770 --> 00:24:39,000
So there's now there's this tension, want to make the derivatives large individual layers,

207
00:24:39,000 --> 00:24:48,260
but we want to we want to build the derivative for the overall networks to be smaller than so constant.

208
00:24:48,260 --> 00:24:56,780
And they'll be sort of a way to compute this appearance of this trade-off.

209
00:24:56,780 --> 00:25:08,030
Yet another failure mode is that our approximations that we're basing all of this analysis on my not mine at all.

210
00:25:08,030 --> 00:25:15,140
So, you know, and in that situation, nothing that we're even talking about makes any sense.

211
00:25:15,140 --> 00:25:22,940
Unfortunately, error in these of approximations can actually get very high in deep neural networks unless you make the with extremely large.

212
00:25:22,940 --> 00:25:27,470
In the worst case, the dependence could be exponential.

213
00:25:27,470 --> 00:25:32,850
Requiring an exponentially wide network to it is a function of the depth.

214
00:25:32,850 --> 00:25:40,080
So that's not tenable, because, you know, networks can get quite deep these days.

215
00:25:40,080 --> 00:25:42,480
Now you can see this issue, perhaps.

216
00:25:42,480 --> 00:25:54,840
Obviously, when you think about values, so let's say they're mapping you maps and how they can be sort of vulnerable to errors.

217
00:25:54,840 --> 00:25:58,800
And then you look at the map up to first order.

218
00:25:58,800 --> 00:26:07,590
The evidence output is proportional to its derivative times, the error it's in the book.

219
00:26:07,590 --> 00:26:13,570
So he maps will amplify any any errors that they've that they've said that are their input.

220
00:26:13,570 --> 00:26:21,640
And if this derivative can get very big in a deep network, it turns out.

221
00:26:21,640 --> 00:26:29,970
In general. Offence, if you think it is just, you know, this problem is just a cute values, unfortunately, that doesn't really work.

222
00:26:29,970 --> 00:26:36,570
Values are you did wrong. See, map competitions also become essentially meaningless as well.

223
00:26:36,570 --> 00:26:46,560
So we need to handle that problem. So the solution that we use indicates is to require that the derivative of the

224
00:26:46,560 --> 00:26:53,540
cue map is less or equal to one of four values of cue that we expect to see.

225
00:26:53,540 --> 00:26:58,700
Turns out, we can't actually enforce this condition that we or someone reasonably close to it.

226
00:26:58,700 --> 00:27:08,830
And so this will control the compounding of errors in deep networks of these kind of approximations.

227
00:27:08,830 --> 00:27:20,890
OK, so now we've sort of identified these various failure cases and ways to manipulate the queue and see map in order to prevent them.

228
00:27:20,890 --> 00:27:28,300
So that leaves those sort of conditions which define decades. And these and we'll discuss now.

229
00:27:28,300 --> 00:27:35,950
So for every sub network asked by some network, I just mean some component of the network,

230
00:27:35,950 --> 00:27:41,620
including the whole networks that sort of defines a well-defined input and output.

231
00:27:41,620 --> 00:27:50,940
So you could think about, let's say, if it was a multi layers three four or five four a seven network starting with layer three or five.

232
00:27:50,940 --> 00:27:59,790
And, you know, more general arbitrary structures and networks have more interesting examples of Soviet work to them.

233
00:27:59,790 --> 00:28:07,630
So the so in essence, the network itself is this is something we're typically.

234
00:28:07,630 --> 00:28:15,970
Right. So first condition is something that be discussed before it's more of a convention that we go with,

235
00:28:15,970 --> 00:28:27,340
which is that input values of one Q two values, they they have one man to hold the key values of one.

236
00:28:27,340 --> 00:28:38,620
And what this says is that the networks layers preserve the norms of their inputs at least once you account for the dimension of those factors.

237
00:28:38,620 --> 00:28:44,500
So in other words, it preserves the Q values, which is where norms divided by the dimensions.

238
00:28:44,500 --> 00:28:51,940
And it does this at least if the if the yeah, the values want for other values, you can't necessarily say anything.

239
00:28:51,940 --> 00:28:57,970
Although it should be noted after reviewing it works. You get that this is the identity function for free, essentially, because really,

240
00:28:57,970 --> 00:29:04,180
no, it's kind of a preservation of their skills from their input to their own.

241
00:29:04,180 --> 00:29:11,760
But. Right, so and we go with the value of one, just because that's a common convention.

242
00:29:11,760 --> 00:29:15,620
You know, we could amount to to two or five to five weeks,

243
00:29:15,620 --> 00:29:22,730
but we just have to standardise to some vector length on or to sort of do everything else that we want to do.

244
00:29:22,730 --> 00:29:25,340
Also, by doing this,

245
00:29:25,340 --> 00:29:38,030
we prevent the problem where you've got an exploding or vanishing vector lengths for your your your activation vectors in networks, which you know,

246
00:29:38,030 --> 00:29:44,600
if you go right to the end of the radio network could lead to a very small or very big input to your lost function,

247
00:29:44,600 --> 00:29:52,610
which could be to an America problems or optimisation problems, depending on the type of loss of.

248
00:29:52,610 --> 00:29:57,510
In a value of one is kind of what most standard loss functions expect.

249
00:29:57,510 --> 00:30:05,530
So. All right, so that's that's condition one condition, a second condition is what we just discussed previously,

250
00:30:05,530 --> 00:30:14,590
which is they require that they come out of it, which is as well behaved and are in this control colonel approximation error.

251
00:30:14,590 --> 00:30:22,870
Now previously basically, we wanted this to be less than or equal to one for all potential capabilities of Q.

252
00:30:22,870 --> 00:30:29,770
Now in general, we only expect to see one type of Q value in our network, which would be equal to one.

253
00:30:29,770 --> 00:30:38,080
Of course, due to random errors that will sort of break down. But as long as we're close, this map is sort of continuous and smooth it.

254
00:30:38,080 --> 00:30:43,390
You know, enforcing this condition at one will be sort of good enough for another close to one as well.

255
00:30:43,390 --> 00:30:49,870
And we said that equal to one, not we don't try to minimise it because it's equal to one.

256
00:30:49,870 --> 00:30:56,590
And this turns out to work best in practise. You could also have said it less or equal to one or try to minimise it.

257
00:30:56,590 --> 00:30:59,320
That would also control the current approximation error,

258
00:30:59,320 --> 00:31:07,180
but we find that the Eagles one just seems to work best in practise for reasons that are not totally well understood.

259
00:31:07,180 --> 00:31:17,440
This is where the maximum value if you set it larger than when you run into problems. OK, and then we've got conditions C and D.

260
00:31:17,440 --> 00:31:22,440
These are the conditions that we hope for that prevent seamount degeneration.

261
00:31:22,440 --> 00:31:32,740
Um, so yeah, so it's a study of equal to zero and zero and also restricting its its derivative at one to be less or equal to some constant.

262
00:31:32,740 --> 00:31:39,010
And oftentimes, this is sort of like 1.5 or just some moderate value.

263
00:31:39,010 --> 00:31:44,500
And finally, we've got this condition here, which prevents the nearly linear networks problem.

264
00:31:44,500 --> 00:31:53,510
This this is that the the derivative came out through nonlinear layers is maximised subject to these other conditions.

265
00:31:53,510 --> 00:31:59,630
So put. So for A, B and C, it turns out that you can.

266
00:31:59,630 --> 00:32:09,050
It's efficient to have these conditions hold for cutesy maps of non-linear letters and get a free for all sub networks,

267
00:32:09,050 --> 00:32:15,170
provided that, you know, formalise quote unquote any way that sums the network.

268
00:32:15,170 --> 00:32:21,090
And I'm not going to describe what that means, but it's a straightforward operation.

269
00:32:21,090 --> 00:32:30,240
And then, Danny, the combination of those turns out to be equivalent to enforcing the condition for each nonlinear layer.

270
00:32:30,240 --> 00:32:39,460
So we're setting the derivative of the cement at one to be able to this constant to the power one over T.

271
00:32:39,460 --> 00:32:45,270
Where does the depth of the network for or more arbitrary topologies?

272
00:32:45,270 --> 00:32:53,170
Is there is a more complicated formula here. This is the one for M.P.s.

273
00:32:53,170 --> 00:33:02,330
But it's important to know that this formula can be easily competed in the more general case, so that's not a problem.

274
00:33:02,330 --> 00:33:10,640
Right. So I talked a lot about conditions that we want to enforce first on the cutesy maps for the network and some networks,

275
00:33:10,640 --> 00:33:14,600
and then for translating that into conditions on hotels.

276
00:33:14,600 --> 00:33:19,700
I think it is. There's someone asking a question. Oh, sure.

277
00:33:19,700 --> 00:33:25,280
Yeah, yeah. Hey, that was me on the pace on the slide.

278
00:33:25,280 --> 00:33:33,470
Before you have these different conditions, I think Slide B. Sorry, Point B was the thing that controls a kind of approximation error.

279
00:33:33,470 --> 00:33:46,130
Not did you say this? Not only it's not only that for the validity of the analysis, it also has an impact on performance is that we said.

280
00:33:46,130 --> 00:33:54,590
Yeah. So we need the curve approximation here to be low in order for this analysis to make any sense.

281
00:33:54,590 --> 00:34:02,180
But but you know, you could achieve low colonel brown application error by requiring this to be less than or equal to one,

282
00:34:02,180 --> 00:34:08,420
and you could try a minimum minimise and in fact minimise it. It will minimise the approximation error.

283
00:34:08,420 --> 00:34:09,850
So why didn't we minimise it? Why?

284
00:34:09,850 --> 00:34:18,500
Why do we actually set it equal to one which is actually the maximum permissible value before you get run runaway approximation error?

285
00:34:18,500 --> 00:34:25,700
And the reason why is because it works best in practise in terms of the overall effectiveness of these networks at the end of the day.

286
00:34:25,700 --> 00:34:32,300
We don't have a good explanation for why that's true. This is this is sort of one of the remaining mysteries.

287
00:34:32,300 --> 00:34:40,220
OK, so so A, B, D and E, or let's say necessarily for the sake of argument to have a performant network.

288
00:34:40,220 --> 00:34:51,440
But so a CD and a b, there might be networks to satisfy the other conditions that a good initialisation to train and so on.

289
00:34:51,440 --> 00:34:59,030
But the do not look anything like kernels. Um, maybe.

290
00:34:59,030 --> 00:35:06,460
Of course, once once the kernel approximation kernel approximations break down, like none of these conditions really make any sense like they're,

291
00:35:06,460 --> 00:35:12,290
you know, they're describing things that are not descriptions anymore of the network's behaviour.

292
00:35:12,290 --> 00:35:13,550
Right. Okay. OK.

293
00:35:13,550 --> 00:35:23,330
It's certainly possible that a number of that is sort of well outside of the kernel regime could be, um, you know, could could train well,

294
00:35:23,330 --> 00:35:29,760
but then it's it's much harder to talk about it like we just don't have the theoretical tools to really analyse it at that point.

295
00:35:29,760 --> 00:35:34,570
OK. Thank you. Right.

296
00:35:34,570 --> 00:35:40,000
So, yeah, so having having reduced these conditions and I can't see maps for some networks

297
00:35:40,000 --> 00:35:44,560
down to conditions in the cutesy maps for individual non-linear layers,

298
00:35:44,560 --> 00:35:45,190
I still, you know,

299
00:35:45,190 --> 00:35:54,100
we still haven't actually about how how do we achieve these conditions on the cansee maps of non-linear letters like what are our levers of control?

300
00:35:54,100 --> 00:35:59,350
And so for that, we're going to transform the activation functions in a fairly benign way.

301
00:35:59,350 --> 00:36:07,650
I would argue, in particular, we're going to introduce non-tradable to constant.

302
00:36:07,650 --> 00:36:12,720
It's for both the input and the output to each activation function, so in particular,

303
00:36:12,720 --> 00:36:24,810
going from side effects to this where all of these gamer health beta and delta are just fixed non-tradable scalar constants.

304
00:36:24,810 --> 00:36:35,840
Um, because you can always carefully choose your weights and biases in your network to simulate the kind of transformation.

305
00:36:35,840 --> 00:36:38,750
This will not actually change our middle class,

306
00:36:38,750 --> 00:36:50,150
at least not assuming a perfect optimiser space of functions computed by the network is is the same now in practise.

307
00:36:50,150 --> 00:37:01,640
Of course, doing these kinds of transformations could change the inductive bias of the model under a limited optimiser like gradient design.

308
00:37:01,640 --> 00:37:07,700
So a couple of examples of transcriptome deactivation functions are plotted here.

309
00:37:07,700 --> 00:37:11,090
This is for a vanilla 100 layer on LP.

310
00:37:11,090 --> 00:37:20,990
So in the case of software, as we go from this sort of familiar to real use, soft, plush, a softer, more gentle curve.

311
00:37:20,990 --> 00:37:25,010
And we see something even more dramatic for 10h.

312
00:37:25,010 --> 00:37:27,350
And in general, that's what this method is going to do.

313
00:37:27,350 --> 00:37:36,860
It's going to take a maturation function, which is quite nonlinear and kind of tone it down to be closer to a linear function.

314
00:37:36,860 --> 00:37:46,190
I should say this is over an input range with a typical range of input factors sort of beyond the typical range of inputs you'd expect to see.

315
00:37:46,190 --> 00:37:50,250
Debates will typically approximately follow a Gaussian distribution.

316
00:37:50,250 --> 00:37:53,080
And so would with a variance of one.

317
00:37:53,080 --> 00:37:58,910
And so once you get up to negative 10, essentially it's negligible probability that you'd ever see an input of that size.

318
00:37:58,910 --> 00:38:04,760
So, you know, the steeper the activation function occurs the central region.

319
00:38:04,760 --> 00:38:10,490
If you go further from this graph, you'll see that this is still this is still a teenager here.

320
00:38:10,490 --> 00:38:19,850
If I were split it up much, much further, you'd see eventually some token button up, but but rather work actually matter in practise.

321
00:38:19,850 --> 00:38:25,530
But it basically looks like this kind of software, as it does here as well.

322
00:38:25,530 --> 00:38:32,220
Um. All right. So now, having describe the approach, how about the experiments?

323
00:38:32,220 --> 00:38:39,810
So our basic setup is that we are training the resident one to one v two style architecture on image,

324
00:38:39,810 --> 00:38:45,060
or they do this with and without that storm and with a skip connexions.

325
00:38:45,060 --> 00:38:57,180
Batch sizes five 12 learning rate schedules were optimised dynamically using a method called Fire PBT developed at DeepMind recently,

326
00:38:57,180 --> 00:39:01,780
and this was done to with the particular objective of maximising optimisation speed.

327
00:39:01,780 --> 00:39:08,880
So in other words, the choice of that kind of learning rate schedule is not a confound with regards to optimisation.

328
00:39:08,880 --> 00:39:13,110
We know why did we examine optimisation speed and generalisation performance?

329
00:39:13,110 --> 00:39:22,590
Well, the goal of this work is was mostly just to make up the gap between resonance and and networks that don't have

330
00:39:22,590 --> 00:39:29,930
all of those architectural flourishes in the main gap that you see there is actually optimisation speed.

331
00:39:29,930 --> 00:39:34,550
In fact, the networks without connexions in better if they're made deep enough.

332
00:39:34,550 --> 00:39:41,060
Basically, they don't even train at all to the time. They'll just sit and zero performance.

333
00:39:41,060 --> 00:39:46,550
We do some footwork that looks more at generalisation performance.

334
00:39:46,550 --> 00:39:51,700
That was actually recently accepted that I here, I'm not going to talk about that here.

335
00:39:51,700 --> 00:40:00,830
So, yeah, and the other optimisation parameters were tuned lately, not as much as learning rate, but even even at Alphabet,

336
00:40:00,830 --> 00:40:09,230
we only have limited capabilities, although what you would cry if you saw how much computation I used.

337
00:40:09,230 --> 00:40:20,180
And yeah, there's lots of experiments in the paper, tons of observations and different different things that we studied in relation to this approach.

338
00:40:20,180 --> 00:40:32,770
In fact, that makes up sort of the bulk of the paper. It's just experiments. So the main result is for vanilla episode of this network,

339
00:40:32,770 --> 00:40:39,430
let's get Typekit without connexions and short using chaos and comparing those to resonates.

340
00:40:39,430 --> 00:40:50,690
And so a standard resume that is this ludicrous here to even see if the network is in stock plus or change are slowly.

341
00:40:50,690 --> 00:40:56,240
And meanwhile, really universe where you where we've stripped out either the ship Connexions the shore,

342
00:40:56,240 --> 00:41:01,790
we're both perfect, but first we do need all of those elements.

343
00:41:01,790 --> 00:41:10,130
You can't just rip them now, insults or laser, I should say, and that's a very important point to make.

344
00:41:10,130 --> 00:41:18,500
If we go to see, the key fact is it's optimised for neural nets.

345
00:41:18,500 --> 00:41:24,470
It's a non diagonal approach. It's pretty powerful, but somewhat expensive.

346
00:41:24,470 --> 00:41:37,730
The if we go to networks of optimisation, the situation isn't as nice standard risk that is optimised about the same rate as it was with K-Fed.

347
00:41:37,730 --> 00:41:51,560
Maybe a little slower, but still it works with us are now truly behind, although they're still doing much faster than the bricks that don't stick.

348
00:41:51,560 --> 00:42:14,340
But the gap there is now a gap with that with residents. So trying to drill down into this a little bit more if we look at it.

349
00:42:14,340 --> 00:42:24,480
Using Connexions with chaos, if you're using caretaker's network, once you introduce chaos and fat, Skip Connexions don't seem to matter at all.

350
00:42:24,480 --> 00:42:31,080
Here we're using Skype Connexions that have been chosen to have a residual weight equal to this constant.

351
00:42:31,080 --> 00:42:36,090
And the rate on the shortcut is is such that the sum of the squares is equal to one.

352
00:42:36,090 --> 00:42:44,160
That's a requirement of this method. And it turns out that actually, you know, well, yeah.

353
00:42:44,160 --> 00:42:51,360
So all of these approaches are met the performance of a sort of standard resonance that we've got to.

354
00:42:51,360 --> 00:42:56,830
That's where things get a bit interesting. And now we see that, in fact,

355
00:42:56,830 --> 00:43:06,790
we can obtain the same performance as stand resident using the case just by reintroducing Ship Connexions into these networks,

356
00:43:06,790 --> 00:43:10,390
at least, at least at least in. It's a thoughtless activation.

357
00:43:10,390 --> 00:43:14,410
It's intended for some reason behaves weird in this experiment.

358
00:43:14,410 --> 00:43:20,020
All those I should note that almost all recognition functions do well.

359
00:43:20,020 --> 00:43:31,240
So it does seem that, uh, you know, with with these conditions, plus de cases is good enough to match resonant performance.

360
00:43:31,240 --> 00:43:38,400
But your other option if you don't want to use get Connexions is to use CVAC.

361
00:43:38,400 --> 00:43:43,980
Now we can apply to gas to networks with a whole bunch of different activation functions.

362
00:43:43,980 --> 00:43:46,860
And actually, we see that many of these education functions,

363
00:43:46,860 --> 00:43:54,780
including a lot of ones that work in typically very badly or don't even train at all in such deep networks work just fine.

364
00:43:54,780 --> 00:44:01,080
In fact, they all work pretty similarly, except for real, you know, real news actually trailing behind here.

365
00:44:01,080 --> 00:44:06,030
Somewhat ironically, actually, because it's not really compatible DKA on top of that,

366
00:44:06,030 --> 00:44:14,040
but the kinds of transformations that we do in the activation functions are a limited power in the case of value activation functions.

367
00:44:14,040 --> 00:44:18,000
So, so it's in some ways this isn't really the performance of class at all,

368
00:44:18,000 --> 00:44:25,080
because it's not the method isn't really working properly in the case of real news.

369
00:44:25,080 --> 00:44:30,660
We can also look at the effect of using different optimisers because we need this strong dependence on OPTIMISER,

370
00:44:30,660 --> 00:44:34,050
at least when we're talking about networks. So don't skip Connexions.

371
00:44:34,050 --> 00:44:42,390
And we see that in fact, key and shampoo or a modified version of shampoo or both doing very well and match the performance of resonance,

372
00:44:42,390 --> 00:44:54,380
whereas Study and Adam perform roughly similarly and do not allow us to replicate the optimisation performance of residents.

373
00:44:54,380 --> 00:45:03,080
We can also look at some previous work, for example, it's the edge of chaos method, which is sort of the main is preparing for chaos.

374
00:45:03,080 --> 00:45:12,500
And yet we see that. This is clearly performing better in this in the case of these teenage networks.

375
00:45:12,500 --> 00:45:20,700
This is what key facts, although the the gap persists and gets even.

376
00:45:20,700 --> 00:45:30,270
Bigger with my computer would load the grass there. And in fact, yeah, in fact, the U.S. is going to be doing that much compared to Saddam.

377
00:45:30,270 --> 00:45:36,730
And in this context, the whole thing is this, by the way, are below presents.

378
00:45:36,730 --> 00:45:42,660
Standard resonance with us should be. We can also look at looks linear,

379
00:45:42,660 --> 00:45:49,470
which is the method that tries to initialise the network to be exactly linear and initialisation time due to certain weight,

380
00:45:49,470 --> 00:45:59,660
symmetries and the use of a radio activation functions. And this is nothing, if it turns out that it doesn't seem to work well with.

381
00:45:59,660 --> 00:46:07,340
I think because in fact too aggressively breaks the symmetries that you have, and as a result,

382
00:46:07,340 --> 00:46:13,820
the network enters this very nonlinear behaviour too quickly and sort of things go off the rails.

383
00:46:13,820 --> 00:46:21,060
So we have to use looks linearly with Adam, and in that case, to perform that there's a clear performance gap.

384
00:46:21,060 --> 00:46:29,240
And this is partly just because we're using CVAC, which works with CAS.

385
00:46:29,240 --> 00:46:38,600
The situation with SUV looks linear, it much more close to the case, I would say.

386
00:46:38,600 --> 00:46:43,730
Right. So, yeah, so this is coming to the end of the talk.

387
00:46:43,730 --> 00:46:52,130
Current limitations of this approach. We do not have support for multiplicative units like you see in Transformers.

388
00:46:52,130 --> 00:47:00,710
But I think an extension is very possible and quite interesting, actually.

389
00:47:00,710 --> 00:47:05,090
Vanilla networks, that is next. Let's get predictions about you're using.

390
00:47:05,090 --> 00:47:07,820
Your do seem to generalise words, at least in these experiments.

391
00:47:07,820 --> 00:47:12,290
I didn't talk about generalisation performance, but that is that is an observation we make.

392
00:47:12,290 --> 00:47:17,720
Although that's largely been addressed in follow up work, that was this clear paper.

393
00:47:17,720 --> 00:47:24,740
And in particular, if you just change the way you do the optimisation and also you make some small changes to decrease,

394
00:47:24,740 --> 00:47:31,820
you can actually close the gap to standard resonance almost completely.

395
00:47:31,820 --> 00:47:38,840
Measurement, the speed of resonance using the networks, we had to use crack.

396
00:47:38,840 --> 00:47:46,880
Otherwise, we had to reintroduce the Connexions and the general attitude on these and all

397
00:47:46,880 --> 00:47:51,890
networks will require at least two explorations and interesting and perhaps more,

398
00:47:51,890 --> 00:48:01,190
depending on your. Trying to understand that, I think, is a very interesting question for future work as well.

399
00:48:01,190 --> 00:48:11,390
Maybe, maybe with the right tweak to this approach, we could actually have it perform just as well as residents would use in a study.

400
00:48:11,390 --> 00:48:18,440
Right. So I think the outlook is pretty good for decades and it could be a useful tool for unlocking new model classes.

401
00:48:18,440 --> 00:48:28,280
I think that's the primary location here, allowing you to build the design your models without having to sort of rely on some conflicts

402
00:48:28,280 --> 00:48:35,030
of tricks to sort of make it optimise faster for reasons that you hopefully in your stand.

403
00:48:35,030 --> 00:48:40,520
And it also should help enable existing models that have optimisation issues to train better.

404
00:48:40,520 --> 00:48:44,310
And we've sort of started to look at that actually at the might.

405
00:48:44,310 --> 00:48:54,360
And also, you know, if you if you do, you have models where tricks like Bachelor Skip Connexions are causing problems or can't be used,

406
00:48:54,360 --> 00:49:01,190
this method could be very useful in those contexts. Right. So there is a there's a paper on archives.

407
00:49:01,190 --> 00:49:06,540
It's long, but I'd say it's actually not very dense and it's very self-contained.

408
00:49:06,540 --> 00:49:15,870
So hopefully if you if you're interested that you'll find it an easy read and also a lot of the length is just in terms of the experiments.

409
00:49:15,870 --> 00:49:23,490
And there's an official augmentation which is going to be on GitHub quite soon.

410
00:49:23,490 --> 00:49:33,470
And here are some of the work that inspired this project, and I'm happy to take any questions.

411
00:49:33,470 --> 00:49:44,900
His successor, James Foley, a wonderful talk. So there's one question in the title window, so is quite long, so I was just a reader.

412
00:49:44,900 --> 00:49:56,690
You showed, remarkably that chaos plus any activity function seems to perform similarly irrespective of the closing of our function.

413
00:49:56,690 --> 00:50:10,400
But is is it surprising at all since you have shown in reading them matter as opposed to a tightening or a softer pass plus decay space,

414
00:50:10,400 --> 00:50:17,500
they look the same in terms of resulting activation function?

415
00:50:17,500 --> 00:50:28,460
Yeah. You've idiocy now surprising? Well, when this kind of a phenomenon to tell us is rather one much smoother activity,

416
00:50:28,460 --> 00:50:35,510
active vision function would actually work best than anything else.

417
00:50:35,510 --> 00:50:39,620
Yeah. Well, OK. So there's a few things there, I would say.

418
00:50:39,620 --> 00:50:44,810
It still is somewhat surprising because it's not obvious that when you train these networks,

419
00:50:44,810 --> 00:50:53,260
that they're going to stay in these regions of description that we develop these, you know, Q maps to see maps, right?

420
00:50:53,260 --> 00:51:02,240
The inputs to the activation functions could get much bigger or much smaller than what is predicted by this theory during the course of training.

421
00:51:02,240 --> 00:51:10,670
Now, that won't happen in the entire team, but not everybody believes that that works David Typekit machine when you're training them.

422
00:51:10,670 --> 00:51:17,630
So I think it's it's still it's still not obvious that this this would work despite the fact that, yes, as you pointed out,

423
00:51:17,630 --> 00:51:20,840
the activation functions do look similar to each other once they're transformed,

424
00:51:20,840 --> 00:51:26,860
at least in the region where the cardinal theory says that, you know, behaviour should matter.

425
00:51:26,860 --> 00:51:34,700
Um, the uh, that another thing is that like it's not good enough just to make activation functions smooth.

426
00:51:34,700 --> 00:51:40,220
I mean, this approach is making them so very, very particular and delicate way,

427
00:51:40,220 --> 00:51:47,260
and it's very easy to trick yourself into thinking that you can just eyeball these.

428
00:51:47,260 --> 00:51:52,630
Plots for the activation functions and know if they're going to do well. Trust me, that's not true.

429
00:51:52,630 --> 00:51:59,170
For example, if, say, a soft place looks very similar to a rail, you when you just look at the graph.

430
00:51:59,170 --> 00:52:08,240
But in terms of the criminal properties that are wildly different. Is there any other question from the audience?

431
00:52:08,240 --> 00:52:15,590
You can just email yourself. Yeah, there's one question.

432
00:52:15,590 --> 00:52:25,130
Yeah, it's very interesting. I was wondering, so it's more just to understand better the work when you are enforcing the conditions

433
00:52:25,130 --> 00:52:29,480
that you the four conditions despite five conditions that you you tend to find,

434
00:52:29,480 --> 00:52:33,860
like with the the sea map of being zero at zero.

435
00:52:33,860 --> 00:52:40,700
Do you work primarily on the mission functions or do you also work on the initialisation weights?

436
00:52:40,700 --> 00:52:48,300
So I was thinking I'd like to see my fellow remember it was defined by the variances of the weights in the original paper.

437
00:52:48,300 --> 00:52:57,110
Yes, it's also, I don't know. So we assume that the the variances are fixed for the weights and we do everything.

438
00:52:57,110 --> 00:53:02,330
All of our manipulations happen on the activation functions.

439
00:53:02,330 --> 00:53:10,610
You can. So it turns out that due to that statement that I made about about these transformations

440
00:53:10,610 --> 00:53:16,820
sort of being something that you can replicate by manipulation of the weights and biases,

441
00:53:16,820 --> 00:53:24,500
you could actually transform this approach into one that only sets the weights and the biases of the activation functions,

442
00:53:24,500 --> 00:53:29,570
although I should point out that it will require the weights and places to be non-independent,

443
00:53:29,570 --> 00:53:36,750
which sort of departs from the old literature, which always assumes that they're they're independent.

444
00:53:36,750 --> 00:53:39,960
The you so you can do that.

445
00:53:39,960 --> 00:53:47,130
It's cleaner, I would say, to think about in terms of activation transforms it just leave that distributions pretty vanilla.

446
00:53:47,130 --> 00:53:57,660
It also works better in practise. That's another finding. So because Kathak is invariant to that kind of reprioritisation,

447
00:53:57,660 --> 00:54:04,410
if you push the transformation out of the activation function and into the way it symbolises flexibility into that.

448
00:54:04,410 --> 00:54:06,780
So in fact, the performance will be quite similar.

449
00:54:06,780 --> 00:54:16,350
But for be pushing the transformation out of the activation function and into the way Tobias is actually makes it much worse of this method.

450
00:54:16,350 --> 00:54:20,980
And that's again, it's not invariant to that kind of thing.

451
00:54:20,980 --> 00:54:26,820
And why is that the privatisation that does the manipulation inside the activation function

452
00:54:26,820 --> 00:54:31,330
different from the one that does it outside in terms of study optimisation performance?

453
00:54:31,330 --> 00:54:38,790
I don't know. I mean, you could probably you could make an argument in terms of like the condition number of the A.K or something.

454
00:54:38,790 --> 00:54:41,910
But but it's something that we've studied in depth, and it might be, you know,

455
00:54:41,910 --> 00:54:49,740
that might that difference might be the key to sort of making this approach work even better than it does with it with speed.

456
00:54:49,740 --> 00:54:58,050
Because if you could, if you could make the resulting opposition landscape even better conditioned without Connexions,

457
00:54:58,050 --> 00:55:05,500
that would perhaps enable us to to to get rid of fat from this equation.

458
00:55:05,500 --> 00:55:14,990
OK, thank you very much. I think we are out of time, so let's ask the speaker again.

459
00:55:14,990 --> 00:55:16,640
Thank you.