1
00:00:01,550 --> 00:00:11,090
And now to. So, yeah, already you started recording, so hello to everyone and welcome to our like,

2
00:00:11,090 --> 00:00:17,930
a seminar of the computational statistics and Machine Learning Group, a group at Oxford.

3
00:00:17,930 --> 00:00:28,080
Today we are very happy to have a Gyang, do you? He's an assistant professor of computer science at the University of Texas at Austin.

4
00:00:28,080 --> 00:00:33,170
Here she she's pictured here at the University of California, Irvine and BSD.

5
00:00:33,170 --> 00:00:40,910
She has been a doctoral fellow at the Computer Science and Artificial Intelligence Lab at MIT.

6
00:00:40,910 --> 00:00:48,410
His work lies at the intersection of machine learning and statistics, with interest spreading over the pipeline of data collection,

7
00:00:48,410 --> 00:00:55,100
learning, inference, decision making and various applications using probabilistic modelling.

8
00:00:55,100 --> 00:01:01,910
She's one of the leaders in the development of machine learning and statistical methodology that implements the Steyn's method,

9
00:01:01,910 --> 00:01:10,400
a topic that is of great interest for many people in a room. So that's why we are very happy to have him here today.

10
00:01:10,400 --> 00:01:15,170
So young say that he is also happy to take questions as you may have them.

11
00:01:15,170 --> 00:01:18,950
I have muted all of you, but you want you have questions.

12
00:01:18,950 --> 00:01:27,230
You can type them in the chat or just like I say them that amuse yourself and say, say them.

13
00:01:27,230 --> 00:01:33,260
But anyway, so you can start now again. OK, thank you so much.

14
00:01:33,260 --> 00:01:39,350
It's great to be here and talk about this dance method and machine.

15
00:01:39,350 --> 00:01:45,380
So this is really a lot like a few years of research on the topic.

16
00:01:45,380 --> 00:01:54,560
So I will not talk anything that is particular reason, but focus on the basis of the framework.

17
00:01:54,560 --> 00:02:06,710
So, so machine learning and statistics one, the motivation that that I had for this talk, for this line of work was was to develop, you know,

18
00:02:06,710 --> 00:02:13,830
computable discriminating measures between data and model because essentially statistical

19
00:02:13,830 --> 00:02:18,740
and machine learning at a high level is really about doing a very simple thing,

20
00:02:18,740 --> 00:02:26,420
which is matching daytime model, using the data to understand models or using model to understand data.

21
00:02:26,420 --> 00:02:32,750
So now because of this, essentially lots of problems in statistics and machine learning,

22
00:02:32,750 --> 00:02:41,540
you can find them as as either evaluating or computing or optimising some notional discrepancies.

23
00:02:41,540 --> 00:02:46,340
So let's see if we are doing prompt estimation. So that's the problem.

24
00:02:46,340 --> 00:02:53,630
You are given a set of data, which sort of points you can view it as the pick up measure and you try to

25
00:02:53,630 --> 00:02:58,340
find a model that fitzsimmons the data so that can be viewed as a minimising.

26
00:02:58,340 --> 00:03:05,600
Minimising the discrepancy typically reduce the current divergence that gives you maximum likelihood estimation.

27
00:03:05,600 --> 00:03:13,670
But then on the other angle, let's say sometimes, well, give me a model of probabilistic model and we want to understand the model.

28
00:03:13,670 --> 00:03:18,470
Consider that this happens, especially in Beijing, when things went so wrong.

29
00:03:18,470 --> 00:03:19,130
And in that case,

30
00:03:19,130 --> 00:03:29,090
we captured your sample from that description and from the closet Monte Carlo view uncut interview that can also be viewed as optimisation problem.

31
00:03:29,090 --> 00:03:32,360
So in this case, you're finding a set of samples,

32
00:03:32,360 --> 00:03:39,920
but your findings have some points that fits with your model so that you can use the points on the standard model.

33
00:03:39,920 --> 00:03:45,650
Again, a discrepancy optimisation problem from different angle different.

34
00:03:45,650 --> 00:03:53,930
And then you have the model evaluation how if just takes it and formulate it as the good news of it past that saying that we are

35
00:03:53,930 --> 00:04:03,500
giving birth to publish the model and and instead of sample and want to decide if the sample is actually drawn from the model.

36
00:04:03,500 --> 00:04:10,870
So that can be viewed as evaluating whether the discrepancy equals zero or not.

37
00:04:10,870 --> 00:04:18,170
Now so. So that's I think that's a summary of what's doing in statistics.

38
00:04:18,170 --> 00:04:28,760
But then what's you know, what's additional or extra to that emission meaning is that, you know, we care about very large models, right?

39
00:04:28,760 --> 00:04:32,750
We have very high, you know, structured data, high dimensional data,

40
00:04:32,750 --> 00:04:43,970
and we have to match them with really complicated models and sometimes the new network models like what's popular in states, right?

41
00:04:43,970 --> 00:04:46,580
And the problem here is that, you know,

42
00:04:46,580 --> 00:04:54,320
the emphasis is a bit different because the instant mistakes we often are interesting finding that the most dramatically.

43
00:04:54,320 --> 00:05:05,200
So make catches by other means. So in in statistics, we are mostly interested in finding the statistic.

44
00:05:05,200 --> 00:05:09,670
Most powerful estimations, right? The machinery we often, you know,

45
00:05:09,670 --> 00:05:22,180
cannot achieve that means that we can only hope to find whatever you know available to us find and we have to prioritise.

46
00:05:22,180 --> 00:05:33,890
Pilot has competition was the statistical efficiency. So an example of the intractable models that widely used in machine learning,

47
00:05:33,890 --> 00:05:40,090
especially the just as about is these are globalised distribution models.

48
00:05:40,090 --> 00:05:52,510
So so what's happening here is that the probability distribution is specified by some are normalised probability, functional density function.

49
00:05:52,510 --> 00:06:01,270
So and what's critical the to, you know, evaluate the integration which represents here.

50
00:06:01,270 --> 00:06:14,800
This happens, obviously reading statistics in Beijing, defence graphical models, lots of planning models also have this problem.

51
00:06:14,800 --> 00:06:22,210
As you let's see, assume your sidebar here is the exponential of your network, where in that case,

52
00:06:22,210 --> 00:06:30,940
people use the antigen models as one way to generate images of all kinds of things.

53
00:06:30,940 --> 00:06:42,430
The traditional way to solve this problem is using maths, Markov chairman DeCaro, which you know, is known to be slow in many cases,

54
00:06:42,430 --> 00:06:56,500
but theoretically, you know, rigorous if it converge right on other hands in machine learning, lots of people use what's called variation on events.

55
00:06:56,500 --> 00:07:06,610
This is the idea that you can transform the inference problem into just as I mentioned as an optimisation problem,

56
00:07:06,610 --> 00:07:16,710
overall counter divergence so that you can approximate complicated distributions using simple parametric families such as Gaussian.

57
00:07:16,710 --> 00:07:27,430
But but in this way, you have to specify what families you have and if if you are not doing that properly, you may end up with biases.

58
00:07:27,430 --> 00:07:34,270
So, so today, I would focus on standard methods as as a mechanism,

59
00:07:34,270 --> 00:07:40,690
as a as a new foundation for solving this kind of a kind of discriminatory problems

60
00:07:40,690 --> 00:07:46,900
where you know all the other three problems in principle that I mentioned earlier about.

61
00:07:46,900 --> 00:07:56,680
Whenever you want to evaluate the discrepancy between daytime model, it turns out, especially for this, our normalised dispositions.

62
00:07:56,680 --> 00:08:05,920
It turns out the assessment is indeed a fundamental approach to do that that allows us to, you know,

63
00:08:05,920 --> 00:08:18,250
avoid the computational difficulty that traditional methods such as these based on maximum likelihood and divergence path.

64
00:08:18,250 --> 00:08:21,940
OK, so so Stan Smith.

65
00:08:21,940 --> 00:08:36,500
It's a it's a theoretical tool that was developed by Charles STEM to bound as a technique to bond the difference between probability distributions.

66
00:08:36,500 --> 00:08:46,000
It's quite very elegant and smart technique that was found to be, you know,

67
00:08:46,000 --> 00:08:56,860
remarkably powerful in the theoretical published his theory theoretical community and has been used to do lots of things.

68
00:08:56,860 --> 00:09:05,140
Well, recently it was was, you know, proposed as a way to prove central in history.

69
00:09:05,140 --> 00:09:16,960
But then people will realise you can extend it in many different ways and you can prove all kinds of probative bonds, even concentration qualities.

70
00:09:16,960 --> 00:09:22,630
And when we applied its holding found to be really successful.

71
00:09:22,630 --> 00:09:34,330
So, so you know that this year we have a paper that is titled Distance Magic Method, which I think is a very good description of the method,

72
00:09:34,330 --> 00:09:43,060
but it was not well known in English, in any community, just because it was a purely theoretical tool.

73
00:09:43,060 --> 00:09:51,820
Just to prove since you demonstrated if you're not interested in proving that there was not, probably not that useful for you.

74
00:09:51,820 --> 00:09:53,290
But it turns out it's not true.

75
00:09:53,290 --> 00:10:02,050
So it turns out that, you know, the key idea behind the standard method is actually extremely powerful, even as the computational tool.

76
00:10:02,050 --> 00:10:09,010
And the fundamental reason is that all of the statistical machine learning the computation that we have are essentially about,

77
00:10:09,010 --> 00:10:12,760
you know, providing bonds for foot between distributions.

78
00:10:12,760 --> 00:10:26,170
And that's exactly what segment is doing. OK, so so now I'm just going to diving and just this is a very quick review of stands mass.

79
00:10:26,170 --> 00:10:30,160
In fact, the product spent with. And that we will use because in fact,

80
00:10:30,160 --> 00:10:39,530
we have we would only use a pot of science methods that is essential to us and other technical parts of class that

81
00:10:39,530 --> 00:10:48,410
we will not talk about it because they are at least right now and not be able to use it for communication purposes.

82
00:10:48,410 --> 00:10:56,690
So the pod that we will use is a essential idea. The idea is that let's say PS, the distribution,

83
00:10:56,690 --> 00:11:01,640
the intractable immobilise distribution that is given to you and the whole idea

84
00:11:01,640 --> 00:11:05,060
of standard method is that you can you can construct is something called a

85
00:11:05,060 --> 00:11:12,320
standard operator that is a differential operator that acts in a function space

86
00:11:12,320 --> 00:11:18,500
such that if you apply the operator over arbitrary function of function,

87
00:11:18,500 --> 00:11:29,210
certified system, my boundary condition defensible, then you will get a zero expectation.

88
00:11:29,210 --> 00:11:37,230
You will get zero mean function. If so, does, their operator is essentially doing some sort of central right centring operate.

89
00:11:37,230 --> 00:11:45,830
All right. So and it's constructed as such that, you know, two distributions P and Q equals if only if.

90
00:11:45,830 --> 00:11:55,430
If you apply the state of play to associate one P, then you will always gather the expectation zero expected zero expectation.

91
00:11:55,430 --> 00:12:03,260
Q This happens for arbitrary function inside a function.

92
00:12:03,260 --> 00:12:12,410
So a simple the the there are different ways to define stent operator, but the particular one that we will use is something like this.

93
00:12:12,410 --> 00:12:24,350
So, so here basically, it is the inner to find the function that you're interested and and the lock the duty of Locky,

94
00:12:24,350 --> 00:12:33,710
plus the divergence operator, which is the sum of all the diagonal of the Jacobean, in fact.

95
00:12:33,710 --> 00:12:41,120
So here fire is actually a vector function, so you map form of natural features.

96
00:12:41,120 --> 00:12:54,080
So a way to think about this is that if we simply just, you know, the trivial way to achieve this is that we can just, you know.

97
00:12:54,080 --> 00:13:00,320
So let me see if. Yes. OK, so the simplest way to achieve this zero meaning is the following.

98
00:13:00,320 --> 00:13:06,110
So you will you can be Phi Phi equals two minus the expectation.

99
00:13:06,110 --> 00:13:12,150
Sure. Let's say this is.

100
00:13:12,150 --> 00:13:21,240
He has far right. So that should be a way to kind of achieve this thing is just simply minus the meaning of.

101
00:13:21,240 --> 00:13:30,390
Right? That that is the operator that can be applied over fi and that allows us to centralise everything.

102
00:13:30,390 --> 00:13:39,750
So you'll achieve this. But the problem is that you cannot directly calculate expectation of over oh, sorry, over P.

103
00:13:39,750 --> 00:13:45,420
So here should be p. You cannot directly calculate expectations right now.

104
00:13:45,420 --> 00:13:52,530
What's magic about this method is that if you just we placed this centralisation with this kind of,

105
00:13:52,530 --> 00:13:56,310
you know, the special operator that just taking, you know,

106
00:13:56,310 --> 00:14:03,450
between fi and do the lock key and plus an exchange divergence thing,

107
00:14:03,450 --> 00:14:11,740
then you can also achieve exactly the same thing as if you are centralising using the mean on the P right now.

108
00:14:11,740 --> 00:14:18,960
If you can do that, you can actually convert and go back to using that to calculate the the centralisation.

109
00:14:18,960 --> 00:14:21,810
And that involves solving some differential equation.

110
00:14:21,810 --> 00:14:30,480
But this is the essential idea is that it is ah, you can centralise everything under under the distribution,

111
00:14:30,480 --> 00:14:40,740
you know, by just this operator, not directly calculating the integration, but.

112
00:14:40,740 --> 00:14:52,460
OK, so now I need to clean up my spring.

113
00:14:52,460 --> 00:15:01,400
Something's wrong. OK. Yes. OK, so now what makes you know this this idea, especially intractable,

114
00:15:01,400 --> 00:15:06,380
is that if you look at the stand up, Peter, in fact, everything has come to Europe.

115
00:15:06,380 --> 00:15:10,460
So even if the distribution is our last.

116
00:15:10,460 --> 00:15:17,720
The reason is that this whole stand operator depends on the distribution key on to the school function.

117
00:15:17,720 --> 00:15:25,760
And the school function is is the duty like which is which, well, you know, the duty equal to the duo to be divided by PE.

118
00:15:25,760 --> 00:15:30,770
And if you do that, the dependency on the normalisation constant is cancelled.

119
00:15:30,770 --> 00:15:37,670
So. So you can actually directly inculcate the school function without calculating the mobilisation constant.

120
00:15:37,670 --> 00:15:45,290
And you know, that's the key, right? Naturally, if you gave me a discussion, I can just code up the stand up.

121
00:15:45,290 --> 00:15:52,310
It's using using Python or something. This is something that completely computes.

122
00:15:52,310 --> 00:15:58,490
Right? OK, so so now whilst an op ed, why that strange equivalence?

123
00:15:58,490 --> 00:16:07,620
So I'm going to give you some simple intuition. The best way to look at it is using integration by parts.

124
00:16:07,620 --> 00:16:16,880
So let's look at one direction, which is if people do queue, then that whole thing has equal to zero,

125
00:16:16,880 --> 00:16:25,820
and that's actually equating to something called a standard identity. This is more well known to to, you know, statistics in general.

126
00:16:25,820 --> 00:16:31,010
It's more widely used than stats methods.

127
00:16:31,010 --> 00:16:38,240
So so essentially, it says that this whole thing grew to zero on the P, and then you can prove it just by expanding the expectations.

128
00:16:38,240 --> 00:16:42,680
You all have peaks multiplied by this host and up to.

129
00:16:42,680 --> 00:16:46,460
And then you cancel the log p. You will get this whole thing.

130
00:16:46,460 --> 00:16:57,470
And this is actually a integration by parts so Pennzoil equals to the value of p times phi on the boundary, assuming it's one dimensional.

131
00:16:57,470 --> 00:17:06,470
And then if you assume that the product it be PHI has zero value on the boundary or decay sufficiently fast,

132
00:17:06,470 --> 00:17:14,450
then then you come to understand that, right? So so what this says is that we we do need some boundary condition,

133
00:17:14,450 --> 00:17:21,020
but this is a very mild condition because the only requires p times phi to declare.

134
00:17:21,020 --> 00:17:29,660
So you can either, you know, pick. If your P is decaying across the boundary, then you don't need to worry about PHI.

135
00:17:29,660 --> 00:17:38,330
Now, if it doesn't decay, then you have to choose Phi to decay. So either way, it's actually easy to achieve in practise.

136
00:17:38,330 --> 00:17:43,640
So, so Stan's identity in particular, has been widely used.

137
00:17:43,640 --> 00:17:46,580
It's a really powerful tool. The reason it's powerful.

138
00:17:46,580 --> 00:17:54,860
I think again, if you think about it, it's it's like a magic idea that, you know, suddenly for any given institution, p,

139
00:17:54,860 --> 00:17:59,600
you can get infinite number of identities that you can actually calculate,

140
00:17:59,600 --> 00:18:05,930
even though the function that even though the description is in check, this is this is a remarkable way.

141
00:18:05,930 --> 00:18:11,210
And then you can use this to do lots of things like, for example, if you treat them as movement equation,

142
00:18:11,210 --> 00:18:17,510
you can use them to as a way to as a way to estimate Panopto is right.

143
00:18:17,510 --> 00:18:26,850
There are many different methods develop and related to this, including the score matching method for for many energy based models.

144
00:18:26,850 --> 00:18:33,140
Right. There are many other things that you can do. For example, you can use that is the equation and control variant,

145
00:18:33,140 --> 00:18:39,800
and that allows you to reduce the barriers, in fact, to again, Magic's happening here.

146
00:18:39,800 --> 00:18:44,690
So it turns out, you know, under certain conditions, you can actually reduce the variance to zero,

147
00:18:44,690 --> 00:18:48,110
meaning that you know, the difficult convergence rate is going to end.

148
00:18:48,110 --> 00:18:52,890
But now you can actually get faster rate than the typical scale.

149
00:18:52,890 --> 00:18:59,660
And so again, it's it's very remarkable tool, but I think most people actually is more like,

150
00:18:59,660 --> 00:19:04,380
you know, about stance method than stance, identity and stance methods.

151
00:19:04,380 --> 00:19:12,050
Well, so what the stats method does is something that I think is deeper than stance identity, but less well known.

152
00:19:12,050 --> 00:19:20,120
So that's about this. Then that's the other direction of improvement, which says that if he doesn't, he could kill.

153
00:19:20,120 --> 00:19:25,430
Then I must be able to find some in such that I can violate that equation, right?

154
00:19:25,430 --> 00:19:30,770
So. So here what it says is that for any two distribution John Q that are different,

155
00:19:30,770 --> 00:19:37,520
I can always find a find some sort of discriminator that that gathers non-zero expectation

156
00:19:37,520 --> 00:19:45,830
of just an obvious place and a simple way to to say this is by this simple derivation.

157
00:19:45,830 --> 00:19:52,380
So basically, if you look at the expectation of of stand up to over Q.

158
00:19:52,380 --> 00:20:03,090
Then you can actually write it down. You can add another term, which is the same editor of Q Under Q, assuming this is true, this is the second term.

159
00:20:03,090 --> 00:20:10,950
Is this identity? And then you can you can combine these two stand all operators and divergence terms caso.

160
00:20:10,950 --> 00:20:15,480
Then you will get the defence of the school function in product five.

161
00:20:15,480 --> 00:20:24,120
Right. Essentially, what this says is that this whole expectation thing is actually calculating some sort of email product

162
00:20:24,120 --> 00:20:30,810
between FY and the defence of the school function where it's now you fiscal function of PM Q doesn't equal,

163
00:20:30,810 --> 00:20:32,610
then you should. In principle,

164
00:20:32,610 --> 00:20:45,260
you should find the file that violates the the the the the non-zero condition just by taking fire to be the defence of the school function.

165
00:20:45,260 --> 00:20:49,650
So, so in this way, you can you can show that your goal has been fired.

166
00:20:49,650 --> 00:21:02,840
So again, it's a very simple intrusion here. So now that's another another way to prove it, which is less was less well known,

167
00:21:02,840 --> 00:21:10,040
but actually this is a way that I really like and motivated lots of my method,

168
00:21:10,040 --> 00:21:17,300
which is saying that it turns out this this whole thing actually relates to clan mothers in a very interesting way.

169
00:21:17,300 --> 00:21:22,970
So, so assume you have a random variable x that is drawn from Q.

170
00:21:22,970 --> 00:21:28,520
And then let's say you can remember Pfizer actually a vector field, actually.

171
00:21:28,520 --> 00:21:37,250
So what you can do is you can take fire as a vector field and multiply by some small step size Ipsen and you will get an updated variable.

172
00:21:37,250 --> 00:21:48,710
That's fine. Now, if Axe's John from Q then shown the the the distribution of X Y is is this Q-tips on fire, which depends on both seasonal and fire?

173
00:21:48,710 --> 00:21:56,810
And then what you can do is you you can take the kill divergence between shoots on fire and p and and take the due

174
00:21:56,810 --> 00:22:05,600
to whizzed right through Ipsen and turns out that due to UV wave cycle zero is exactly the negative of expectation.

175
00:22:05,600 --> 00:22:17,840
So what's happening here is that as you apply this transform over the random variable and as you increase the step size from zero to some small value,

176
00:22:17,840 --> 00:22:23,660
you can measure essentially that the increased rate of CO that approaches and

177
00:22:23,660 --> 00:22:30,380
that increased rate is exactly the minus this expectation to stay off it.

178
00:22:30,380 --> 00:22:39,650
Now in this view, you can sort of you can sort of view that as this whole thing as a as the gradient, I saw some sort of gradient of divergence.

179
00:22:39,650 --> 00:22:46,070
So if he doesn't look to, then obviously you have zero tolerance, you can no longer decrease it.

180
00:22:46,070 --> 00:22:50,180
That's why you get zero. But if you have two distributions that are different,

181
00:22:50,180 --> 00:22:54,740
then you should be able to find a direction that decrease the colour divergence and that your

182
00:22:54,740 --> 00:23:03,140
action is going to be exactly the fire that have a non-zero decreasing rate of divergence.

183
00:23:03,140 --> 00:23:08,860
Any questions so far? I don't see any questioning.

184
00:23:08,860 --> 00:23:14,490
So. OK, sounds great. OK.

185
00:23:14,490 --> 00:23:21,900
And then you can essentially summarise the standard method using Using Standards Committee.

186
00:23:21,900 --> 00:23:26,130
So the idea is that now if you give two distributions on cue,

187
00:23:26,130 --> 00:23:35,110
we can just take the maximum of this expectation of stand up to what some function family, some functions set right.

188
00:23:35,110 --> 00:23:40,080
And now, if the function set is sufficiently large, then you should, you know,

189
00:23:40,080 --> 00:23:46,530
this whole thing should actually differentiate between you to equal to zero if only people took,

190
00:23:46,530 --> 00:23:51,750
you know, the choice of dysfunction cost is actually very,

191
00:23:51,750 --> 00:23:59,670
very important in the original Costco co-stars method that was developed for theoretical purpose.

192
00:23:59,670 --> 00:24:04,050
But you really want that function space to be large because you don't actually care

193
00:24:04,050 --> 00:24:09,540
about actually computing these thing numerically and you just want to use that.

194
00:24:09,540 --> 00:24:18,810
You want to make sure it's sufficiently large today compounds other metrics such as Motion Stand or to the resolution distance of,

195
00:24:18,810 --> 00:24:25,920
you know, and and you can. So basically the way it works is that you can use the standard equipment, see £2,

196
00:24:25,920 --> 00:24:30,030
lets you watch a stand and then you can show the Spanish script and see is also small.

197
00:24:30,030 --> 00:24:38,400
And that's why that's how you can prove or response of. But for practical purposes, you don't have, you don't.

198
00:24:38,400 --> 00:24:45,030
You cannot choose Option F because we actually want to numerically calculate its hosts this sustained discrepancy.

199
00:24:45,030 --> 00:24:48,450
So we have to choose a function space that is, you know,

200
00:24:48,450 --> 00:25:00,750
both sufficiently large as as well as compute computational intractable so that we do such a sacrifice some statistical power,

201
00:25:00,750 --> 00:25:06,740
but then we gain computational efficiency. So that's the essential trade-off here.

202
00:25:06,740 --> 00:25:14,580
And end the function cost that we are using is the colour of space reproducing colonel hemo space.

203
00:25:14,580 --> 00:25:18,180
So here's a very brief introduction. So let's see. You know,

204
00:25:18,180 --> 00:25:28,380
we have some positive stuff in Kano and then the reproducing colour space Isuzu wins that is defined as essentially the DNA sparing of the Kuno,

205
00:25:28,380 --> 00:25:32,490
where you can take out obituary reference point in the space,

206
00:25:32,490 --> 00:25:38,820
and you can have infinite number of this reference points that combine together

207
00:25:38,820 --> 00:25:44,360
and then you can define to the norm and apply in this way and that if you,

208
00:25:44,360 --> 00:25:49,320
you know, take the closure, you will get the cone of space.

209
00:25:49,320 --> 00:25:53,280
And if you choose the colour to be strictly positive, definitely in certain sense,

210
00:25:53,280 --> 00:26:06,520
then the space can approximate the space of continuous functions al-bashir well in a bounded domain.

211
00:26:06,520 --> 00:26:18,710
So. And then if you just plug in, let's see, I'm optimising this whole thing, but now I'm actually working on the producing space.

212
00:26:18,710 --> 00:26:26,420
And here any kind of constraint that norm has to be smaller than one to avoid the scanning you.

213
00:26:26,420 --> 00:26:33,230
And then you can actually solve this optimisation and postpone things on the optimal solution.

214
00:26:33,230 --> 00:26:38,270
Five star Yeah, exactly. The star Peter. When you apply this dollop,

215
00:26:38,270 --> 00:26:48,470
either over the function of functions to variables you won't integrate while the variable that devalue you another function.

216
00:26:48,470 --> 00:26:52,310
And then you can also show that the standard does come and see the value.

217
00:26:52,310 --> 00:27:00,590
The maximum value is going to be the expectation of a new kuno function, and the new colour function is very interesting.

218
00:27:00,590 --> 00:27:04,820
So basically you have the original kernel and then it's a turbo function.

219
00:27:04,820 --> 00:27:15,380
And then you just applied the stand up to twice, you know, the first time, treat it as a rainbow of explained second time to two as a function of X,

220
00:27:15,380 --> 00:27:21,200
and that gives you another a new positive definitely colour that is in some externalised.

221
00:27:21,200 --> 00:27:28,610
And then and then you can show this is this is very similar to the kernel maximum mean discrepancy.

222
00:27:28,610 --> 00:27:35,600
But now we have a special kernel that is defined by the extent of it.

223
00:27:35,600 --> 00:27:39,260
And the reason you know, you can you can do the duration yourself.

224
00:27:39,260 --> 00:27:49,190
It's actually a simple derivation. Basically, the reason we can we can solve this whole thing in custom is that this host the all.

225
00:27:49,190 --> 00:27:51,290
Peter is a Drina operator.

226
00:27:51,290 --> 00:28:00,320
And now if you optimise an opening, you know, the unit, the unit ball of hyperspace it does give always gives you a control.

227
00:28:00,320 --> 00:28:11,360
So, so this is a simple derivation. And then because of this nice form, nice close to home, you can actually evaluate.

228
00:28:11,360 --> 00:28:21,770
Now this is really getting to our point, which is if you if the cue description is unknown and it's observed through a set of ideas,

229
00:28:21,770 --> 00:28:28,860
sample exi, then you can pop made this then discrepancy between IQ and using this empirical version of that.

230
00:28:28,860 --> 00:28:34,130
But there are different ways to do it. You know, here I'm writing this.

231
00:28:34,130 --> 00:28:41,630
I'm biased the you and you take it basically that the true history piece, the true discrepancy is the expectation that.

232
00:28:41,630 --> 00:28:47,300
But now you can actually replace the expectation wins in pick of some.

233
00:28:47,300 --> 00:28:53,720
If you remove the diagonal, you will get an unbiased estimation. This is what's called use statistics.

234
00:28:53,720 --> 00:28:58,760
And then you can show essentially nice s and taunting properties of that.

235
00:28:58,760 --> 00:29:05,900
And then you can use this to construct a very powerful unions of protests,

236
00:29:05,900 --> 00:29:13,100
saying that if the discrepancy of the empirical data and p is larger than some threshold you can,

237
00:29:13,100 --> 00:29:16,820
you can basically rejected the hypothesis that people care.

238
00:29:16,820 --> 00:29:20,480
So this is one way to achieve good news to be passed.

239
00:29:20,480 --> 00:29:29,570
And what what's interesting of this method is that now you can actually do these tests for, you know, our normalised distributions.

240
00:29:29,570 --> 00:29:34,250
Very complicated. Let's see graphical models and high dimensional structural models.

241
00:29:34,250 --> 00:29:45,620
And this was not possible using traditional methods. And the threshold here can be decided by either bootstrap or you can divide the concentration

242
00:29:45,620 --> 00:29:54,100
in quality or over the standard discrepancy and use that as the threshold as well.

243
00:29:54,100 --> 00:30:01,060
But then, you know, another another idea is that, you know, we talked about this different view.

244
00:30:01,060 --> 00:30:08,980
You know, the good news tests, good news feed testing is like evaluating the discrepancy.

245
00:30:08,980 --> 00:30:16,190
But let's see, we are doing sampling problem when you make a vaccination. And that's like, you know, I'll give you a model.

246
00:30:16,190 --> 00:30:23,020
You want to essentially find a set of points you can view as funny, funniest of points of food, the goodness of the past.

247
00:30:23,020 --> 00:30:29,800
If we can fool the goodness test, then then that means that the sample will use a good approximation for the distribution.

248
00:30:29,800 --> 00:30:36,340
So now you can actually do it as a as minimisation problem, so you can see that even distribution p I can find,

249
00:30:36,340 --> 00:30:39,920
I want to find a sort of point to minimise the standard currency.

250
00:30:39,920 --> 00:30:45,490
And by doing that, you hopefully find points that can approximate the distribution that well,

251
00:30:45,490 --> 00:30:51,070
this is indeed a very powerful idea and has been exploited in several different ways.

252
00:30:51,070 --> 00:31:00,010
So, so the way that I will explore is a bit different. So I'm not going to directly minimise the the points because somehow it's difficult.

253
00:31:00,010 --> 00:31:14,820
It's getting some complex optimisation. And the way I'm doing is, you know, instead of doing this, I can solve the easy a problem that is.

254
00:31:14,820 --> 00:31:26,250
You know, that is, you know. You know, assume we all have a set of points, that's why is a set of points that is, you know,

255
00:31:26,250 --> 00:31:34,650
generated arbitrarily so and then what we want is to find a set of ways that associate with that point,

256
00:31:34,650 --> 00:31:41,190
such that the weighted empirical magic of the points approximate the distribution.

257
00:31:41,190 --> 00:31:50,970
And that can be framed as as minimising this weighted quadratic function subject to normalisation conditions.

258
00:31:50,970 --> 00:31:59,010
So that is actually quite powerful because. So here we are keeping a set of points, the same points as given to you.

259
00:31:59,010 --> 00:32:03,660
And in this, this is the arbitrary points. You don't need to know where it comes from.

260
00:32:03,660 --> 00:32:08,640
For example, you could ra I'm Sam C procedure and then you can, you know,

261
00:32:08,640 --> 00:32:13,200
you can get an approximation, but you are not sure if the approximation is good enough.

262
00:32:13,200 --> 00:32:23,910
Then what you can do is you can actually find a set of point weights to kind of correct the bias the your original NCMEC procedure.

263
00:32:23,910 --> 00:32:33,030
And you don't have to know about the distribution of X Y, and they can even be generated domestically.

264
00:32:33,030 --> 00:32:34,980
But using this method,

265
00:32:34,980 --> 00:32:44,100
you can still get a set of weights that kind of correct the bias in the distribution so we can show that this is actually very nice.

266
00:32:44,100 --> 00:32:55,590
And not that doesn't require us to know that the proposed the distribution back side actually gives you a better estimation, if you will.

267
00:32:55,590 --> 00:33:02,160
If in the function that you want to approximate age here is is a smooth function of out you can get.

268
00:33:02,160 --> 00:33:11,020
You can also get some benefits of variance reduction so that you can actually improve the approximation rate.

269
00:33:11,020 --> 00:33:15,660
So, so this is one kind of approach that we can explore.

270
00:33:15,660 --> 00:33:20,460
Staff met the standard discrimination to improve numerical approximation,

271
00:33:20,460 --> 00:33:27,240
but then that's another method that I think was really particularly interesting is that,

272
00:33:27,240 --> 00:33:32,340
you know, how can we actually directly finance a points to approximate a distribution?

273
00:33:32,340 --> 00:33:46,180
This is really the sampling problem. For some reason, I always had difficulty at.

274
00:33:46,180 --> 00:33:53,890
Yeah. OK, so so the idea here is that, you know, we are giving a distribution key.

275
00:33:53,890 --> 00:33:59,470
We want to find this at a point to approximate essentially is the sampling problem.

276
00:33:59,470 --> 00:34:02,890
And instead of minimising this, then it's good to see what do.

277
00:34:02,890 --> 00:34:14,650
We can do is we can directly minimise the divergence by optimally, you know, changing the variable, transporting the variable in some sense, right?

278
00:34:14,650 --> 00:34:19,720
So it's very similar to optimal transport.

279
00:34:19,720 --> 00:34:24,340
So the idea is that every time we have this particle and everything,

280
00:34:24,340 --> 00:34:29,260
we transport the particle using this way and then we choose this velocity field

281
00:34:29,260 --> 00:34:35,350
fine such that it always decreases the the divergence as fast as possible.

282
00:34:35,350 --> 00:34:43,150
And that can be framed as meaning maximising this negative decreasing rate of increasing vocal divergence.

283
00:34:43,150 --> 00:34:52,300
And essentially, this is really defining some notion of functional breeding sat on the distribution space.

284
00:34:52,300 --> 00:35:00,340
And and as I mentioned earlier, it turns out this decreasing range of cargo is exactly the standard operator.

285
00:35:00,340 --> 00:35:05,050
So that's why you know this optimisation actually exactly reduced to the optimisation

286
00:35:05,050 --> 00:35:11,410
we had for then it's going to say and and therefore the optimal find that we

287
00:35:11,410 --> 00:35:19,540
obtained earlier was exactly that the fire that allows you to decrease the divergence

288
00:35:19,540 --> 00:35:25,240
as fast as possible so you can actually use this fire to transport your particles.

289
00:35:25,240 --> 00:35:31,390
And then it turns out this standard is going to say is exactly the maximum decreasing radio networks.

290
00:35:31,390 --> 00:35:41,440
So that quantifies how much you can decrease the divergence from from studying from peer to peer.

291
00:35:41,440 --> 00:35:46,360
And then using that you can divide what's called stand, basically, then descend.

292
00:35:46,360 --> 00:35:49,300
So basically, you just take this whole thing.

293
00:35:49,300 --> 00:35:56,110
So basically, you maintain sort of particles every time Q is equal to the empirical measure of the particles,

294
00:35:56,110 --> 00:36:00,670
and then you just apply this to transform iteratively.

295
00:36:00,670 --> 00:36:09,400
It's very similar to it in the sense, but it is a particle system because you have a set of, you know, particles.

296
00:36:09,400 --> 00:36:15,160
Each of them is high and that is updated sequentially.

297
00:36:15,160 --> 00:36:25,690
And then you can, you know, this is an interacting particle system because because that you know that each of the update of each particle depends,

298
00:36:25,690 --> 00:36:33,850
on the other hand, it goes through the empirical measure. This is something called Minnifield of the particles.

299
00:36:33,850 --> 00:36:44,440
So that's why it's also related to Minnifield interacting, having the systems, which is a largely to chain applied mathematics.

300
00:36:44,440 --> 00:36:47,560
So this is an intuition. What's happening?

301
00:36:47,560 --> 00:36:54,580
So turns out the first term here is is a gradient term that drives the particles to increase the probability.

302
00:36:54,580 --> 00:37:00,070
The second term here is a very positive force term that, you know, practically speaking,

303
00:37:00,070 --> 00:37:04,240
actually enforce the different particles to stay away from each other.

304
00:37:04,240 --> 00:37:10,300
And then in the end, you can get a nice, nice approximation for further distribution.

305
00:37:10,300 --> 00:37:17,980
So if you don't have the second term, you will cops annual, you can only find a mode like typical optimisation does.

306
00:37:17,980 --> 00:37:23,780
But the positive force actually plays a critical role here.

307
00:37:23,780 --> 00:37:33,830
So this is this is what's happening when you have lots of particles, then you got to realise the density function.

308
00:37:33,830 --> 00:37:41,380
So. Yeah, so you can almost view this as kind of limited when you have even particles,

309
00:37:41,380 --> 00:37:47,320
then essentially these whole processes like involving some partial differential equation.

310
00:37:47,320 --> 00:37:56,900
And that's exactly what we can analyse here. So just another demo.

311
00:37:56,900 --> 00:38:11,930
Yeah. So. So one particular practical advantage is that this has created the algorithm as to exactly who it is to create in

312
00:38:11,930 --> 00:38:19,550
the Senate when you only have one article and this is very nice because if you do like typical methodical methods,

313
00:38:19,550 --> 00:38:23,630
if you just approximate the whole discussion with one single point,

314
00:38:23,630 --> 00:38:32,540
that point is going to be super random and it's not going to do well in any sense, except that it's unbiased estimation, probably.

315
00:38:32,540 --> 00:38:37,970
But if not only if we use as well using only one single particle,

316
00:38:37,970 --> 00:38:45,130
you already get the mode and the mode is already very powerful as we see in machine learning.

317
00:38:45,130 --> 00:38:52,220
So that's why you can build opera around the map and then gradually increase to the power.

318
00:38:52,220 --> 00:38:57,050
So it turns out that it has which theory associated with this type of ours.

319
00:38:57,050 --> 00:39:02,210
Them, you know, in the limit when you have, let's say, a number of particles,

320
00:39:02,210 --> 00:39:10,460
and if you step size decreased to zero, then this whole thing and this particle Benson, you can be, you know,

321
00:39:10,460 --> 00:39:19,640
associated with a differential equations, and you can show that the sequester equation actually decrease the divergence monotonically,

322
00:39:19,640 --> 00:39:25,130
unsurprisingly with the rates that equal to the standard and.

323
00:39:25,130 --> 00:39:30,830
And then you can also show that, you know, formally you can actually, you know,

324
00:39:30,830 --> 00:39:38,480
interpret that whole process as freedom flow of divergence on the ground in the space of distributions, right?

325
00:39:38,480 --> 00:39:43,580
This is really getting a very close connexion to optimal transport.

326
00:39:43,580 --> 00:39:53,750
So turns out, you know, you can define a some sort of optimal transport distance from Q to P as the minimum.

327
00:39:53,750 --> 00:40:01,070
You know, transport is a cost that, you know, transport hub mass from utopia.

328
00:40:01,070 --> 00:40:12,140
But here we are using a very special, sort of very special way to define and transport costs by using the HHS the of wall.

329
00:40:12,140 --> 00:40:17,420
If you use the typical L2 norm, you will get a typical optimal transport thing.

330
00:40:17,420 --> 00:40:27,050
But then here it's kind of optimal transport. It's like economise the match, analyse the optimal transport in some sense.

331
00:40:27,050 --> 00:40:35,150
And they if you define that metric and you just define the gradient flow, under that metric, you will get a switch.

332
00:40:35,150 --> 00:40:43,100
So this is a comparison between SPG and non-jury dynamics, which is very similar and closely related.

333
00:40:43,100 --> 00:40:53,650
So if you run non-Jew and dynamics. So if you run London Dynamics, it's like you have particles and every time you are adding random noise.

334
00:40:53,650 --> 00:40:55,030
But here in Australia,

335
00:40:55,030 --> 00:41:08,830
do we have a set of particle that interacting with deterministic function and then both of them can be catalysed by different differential equations?

336
00:41:08,830 --> 00:41:18,460
And actually, most of them can have great inflow interpretation, except energy and dynamics is the gradient on the typical L2.

337
00:41:18,460 --> 00:41:26,470
Optimal transport was as we are having a special Kunal's the optimal transport.

338
00:41:26,470 --> 00:41:33,370
And that's also another very different way to view, as we do be very different from the gradient flow view,

339
00:41:33,370 --> 00:41:37,510
which is essentially what's happening as you evolve these particles.

340
00:41:37,510 --> 00:41:43,180
It's actually trying to do something that is very similar to Cogito methods in numerical integration.

341
00:41:43,180 --> 00:41:54,400
So let's say, you know, if you remember from the numerical methods textbook, let's say Gaussian does Hamid conjecture.

342
00:41:54,400 --> 00:42:04,570
These methods are basically based on the idea that you want to find a set of points such that the the way you integrate over polynomials,

343
00:42:04,570 --> 00:42:07,420
for example, you will get exactly the solution.

344
00:42:07,420 --> 00:42:14,620
And then the hope is that the actual function that you integrate is close to polynomial so that you will get a good approximation.

345
00:42:14,620 --> 00:42:18,370
It turns out the S20 is doing something very similar to that.

346
00:42:18,370 --> 00:42:19,390
It turns out you can.

347
00:42:19,390 --> 00:42:31,120
You can actually find a set of points, a final set of functions that in which the SPG any fixed point on West Virginia is is matching.

348
00:42:31,120 --> 00:42:38,260
Exactly, and that sort of function is actually decided by the all as well as the Kuno.

349
00:42:38,260 --> 00:42:45,130
So if you choose to stand up the properly, you will recover the polynomial family.

350
00:42:45,130 --> 00:42:50,260
But now we are Modiano so we can get the major console function.

351
00:42:50,260 --> 00:42:52,720
So that's essentially what it's doing here.

352
00:42:52,720 --> 00:43:02,740
And basically, you can show that if you are approximating Gaussian distribution and if you use a demon code, then you will actually recover the pun.

353
00:43:02,740 --> 00:43:11,410
And if you use a polynomial control over calcium distribution, you will actually recover the polynomial families believe you use out of.

354
00:43:11,410 --> 00:43:17,620
You can. You can apply this method to more general distributions, and using this, you can actually show some balance.

355
00:43:17,620 --> 00:43:25,060
I think this opens some very interesting directions and angles that hasn't been really explored.

356
00:43:25,060 --> 00:43:35,510
OK, so I think I'm out of time. So, but very quickly. You know, does civil balance, I think I can cover perfectly well well.

357
00:43:35,510 --> 00:43:42,860
Well, I want an extension that I think is particularly interesting is that, you know, this whole thing doesn't have to depend on the gradient.

358
00:43:42,860 --> 00:43:46,260
And so it turns out you can actually divide between them.

359
00:43:46,260 --> 00:43:55,530
Free was the best way to be. It's an idea that is very similar to important sampling, but different in important ways.

360
00:43:55,530 --> 00:44:03,300
So basically, what's happening here is that assume you have the greedy end of the lobby and assuming it's very difficult to calculate,

361
00:44:03,300 --> 00:44:13,140
then what you can do is you can pay an arbitrary positive function and you can replace the gradient of sloppy wins the and block row.

362
00:44:13,140 --> 00:44:13,980
And then you can.

363
00:44:13,980 --> 00:44:24,630
Obviously, this will give you wrong direction, but then you can corrected the bias using the importance ratio, the the ratio between rule and pie.

364
00:44:24,630 --> 00:44:29,850
And in that way, you actually still get some contact to correct decisions.

365
00:44:29,850 --> 00:44:40,140
And then this can be very useful if your, you know, your disposition is has is it's very difficult to calculate gradient, right?

366
00:44:40,140 --> 00:44:51,060
Another another another algorithm, I think is interesting more or less less understood is is amortised as well, Judy.

367
00:44:51,060 --> 00:44:55,950
So the idea here is that it's a finding of some particles to approximate distribution.

368
00:44:55,950 --> 00:44:58,890
What I can do is I can do something very similar to again,

369
00:44:58,890 --> 00:45:05,970
which is find a newer network such that when you inject random inputs into the neural network,

370
00:45:05,970 --> 00:45:10,740
then you're network output, random outputs that follows approximate.

371
00:45:10,740 --> 00:45:17,900
You follow the distribution that you want, and this can be done easily using by some sort of imitation idea.

372
00:45:17,900 --> 00:45:26,160
So the idea here is that every time you opted in your network such that the particles followed the switch direction.

373
00:45:26,160 --> 00:45:32,460
So here is an iterative algorithm. Let's see. Let me explain very quickly.

374
00:45:32,460 --> 00:45:44,250
So the idea is that. So let me say so every time you have a new one at work and the new another output outputs of particles,

375
00:45:44,250 --> 00:45:51,120
this will be the the the green dots and then you update the particle when using as we did.

376
00:45:51,120 --> 00:45:56,850
So the particle will move closer to the target distribution. This will be the purple dots.

377
00:45:56,850 --> 00:46:00,510
And then what you can do is you go back,

378
00:46:00,510 --> 00:46:08,790
then you go back to the new and that's where modify the weights such that the next time the new model outputs the purple dots, right?

379
00:46:08,790 --> 00:46:14,400
And then based on the evidence, you will further find points that are closer to the distribution.

380
00:46:14,400 --> 00:46:20,580
And then you update the neurones where such that you know you will find the the dots is even closer.

381
00:46:20,580 --> 00:46:27,570
So by integrating this, you can actually turn your network to draw sample from descriptions.

382
00:46:27,570 --> 00:46:34,440
So that's that's that's essentially what I want to talk about. You know, that's this area of standard methods.

383
00:46:34,440 --> 00:46:39,360
New machine learning has really, I think, attracted lots of recent interest.

384
00:46:39,360 --> 00:46:49,560
I think it's the area where you can, you know, lots of very interesting theoretical problems that are still very open, for example, for as we did.

385
00:46:49,560 --> 00:46:52,230
We don't know exactly what's the meaning of convergence,

386
00:46:52,230 --> 00:46:59,310
which we don't know what's the best choice of colour, which is always a problem for common methods.

387
00:46:59,310 --> 00:47:09,240
And you know that many spaces for improving and extending as well, as well as the GST that we,

388
00:47:09,240 --> 00:47:14,890
I think, I don't think has been fully explored and lots of implications as well.

389
00:47:14,890 --> 00:47:28,050
In fact, this idea has been used in many epic applications, such as being first learning depending on certain qualifications.

390
00:47:28,050 --> 00:47:37,500
So I think that's also lots of room for applications of both. So when's that ever stop here?

391
00:47:37,500 --> 00:47:45,150
Thank you. Yeah. I don't know if anyone has questions otherwise.

392
00:47:45,150 --> 00:47:57,380
I do have some questions. So first. Yeah, so you at the end, you mentioned this kind of like important sampling.

393
00:47:57,380 --> 00:48:02,540
Method that you emphasise that it is not quite important something.

394
00:48:02,540 --> 00:48:13,390
So I wanted to know why it is not quite the same. It is not the same because here we are not doing Monte Carlo sampling, right?

395
00:48:13,390 --> 00:48:27,730
So. So and if you look at this, it's weird, it's a weird method because it's actually more like the numerous important template was the proposal.

396
00:48:27,730 --> 00:48:35,290
The actually is using the denominator, but in the typical invalid sampling, the actual suspicion is that you were nominated.

397
00:48:35,290 --> 00:48:43,420
Yeah, yeah. Well, I mean, it's it's it's it's similar in that most of most of them involves the density ratio,

398
00:48:43,420 --> 00:48:49,150
but but they are different because it's completely different to, you know, setting.

399
00:48:49,150 --> 00:48:53,890
We're not doing any colour on this increase.

400
00:48:53,890 --> 00:48:57,670
You can actually follow up on that. Yeah.

401
00:48:57,670 --> 00:49:06,990
So I wasn't quite sure about this, but in something because you said, I mean, the big advantage is that we don't need the normalising constant for P.

402
00:49:06,990 --> 00:49:11,430
Right. If you go through the rule, there was some good distribution.

403
00:49:11,430 --> 00:49:16,860
So you can't actually calculate the ratio unless we have the normalised constant right?

404
00:49:16,860 --> 00:49:25,850
Yes, but but but the the you know, the. But they it's really just a part of the step size like.

405
00:49:25,850 --> 00:49:35,860
Because, you know, the you know, let's see, let's say he has a normalisation constant, but you can push the normalising consent to the step size.

406
00:49:35,860 --> 00:49:43,010
And then if you choose the subsidies to be small, then you don't need to worry about that.

407
00:49:43,010 --> 00:49:49,750
Does it make sense? So let's see, let's say I have a look.

408
00:49:49,750 --> 00:49:53,330
So then you don't know the Epsilon, but you do.

409
00:49:53,330 --> 00:49:59,960
Yeah, yeah, yeah. Let's see you. You have to model, you have to divide the here, but then you have to push the Z here.

410
00:49:59,960 --> 00:50:06,360
So then the yips only IPS Z, right? But then it's the step size you can choose.

411
00:50:06,360 --> 00:50:16,950
Yes, and then the substance goes to zero. But so, yeah, so it will impact the way you choose the step size, but other than that?

412
00:50:16,950 --> 00:50:21,760
Yeah. OK, thank you. Yeah.

413
00:50:21,760 --> 00:50:32,050
Another question. At some point you showed a like a convergence resolved and you say that the the like in order to approximate an

414
00:50:32,050 --> 00:50:42,850
interval this time missile based approach a convergence rate which was strictly better than the than Monte-Carlo.

415
00:50:42,850 --> 00:50:47,350
So that was surprising to me. And so could you comment about that?

416
00:50:47,350 --> 00:50:57,040
Is it this one? Yeah. Yes, yes, it's actually is something that is very interesting, although it's not so, so let me explain what's happening here.

417
00:50:57,040 --> 00:51:08,480
So, so. If you are so big, the reason is that here you are designing the weights to explicitly minimise the standard discriminate.

418
00:51:08,480 --> 00:51:13,390
Right now it is out.

419
00:51:13,390 --> 00:51:27,910
This 10 discrepancy can be right as the shape of the difference between the empirical mean and the actual mean over the kind of special space.

420
00:51:27,910 --> 00:51:34,810
And that space is all penned by taking the original across and applies to all people over that space.

421
00:51:34,810 --> 00:51:43,840
You will get a new space. It turns out for that space of functions, you know they are approximated particularly, well,

422
00:51:43,840 --> 00:51:51,540
biased and discriminatory because they are exactly the kind of things that will be bounded by standard screws.

423
00:51:51,540 --> 00:51:56,680
But so for this family of function will get a really good approximation error.

424
00:51:56,680 --> 00:52:02,320
But it doesn't mean that we are, you know, having free lunch because the, you know,

425
00:52:02,320 --> 00:52:09,150
it could have we could have functions the outside of family that performs worse than the car.

426
00:52:09,150 --> 00:52:20,510
So so what? I think, what standard is what the oldest method does is somehow kind of prioritise the functions and this happens to us as well.

427
00:52:20,510 --> 00:52:22,510
So as I mentioned,

428
00:52:22,510 --> 00:52:34,150
you can actually find a set of function that on which switch the algorithm is exactly calculating like and that's no error up to the new macro.

429
00:52:34,150 --> 00:52:44,770
But then for other functions, they may not be well approximated, so it's more like a prioritised space of functions.

430
00:52:44,770 --> 00:52:52,280
This is different from when the methods were, you know, you can get the same approximation rate across all the functions.

431
00:52:52,280 --> 00:53:00,530
Mm-Hmm. I should say so someone is asking, do you have a good rule of thumb for choosing different candidates?

432
00:53:00,530 --> 00:53:09,020
Asked me, Do we? We don't really have that one open question.

433
00:53:09,020 --> 00:53:17,570
So we do have lots of insights that hadn't been really put together into automatic procedure.

434
00:53:17,570 --> 00:53:27,560
I guess that's the way we frame. So so what happened was that, you know, we in the beginning we kind of didn't know what to cut or to use.

435
00:53:27,560 --> 00:53:32,940
We know. Let's see, for example, if you use kind of it's a universe of hot tomatoes.

436
00:53:32,940 --> 00:53:43,130
So it's OK, right? So it must work. So we're happy using IVF in most of the applications, and it works reasonably well.

437
00:53:43,130 --> 00:53:49,160
And obviously other researchers have been proposing different way, different clonal choices.

438
00:53:49,160 --> 00:53:54,680
But one thing that you know, I was one to explore, but we haven't.

439
00:53:54,680 --> 00:54:02,540
Was this kuno college of I think, which I think really gives a lot of insight on the choice of colour.

440
00:54:02,540 --> 00:54:13,910
So what's happening is that the kernel actually defines the space and functions on which as we look to exactly match.

441
00:54:13,910 --> 00:54:20,390
Right? Just like college, internet is like choosing to match the polynomial functions.

442
00:54:20,390 --> 00:54:27,380
And, you know, as we did, the maths is choosing to match a special family of functions that is defined by the crow.

443
00:54:27,380 --> 00:54:37,130
So but then this dependency from the kernel to the function that we exactly match is actually a complicated maths.

444
00:54:37,130 --> 00:54:45,650
If we can somehow, you know, understand the maths and in fact numerically solve that maths, then it will be very powerful.

445
00:54:45,650 --> 00:54:53,660
Because let's say we, if we are interested in calculating the variance and not the mean, for example, then we can make.

446
00:54:53,660 --> 00:55:02,330
We can hopefully design the Kuno such that the quadratic function is inside that function space that we are approximating or even close.

447
00:55:02,330 --> 00:55:11,750
Possibly then if that happens, then we can get really good approximation where in fact, the but you can see it's already happening.

448
00:55:11,750 --> 00:55:22,070
For example, if the distribution p is a Gaussian distribution and if we choose the colour to be a dimensional, which is, you know,

449
00:55:22,070 --> 00:55:34,820
k x x trying to do x x and transpose +1, then you can show that this function space is going to be exactly the set of first of all, polynomials.

450
00:55:34,820 --> 00:55:47,630
So what that means is that you can actually approximate, you know, exactly calculate the meaning the variance of Gaussian if you use the union, right?

451
00:55:47,630 --> 00:55:55,820
So so that actually explains why sometimes calculating the mean as we do is really good at the calculating the means of the Gaussian.

452
00:55:55,820 --> 00:56:04,790
I think that's that's the reason. But the typical way to, you know, using APF is not actually the right way to Gaussian distributions.

453
00:56:04,790 --> 00:56:12,350
So that also explain why, you know, for example, people often find that as we did, the actually tends to underestimate the randomness.

454
00:56:12,350 --> 00:56:22,820
I think that's because the Gaussian idea of cool is not the right kind of Gaussian institutions actually didn't realise that right, Vikram.

455
00:56:22,820 --> 00:56:32,940
She. Yeah. I don't know if there are more questions like, you know, where Typekit to finish writing about, but a lot.

456
00:56:32,940 --> 00:56:41,220
This has been very interesting. Thank you.

457
00:56:41,220 --> 00:56:43,040
If you've got a minute.