1 00:00:01,550 --> 00:00:11,090 And now to. So, yeah, already you started recording, so hello to everyone and welcome to our like, 2 00:00:11,090 --> 00:00:17,930 a seminar of the computational statistics and Machine Learning Group, a group at Oxford. 3 00:00:17,930 --> 00:00:28,080 Today we are very happy to have a Gyang, do you? He's an assistant professor of computer science at the University of Texas at Austin. 4 00:00:28,080 --> 00:00:33,170 Here she she's pictured here at the University of California, Irvine and BSD. 5 00:00:33,170 --> 00:00:40,910 She has been a doctoral fellow at the Computer Science and Artificial Intelligence Lab at MIT. 6 00:00:40,910 --> 00:00:48,410 His work lies at the intersection of machine learning and statistics, with interest spreading over the pipeline of data collection, 7 00:00:48,410 --> 00:00:55,100 learning, inference, decision making and various applications using probabilistic modelling. 8 00:00:55,100 --> 00:01:01,910 She's one of the leaders in the development of machine learning and statistical methodology that implements the Steyn's method, 9 00:01:01,910 --> 00:01:10,400 a topic that is of great interest for many people in a room. So that's why we are very happy to have him here today. 10 00:01:10,400 --> 00:01:15,170 So young say that he is also happy to take questions as you may have them. 11 00:01:15,170 --> 00:01:18,950 I have muted all of you, but you want you have questions. 12 00:01:18,950 --> 00:01:27,230 You can type them in the chat or just like I say them that amuse yourself and say, say them. 13 00:01:27,230 --> 00:01:33,260 But anyway, so you can start now again. OK, thank you so much. 14 00:01:33,260 --> 00:01:39,350 It's great to be here and talk about this dance method and machine. 15 00:01:39,350 --> 00:01:45,380 So this is really a lot like a few years of research on the topic. 16 00:01:45,380 --> 00:01:54,560 So I will not talk anything that is particular reason, but focus on the basis of the framework. 17 00:01:54,560 --> 00:02:06,710 So, so machine learning and statistics one, the motivation that that I had for this talk, for this line of work was was to develop, you know, 18 00:02:06,710 --> 00:02:13,830 computable discriminating measures between data and model because essentially statistical 19 00:02:13,830 --> 00:02:18,740 and machine learning at a high level is really about doing a very simple thing, 20 00:02:18,740 --> 00:02:26,420 which is matching daytime model, using the data to understand models or using model to understand data. 21 00:02:26,420 --> 00:02:32,750 So now because of this, essentially lots of problems in statistics and machine learning, 22 00:02:32,750 --> 00:02:41,540 you can find them as as either evaluating or computing or optimising some notional discrepancies. 23 00:02:41,540 --> 00:02:46,340 So let's see if we are doing prompt estimation. So that's the problem. 24 00:02:46,340 --> 00:02:53,630 You are given a set of data, which sort of points you can view it as the pick up measure and you try to 25 00:02:53,630 --> 00:02:58,340 find a model that fitzsimmons the data so that can be viewed as a minimising. 26 00:02:58,340 --> 00:03:05,600 Minimising the discrepancy typically reduce the current divergence that gives you maximum likelihood estimation. 27 00:03:05,600 --> 00:03:13,670 But then on the other angle, let's say sometimes, well, give me a model of probabilistic model and we want to understand the model. 28 00:03:13,670 --> 00:03:18,470 Consider that this happens, especially in Beijing, when things went so wrong. 29 00:03:18,470 --> 00:03:19,130 And in that case, 30 00:03:19,130 --> 00:03:29,090 we captured your sample from that description and from the closet Monte Carlo view uncut interview that can also be viewed as optimisation problem. 31 00:03:29,090 --> 00:03:32,360 So in this case, you're finding a set of samples, 32 00:03:32,360 --> 00:03:39,920 but your findings have some points that fits with your model so that you can use the points on the standard model. 33 00:03:39,920 --> 00:03:45,650 Again, a discrepancy optimisation problem from different angle different. 34 00:03:45,650 --> 00:03:53,930 And then you have the model evaluation how if just takes it and formulate it as the good news of it past that saying that we are 35 00:03:53,930 --> 00:04:03,500 giving birth to publish the model and and instead of sample and want to decide if the sample is actually drawn from the model. 36 00:04:03,500 --> 00:04:10,870 So that can be viewed as evaluating whether the discrepancy equals zero or not. 37 00:04:10,870 --> 00:04:18,170 Now so. So that's I think that's a summary of what's doing in statistics. 38 00:04:18,170 --> 00:04:28,760 But then what's you know, what's additional or extra to that emission meaning is that, you know, we care about very large models, right? 39 00:04:28,760 --> 00:04:32,750 We have very high, you know, structured data, high dimensional data, 40 00:04:32,750 --> 00:04:43,970 and we have to match them with really complicated models and sometimes the new network models like what's popular in states, right? 41 00:04:43,970 --> 00:04:46,580 And the problem here is that, you know, 42 00:04:46,580 --> 00:04:54,320 the emphasis is a bit different because the instant mistakes we often are interesting finding that the most dramatically. 43 00:04:54,320 --> 00:05:05,200 So make catches by other means. So in in statistics, we are mostly interested in finding the statistic. 44 00:05:05,200 --> 00:05:09,670 Most powerful estimations, right? The machinery we often, you know, 45 00:05:09,670 --> 00:05:22,180 cannot achieve that means that we can only hope to find whatever you know available to us find and we have to prioritise. 46 00:05:22,180 --> 00:05:33,890 Pilot has competition was the statistical efficiency. So an example of the intractable models that widely used in machine learning, 47 00:05:33,890 --> 00:05:40,090 especially the just as about is these are globalised distribution models. 48 00:05:40,090 --> 00:05:52,510 So so what's happening here is that the probability distribution is specified by some are normalised probability, functional density function. 49 00:05:52,510 --> 00:06:01,270 So and what's critical the to, you know, evaluate the integration which represents here. 50 00:06:01,270 --> 00:06:14,800 This happens, obviously reading statistics in Beijing, defence graphical models, lots of planning models also have this problem. 51 00:06:14,800 --> 00:06:22,210 As you let's see, assume your sidebar here is the exponential of your network, where in that case, 52 00:06:22,210 --> 00:06:30,940 people use the antigen models as one way to generate images of all kinds of things. 53 00:06:30,940 --> 00:06:42,430 The traditional way to solve this problem is using maths, Markov chairman DeCaro, which you know, is known to be slow in many cases, 54 00:06:42,430 --> 00:06:56,500 but theoretically, you know, rigorous if it converge right on other hands in machine learning, lots of people use what's called variation on events. 55 00:06:56,500 --> 00:07:06,610 This is the idea that you can transform the inference problem into just as I mentioned as an optimisation problem, 56 00:07:06,610 --> 00:07:16,710 overall counter divergence so that you can approximate complicated distributions using simple parametric families such as Gaussian. 57 00:07:16,710 --> 00:07:27,430 But but in this way, you have to specify what families you have and if if you are not doing that properly, you may end up with biases. 58 00:07:27,430 --> 00:07:34,270 So, so today, I would focus on standard methods as as a mechanism, 59 00:07:34,270 --> 00:07:40,690 as a as a new foundation for solving this kind of a kind of discriminatory problems 60 00:07:40,690 --> 00:07:46,900 where you know all the other three problems in principle that I mentioned earlier about. 61 00:07:46,900 --> 00:07:56,680 Whenever you want to evaluate the discrepancy between daytime model, it turns out, especially for this, our normalised dispositions. 62 00:07:56,680 --> 00:08:05,920 It turns out the assessment is indeed a fundamental approach to do that that allows us to, you know, 63 00:08:05,920 --> 00:08:18,250 avoid the computational difficulty that traditional methods such as these based on maximum likelihood and divergence path. 64 00:08:18,250 --> 00:08:21,940 OK, so so Stan Smith. 65 00:08:21,940 --> 00:08:36,500 It's a it's a theoretical tool that was developed by Charles STEM to bound as a technique to bond the difference between probability distributions. 66 00:08:36,500 --> 00:08:46,000 It's quite very elegant and smart technique that was found to be, you know, 67 00:08:46,000 --> 00:08:56,860 remarkably powerful in the theoretical published his theory theoretical community and has been used to do lots of things. 68 00:08:56,860 --> 00:09:05,140 Well, recently it was was, you know, proposed as a way to prove central in history. 69 00:09:05,140 --> 00:09:16,960 But then people will realise you can extend it in many different ways and you can prove all kinds of probative bonds, even concentration qualities. 70 00:09:16,960 --> 00:09:22,630 And when we applied its holding found to be really successful. 71 00:09:22,630 --> 00:09:34,330 So, so you know that this year we have a paper that is titled Distance Magic Method, which I think is a very good description of the method, 72 00:09:34,330 --> 00:09:43,060 but it was not well known in English, in any community, just because it was a purely theoretical tool. 73 00:09:43,060 --> 00:09:51,820 Just to prove since you demonstrated if you're not interested in proving that there was not, probably not that useful for you. 74 00:09:51,820 --> 00:09:53,290 But it turns out it's not true. 75 00:09:53,290 --> 00:10:02,050 So it turns out that, you know, the key idea behind the standard method is actually extremely powerful, even as the computational tool. 76 00:10:02,050 --> 00:10:09,010 And the fundamental reason is that all of the statistical machine learning the computation that we have are essentially about, 77 00:10:09,010 --> 00:10:12,760 you know, providing bonds for foot between distributions. 78 00:10:12,760 --> 00:10:26,170 And that's exactly what segment is doing. OK, so so now I'm just going to diving and just this is a very quick review of stands mass. 79 00:10:26,170 --> 00:10:30,160 In fact, the product spent with. And that we will use because in fact, 80 00:10:30,160 --> 00:10:39,530 we have we would only use a pot of science methods that is essential to us and other technical parts of class that 81 00:10:39,530 --> 00:10:48,410 we will not talk about it because they are at least right now and not be able to use it for communication purposes. 82 00:10:48,410 --> 00:10:56,690 So the pod that we will use is a essential idea. The idea is that let's say PS, the distribution, 83 00:10:56,690 --> 00:11:01,640 the intractable immobilise distribution that is given to you and the whole idea 84 00:11:01,640 --> 00:11:05,060 of standard method is that you can you can construct is something called a 85 00:11:05,060 --> 00:11:12,320 standard operator that is a differential operator that acts in a function space 86 00:11:12,320 --> 00:11:18,500 such that if you apply the operator over arbitrary function of function, 87 00:11:18,500 --> 00:11:29,210 certified system, my boundary condition defensible, then you will get a zero expectation. 88 00:11:29,210 --> 00:11:37,230 You will get zero mean function. If so, does, their operator is essentially doing some sort of central right centring operate. 89 00:11:37,230 --> 00:11:45,830 All right. So and it's constructed as such that, you know, two distributions P and Q equals if only if. 90 00:11:45,830 --> 00:11:55,430 If you apply the state of play to associate one P, then you will always gather the expectation zero expected zero expectation. 91 00:11:55,430 --> 00:12:03,260 Q This happens for arbitrary function inside a function. 92 00:12:03,260 --> 00:12:12,410 So a simple the the there are different ways to define stent operator, but the particular one that we will use is something like this. 93 00:12:12,410 --> 00:12:24,350 So, so here basically, it is the inner to find the function that you're interested and and the lock the duty of Locky, 94 00:12:24,350 --> 00:12:33,710 plus the divergence operator, which is the sum of all the diagonal of the Jacobean, in fact. 95 00:12:33,710 --> 00:12:41,120 So here fire is actually a vector function, so you map form of natural features. 96 00:12:41,120 --> 00:12:54,080 So a way to think about this is that if we simply just, you know, the trivial way to achieve this is that we can just, you know. 97 00:12:54,080 --> 00:13:00,320 So let me see if. Yes. OK, so the simplest way to achieve this zero meaning is the following. 98 00:13:00,320 --> 00:13:06,110 So you will you can be Phi Phi equals two minus the expectation. 99 00:13:06,110 --> 00:13:12,150 Sure. Let's say this is. 100 00:13:12,150 --> 00:13:21,240 He has far right. So that should be a way to kind of achieve this thing is just simply minus the meaning of. 101 00:13:21,240 --> 00:13:30,390 Right? That that is the operator that can be applied over fi and that allows us to centralise everything. 102 00:13:30,390 --> 00:13:39,750 So you'll achieve this. But the problem is that you cannot directly calculate expectation of over oh, sorry, over P. 103 00:13:39,750 --> 00:13:45,420 So here should be p. You cannot directly calculate expectations right now. 104 00:13:45,420 --> 00:13:52,530 What's magic about this method is that if you just we placed this centralisation with this kind of, 105 00:13:52,530 --> 00:13:56,310 you know, the special operator that just taking, you know, 106 00:13:56,310 --> 00:14:03,450 between fi and do the lock key and plus an exchange divergence thing, 107 00:14:03,450 --> 00:14:11,740 then you can also achieve exactly the same thing as if you are centralising using the mean on the P right now. 108 00:14:11,740 --> 00:14:18,960 If you can do that, you can actually convert and go back to using that to calculate the the centralisation. 109 00:14:18,960 --> 00:14:21,810 And that involves solving some differential equation. 110 00:14:21,810 --> 00:14:30,480 But this is the essential idea is that it is ah, you can centralise everything under under the distribution, 111 00:14:30,480 --> 00:14:40,740 you know, by just this operator, not directly calculating the integration, but. 112 00:14:40,740 --> 00:14:52,460 OK, so now I need to clean up my spring. 113 00:14:52,460 --> 00:15:01,400 Something's wrong. OK. Yes. OK, so now what makes you know this this idea, especially intractable, 114 00:15:01,400 --> 00:15:06,380 is that if you look at the stand up, Peter, in fact, everything has come to Europe. 115 00:15:06,380 --> 00:15:10,460 So even if the distribution is our last. 116 00:15:10,460 --> 00:15:17,720 The reason is that this whole stand operator depends on the distribution key on to the school function. 117 00:15:17,720 --> 00:15:25,760 And the school function is is the duty like which is which, well, you know, the duty equal to the duo to be divided by PE. 118 00:15:25,760 --> 00:15:30,770 And if you do that, the dependency on the normalisation constant is cancelled. 119 00:15:30,770 --> 00:15:37,670 So. So you can actually directly inculcate the school function without calculating the mobilisation constant. 120 00:15:37,670 --> 00:15:45,290 And you know, that's the key, right? Naturally, if you gave me a discussion, I can just code up the stand up. 121 00:15:45,290 --> 00:15:52,310 It's using using Python or something. This is something that completely computes. 122 00:15:52,310 --> 00:15:58,490 Right? OK, so so now whilst an op ed, why that strange equivalence? 123 00:15:58,490 --> 00:16:07,620 So I'm going to give you some simple intuition. The best way to look at it is using integration by parts. 124 00:16:07,620 --> 00:16:16,880 So let's look at one direction, which is if people do queue, then that whole thing has equal to zero, 125 00:16:16,880 --> 00:16:25,820 and that's actually equating to something called a standard identity. This is more well known to to, you know, statistics in general. 126 00:16:25,820 --> 00:16:31,010 It's more widely used than stats methods. 127 00:16:31,010 --> 00:16:38,240 So so essentially, it says that this whole thing grew to zero on the P, and then you can prove it just by expanding the expectations. 128 00:16:38,240 --> 00:16:42,680 You all have peaks multiplied by this host and up to. 129 00:16:42,680 --> 00:16:46,460 And then you cancel the log p. You will get this whole thing. 130 00:16:46,460 --> 00:16:57,470 And this is actually a integration by parts so Pennzoil equals to the value of p times phi on the boundary, assuming it's one dimensional. 131 00:16:57,470 --> 00:17:06,470 And then if you assume that the product it be PHI has zero value on the boundary or decay sufficiently fast, 132 00:17:06,470 --> 00:17:14,450 then then you come to understand that, right? So so what this says is that we we do need some boundary condition, 133 00:17:14,450 --> 00:17:21,020 but this is a very mild condition because the only requires p times phi to declare. 134 00:17:21,020 --> 00:17:29,660 So you can either, you know, pick. If your P is decaying across the boundary, then you don't need to worry about PHI. 135 00:17:29,660 --> 00:17:38,330 Now, if it doesn't decay, then you have to choose Phi to decay. So either way, it's actually easy to achieve in practise. 136 00:17:38,330 --> 00:17:43,640 So, so Stan's identity in particular, has been widely used. 137 00:17:43,640 --> 00:17:46,580 It's a really powerful tool. The reason it's powerful. 138 00:17:46,580 --> 00:17:54,860 I think again, if you think about it, it's it's like a magic idea that, you know, suddenly for any given institution, p, 139 00:17:54,860 --> 00:17:59,600 you can get infinite number of identities that you can actually calculate, 140 00:17:59,600 --> 00:18:05,930 even though the function that even though the description is in check, this is this is a remarkable way. 141 00:18:05,930 --> 00:18:11,210 And then you can use this to do lots of things like, for example, if you treat them as movement equation, 142 00:18:11,210 --> 00:18:17,510 you can use them to as a way to as a way to estimate Panopto is right. 143 00:18:17,510 --> 00:18:26,850 There are many different methods develop and related to this, including the score matching method for for many energy based models. 144 00:18:26,850 --> 00:18:33,140 Right. There are many other things that you can do. For example, you can use that is the equation and control variant, 145 00:18:33,140 --> 00:18:39,800 and that allows you to reduce the barriers, in fact, to again, Magic's happening here. 146 00:18:39,800 --> 00:18:44,690 So it turns out, you know, under certain conditions, you can actually reduce the variance to zero, 147 00:18:44,690 --> 00:18:48,110 meaning that you know, the difficult convergence rate is going to end. 148 00:18:48,110 --> 00:18:52,890 But now you can actually get faster rate than the typical scale. 149 00:18:52,890 --> 00:18:59,660 And so again, it's it's very remarkable tool, but I think most people actually is more like, 150 00:18:59,660 --> 00:19:04,380 you know, about stance method than stance, identity and stance methods. 151 00:19:04,380 --> 00:19:12,050 Well, so what the stats method does is something that I think is deeper than stance identity, but less well known. 152 00:19:12,050 --> 00:19:20,120 So that's about this. Then that's the other direction of improvement, which says that if he doesn't, he could kill. 153 00:19:20,120 --> 00:19:25,430 Then I must be able to find some in such that I can violate that equation, right? 154 00:19:25,430 --> 00:19:30,770 So. So here what it says is that for any two distribution John Q that are different, 155 00:19:30,770 --> 00:19:37,520 I can always find a find some sort of discriminator that that gathers non-zero expectation 156 00:19:37,520 --> 00:19:45,830 of just an obvious place and a simple way to to say this is by this simple derivation. 157 00:19:45,830 --> 00:19:52,380 So basically, if you look at the expectation of of stand up to over Q. 158 00:19:52,380 --> 00:20:03,090 Then you can actually write it down. You can add another term, which is the same editor of Q Under Q, assuming this is true, this is the second term. 159 00:20:03,090 --> 00:20:10,950 Is this identity? And then you can you can combine these two stand all operators and divergence terms caso. 160 00:20:10,950 --> 00:20:15,480 Then you will get the defence of the school function in product five. 161 00:20:15,480 --> 00:20:24,120 Right. Essentially, what this says is that this whole expectation thing is actually calculating some sort of email product 162 00:20:24,120 --> 00:20:30,810 between FY and the defence of the school function where it's now you fiscal function of PM Q doesn't equal, 163 00:20:30,810 --> 00:20:32,610 then you should. In principle, 164 00:20:32,610 --> 00:20:45,260 you should find the file that violates the the the the the non-zero condition just by taking fire to be the defence of the school function. 165 00:20:45,260 --> 00:20:49,650 So, so in this way, you can you can show that your goal has been fired. 166 00:20:49,650 --> 00:21:02,840 So again, it's a very simple intrusion here. So now that's another another way to prove it, which is less was less well known, 167 00:21:02,840 --> 00:21:10,040 but actually this is a way that I really like and motivated lots of my method, 168 00:21:10,040 --> 00:21:17,300 which is saying that it turns out this this whole thing actually relates to clan mothers in a very interesting way. 169 00:21:17,300 --> 00:21:22,970 So, so assume you have a random variable x that is drawn from Q. 170 00:21:22,970 --> 00:21:28,520 And then let's say you can remember Pfizer actually a vector field, actually. 171 00:21:28,520 --> 00:21:37,250 So what you can do is you can take fire as a vector field and multiply by some small step size Ipsen and you will get an updated variable. 172 00:21:37,250 --> 00:21:48,710 That's fine. Now, if Axe's John from Q then shown the the the distribution of X Y is is this Q-tips on fire, which depends on both seasonal and fire? 173 00:21:48,710 --> 00:21:56,810 And then what you can do is you you can take the kill divergence between shoots on fire and p and and take the due 174 00:21:56,810 --> 00:22:05,600 to whizzed right through Ipsen and turns out that due to UV wave cycle zero is exactly the negative of expectation. 175 00:22:05,600 --> 00:22:17,840 So what's happening here is that as you apply this transform over the random variable and as you increase the step size from zero to some small value, 176 00:22:17,840 --> 00:22:23,660 you can measure essentially that the increased rate of CO that approaches and 177 00:22:23,660 --> 00:22:30,380 that increased rate is exactly the minus this expectation to stay off it. 178 00:22:30,380 --> 00:22:39,650 Now in this view, you can sort of you can sort of view that as this whole thing as a as the gradient, I saw some sort of gradient of divergence. 179 00:22:39,650 --> 00:22:46,070 So if he doesn't look to, then obviously you have zero tolerance, you can no longer decrease it. 180 00:22:46,070 --> 00:22:50,180 That's why you get zero. But if you have two distributions that are different, 181 00:22:50,180 --> 00:22:54,740 then you should be able to find a direction that decrease the colour divergence and that your 182 00:22:54,740 --> 00:23:03,140 action is going to be exactly the fire that have a non-zero decreasing rate of divergence. 183 00:23:03,140 --> 00:23:08,860 Any questions so far? I don't see any questioning. 184 00:23:08,860 --> 00:23:14,490 So. OK, sounds great. OK. 185 00:23:14,490 --> 00:23:21,900 And then you can essentially summarise the standard method using Using Standards Committee. 186 00:23:21,900 --> 00:23:26,130 So the idea is that now if you give two distributions on cue, 187 00:23:26,130 --> 00:23:35,110 we can just take the maximum of this expectation of stand up to what some function family, some functions set right. 188 00:23:35,110 --> 00:23:40,080 And now, if the function set is sufficiently large, then you should, you know, 189 00:23:40,080 --> 00:23:46,530 this whole thing should actually differentiate between you to equal to zero if only people took, 190 00:23:46,530 --> 00:23:51,750 you know, the choice of dysfunction cost is actually very, 191 00:23:51,750 --> 00:23:59,670 very important in the original Costco co-stars method that was developed for theoretical purpose. 192 00:23:59,670 --> 00:24:04,050 But you really want that function space to be large because you don't actually care 193 00:24:04,050 --> 00:24:09,540 about actually computing these thing numerically and you just want to use that. 194 00:24:09,540 --> 00:24:18,810 You want to make sure it's sufficiently large today compounds other metrics such as Motion Stand or to the resolution distance of, 195 00:24:18,810 --> 00:24:25,920 you know, and and you can. So basically the way it works is that you can use the standard equipment, see £2, 196 00:24:25,920 --> 00:24:30,030 lets you watch a stand and then you can show the Spanish script and see is also small. 197 00:24:30,030 --> 00:24:38,400 And that's why that's how you can prove or response of. But for practical purposes, you don't have, you don't. 198 00:24:38,400 --> 00:24:45,030 You cannot choose Option F because we actually want to numerically calculate its hosts this sustained discrepancy. 199 00:24:45,030 --> 00:24:48,450 So we have to choose a function space that is, you know, 200 00:24:48,450 --> 00:25:00,750 both sufficiently large as as well as compute computational intractable so that we do such a sacrifice some statistical power, 201 00:25:00,750 --> 00:25:06,740 but then we gain computational efficiency. So that's the essential trade-off here. 202 00:25:06,740 --> 00:25:14,580 And end the function cost that we are using is the colour of space reproducing colonel hemo space. 203 00:25:14,580 --> 00:25:18,180 So here's a very brief introduction. So let's see. You know, 204 00:25:18,180 --> 00:25:28,380 we have some positive stuff in Kano and then the reproducing colour space Isuzu wins that is defined as essentially the DNA sparing of the Kuno, 205 00:25:28,380 --> 00:25:32,490 where you can take out obituary reference point in the space, 206 00:25:32,490 --> 00:25:38,820 and you can have infinite number of this reference points that combine together 207 00:25:38,820 --> 00:25:44,360 and then you can define to the norm and apply in this way and that if you, 208 00:25:44,360 --> 00:25:49,320 you know, take the closure, you will get the cone of space. 209 00:25:49,320 --> 00:25:53,280 And if you choose the colour to be strictly positive, definitely in certain sense, 210 00:25:53,280 --> 00:26:06,520 then the space can approximate the space of continuous functions al-bashir well in a bounded domain. 211 00:26:06,520 --> 00:26:18,710 So. And then if you just plug in, let's see, I'm optimising this whole thing, but now I'm actually working on the producing space. 212 00:26:18,710 --> 00:26:26,420 And here any kind of constraint that norm has to be smaller than one to avoid the scanning you. 213 00:26:26,420 --> 00:26:33,230 And then you can actually solve this optimisation and postpone things on the optimal solution. 214 00:26:33,230 --> 00:26:38,270 Five star Yeah, exactly. The star Peter. When you apply this dollop, 215 00:26:38,270 --> 00:26:48,470 either over the function of functions to variables you won't integrate while the variable that devalue you another function. 216 00:26:48,470 --> 00:26:52,310 And then you can also show that the standard does come and see the value. 217 00:26:52,310 --> 00:27:00,590 The maximum value is going to be the expectation of a new kuno function, and the new colour function is very interesting. 218 00:27:00,590 --> 00:27:04,820 So basically you have the original kernel and then it's a turbo function. 219 00:27:04,820 --> 00:27:15,380 And then you just applied the stand up to twice, you know, the first time, treat it as a rainbow of explained second time to two as a function of X, 220 00:27:15,380 --> 00:27:21,200 and that gives you another a new positive definitely colour that is in some externalised. 221 00:27:21,200 --> 00:27:28,610 And then and then you can show this is this is very similar to the kernel maximum mean discrepancy. 222 00:27:28,610 --> 00:27:35,600 But now we have a special kernel that is defined by the extent of it. 223 00:27:35,600 --> 00:27:39,260 And the reason you know, you can you can do the duration yourself. 224 00:27:39,260 --> 00:27:49,190 It's actually a simple derivation. Basically, the reason we can we can solve this whole thing in custom is that this host the all. 225 00:27:49,190 --> 00:27:51,290 Peter is a Drina operator. 226 00:27:51,290 --> 00:28:00,320 And now if you optimise an opening, you know, the unit, the unit ball of hyperspace it does give always gives you a control. 227 00:28:00,320 --> 00:28:11,360 So, so this is a simple derivation. And then because of this nice form, nice close to home, you can actually evaluate. 228 00:28:11,360 --> 00:28:21,770 Now this is really getting to our point, which is if you if the cue description is unknown and it's observed through a set of ideas, 229 00:28:21,770 --> 00:28:28,860 sample exi, then you can pop made this then discrepancy between IQ and using this empirical version of that. 230 00:28:28,860 --> 00:28:34,130 But there are different ways to do it. You know, here I'm writing this. 231 00:28:34,130 --> 00:28:41,630 I'm biased the you and you take it basically that the true history piece, the true discrepancy is the expectation that. 232 00:28:41,630 --> 00:28:47,300 But now you can actually replace the expectation wins in pick of some. 233 00:28:47,300 --> 00:28:53,720 If you remove the diagonal, you will get an unbiased estimation. This is what's called use statistics. 234 00:28:53,720 --> 00:28:58,760 And then you can show essentially nice s and taunting properties of that. 235 00:28:58,760 --> 00:29:05,900 And then you can use this to construct a very powerful unions of protests, 236 00:29:05,900 --> 00:29:13,100 saying that if the discrepancy of the empirical data and p is larger than some threshold you can, 237 00:29:13,100 --> 00:29:16,820 you can basically rejected the hypothesis that people care. 238 00:29:16,820 --> 00:29:20,480 So this is one way to achieve good news to be passed. 239 00:29:20,480 --> 00:29:29,570 And what what's interesting of this method is that now you can actually do these tests for, you know, our normalised distributions. 240 00:29:29,570 --> 00:29:34,250 Very complicated. Let's see graphical models and high dimensional structural models. 241 00:29:34,250 --> 00:29:45,620 And this was not possible using traditional methods. And the threshold here can be decided by either bootstrap or you can divide the concentration 242 00:29:45,620 --> 00:29:54,100 in quality or over the standard discrepancy and use that as the threshold as well. 243 00:29:54,100 --> 00:30:01,060 But then, you know, another another idea is that, you know, we talked about this different view. 244 00:30:01,060 --> 00:30:08,980 You know, the good news tests, good news feed testing is like evaluating the discrepancy. 245 00:30:08,980 --> 00:30:16,190 But let's see, we are doing sampling problem when you make a vaccination. And that's like, you know, I'll give you a model. 246 00:30:16,190 --> 00:30:23,020 You want to essentially find a set of points you can view as funny, funniest of points of food, the goodness of the past. 247 00:30:23,020 --> 00:30:29,800 If we can fool the goodness test, then then that means that the sample will use a good approximation for the distribution. 248 00:30:29,800 --> 00:30:36,340 So now you can actually do it as a as minimisation problem, so you can see that even distribution p I can find, 249 00:30:36,340 --> 00:30:39,920 I want to find a sort of point to minimise the standard currency. 250 00:30:39,920 --> 00:30:45,490 And by doing that, you hopefully find points that can approximate the distribution that well, 251 00:30:45,490 --> 00:30:51,070 this is indeed a very powerful idea and has been exploited in several different ways. 252 00:30:51,070 --> 00:31:00,010 So, so the way that I will explore is a bit different. So I'm not going to directly minimise the the points because somehow it's difficult. 253 00:31:00,010 --> 00:31:14,820 It's getting some complex optimisation. And the way I'm doing is, you know, instead of doing this, I can solve the easy a problem that is. 254 00:31:14,820 --> 00:31:26,250 You know, that is, you know. You know, assume we all have a set of points, that's why is a set of points that is, you know, 255 00:31:26,250 --> 00:31:34,650 generated arbitrarily so and then what we want is to find a set of ways that associate with that point, 256 00:31:34,650 --> 00:31:41,190 such that the weighted empirical magic of the points approximate the distribution. 257 00:31:41,190 --> 00:31:50,970 And that can be framed as as minimising this weighted quadratic function subject to normalisation conditions. 258 00:31:50,970 --> 00:31:59,010 So that is actually quite powerful because. So here we are keeping a set of points, the same points as given to you. 259 00:31:59,010 --> 00:32:03,660 And in this, this is the arbitrary points. You don't need to know where it comes from. 260 00:32:03,660 --> 00:32:08,640 For example, you could ra I'm Sam C procedure and then you can, you know, 261 00:32:08,640 --> 00:32:13,200 you can get an approximation, but you are not sure if the approximation is good enough. 262 00:32:13,200 --> 00:32:23,910 Then what you can do is you can actually find a set of point weights to kind of correct the bias the your original NCMEC procedure. 263 00:32:23,910 --> 00:32:33,030 And you don't have to know about the distribution of X Y, and they can even be generated domestically. 264 00:32:33,030 --> 00:32:34,980 But using this method, 265 00:32:34,980 --> 00:32:44,100 you can still get a set of weights that kind of correct the bias in the distribution so we can show that this is actually very nice. 266 00:32:44,100 --> 00:32:55,590 And not that doesn't require us to know that the proposed the distribution back side actually gives you a better estimation, if you will. 267 00:32:55,590 --> 00:33:02,160 If in the function that you want to approximate age here is is a smooth function of out you can get. 268 00:33:02,160 --> 00:33:11,020 You can also get some benefits of variance reduction so that you can actually improve the approximation rate. 269 00:33:11,020 --> 00:33:15,660 So, so this is one kind of approach that we can explore. 270 00:33:15,660 --> 00:33:20,460 Staff met the standard discrimination to improve numerical approximation, 271 00:33:20,460 --> 00:33:27,240 but then that's another method that I think was really particularly interesting is that, 272 00:33:27,240 --> 00:33:32,340 you know, how can we actually directly finance a points to approximate a distribution? 273 00:33:32,340 --> 00:33:46,180 This is really the sampling problem. For some reason, I always had difficulty at. 274 00:33:46,180 --> 00:33:53,890 Yeah. OK, so so the idea here is that, you know, we are giving a distribution key. 275 00:33:53,890 --> 00:33:59,470 We want to find this at a point to approximate essentially is the sampling problem. 276 00:33:59,470 --> 00:34:02,890 And instead of minimising this, then it's good to see what do. 277 00:34:02,890 --> 00:34:14,650 We can do is we can directly minimise the divergence by optimally, you know, changing the variable, transporting the variable in some sense, right? 278 00:34:14,650 --> 00:34:19,720 So it's very similar to optimal transport. 279 00:34:19,720 --> 00:34:24,340 So the idea is that every time we have this particle and everything, 280 00:34:24,340 --> 00:34:29,260 we transport the particle using this way and then we choose this velocity field 281 00:34:29,260 --> 00:34:35,350 fine such that it always decreases the the divergence as fast as possible. 282 00:34:35,350 --> 00:34:43,150 And that can be framed as meaning maximising this negative decreasing rate of increasing vocal divergence. 283 00:34:43,150 --> 00:34:52,300 And essentially, this is really defining some notion of functional breeding sat on the distribution space. 284 00:34:52,300 --> 00:35:00,340 And and as I mentioned earlier, it turns out this decreasing range of cargo is exactly the standard operator. 285 00:35:00,340 --> 00:35:05,050 So that's why you know this optimisation actually exactly reduced to the optimisation 286 00:35:05,050 --> 00:35:11,410 we had for then it's going to say and and therefore the optimal find that we 287 00:35:11,410 --> 00:35:19,540 obtained earlier was exactly that the fire that allows you to decrease the divergence 288 00:35:19,540 --> 00:35:25,240 as fast as possible so you can actually use this fire to transport your particles. 289 00:35:25,240 --> 00:35:31,390 And then it turns out this standard is going to say is exactly the maximum decreasing radio networks. 290 00:35:31,390 --> 00:35:41,440 So that quantifies how much you can decrease the divergence from from studying from peer to peer. 291 00:35:41,440 --> 00:35:46,360 And then using that you can divide what's called stand, basically, then descend. 292 00:35:46,360 --> 00:35:49,300 So basically, you just take this whole thing. 293 00:35:49,300 --> 00:35:56,110 So basically, you maintain sort of particles every time Q is equal to the empirical measure of the particles, 294 00:35:56,110 --> 00:36:00,670 and then you just apply this to transform iteratively. 295 00:36:00,670 --> 00:36:09,400 It's very similar to it in the sense, but it is a particle system because you have a set of, you know, particles. 296 00:36:09,400 --> 00:36:15,160 Each of them is high and that is updated sequentially. 297 00:36:15,160 --> 00:36:25,690 And then you can, you know, this is an interacting particle system because because that you know that each of the update of each particle depends, 298 00:36:25,690 --> 00:36:33,850 on the other hand, it goes through the empirical measure. This is something called Minnifield of the particles. 299 00:36:33,850 --> 00:36:44,440 So that's why it's also related to Minnifield interacting, having the systems, which is a largely to chain applied mathematics. 300 00:36:44,440 --> 00:36:47,560 So this is an intuition. What's happening? 301 00:36:47,560 --> 00:36:54,580 So turns out the first term here is is a gradient term that drives the particles to increase the probability. 302 00:36:54,580 --> 00:37:00,070 The second term here is a very positive force term that, you know, practically speaking, 303 00:37:00,070 --> 00:37:04,240 actually enforce the different particles to stay away from each other. 304 00:37:04,240 --> 00:37:10,300 And then in the end, you can get a nice, nice approximation for further distribution. 305 00:37:10,300 --> 00:37:17,980 So if you don't have the second term, you will cops annual, you can only find a mode like typical optimisation does. 306 00:37:17,980 --> 00:37:23,780 But the positive force actually plays a critical role here. 307 00:37:23,780 --> 00:37:33,830 So this is this is what's happening when you have lots of particles, then you got to realise the density function. 308 00:37:33,830 --> 00:37:41,380 So. Yeah, so you can almost view this as kind of limited when you have even particles, 309 00:37:41,380 --> 00:37:47,320 then essentially these whole processes like involving some partial differential equation. 310 00:37:47,320 --> 00:37:56,900 And that's exactly what we can analyse here. So just another demo. 311 00:37:56,900 --> 00:38:11,930 Yeah. So. So one particular practical advantage is that this has created the algorithm as to exactly who it is to create in 312 00:38:11,930 --> 00:38:19,550 the Senate when you only have one article and this is very nice because if you do like typical methodical methods, 313 00:38:19,550 --> 00:38:23,630 if you just approximate the whole discussion with one single point, 314 00:38:23,630 --> 00:38:32,540 that point is going to be super random and it's not going to do well in any sense, except that it's unbiased estimation, probably. 315 00:38:32,540 --> 00:38:37,970 But if not only if we use as well using only one single particle, 316 00:38:37,970 --> 00:38:45,130 you already get the mode and the mode is already very powerful as we see in machine learning. 317 00:38:45,130 --> 00:38:52,220 So that's why you can build opera around the map and then gradually increase to the power. 318 00:38:52,220 --> 00:38:57,050 So it turns out that it has which theory associated with this type of ours. 319 00:38:57,050 --> 00:39:02,210 Them, you know, in the limit when you have, let's say, a number of particles, 320 00:39:02,210 --> 00:39:10,460 and if you step size decreased to zero, then this whole thing and this particle Benson, you can be, you know, 321 00:39:10,460 --> 00:39:19,640 associated with a differential equations, and you can show that the sequester equation actually decrease the divergence monotonically, 322 00:39:19,640 --> 00:39:25,130 unsurprisingly with the rates that equal to the standard and. 323 00:39:25,130 --> 00:39:30,830 And then you can also show that, you know, formally you can actually, you know, 324 00:39:30,830 --> 00:39:38,480 interpret that whole process as freedom flow of divergence on the ground in the space of distributions, right? 325 00:39:38,480 --> 00:39:43,580 This is really getting a very close connexion to optimal transport. 326 00:39:43,580 --> 00:39:53,750 So turns out, you know, you can define a some sort of optimal transport distance from Q to P as the minimum. 327 00:39:53,750 --> 00:40:01,070 You know, transport is a cost that, you know, transport hub mass from utopia. 328 00:40:01,070 --> 00:40:12,140 But here we are using a very special, sort of very special way to define and transport costs by using the HHS the of wall. 329 00:40:12,140 --> 00:40:17,420 If you use the typical L2 norm, you will get a typical optimal transport thing. 330 00:40:17,420 --> 00:40:27,050 But then here it's kind of optimal transport. It's like economise the match, analyse the optimal transport in some sense. 331 00:40:27,050 --> 00:40:35,150 And they if you define that metric and you just define the gradient flow, under that metric, you will get a switch. 332 00:40:35,150 --> 00:40:43,100 So this is a comparison between SPG and non-jury dynamics, which is very similar and closely related. 333 00:40:43,100 --> 00:40:53,650 So if you run non-Jew and dynamics. So if you run London Dynamics, it's like you have particles and every time you are adding random noise. 334 00:40:53,650 --> 00:40:55,030 But here in Australia, 335 00:40:55,030 --> 00:41:08,830 do we have a set of particle that interacting with deterministic function and then both of them can be catalysed by different differential equations? 336 00:41:08,830 --> 00:41:18,460 And actually, most of them can have great inflow interpretation, except energy and dynamics is the gradient on the typical L2. 337 00:41:18,460 --> 00:41:26,470 Optimal transport was as we are having a special Kunal's the optimal transport. 338 00:41:26,470 --> 00:41:33,370 And that's also another very different way to view, as we do be very different from the gradient flow view, 339 00:41:33,370 --> 00:41:37,510 which is essentially what's happening as you evolve these particles. 340 00:41:37,510 --> 00:41:43,180 It's actually trying to do something that is very similar to Cogito methods in numerical integration. 341 00:41:43,180 --> 00:41:54,400 So let's say, you know, if you remember from the numerical methods textbook, let's say Gaussian does Hamid conjecture. 342 00:41:54,400 --> 00:42:04,570 These methods are basically based on the idea that you want to find a set of points such that the the way you integrate over polynomials, 343 00:42:04,570 --> 00:42:07,420 for example, you will get exactly the solution. 344 00:42:07,420 --> 00:42:14,620 And then the hope is that the actual function that you integrate is close to polynomial so that you will get a good approximation. 345 00:42:14,620 --> 00:42:18,370 It turns out the S20 is doing something very similar to that. 346 00:42:18,370 --> 00:42:19,390 It turns out you can. 347 00:42:19,390 --> 00:42:31,120 You can actually find a set of points, a final set of functions that in which the SPG any fixed point on West Virginia is is matching. 348 00:42:31,120 --> 00:42:38,260 Exactly, and that sort of function is actually decided by the all as well as the Kuno. 349 00:42:38,260 --> 00:42:45,130 So if you choose to stand up the properly, you will recover the polynomial family. 350 00:42:45,130 --> 00:42:50,260 But now we are Modiano so we can get the major console function. 351 00:42:50,260 --> 00:42:52,720 So that's essentially what it's doing here. 352 00:42:52,720 --> 00:43:02,740 And basically, you can show that if you are approximating Gaussian distribution and if you use a demon code, then you will actually recover the pun. 353 00:43:02,740 --> 00:43:11,410 And if you use a polynomial control over calcium distribution, you will actually recover the polynomial families believe you use out of. 354 00:43:11,410 --> 00:43:17,620 You can. You can apply this method to more general distributions, and using this, you can actually show some balance. 355 00:43:17,620 --> 00:43:25,060 I think this opens some very interesting directions and angles that hasn't been really explored. 356 00:43:25,060 --> 00:43:35,510 OK, so I think I'm out of time. So, but very quickly. You know, does civil balance, I think I can cover perfectly well well. 357 00:43:35,510 --> 00:43:42,860 Well, I want an extension that I think is particularly interesting is that, you know, this whole thing doesn't have to depend on the gradient. 358 00:43:42,860 --> 00:43:46,260 And so it turns out you can actually divide between them. 359 00:43:46,260 --> 00:43:55,530 Free was the best way to be. It's an idea that is very similar to important sampling, but different in important ways. 360 00:43:55,530 --> 00:44:03,300 So basically, what's happening here is that assume you have the greedy end of the lobby and assuming it's very difficult to calculate, 361 00:44:03,300 --> 00:44:13,140 then what you can do is you can pay an arbitrary positive function and you can replace the gradient of sloppy wins the and block row. 362 00:44:13,140 --> 00:44:13,980 And then you can. 363 00:44:13,980 --> 00:44:24,630 Obviously, this will give you wrong direction, but then you can corrected the bias using the importance ratio, the the ratio between rule and pie. 364 00:44:24,630 --> 00:44:29,850 And in that way, you actually still get some contact to correct decisions. 365 00:44:29,850 --> 00:44:40,140 And then this can be very useful if your, you know, your disposition is has is it's very difficult to calculate gradient, right? 366 00:44:40,140 --> 00:44:51,060 Another another another algorithm, I think is interesting more or less less understood is is amortised as well, Judy. 367 00:44:51,060 --> 00:44:55,950 So the idea here is that it's a finding of some particles to approximate distribution. 368 00:44:55,950 --> 00:44:58,890 What I can do is I can do something very similar to again, 369 00:44:58,890 --> 00:45:05,970 which is find a newer network such that when you inject random inputs into the neural network, 370 00:45:05,970 --> 00:45:10,740 then you're network output, random outputs that follows approximate. 371 00:45:10,740 --> 00:45:17,900 You follow the distribution that you want, and this can be done easily using by some sort of imitation idea. 372 00:45:17,900 --> 00:45:26,160 So the idea here is that every time you opted in your network such that the particles followed the switch direction. 373 00:45:26,160 --> 00:45:32,460 So here is an iterative algorithm. Let's see. Let me explain very quickly. 374 00:45:32,460 --> 00:45:44,250 So the idea is that. So let me say so every time you have a new one at work and the new another output outputs of particles, 375 00:45:44,250 --> 00:45:51,120 this will be the the the green dots and then you update the particle when using as we did. 376 00:45:51,120 --> 00:45:56,850 So the particle will move closer to the target distribution. This will be the purple dots. 377 00:45:56,850 --> 00:46:00,510 And then what you can do is you go back, 378 00:46:00,510 --> 00:46:08,790 then you go back to the new and that's where modify the weights such that the next time the new model outputs the purple dots, right? 379 00:46:08,790 --> 00:46:14,400 And then based on the evidence, you will further find points that are closer to the distribution. 380 00:46:14,400 --> 00:46:20,580 And then you update the neurones where such that you know you will find the the dots is even closer. 381 00:46:20,580 --> 00:46:27,570 So by integrating this, you can actually turn your network to draw sample from descriptions. 382 00:46:27,570 --> 00:46:34,440 So that's that's that's essentially what I want to talk about. You know, that's this area of standard methods. 383 00:46:34,440 --> 00:46:39,360 New machine learning has really, I think, attracted lots of recent interest. 384 00:46:39,360 --> 00:46:49,560 I think it's the area where you can, you know, lots of very interesting theoretical problems that are still very open, for example, for as we did. 385 00:46:49,560 --> 00:46:52,230 We don't know exactly what's the meaning of convergence, 386 00:46:52,230 --> 00:46:59,310 which we don't know what's the best choice of colour, which is always a problem for common methods. 387 00:46:59,310 --> 00:47:09,240 And you know that many spaces for improving and extending as well, as well as the GST that we, 388 00:47:09,240 --> 00:47:14,890 I think, I don't think has been fully explored and lots of implications as well. 389 00:47:14,890 --> 00:47:28,050 In fact, this idea has been used in many epic applications, such as being first learning depending on certain qualifications. 390 00:47:28,050 --> 00:47:37,500 So I think that's also lots of room for applications of both. So when's that ever stop here? 391 00:47:37,500 --> 00:47:45,150 Thank you. Yeah. I don't know if anyone has questions otherwise. 392 00:47:45,150 --> 00:47:57,380 I do have some questions. So first. Yeah, so you at the end, you mentioned this kind of like important sampling. 393 00:47:57,380 --> 00:48:02,540 Method that you emphasise that it is not quite important something. 394 00:48:02,540 --> 00:48:13,390 So I wanted to know why it is not quite the same. It is not the same because here we are not doing Monte Carlo sampling, right? 395 00:48:13,390 --> 00:48:27,730 So. So and if you look at this, it's weird, it's a weird method because it's actually more like the numerous important template was the proposal. 396 00:48:27,730 --> 00:48:35,290 The actually is using the denominator, but in the typical invalid sampling, the actual suspicion is that you were nominated. 397 00:48:35,290 --> 00:48:43,420 Yeah, yeah. Well, I mean, it's it's it's it's similar in that most of most of them involves the density ratio, 398 00:48:43,420 --> 00:48:49,150 but but they are different because it's completely different to, you know, setting. 399 00:48:49,150 --> 00:48:53,890 We're not doing any colour on this increase. 400 00:48:53,890 --> 00:48:57,670 You can actually follow up on that. Yeah. 401 00:48:57,670 --> 00:49:06,990 So I wasn't quite sure about this, but in something because you said, I mean, the big advantage is that we don't need the normalising constant for P. 402 00:49:06,990 --> 00:49:11,430 Right. If you go through the rule, there was some good distribution. 403 00:49:11,430 --> 00:49:16,860 So you can't actually calculate the ratio unless we have the normalised constant right? 404 00:49:16,860 --> 00:49:25,850 Yes, but but but the the you know, the. But they it's really just a part of the step size like. 405 00:49:25,850 --> 00:49:35,860 Because, you know, the you know, let's see, let's say he has a normalisation constant, but you can push the normalising consent to the step size. 406 00:49:35,860 --> 00:49:43,010 And then if you choose the subsidies to be small, then you don't need to worry about that. 407 00:49:43,010 --> 00:49:49,750 Does it make sense? So let's see, let's say I have a look. 408 00:49:49,750 --> 00:49:53,330 So then you don't know the Epsilon, but you do. 409 00:49:53,330 --> 00:49:59,960 Yeah, yeah, yeah. Let's see you. You have to model, you have to divide the here, but then you have to push the Z here. 410 00:49:59,960 --> 00:50:06,360 So then the yips only IPS Z, right? But then it's the step size you can choose. 411 00:50:06,360 --> 00:50:16,950 Yes, and then the substance goes to zero. But so, yeah, so it will impact the way you choose the step size, but other than that? 412 00:50:16,950 --> 00:50:21,760 Yeah. OK, thank you. Yeah. 413 00:50:21,760 --> 00:50:32,050 Another question. At some point you showed a like a convergence resolved and you say that the the like in order to approximate an 414 00:50:32,050 --> 00:50:42,850 interval this time missile based approach a convergence rate which was strictly better than the than Monte-Carlo. 415 00:50:42,850 --> 00:50:47,350 So that was surprising to me. And so could you comment about that? 416 00:50:47,350 --> 00:50:57,040 Is it this one? Yeah. Yes, yes, it's actually is something that is very interesting, although it's not so, so let me explain what's happening here. 417 00:50:57,040 --> 00:51:08,480 So, so. If you are so big, the reason is that here you are designing the weights to explicitly minimise the standard discriminate. 418 00:51:08,480 --> 00:51:13,390 Right now it is out. 419 00:51:13,390 --> 00:51:27,910 This 10 discrepancy can be right as the shape of the difference between the empirical mean and the actual mean over the kind of special space. 420 00:51:27,910 --> 00:51:34,810 And that space is all penned by taking the original across and applies to all people over that space. 421 00:51:34,810 --> 00:51:43,840 You will get a new space. It turns out for that space of functions, you know they are approximated particularly, well, 422 00:51:43,840 --> 00:51:51,540 biased and discriminatory because they are exactly the kind of things that will be bounded by standard screws. 423 00:51:51,540 --> 00:51:56,680 But so for this family of function will get a really good approximation error. 424 00:51:56,680 --> 00:52:02,320 But it doesn't mean that we are, you know, having free lunch because the, you know, 425 00:52:02,320 --> 00:52:09,150 it could have we could have functions the outside of family that performs worse than the car. 426 00:52:09,150 --> 00:52:20,510 So so what? I think, what standard is what the oldest method does is somehow kind of prioritise the functions and this happens to us as well. 427 00:52:20,510 --> 00:52:22,510 So as I mentioned, 428 00:52:22,510 --> 00:52:34,150 you can actually find a set of function that on which switch the algorithm is exactly calculating like and that's no error up to the new macro. 429 00:52:34,150 --> 00:52:44,770 But then for other functions, they may not be well approximated, so it's more like a prioritised space of functions. 430 00:52:44,770 --> 00:52:52,280 This is different from when the methods were, you know, you can get the same approximation rate across all the functions. 431 00:52:52,280 --> 00:53:00,530 Mm-Hmm. I should say so someone is asking, do you have a good rule of thumb for choosing different candidates? 432 00:53:00,530 --> 00:53:09,020 Asked me, Do we? We don't really have that one open question. 433 00:53:09,020 --> 00:53:17,570 So we do have lots of insights that hadn't been really put together into automatic procedure. 434 00:53:17,570 --> 00:53:27,560 I guess that's the way we frame. So so what happened was that, you know, we in the beginning we kind of didn't know what to cut or to use. 435 00:53:27,560 --> 00:53:32,940 We know. Let's see, for example, if you use kind of it's a universe of hot tomatoes. 436 00:53:32,940 --> 00:53:43,130 So it's OK, right? So it must work. So we're happy using IVF in most of the applications, and it works reasonably well. 437 00:53:43,130 --> 00:53:49,160 And obviously other researchers have been proposing different way, different clonal choices. 438 00:53:49,160 --> 00:53:54,680 But one thing that you know, I was one to explore, but we haven't. 439 00:53:54,680 --> 00:54:02,540 Was this kuno college of I think, which I think really gives a lot of insight on the choice of colour. 440 00:54:02,540 --> 00:54:13,910 So what's happening is that the kernel actually defines the space and functions on which as we look to exactly match. 441 00:54:13,910 --> 00:54:20,390 Right? Just like college, internet is like choosing to match the polynomial functions. 442 00:54:20,390 --> 00:54:27,380 And, you know, as we did, the maths is choosing to match a special family of functions that is defined by the crow. 443 00:54:27,380 --> 00:54:37,130 So but then this dependency from the kernel to the function that we exactly match is actually a complicated maths. 444 00:54:37,130 --> 00:54:45,650 If we can somehow, you know, understand the maths and in fact numerically solve that maths, then it will be very powerful. 445 00:54:45,650 --> 00:54:53,660 Because let's say we, if we are interested in calculating the variance and not the mean, for example, then we can make. 446 00:54:53,660 --> 00:55:02,330 We can hopefully design the Kuno such that the quadratic function is inside that function space that we are approximating or even close. 447 00:55:02,330 --> 00:55:11,750 Possibly then if that happens, then we can get really good approximation where in fact, the but you can see it's already happening. 448 00:55:11,750 --> 00:55:22,070 For example, if the distribution p is a Gaussian distribution and if we choose the colour to be a dimensional, which is, you know, 449 00:55:22,070 --> 00:55:34,820 k x x trying to do x x and transpose +1, then you can show that this function space is going to be exactly the set of first of all, polynomials. 450 00:55:34,820 --> 00:55:47,630 So what that means is that you can actually approximate, you know, exactly calculate the meaning the variance of Gaussian if you use the union, right? 451 00:55:47,630 --> 00:55:55,820 So so that actually explains why sometimes calculating the mean as we do is really good at the calculating the means of the Gaussian. 452 00:55:55,820 --> 00:56:04,790 I think that's that's the reason. But the typical way to, you know, using APF is not actually the right way to Gaussian distributions. 453 00:56:04,790 --> 00:56:12,350 So that also explain why, you know, for example, people often find that as we did, the actually tends to underestimate the randomness. 454 00:56:12,350 --> 00:56:22,820 I think that's because the Gaussian idea of cool is not the right kind of Gaussian institutions actually didn't realise that right, Vikram. 455 00:56:22,820 --> 00:56:32,940 She. Yeah. I don't know if there are more questions like, you know, where Typekit to finish writing about, but a lot. 456 00:56:32,940 --> 00:56:41,220 This has been very interesting. Thank you. 457 00:56:41,220 --> 00:56:43,040 If you've got a minute.