1 00:00:00,990 --> 00:00:05,670 So the motivation from this work comes from large scale folks here in France. 2 00:00:05,670 --> 00:00:10,950 And to illustrate what I mean by that, I want you to consider this very simple example of logistic regression. 3 00:00:10,950 --> 00:00:15,940 So you have a data set with all data points. Each data point has a feature vector. 4 00:00:15,940 --> 00:00:21,660 I call it V.L. It's an Archimedes three dimensional feature vector and has a binary class label. 5 00:00:21,660 --> 00:00:30,120 And the feature vector is related to the class label through an unknown parameter vector that governs the mean of of the class label. 6 00:00:30,120 --> 00:00:40,410 We'll call that parameter vector beta. Now, if you place a prior on that number vector, this becomes evasion, logistic regression model. 7 00:00:40,410 --> 00:00:45,810 And one day I want to highlight about this, about this genitive models that it's pretty simple to express. 8 00:00:45,810 --> 00:00:52,080 But even in this simple case, the posterior distribution over the unknown parameters is complex. 9 00:00:52,080 --> 00:00:57,170 We don't know the normalisation, constant and exact, integrate and tractable. 10 00:00:57,170 --> 00:01:06,840 So what does one normally do? Well, one will often reach for Markov chain Monte Carlo CMC to eventually draw samples from the posterior distribution. 11 00:01:06,840 --> 00:01:12,720 And the benefit of this is that you can then approximate these intractable posterior expectations that are written in red. 12 00:01:12,720 --> 00:01:20,870 These integrals with asymptotically exact sample estimates, which are just averages over your sample plants and blue. 13 00:01:20,870 --> 00:01:31,600 But there's also a drawback. And one drawback is that every new CMC sample point requires iterating through your entire observed dataset. 14 00:01:31,600 --> 00:01:36,870 And this can be pretty prohibitive when your dataset is too large. 15 00:01:36,870 --> 00:01:45,630 So motivating this work is how do we scale Markov chain Monte Carlo Osteria in France to massive datasets. 16 00:01:45,630 --> 00:01:53,670 And over the past decade, this template solution has been emerging, which is approximate CMC with subsect posteriors. 17 00:01:53,670 --> 00:01:56,670 The idea here is that instead of running your exact Markov chain, 18 00:01:56,670 --> 00:02:02,570 you'll approximate it in a way that only uses a subset of your data to draw every new sample point. 19 00:02:02,570 --> 00:02:05,690 The benefit is that you have reduced computational overhead. 20 00:02:05,690 --> 00:02:10,640 So it means you have faster sampling and you can reduce your Monte Carlo variance quickly. 21 00:02:10,640 --> 00:02:15,640 But there's also a drawback, which is that it introduces this bias. That's awesome. 22 00:02:15,640 --> 00:02:22,190 It doesn't go away. And the reason why is that your target distribution is no longer the stationary distribution of your chain. 23 00:02:22,190 --> 00:02:27,230 So you've that you sample forever would always be sampling from the wrong distribution. 24 00:02:27,230 --> 00:02:31,340 But the hope is that for the fixed amount of sampling time that you have in practise, 25 00:02:31,340 --> 00:02:38,490 the reduction in variance that you've introduced will outweigh the bias. And this will be lead to better inference. 26 00:02:38,490 --> 00:02:45,780 So I think this is a promising approach, but it introduces a number of new challenges for inference. 27 00:02:45,780 --> 00:02:51,840 For instance, how do we compare and evaluate the samples that are being produced by these various approximate CMC procedures? 28 00:02:51,840 --> 00:02:59,350 How do we know how good they are? How do we select an approximate M.C., M.C. sampler and how do we tune the type of parameters? 29 00:02:59,350 --> 00:03:08,560 Because they're always hyper parameters that you need to tune. And then how do we quantify this bias, Ference Trade-Off that I just mentioned? 30 00:03:08,560 --> 00:03:16,270 So there are standard procedures for CMC assessment, like effective sample size trace plots. 31 00:03:16,270 --> 00:03:20,920 Diagnostics. But they all assume eventual convergence to the target distribution. 32 00:03:20,920 --> 00:03:28,570 And they don't account for this additional Assam Toric bias that's inherent to these approximate CMC methods. 33 00:03:28,570 --> 00:03:39,750 So in this talk, I'll introduce to you some new quality measures that are suitable for comparing the quality of these approximate CMC samples. 34 00:03:39,750 --> 00:03:42,450 And a bit more generally. 35 00:03:42,450 --> 00:03:50,330 We'd like to develop a measure that's suitable for comparing the quality of any two samples that are approximating a common target distribution. 36 00:03:50,330 --> 00:03:54,880 So here's here's our setup for the talk. We'll be focussing on continuous targets. 37 00:03:54,880 --> 00:03:59,870 P Target Distribution's P with support and all of our to the D. 38 00:03:59,870 --> 00:04:07,220 But you can relax that to any convex upset and we'll say that our targets have a density little P that's known up to normalisation, 39 00:04:07,220 --> 00:04:19,060 but that integration under the distribution is intractable. So to approximate expectations under P., we're going to make make use of a sample. 40 00:04:19,060 --> 00:04:23,140 So we have sample points X1 through XM and. 41 00:04:23,140 --> 00:04:29,470 Together, these define a discrete distribution, which is just the empirical distribution that places one over end mass at each point. 42 00:04:29,470 --> 00:04:34,560 So we'll call it distribution cubin. And what we will do with Cunard's will approximate integrals under P. 43 00:04:34,560 --> 00:04:41,030 Expectation's under P. With sample averages under Q in blue. 44 00:04:41,030 --> 00:04:44,280 And it's I should say at this point that I'm not going to make any particular assumption about the 45 00:04:44,280 --> 00:04:49,890 provenance of these sample points so they could be samples from your favourite approximate Markov chain, 46 00:04:49,890 --> 00:04:54,030 Monte Carlo algorithm, but they could also be deterministic points, for instance, 47 00:04:54,030 --> 00:05:00,890 produced by quadrature rule, or they could be iodines points from some auxiliary distribution. 48 00:05:00,890 --> 00:05:09,560 OK, so given the sample. Our goal is to quantify how well, sample expectations, approximate target expectations. 49 00:05:09,560 --> 00:05:16,070 And we have a few criteria for doing this. First, we want are our quality measure, our measure. 50 00:05:16,070 --> 00:05:22,340 A way of quantifying. To detect when our sample sequence actually is converging to the target. 51 00:05:22,340 --> 00:05:27,040 And we also want this tact when the sample sequence is not conversion to the target. 52 00:05:27,040 --> 00:05:37,160 And we want to do all of this in a way that's computationally feasible so that you can use this quality measure to assess your Markov chains. 53 00:05:37,160 --> 00:05:44,690 OK, well, given these goals, I think it's natural to consider the family of integral probability metrics because these 54 00:05:44,690 --> 00:05:51,000 are measures that quantify the maximum discrepancy between sample and target expectations. 55 00:05:51,000 --> 00:05:55,580 Overclass of real value test functions. Call that H. 56 00:05:55,580 --> 00:06:01,670 And when is classes sufficiently large convergence of your if your integral probability metric, I'll call it IPN for short. 57 00:06:01,670 --> 00:06:08,140 So convergence of your IPN to zero implies that your sample sequence is converging weekly to P, 58 00:06:08,140 --> 00:06:14,040 and that's good because that satisfies our second requirement. Detection of non convergence. 59 00:06:14,040 --> 00:06:17,130 And there are lots of examples that you may be familiar with that fall into this, 60 00:06:17,130 --> 00:06:25,130 this family, the L1 battisti in distance the deadly metric, even total variation. 61 00:06:25,130 --> 00:06:31,820 But for for vast distances and to end the deadly metric there, there's a problem, there's a problem for our for our criteria. 62 00:06:31,820 --> 00:06:39,230 And the problem is that the definition of the integral probability metric involves integration under P. 63 00:06:39,230 --> 00:06:49,030 And that's exactly what we don't know how to do. So most of these IP memes that you're familiar with can't actually be computed in practise. 64 00:06:49,030 --> 00:06:51,790 So how can we get around this? 65 00:06:51,790 --> 00:07:01,810 Well, one idea is that we could only consider functions that we know test function to age, that we know apriori are mean zero under P. 66 00:07:01,810 --> 00:07:08,410 If we could do that, then we'll get rid of this integration under P. And then the IPN computation only depends on our sample. 67 00:07:08,410 --> 00:07:15,400 And that's potentially something that's tractable, something that we can compute. That's an interesting idea. 68 00:07:15,400 --> 00:07:20,310 But how do we find test functions that are mean zero under P. 69 00:07:20,310 --> 00:07:30,090 And. Once we find them, how do we know that the resulting IPN will actually track sample sequence convergence in the way we wanted to? 70 00:07:30,090 --> 00:07:34,800 And then finally, even once we've gotten rid of the expectation under P, 71 00:07:34,800 --> 00:07:38,460 we're still stuck with this infinite dimensional optimisation problem over test function. 72 00:07:38,460 --> 00:07:44,680 So how do we solve that in practise? So come back to the third question in a in a bit. 73 00:07:44,680 --> 00:07:54,020 But first, I'd like to show you how we can use ideas from Psycho Stein's method to answer these first two questions. 74 00:07:54,020 --> 00:07:59,160 OK. Stein's method, I'm sure many of you are already familiar with this, but you know, Stein, 75 00:07:59,160 --> 00:08:06,140 Charles Sharfstein in the 1960s and 70s produced this recipe for controlling convergence and distribution. 76 00:08:06,140 --> 00:08:12,590 And he did it to provide a central limit, to prove a central limit theorem and to provide rates of convergence for the central limit era. 77 00:08:12,590 --> 00:08:16,970 And I like to think of it as a three step approach. 78 00:08:16,970 --> 00:08:26,390 So in step one, you want to identify an operator, what's called a T. and a set of functions G that have a following property. 79 00:08:26,390 --> 00:08:31,760 When you pass functions from your set through your operator, the result is means zero under your target. 80 00:08:31,760 --> 00:08:37,220 So this operator generates means uro functions under P, which is great. 81 00:08:37,220 --> 00:08:43,550 That's exactly what we want for our R improved IPN. 82 00:08:43,550 --> 00:08:49,620 OK, so say we've picked this operate and we pick the set, judges are going gonna be the domain of the operator together, 83 00:08:49,620 --> 00:08:53,690 TNG will define what will will define what we'll call a style discrepancy, 84 00:08:53,690 --> 00:09:03,260 which is an IBM type measure that has no explicit integration under P, because by design, all the test functions are means you're under P. 85 00:09:03,260 --> 00:09:07,110 So call it sign discrepancy S. And it's a function of your sample. 86 00:09:07,110 --> 00:09:12,440 The operator you picked and the set you picked. Now, I'll come back to how we actually find such an operator in a moment, 87 00:09:12,440 --> 00:09:16,910 but first I want to show you what we'll do with it in Stine's one, what one would normally do with it. 88 00:09:16,910 --> 00:09:21,260 Stine's method. So the second step in science method is typically the lower bound. 89 00:09:21,260 --> 00:09:28,370 This time discrepancy by some reference IPN that you know and trust like a stay stand distance. 90 00:09:28,370 --> 00:09:32,660 And the purpose there is to show that if you're STEINERT 70 goes to zero, 91 00:09:32,660 --> 00:09:38,870 then your your faithful IPN is also going to zero, which would mean that if your Stine's comes, he goes to zero. 92 00:09:38,870 --> 00:09:42,560 Then you know that your sample sequence is converging to P, and that's great. 93 00:09:42,560 --> 00:09:48,480 That solves our our second requirement detection of non conversions. 94 00:09:48,480 --> 00:09:56,670 Now, typically, this requires analysis, but it can be performed once at once in advance for large classes of target distributions. 95 00:09:56,670 --> 00:10:04,000 And then you can go off and happily use your time discrepancy knowing that it controls convergence in this way. 96 00:10:04,000 --> 00:10:10,660 The third step in science, maths, that is typically the upper bound herstein discrepancy. By any means necessary to demonstrate convergence to zero. 97 00:10:10,660 --> 00:10:18,920 This is this would satisfy our second our first requirement. Now, this for us is probably the least interesting step, because in practise, 98 00:10:18,920 --> 00:10:23,930 what we're going to do is actually compute the tiny discrepancy and see how big it is. But it would be nice to know that. 99 00:10:23,930 --> 00:10:29,480 Nice to know, Apriori, that if your sample was converting to P, your time discrepancy wouldn't d go to zero. 100 00:10:29,480 --> 00:10:33,940 So I'll give you some results of this flavour as well. 101 00:10:33,940 --> 00:10:40,920 Now, the standard used for this recipe is as an analytical tool to prove convergence and distribution, to approve central limit theorem. 102 00:10:40,920 --> 00:10:43,020 Just elif rates and the central limit theorems. 103 00:10:43,020 --> 00:10:51,470 But our goal here is to develop it into a practical quality measure that you can use to assess, for instance, approximate M.S., M.S. 104 00:10:51,470 --> 00:10:59,270 So to do that, I'm going to have to give you a very specific instantiation of this otherwise abstract recipe. 105 00:10:59,270 --> 00:11:04,880 So let's do that. Let's first pick one of these operators, we'll call it a Stein operator T. 106 00:11:04,880 --> 00:11:11,020 Remember, we want to find a T that generates mean zero functions under our target distribution P. 107 00:11:11,020 --> 00:11:16,900 Well, to do this, we're going to appeal to this beautiful idea due to Andrew Barr or called the generator method. 108 00:11:16,900 --> 00:11:25,210 And the idea is that if you can find a Markov process that has P, is it stationary distribution and under very mild conditions, 109 00:11:25,210 --> 00:11:30,460 the infinitesimal generator of that process generates mean zero functions under P. 110 00:11:30,460 --> 00:11:35,110 Now, I've given you the definition of an infinitesimal generator here, but we don't actually need that definition. 111 00:11:35,110 --> 00:11:41,380 We're only going to focus on one specific instance which I've written in my my green box in the bottom. 112 00:11:41,380 --> 00:11:46,940 So if we pick a particular Mark-Up process, this is the land of diffusion targeting P. 113 00:11:46,940 --> 00:11:52,300 Then I've written down the form of its generator and from the generator we're going to produce an operator. 114 00:11:52,300 --> 00:11:58,930 And here's the Stein operator. We're going to use a called T.P Land of Einstein operator. 115 00:11:58,930 --> 00:12:03,490 And the main thing I want to highlight what this operator is that it depends on the target, P. 116 00:12:03,490 --> 00:12:06,280 Only through the gradient of its log density. 117 00:12:06,280 --> 00:12:14,450 So this means that this is an operator that could be computable, even if P can't be normalised, even if you don't know that normalisation constant. 118 00:12:14,450 --> 00:12:25,240 Those familiar with Stein's method might also see this as a multivariate generalisation of the density method operator due to Stiehm. 119 00:12:25,240 --> 00:12:32,830 OK. So that's going to be our working Stein operative for most of the talk. We need to know whether it produces mean zero functions. 120 00:12:32,830 --> 00:12:36,760 For that, we need to actually choose which input functions G to pass through it. 121 00:12:36,760 --> 00:12:40,930 And it turns out that one suitable set is the set that have called the classical style set, 122 00:12:40,930 --> 00:12:48,130 which is just a set of functions that are bounded, that have bounded gradients and that have lipshitz gradients. 123 00:12:48,130 --> 00:12:53,930 If you pass any of these functions to your operator, the result will be mean zero, which is exactly what we want. 124 00:12:53,930 --> 00:13:00,310 So now we've picked our operator, T. And we've picked Stein, said G. 125 00:13:00,310 --> 00:13:05,950 Can I just ask what does the storm mean in front of the top of the No. 126 00:13:05,950 --> 00:13:13,120 Oh, yeah, so sorry. So this is, you know, the status Perama tries by some norm, which could be whatever norm you want, your favourite norm. 127 00:13:13,120 --> 00:13:17,470 And then this normal star just represents the dual norm for that norm. 128 00:13:17,470 --> 00:13:25,880 But. Most of the talk. You should think of the norm as the Euclidian norm and then the the norm, Starr is also the Euclidian norm that's there. 129 00:13:25,880 --> 00:13:30,450 So you can basically ignore the star, but it's more generally the dual norm of the normal person. 130 00:13:30,450 --> 00:13:37,270 Okay. Thank you. OK. 131 00:13:37,270 --> 00:13:44,680 So together, this choice of operator TI Incentive strategy produce a Stein discrepancy. 132 00:13:44,680 --> 00:13:48,370 For the purpose of a talk, I'll call that a classical style discrepancy. 133 00:13:48,370 --> 00:13:54,480 And now you need to know, does this time discrepancy actually detect convergent and non convergence in the way that we want? 134 00:13:54,480 --> 00:14:00,390 So what does that mean that means. Is it the case that this time discrepancy goes to zero? 135 00:14:00,390 --> 00:14:08,800 If and only if your sample sequence converges to be. Now, in the univariate case, this is actually a well-known result for many targets. 136 00:14:08,800 --> 00:14:12,400 It's known that for many targets this time, discrepancy goes to zero. 137 00:14:12,400 --> 00:14:20,550 Only if the Bosher Stein distance goes to zero. Here are some references that provide results of that flavour. 138 00:14:20,550 --> 00:14:25,440 But in the multivariate case, relatively few targets have been analysed. 139 00:14:25,440 --> 00:14:31,500 There's been substantial work on multivariate Gaussians. There's been some work on the Dirichlet distribution and a few other targets. 140 00:14:31,500 --> 00:14:35,100 But many targets haven't been assessed in the same way. 141 00:14:35,100 --> 00:14:41,070 So to extend the breadth of the literature, my collaborators and I proved the following result. 142 00:14:41,070 --> 00:14:49,560 And it says that if the land of diffusion underlying the opera that we picked off, the land of diffusion of your target couples' an integral rate, 143 00:14:49,560 --> 00:14:55,020 that means basically that mixes quickly and great into your log density is Lipshitz. 144 00:14:55,020 --> 00:14:58,710 Then we have exactly what we want. You're Steinman's reference. He goes to zero. 145 00:14:58,710 --> 00:15:05,270 If and only if you're for sustained distance goes to zero. And so what are examples of distributions to satisfy these properties? 146 00:15:05,270 --> 00:15:08,750 Well, any strongly law can cave distribution falls into this family, 147 00:15:08,750 --> 00:15:13,580 including the bagian logistic regression with calcium Pryors that I mentioned at the beginning of the talk. 148 00:15:13,580 --> 00:15:17,540 But it also covers non-striking often kape distributions like robust students to 149 00:15:17,540 --> 00:15:23,080 regression with calcium Pryors or even calcium mixtures which are multi-modal. 150 00:15:23,080 --> 00:15:27,520 Now, I should say, though, that this class is one class for which we can provide such a result. 151 00:15:27,520 --> 00:15:29,830 But these conditions are definitely not necessary. 152 00:15:29,830 --> 00:15:35,500 But we hope that this sort of result will provide a template for bounding style discrepancies for other classes as well, 153 00:15:35,500 --> 00:15:43,820 so that we can expand the reach of this style discrepancy use. 154 00:15:43,820 --> 00:15:50,480 OK, so let me give you a simple example of using this time discrepancy in practise, the simplest possible example. 155 00:15:50,480 --> 00:15:56,450 So here your target is just going to be a standard normal distribution and you're going to observe samples 156 00:15:56,450 --> 00:16:03,580 either from the standard normal or from a student's t distribution with matching mean and variance. 157 00:16:03,580 --> 00:16:06,880 So if you're assessing these two samples, what would you like to see? 158 00:16:06,880 --> 00:16:14,130 You'd like to see is that as my sample size grows and I'm getting samples from the I'm getting a sample from the correct distribution, 159 00:16:14,130 --> 00:16:19,270 then my sign discrepancy goes to zero. And that's what you're seeing in this red line as a function of your sample size. 160 00:16:19,270 --> 00:16:25,900 You're seeing this style discrepancy value that was computed. And indeed, this is going to zero at one over root and rate. 161 00:16:25,900 --> 00:16:29,320 But now, if you receiving draws from the wrong distribution, 162 00:16:29,320 --> 00:16:34,190 you'd like your sign discrepancy to be bounded away from zero and that's what you're seeing this to line as I observe draw. 163 00:16:34,190 --> 00:16:44,210 So my students t first a sign discrepancy decreases, but eventually it bottoms out because you're not getting a better approximation to the normal. 164 00:16:44,210 --> 00:16:50,660 OK. So that's something. I mean, we get the vet, we get the value back from a side discrepancy, but it turns out that on top of recovering the value, 165 00:16:50,660 --> 00:16:57,320 you can also recover the function that function G that optimise this time discrepancy, that worst case function. 166 00:16:57,320 --> 00:17:00,380 And I'm showing you those worst case functions here on the right. 167 00:17:00,380 --> 00:17:06,530 I think even more interpretable than a worst case function G is the function after you've passed it through your Stein operator, 168 00:17:06,530 --> 00:17:12,740 because that becomes the test function. That should have been means zero under the under the target, the Gaussian. 169 00:17:12,740 --> 00:17:15,320 But isn't mean zero under your sample. 170 00:17:15,320 --> 00:17:23,030 And it's the thing that it's the function that has the worst case violation as a mean that's farthest from zero. 171 00:17:23,030 --> 00:17:26,750 So. It's I don't over interpret any of these examples. 172 00:17:26,750 --> 00:17:31,610 I'll just give you a little indication of what you can read off from such a from such a function H. 173 00:17:31,610 --> 00:17:37,190 So on the bottom right corner here, you see the worst case test function for our students t sample. 174 00:17:37,190 --> 00:17:43,130 And what you see is that there's a lot of mass, a lot of large function values in the tails. 175 00:17:43,130 --> 00:17:47,540 And this is because the students t is oversampling the tails relative to a Gaussian. 176 00:17:47,540 --> 00:18:00,660 And so that's where you can exploit the difference between your sample. That's we can most exploit the non gassy entity of your sample. 177 00:18:00,660 --> 00:18:04,350 OK. Here's a more interesting example. This is a posterior inference example, 178 00:18:04,350 --> 00:18:08,970 so our target now is going to be a posterior density formed by a prior product of 179 00:18:08,970 --> 00:18:16,490 a prior and a number of likely good terms from a from a from our EL datapoints. 180 00:18:16,490 --> 00:18:25,520 To in this case, we're going to be assessing a particular algorithm called stochastic gradient legend dynamics, which is an approximate CMC algorithm. 181 00:18:25,520 --> 00:18:29,090 It it operates. It does a random walk through space. 182 00:18:29,090 --> 00:18:36,260 The way it does this is by starting at a certain point, taking a step in the gradient direction in the direction of the grade if your log density. 183 00:18:36,260 --> 00:18:40,400 But if your data set is large, computing grid Inverloch log density is expensive. 184 00:18:40,400 --> 00:18:50,120 And so it approximates that using a mini batch to randomly subsample a set of your points and a stochastic gradient instead of a full gradient. 185 00:18:50,120 --> 00:18:55,740 Then add some Gaussian noise and that becomes the on the next step in your Markov chain. 186 00:18:55,740 --> 00:19:00,360 Now, you might see this as an approximation to a few other albums with what you might be familiar. 187 00:19:00,360 --> 00:19:05,160 One is called the Metropolis Adjusted Land of an Algorithm. So this approximates that in two ways. 188 00:19:05,160 --> 00:19:09,930 One, instead of using the actual gradient of your log density, it uses this stochastic gradient. 189 00:19:09,930 --> 00:19:14,630 And two, it throws away the Metropolis Hastings correction. 190 00:19:14,630 --> 00:19:20,790 So the correction is usually employed to ensure that you're Markov chain is indeed targeting your target distribution, 191 00:19:20,790 --> 00:19:26,940 that it has the target as a stationary distribution. But computing the Metropolis Hastings correction is usually very slow for large datasets. 192 00:19:26,940 --> 00:19:31,530 So this algorithm just throws away the Metropolis Hastings correction and hopes for the best. 193 00:19:31,530 --> 00:19:35,440 This is what makes it approximate CMC. OK. 194 00:19:35,440 --> 00:19:39,220 Well, one other we're doing approximate NCMEC, a number of things matter a lot. 195 00:19:39,220 --> 00:19:43,900 And one of those things is the step size. This is how large of a step you taken. 196 00:19:43,900 --> 00:19:46,780 Your stochastic gradient direction is Epsilon. 197 00:19:46,780 --> 00:19:53,560 If your step size is very small, then after a fixed amount of sampling time, you'll barely have moved from your starting point. 198 00:19:53,560 --> 00:19:59,980 So that's slow mixing. So that's bad. That's going to give you a bad approximation to the target and the target I've drawn here in 199 00:19:59,980 --> 00:20:06,170 terms of it's drawing the equity density contours of the target distribution here for reference. 200 00:20:06,170 --> 00:20:10,640 On the other hand, if your step sizes too large, then you'll have fast mixing. 201 00:20:10,640 --> 00:20:20,430 But you'll be mixing very quickly to a very wrong distribution. Here you see that the sample is greatly over, dispersed relative to the target. 202 00:20:20,430 --> 00:20:26,140 So this is a problem. You like to pick something in between. So how do you do that? 203 00:20:26,140 --> 00:20:30,850 Well, you could try to use a you would like to. OK. We like to pick this step size. 204 00:20:30,850 --> 00:20:36,550 So do we do that? You could try to use a standard selection criterion like effective sample size. 205 00:20:36,550 --> 00:20:43,630 This is commonly used to assess pilot change for Mark-Up, for CMC and to pick hyper parameters. 206 00:20:43,630 --> 00:20:48,040 But in this case, effective sample size is really just a measure of variance. 207 00:20:48,040 --> 00:20:50,920 It's a measure of auto correlation amongst your sample points. 208 00:20:50,920 --> 00:20:57,130 So it will, in this case, always pick the sample that says here's here's my measure, effective sample size. 209 00:20:57,130 --> 00:21:03,400 Actually, I've assessed the effective sample size of different values of my of my step size parameter epsilon. 210 00:21:03,400 --> 00:21:08,020 And it basically just says, pick the biggest epsilon you can because the variance looks the best. 211 00:21:08,020 --> 00:21:11,860 So it says, pick this one. And the reason why it picks this one is that it's not. 212 00:21:11,860 --> 00:21:17,800 It has no way of accounting for bias. It doesn't know that this is targeting the wrong distribution asymptotically. 213 00:21:17,800 --> 00:21:25,970 So it can account for the fact this is over dispersed. An alternative would be to use this time discrepancy that I just mentioned. 214 00:21:25,970 --> 00:21:31,490 And here I'm showing you the Stein discrepancy, values computed for each step size here, a lower is better. 215 00:21:31,490 --> 00:21:35,780 So the sign discrepancy says, OK. A very small epsilon is giving you a bad approximation. 216 00:21:35,780 --> 00:21:40,460 A very large epsilon is giving you a bad approximation, but something in between is giving you the best. 217 00:21:40,460 --> 00:21:45,860 So pick this value and the value that it picks is this one in the middle, which you can see here, 218 00:21:45,860 --> 00:21:57,020 certainly from these plots, is like the best approximation to our target. 219 00:21:57,020 --> 00:22:04,250 OK, so one thing I didn't mention about the sign discrepancy that I just evaluated for you was how I computed it. 220 00:22:04,250 --> 00:22:07,280 And it turns out the way a computer does by solving a linear programme, 221 00:22:07,280 --> 00:22:12,380 there's a particular programme that you saw that gives you back the value of the sign discrepancy. And that's great. 222 00:22:12,380 --> 00:22:17,900 But maybe you don't want to carry it. Maybe you want to solve a linear programme every time you have to evaluate your sample. 223 00:22:17,900 --> 00:22:23,930 You might find that cumbersome, too slow. And so now what I'm going to do is I'm going to identify an alternative, Stein said. 224 00:22:23,930 --> 00:22:29,460 That is a bit more user friendly, just easier to use than the one I just described. 225 00:22:29,460 --> 00:22:33,660 And to do this, I'm going to appeal to reproducing Col's. 226 00:22:33,660 --> 00:22:46,500 And I want to build upon an idea of Chris Oates Girolami in Japan, who used means zero reproducing kernels to improve Monte Carlo integration. 227 00:22:46,500 --> 00:22:53,070 So just a quick background on reducing colonels. Colonel, do you think of it as a function with two arguments? 228 00:22:53,070 --> 00:22:59,690 It's symmetric in those two arguments and it's positive semi definite meaning that if I took any 229 00:22:59,690 --> 00:23:06,230 points from my my sample space and I formed the matrix of pairwise evaluation to that function, 230 00:23:06,230 --> 00:23:11,910 then that matrix is always a positive, something definite matrix. So what are some examples? 231 00:23:11,910 --> 00:23:17,670 Probably the most common example is what's called the Gaussian kernel, which just looks like the W of a Gaussian distribution. 232 00:23:17,670 --> 00:23:20,600 And so that will be appearing in our talk in a second. 233 00:23:20,600 --> 00:23:27,320 Just to give you a second example, here's an inverse multi quadrio kernel, which looks a lot like the Gaussian kernel of its decayed polynomial. 234 00:23:27,320 --> 00:23:32,290 Instead of squared exponentially. OK. 235 00:23:32,290 --> 00:23:40,840 What we actually want from this colonel is the space that it generates, every colonel generates a what's called a reproducing Colonel Hilbert space. 236 00:23:40,840 --> 00:23:46,310 And we're going to use that to choose a different style set to pass through our operator. 237 00:23:46,310 --> 00:23:52,610 In particular, we're going to use this sign set, which some call the Colonel Stein said, and this is just the set of functions. 238 00:23:52,610 --> 00:23:57,530 So the function we've been using throughout the talk, we're actually vector value. They have de components. 239 00:23:57,530 --> 00:24:02,910 So this is a set of vector value functions such that every component belongs to your R.K. 240 00:24:02,910 --> 00:24:09,190 Yes, you're reproducing Colonel Hilbert space and the norms of those functions are jointly bounded by one. 241 00:24:09,190 --> 00:24:13,280 So every function component belongs to RCA. Yes. And then you put all the norms together. 242 00:24:13,280 --> 00:24:19,900 They're about to buy one. That's what I said is. What's good about this set? 243 00:24:19,900 --> 00:24:28,780 Well, the very nice property that this set has is that this tiny discrepancy that he uses and that is you plug this into your Stiehm discrepancy. 244 00:24:28,780 --> 00:24:32,590 It has a closed form solution. And to form that solution. 245 00:24:32,590 --> 00:24:40,230 All you have to do is take your input, Colonel Kay. You pass it, you apply the Styne operator to each argument of Kay. 246 00:24:40,230 --> 00:24:49,910 That gives you back a new kernel, which is what we call a Stein kernel. And then you sum up those Stein kernels across of every pair of sample points. 247 00:24:49,910 --> 00:24:55,140 So, again, the current this the resulting Colonel Stein discrepancy is just a pairwise, 248 00:24:55,140 --> 00:25:00,470 some of pairwise evaluations of this new Stein colonel that's formed by applying Herstein operator to your base. 249 00:25:00,470 --> 00:25:05,880 Colonel. So that's pretty cool. 250 00:25:05,880 --> 00:25:10,080 Clothes form does take quadratic time in your sample size. 251 00:25:10,080 --> 00:25:19,800 But it's also very paralyser able so you can evaluate all of your pairwise evaluations in parallel on a cluster or across course. 252 00:25:19,800 --> 00:25:24,780 OK, so now we have another time discrepancy and then we have the same questions. 253 00:25:24,780 --> 00:25:31,320 Does it detect? Does it detection on convergence, for instance? Is it the case that your sample sequence converges to P. 254 00:25:31,320 --> 00:25:39,480 Whenever you're Stiehm, discrepancy converges to zero. Well, it turns out that in the one dimensional case. 255 00:25:39,480 --> 00:25:41,820 This is true and pretty broad generality. 256 00:25:41,820 --> 00:25:48,660 If you pick any of your favourite kernels like that Gaussian kernel or the inverse multi quadrant, then if you're set, 257 00:25:48,660 --> 00:25:53,530 if your target distribution belongs to basically the same set of distributions that we consider before. 258 00:25:53,530 --> 00:25:58,020 Here I am. I've changed the name, but it's basically the same set as before. 259 00:25:58,020 --> 00:26:00,510 If you're if your target belongs to the script at P, 260 00:26:00,510 --> 00:26:08,730 which is the set of targets with Lipshitz grab log density and that satisfied distance, strong lock and cavity. 261 00:26:08,730 --> 00:26:15,210 This is a property that's kind of like being strongly Loken K, but it only is only a measurement of the tail, so it doesn't have to hold locally. 262 00:26:15,210 --> 00:26:18,930 So even Gaussian mixtures that are multi-modal satisfy this property. 263 00:26:18,930 --> 00:26:24,440 If people longs to the set of target distributions and you pick your favourite Kernell favourite, 264 00:26:24,440 --> 00:26:28,800 we'll say universal kernel, then your sample sequence converges to P. 265 00:26:28,800 --> 00:26:32,070 Whenever your state discrepancy goes to zero. 266 00:26:32,070 --> 00:26:38,010 So in terms of targets covered, it's all the same targets from before Bayesian logistic regression students t regression with calcium prioress, 267 00:26:38,010 --> 00:26:42,150 calcium mixtures, chemical variance, they'll fall into this class. 268 00:26:42,150 --> 00:26:47,850 And this also justifies the use of Casti with popular kernels like the Gaussian kernel. 269 00:26:47,850 --> 00:26:52,870 Another popular it is called the Materne Colonel and even the inverse multi quadrant that I mentioned before. 270 00:26:52,870 --> 00:26:57,160 But this result is a little bit limited because this is only for one dimensional targets. 271 00:26:57,160 --> 00:27:03,930 And often when we're dealing with CMC, we're working with multidimensional targets. So what can we say in that case? 272 00:27:03,930 --> 00:27:09,180 Well, it turns out that in higher dimensions, many cases break down, 273 00:27:09,180 --> 00:27:16,400 they actually fail to detect an on convergence even when you're working with a very easy target like a Gaussian. 274 00:27:16,400 --> 00:27:26,300 So here's the result. It says if you're if you're in dimensioned three or greater and you're the colonel you pick has a very rapidly decaying tail. 275 00:27:26,300 --> 00:27:31,360 So it's a Gaussian kernel if some a turn kernel which decays exponentially, 276 00:27:31,360 --> 00:27:36,910 if it's one of these inverse multi Quadro kernels that the case poorly normally. But at the polynomial, the K is too quick. 277 00:27:36,910 --> 00:27:44,140 Then I can give you a sample sequence that will send your sine discrepancy is zero but not converted to P. 278 00:27:44,140 --> 00:27:48,830 Which is actually really bad because that means you can't trust your style discrepancy. 279 00:27:48,830 --> 00:27:54,660 And in fact, the sample sequence doesn't converge, the sample sequence that I can construct will converge to anything. 280 00:27:54,660 --> 00:28:00,900 It really just places a bunch of mass on the surface of a sphere and then shoots the radius of the sphere off to infinity. 281 00:28:00,900 --> 00:28:09,140 So your sample is not converting to anything. It's just shooting all the mass off to infinity. And yet your time discrepancy is still going to zero. 282 00:28:09,140 --> 00:28:16,340 I don't trust you with this. So when you say not converging, so this is converging in distribution, right? 283 00:28:16,340 --> 00:28:23,380 Yeah, that's right. Secondary in distribution to anyway. That's right. There's also a convergence on compact sets, of course. 284 00:28:23,380 --> 00:28:30,110 Yeah. So, I mean, the real issue here is that you're sending all of your mass to infinity. 285 00:28:30,110 --> 00:28:34,410 All of it. And so it's essentially converting to the zero measure. Yeah. 286 00:28:34,410 --> 00:28:42,630 So. So in that sense I suppose. And they converge at the foot, converge because on any compact that you're left is nothing. 287 00:28:42,630 --> 00:28:46,650 Yes, but not to any problem. We won't convert any probability distribution. That's true. Yes. 288 00:28:46,650 --> 00:28:52,410 And that's that's, I think the most part about. But yes, it will convert. It is converting to the zero measure, as in convergence. 289 00:28:52,410 --> 00:28:56,340 But that is OK. I think this is the problem. That is the whole problem, actually. 290 00:28:56,340 --> 00:29:00,930 That is the whole problem. First time currently to work in the way you want it to. 291 00:29:00,930 --> 00:29:06,450 You have to be able to distinguish the zero measure from Pete because the zero measure also has means zero functions. 292 00:29:06,450 --> 00:29:14,930 And so to speak. Yet to build a Singlish, those two things. And the problem is that if you're Col's decaying too quickly, 293 00:29:14,930 --> 00:29:19,800 the these very light tails are ignoring that excess mass that you're sending into the tails. 294 00:29:19,800 --> 00:29:28,250 And so it doesn't notice that you're shooting all the mass off to infinity. Essentially, you're not controlling type this. 295 00:29:28,250 --> 00:29:32,780 So that's a problem. So what can we do? 296 00:29:32,780 --> 00:29:37,000 Well, I just said, well, the problem seems to be these like tales they're ignoring massive, if any. 297 00:29:37,000 --> 00:29:41,630 What if we use heavier tails? And it turns out that that is enough to solve the problem. 298 00:29:41,630 --> 00:29:49,700 If you use one of these inverse multi Khwaja kernels, these Pawling O'Meally, the cane kernels, but you have a cane not too slowly if you choose them. 299 00:29:49,700 --> 00:29:55,730 If you choose the decayed parameter between to be between minus one and zero, then everything works. 300 00:29:55,730 --> 00:29:59,330 If you're if you're distribution belongs to our standard class and you use this. 301 00:29:59,330 --> 00:30:06,200 I am to Colonel. Then indeed if you're Stiehm discrepancy goes to zero, your sample sequence is conversion weekly to P. 302 00:30:06,200 --> 00:30:13,630 In any dimension. OK, so let's go. 303 00:30:13,630 --> 00:30:17,190 We have a solution. Now, what about the other side? What about detecting convergence? 304 00:30:17,190 --> 00:30:24,140 We want to show that if my sample sequence is converting to P, then my sine discrepancy is also going to zero. 305 00:30:24,140 --> 00:30:27,290 Well, turns out that this is true and pretty broad generality. 306 00:30:27,290 --> 00:30:36,680 If your kernel is bounded and twice continuously differentiable and your grab Luppi is Lipshitz and Square enterable under P. 307 00:30:36,680 --> 00:30:42,580 So that's a pretty mild conditions. Then your stand discrepancy will go to zero whenever the Bosher stand distance goes to zero. 308 00:30:42,580 --> 00:30:47,780 So if you're Cubans converting to be embarrassed in distance, your tiny discrepancy will also converge to zero. 309 00:30:47,780 --> 00:30:59,170 And this covers all of your favourite kernels in them in any dimension. 310 00:30:59,170 --> 00:31:01,960 OK, so let me give you an example of using this. 311 00:31:01,960 --> 00:31:06,730 This kernel style discrepancy, now we're going to consider another one of these approximate CMC procedures. 312 00:31:06,730 --> 00:31:13,060 This was designed for scalability. This one is called stochastic gradient fissure scoring, or STF s for short. 313 00:31:13,060 --> 00:31:17,170 And in the initial paper, there are two different variants of this algorithm proposed. 314 00:31:17,170 --> 00:31:21,850 One that is more of an exact variant that involves inverting a DVD matrix. 315 00:31:21,850 --> 00:31:28,220 Another that tries to reduce computational effort by converting into agonal matrix instead. 316 00:31:28,220 --> 00:31:32,450 We're going to operate on a fifty one dimensional problem in this case, 317 00:31:32,450 --> 00:31:43,680 using a post here distribution over parameter Bayesian logistic regression posterior using monist handwritten digit images. 318 00:31:43,680 --> 00:31:48,120 OK, so here's the result, basically, we're comparing two different. 319 00:31:48,120 --> 00:31:55,440 I said we're comparing two different versions of the CMC algorithm approximate I'm CMC algorithm and teal and red. 320 00:31:55,440 --> 00:31:59,970 If you apply the same discrepancy, you see that across sample sizes. 321 00:31:59,970 --> 00:32:06,180 The full version, the F version just seems to be better, has a lower and a lower style discrepancy value. 322 00:32:06,180 --> 00:32:12,690 So this would suggest that you should always use that unless the speed up from using the diagonal version as much as very large. 323 00:32:12,690 --> 00:32:16,360 But it turns out that the speed up from using the diagonal version is quite mild. 324 00:32:16,360 --> 00:32:21,180 No differences like point o one seven seconds versus Ponto and nine seconds. Not much of a gain. 325 00:32:21,180 --> 00:32:26,060 So the quality this would suggest that the quality that you're losing by using the diagonal version is not worth it. 326 00:32:26,060 --> 00:32:33,120 Not the speed up you're getting is not worth the quality being lost. But that's just the output of this discrepancy measure. 327 00:32:33,120 --> 00:32:42,030 You might want to might wonder, what does that have to do with reality? So to give you an external cheque on this, Amun, this assessment, 328 00:32:42,030 --> 00:32:51,310 what I've done is I've run a very long and trustworthy Markov chain to generate a ground truth sample from the posterior. 329 00:32:51,310 --> 00:32:57,320 In this case. I believe it's either Hamiltonian Monte Carlo or in Metropolis suggested it led to an algorithm. 330 00:32:57,320 --> 00:33:01,490 So this is a Markov chain that actually is targeted at the distribution, run it for a very long time. 331 00:33:01,490 --> 00:33:07,010 And I've used that to as a surrogate surrogate ground truth for the true posterior. 332 00:33:07,010 --> 00:33:14,240 And then I'm comparing the meat, the produced means and co variances from our approximate CMC algorithms. 333 00:33:14,240 --> 00:33:16,190 So in the left column here, 334 00:33:16,190 --> 00:33:23,570 I'm showing you the the the best meaningful variance approximation across pairwise marginals and in the right column showing you the worst. 335 00:33:23,570 --> 00:33:30,330 So on the top, we have the diagonal version of the algorithm and at the bottom we have the full version of the algorithm. 336 00:33:30,330 --> 00:33:38,280 What you basically see is that even in the best case, the diagonal version is doing a poor job of approximating pairwise marginal variances. 337 00:33:38,280 --> 00:33:44,460 And even in the worst case, the full version of the album is doing a great job at approximating equal variances. 338 00:33:44,460 --> 00:33:48,680 So it's just an external cheque on the on the quality of these to approximate. 339 00:33:48,680 --> 00:33:55,140 CMC albums. And indeed, just by looking at pairwise marginals, you can see that the full version is doing much better than diagonal version. 340 00:33:55,140 --> 00:33:59,970 The benefit, though, of using this time discrepancy over what I've just done on the right is that you don't 341 00:33:59,970 --> 00:34:11,060 need this extra auxillary surrogate ground truth chain to perform your assessment. 342 00:34:11,060 --> 00:34:16,480 OK. So so far, I focussed wholly on sample quality comparison. 343 00:34:16,480 --> 00:34:22,440 And I like to highlight a few other application of these tick through the application of these techniques. 344 00:34:22,440 --> 00:34:32,420 So as an example, a few groups have proposed using Colonel Stein as references like the one I just described, to perform goodness of fit testing, 345 00:34:32,420 --> 00:34:41,440 in particular to Will Koski at all use the Acasti with a Gaussian kernel to carry out a variety of goodness effect testing experiments. 346 00:34:41,440 --> 00:34:47,180 The authors of this call. All right, Tom. Hey, Arthur. 347 00:34:47,180 --> 00:34:51,960 Arthur is part of that, too. Arthur Gretton is a great idea. 348 00:34:51,960 --> 00:34:56,340 And one thing they found was that the if you use a default Gaussian kernel, 349 00:34:56,340 --> 00:35:00,360 you could experience a considerable loss of power as your dimensioned D increases. 350 00:35:00,360 --> 00:35:08,580 And, you know, I think this is in part related to the problem that we highlighted before with a Gaussian kernel being unable to detect, 351 00:35:08,580 --> 00:35:12,030 unable distinguish the target from the zero measure in higher dimensions. 352 00:35:12,030 --> 00:35:18,420 So what we're going to do is we're just going to repeat their experiment, plugging in the I am Q kernel for the Gaussian kernel. 353 00:35:18,420 --> 00:35:27,480 OK. So here it is in the setup. Your target P is a multivariate standard Gaussian. 354 00:35:27,480 --> 00:35:31,770 You're seeing a sample that's not Gaussian. And you want to test whether it is Gaussian or not. 355 00:35:31,770 --> 00:35:38,070 And specifically, a sample is formed by spiking one of the components with a a uniform random variable. 356 00:35:38,070 --> 00:35:46,260 And so you want to test the power of this? We want to compute the power of this test for the testing, this departure from normality. 357 00:35:46,260 --> 00:35:51,840 And they also compare with the standard normal normality test of bearing House and Enza. So what do we see? 358 00:35:51,840 --> 00:35:57,890 Well, we see that in low dimensions, the Gaussian. All these tests, the Gaussian KSTU, the Barenholtz and hence the test. 359 00:35:57,890 --> 00:36:00,870 They all do quite well. They have full power at detecting these departures. 360 00:36:00,870 --> 00:36:05,280 But as a dimension increases, you see that the power of the Gaussian test decays very rapidly. 361 00:36:05,280 --> 00:36:11,340 By the time you're in dimensioned twenty five, you've lost all of your power. But if you just substitute this, I am. 362 00:36:11,340 --> 00:36:16,980 Q Colonel with polynomial, the k the one that we describe that controls convergence for the Gaussian kernel. 363 00:36:16,980 --> 00:36:25,520 You actually maintain full power across the range of dimensions. OK. 364 00:36:25,520 --> 00:36:29,690 So another thought is that, OK, we've we've looked at. We're assessing sample quality. 365 00:36:29,690 --> 00:36:34,350 What if we could we also improve the quality of our sample? You've you've generally a sample somehow. 366 00:36:34,350 --> 00:36:42,770 I wanna produce a better approximation to my target. Well, one way of doing this might be to assign a weight to each of your sample points 367 00:36:42,770 --> 00:36:47,220 and then optimise those weights to minimise your Col's time discrepancy. This is the idea. 368 00:36:47,220 --> 00:36:53,720 And a paper by Lou and Lee in twenty sixteen. And they again do this with a Gaussian kernel. 369 00:36:53,720 --> 00:36:59,490 And we're going to show you some benefits from doing this with an IQ kernel instead. 370 00:36:59,490 --> 00:37:05,880 So what I'm showing you here is the main square area between the true means and the estimated means, 371 00:37:05,880 --> 00:37:10,530 using a sample, using a sample, a weighted sample or original equal weight sample. 372 00:37:10,530 --> 00:37:17,790 So your initial sample is just an I.D. draw from your target. Then we're going to learn a re waiting using a Gaussian. 373 00:37:17,790 --> 00:37:21,840 I'm q sorry, I'm guessing KSTU guessing style discrepancy. 374 00:37:21,840 --> 00:37:29,420 If you do that, you see uniform improvement in an approximation quality across dimensions from dimension two up to one hundred. 375 00:37:29,420 --> 00:37:41,740 And then if you use an Q colonel and said you see yet further improvements, approximation, quality. 376 00:37:41,740 --> 00:37:49,960 OK. So you might. So in this last example, we were improving sample quality by re weighting points. 377 00:37:49,960 --> 00:37:54,250 You might also imagine that you could improve sample quality by shifting locations of your points. 378 00:37:54,250 --> 00:37:58,450 You start with the initial sample and then you move those points around to improve the quality. 379 00:37:58,450 --> 00:38:04,900 And this is the idea of Stein variation of gradient descent. You know, Leo and Wang. 380 00:38:04,900 --> 00:38:05,260 Essentially, 381 00:38:05,260 --> 00:38:14,470 it uses the colonel's time discrepancy directions like the worst case test functions from Colonel Stein's Pepsi problem to update locations of points. 382 00:38:14,470 --> 00:38:20,060 And it turns out that you can also view this update as a as a as an approximate gradient step 383 00:38:20,060 --> 00:38:25,950 in the Khail divergence between your sample approximation and your target distribution. 384 00:38:25,950 --> 00:38:32,550 Now, the conversion, the organ is not completely clear, but it's a very it's very simple to implement and seems to work well. 385 00:38:32,550 --> 00:38:36,510 One other possible downside is that the update also costs N squared time, 386 00:38:36,510 --> 00:38:41,340 so to update your set of sample points, if you're updating all of your points on every step. 387 00:38:41,340 --> 00:38:50,250 And that takes anywhere time per update. Another approach and other related approaches. 388 00:38:50,250 --> 00:38:53,790 Use your style discrepancy to greedily select the next points. 389 00:38:53,790 --> 00:39:01,050 So start with a start with no approximation. Pick the best single point approximation to your target, a Winterstein discrepancy. 390 00:39:01,050 --> 00:39:05,700 And then repeatedly add in additional points to improve that approximation. 391 00:39:05,700 --> 00:39:10,920 That's another that's another way of selecting Denovo points for proximity. 392 00:39:10,920 --> 00:39:16,320 Your target distribution. Different from just sampling or using CMC. 393 00:39:16,320 --> 00:39:23,370 And you know, we can show that this actually sends your case easy to cast you to zero at a particular rate that a square log end over end rate. 394 00:39:23,370 --> 00:39:32,620 So you can prove convergence of this algorithm. One possible downside is that to pick each new point, you have to solve this optimisation problem. 395 00:39:32,620 --> 00:39:38,500 You have to. You have to solve this to pick the next point, X, X and you this all this optimisation problem over to the T. 396 00:39:38,500 --> 00:39:41,080 Which could be expensive or difficult. 397 00:39:41,080 --> 00:39:48,940 So we proposed a modification which replaces that expensive optimisation with just sampling from a short Markoff chain. 398 00:39:48,940 --> 00:39:51,790 And it turns out if you have a Markov chain that's mixing quickly anyway, 399 00:39:51,790 --> 00:39:56,920 instead of optimising over all the space, you can just generate a short Markov chain. 400 00:39:56,920 --> 00:40:01,750 Pick the best point from the iterates of that chain. And use that as your next. 401 00:40:01,750 --> 00:40:10,290 The next point in your representation. And then it maintains the same rate of convergence. 402 00:40:10,290 --> 00:40:14,580 So here's an example, what those sample points could look like if you're doing em CMC. 403 00:40:14,580 --> 00:40:20,670 Typically the problem, the the lack of approximation quality comes from clumping. 404 00:40:20,670 --> 00:40:22,470 Coupling it with clumping that comes from randomness. 405 00:40:22,470 --> 00:40:29,370 So you'll have to see places where you have your oversampling in places and also large gaps just due to randomness. 406 00:40:29,370 --> 00:40:44,210 Whereas the Stiehm Points CMC algorithm typically spaces out your points to minimise redundancy and they give you a more compact representation. 407 00:40:44,210 --> 00:40:50,390 So I'm showing you some examples of this algorithm against some alternatives. 408 00:40:50,390 --> 00:41:01,400 This one is from a kinetic model of automatic control. So this is a 10 dimensional posterior distribution from the parameters that involve solving. 409 00:41:01,400 --> 00:41:08,510 So the likelihood here involves solving an O.D. And so you'd like to minimise the number of those likelihood evaluations. 410 00:41:08,510 --> 00:41:13,610 And so what I'm showing you is as a function of the number of about likelihood evaluations, 411 00:41:13,610 --> 00:41:21,180 how good is my approximation to the target distribution using a variety of different particle generation procedures? 412 00:41:21,180 --> 00:41:27,280 So comparing Markov chain Monte Carlo algorithms to S PGD appear in red to this nine points. 413 00:41:27,280 --> 00:41:33,680 I'm CMC algorithm, which are these solid purple and blue line to the bottom. 414 00:41:33,680 --> 00:41:39,080 So. The dash lines are NCMEC. 415 00:41:39,080 --> 00:41:43,190 Yes, these dash lines are MSNBC. Red is as PGD. 416 00:41:43,190 --> 00:41:46,920 Essentially, what you're seeing was having with FSU disease because every update ticks end square. 417 00:41:46,920 --> 00:41:54,510 Time it progresses slow. Eventually, it's going to converge to a good solution, but it just takes much longer than these other procedures. 418 00:41:54,510 --> 00:42:05,540 CMC does real relatively well, but you can actually get to better quality more quickly using these Stiehm Point M CMC procedures. 419 00:42:05,540 --> 00:42:12,990 Here is another example from a Financial Times series. I'll just skip over that. 420 00:42:12,990 --> 00:42:19,020 OK, so I've highlighted a few applications and some of the technology that's useful. 421 00:42:19,020 --> 00:42:24,150 I wanted to end with some opportunities for future development areas where I feel like more work is still needed. 422 00:42:24,150 --> 00:42:30,150 So first, improving scalability of style discrepencies while maintaining convergence control. 423 00:42:30,150 --> 00:42:37,290 So this tiny discrepancy itself involves this great into your log density where the gradient, your low density is too expensive. 424 00:42:37,290 --> 00:42:46,120 Can you still monitor convergence in a way that's both computationally feasible and theoretically sound? 425 00:42:46,120 --> 00:42:53,100 My collaborators, I have started doing some investigation into this. We have some new work this year called Stochastics Stein Discrepencies, 426 00:42:53,100 --> 00:42:59,760 which essentially just takes you replaces the GRAP great Inverloch entity with a random stochastic gradient. 427 00:42:59,760 --> 00:43:04,930 And we show that if you do this, you can still control convergence with probability one. 428 00:43:04,930 --> 00:43:12,180 So that's pretty exciting. But I think there's still much more to be explored in that space. 429 00:43:12,180 --> 00:43:14,100 If you're using one of these Col Stein discrepancies, 430 00:43:14,100 --> 00:43:21,120 I mentioned that the expense is quadratic in your sample size and that might be to be prohibitive. 431 00:43:21,120 --> 00:43:28,380 Your sample size is quite large and there's been a good bit of work on this now and trying to improve that. 432 00:43:28,380 --> 00:43:32,580 So Arthur and Collaborator's have work on what are called finite sets time discrepancies, 433 00:43:32,580 --> 00:43:36,450 which essentially are using low rank kernels and provide linear runtime. 434 00:43:36,450 --> 00:43:41,590 But there are still questions around the convergence control of these of these discrepancy measures. 435 00:43:41,590 --> 00:43:46,870 With Jonathan Huggins, I've worked on a different type of style discrepancy, some of which are actually, 436 00:43:46,870 --> 00:43:50,260 Colonel, some discrepancy, some are related, but they're actually they use different function classes. 437 00:43:50,260 --> 00:43:52,570 They're called the random feature style discrepancies. 438 00:43:52,570 --> 00:43:57,520 And you can think of these as providing essentially a stochastic lowering kernels that have near linear runtime, 439 00:43:57,520 --> 00:44:04,180 but also provide high probability conversions control. But there there's still some caveats to establish that convergence control. 440 00:44:04,180 --> 00:44:11,020 We needed to assume something about the moments of your approximating approximating samples that to be uniformly bounded. 441 00:44:11,020 --> 00:44:18,390 And so it'd be nice to get away from that requirement and just have something that works all the time. 442 00:44:18,390 --> 00:44:24,120 Now, I also proposed this particular operator to you that came from the Lensman diffusion call called the lens of an operator, 443 00:44:24,120 --> 00:44:30,480 but actually an infinite number of operators characterise the target distribution. So how do we know which one we should pick? 444 00:44:30,480 --> 00:44:34,480 What is the best operator and how does that impact your discrepancy? 445 00:44:34,480 --> 00:44:41,730 And what's practical? These are all open questions. But I can tell you, I can give you example why this matters. 446 00:44:41,730 --> 00:44:46,050 If I give you a class of distribution's for which the same discrepancy definitely does control convergence, 447 00:44:46,050 --> 00:44:51,990 but here's an example of distributions for which it does not. If you're dealing with, say, a student's t distribution. 448 00:44:51,990 --> 00:44:59,340 So the grading your love density is bounded and you're using any kernel, any C0 kernel like the Gaussian kernel or the item. 449 00:44:59,340 --> 00:45:06,870 Q Colonel Kernel, anything that's standard is zero. Then you're Stiehm discrepancy will not detect an unconverted, will not control convergence. 450 00:45:06,870 --> 00:45:15,440 In particular, I can always produce a sample that will stand your sign as revenue to zero, even though the sample is not converting to P. 451 00:45:15,440 --> 00:45:21,350 That's bad. It's the same sort of problem that we saw in load in low dimensions for certain kernels. 452 00:45:21,350 --> 00:45:29,150 But now this is happening for all the cartels. All your favourite colonels and every dimension. 453 00:45:29,150 --> 00:45:35,430 So what can we do about this? Well, my some of my collaborators and I like Andrew Dunkin's, Vashion, Volmer, Jackson, 454 00:45:35,430 --> 00:45:40,150 Gore, and have been investigating alternative operators called the Fusion site operators. 455 00:45:40,150 --> 00:45:43,430 This is essentially the idea here is that instead of using Relenza and diffusion, 456 00:45:43,430 --> 00:45:52,540 use a different diffusion and you can pick the diffusion in a way with that counteracts the decay or lack of growth of your 457 00:45:52,540 --> 00:45:58,440 of the grab bag density so that you can still detect convergence even if you're dealing with a heavy tail distribution. 458 00:45:58,440 --> 00:46:06,880 So that might be one solution, but I think there's still lots of open questions there. 459 00:46:06,880 --> 00:46:15,700 Some groups have been using these standards committee ideas to train generative models like generative adversarial networks and 460 00:46:15,700 --> 00:46:26,400 variation of auto coders so that you can generate new instances of of different classes of of different data distribution. 461 00:46:26,400 --> 00:46:32,580 So in this case, it's generating new celebrity faces. 462 00:46:32,580 --> 00:46:40,800 You can also use these ideas to develop algorithms with with provable, finite sample guarantees for optimisation. 463 00:46:40,800 --> 00:46:47,490 Even Vernon convex optimisation. And the basic idea here is that if you're going if you want to minimise a particular function, 464 00:46:47,490 --> 00:46:53,010 f you're going to do this by trying to sample from what's called a gibs distributions. 465 00:46:53,010 --> 00:47:00,100 You're going to take a target distribution. Make it E to the minus F or me to eat to the minus F type some scalar. 466 00:47:00,100 --> 00:47:02,730 And you say, like, can I sample from a distribution. 467 00:47:02,730 --> 00:47:08,220 And if you can simplify my distribution and you can increase the size of that scalar at the right rate, 468 00:47:08,220 --> 00:47:15,410 then you can eventually end up minimising that function. And we found that for fun, even for a class of functions. 469 00:47:15,410 --> 00:47:19,610 They're not convicts. But satisfy this other property called distant deceptively. 470 00:47:19,610 --> 00:47:33,330 We can actually design Markov chains that will minimise the function with explicit rate of convergence. 471 00:47:33,330 --> 00:47:39,100 And here's an example of one very simple case of such a function, like here's a function that is not convex, 472 00:47:39,100 --> 00:47:45,250 but it's really just because instead of growing convexity from the tails, it actually has concave growth away from the tails. 473 00:47:45,250 --> 00:47:50,410 And so if you use an algorithm like gradient descent, it has very little signal about where the optimum is. 474 00:47:50,410 --> 00:47:55,570 Even though you seems clear from this picture of the Optimus as you just read zero. Eventually you'll get there. 475 00:47:55,570 --> 00:48:00,910 But if you do in gradient descent, you're going to proceed actually very slowly. And this black line is showing that we're taking. 476 00:48:00,910 --> 00:48:05,080 It takes green and sent about 10000 thousand iterations to get close to the optimum here. 477 00:48:05,080 --> 00:48:08,290 If you use the land of an algorithm to try to sample from the gibs, 478 00:48:08,290 --> 00:48:12,460 distribution will enter and I'll go and actually go off in the wrong direction because it has 479 00:48:12,460 --> 00:48:17,380 poor mixing for distributions like this for for distributions with heavy tails like this. 480 00:48:17,380 --> 00:48:25,030 But if you take our design diffusion, which is based on changing the grand log density, 481 00:48:25,030 --> 00:48:35,440 scaling up the grab low density in a particular way, then you converge to the optimum like fifteen iterations. 482 00:48:35,440 --> 00:48:40,700 OK, and there are a number of other tests which people have been exploring using these tools, 483 00:48:40,700 --> 00:48:44,630 parameter estimation and models where maximum likelihood is intractable. 484 00:48:44,630 --> 00:48:49,730 His reference for that thinning the output of your Markov chain in a way that to get a 485 00:48:49,730 --> 00:48:53,600 more compressed representation of a posterior distribution is a reference for that, 486 00:48:53,600 --> 00:49:00,210 too. And I'll leave you with a slide of future directions and open questions that we have something to talk about. 487 00:49:00,210 --> 00:49:02,537 So thank you for your time.