1
00:00:00,990 --> 00:00:05,670
So the motivation from this work comes from large scale folks here in France.
2
00:00:05,670 --> 00:00:10,950
And to illustrate what I mean by that, I want you to consider this very simple example of logistic regression.
3
00:00:10,950 --> 00:00:15,940
So you have a data set with all data points. Each data point has a feature vector.
4
00:00:15,940 --> 00:00:21,660
I call it V.L. It's an Archimedes three dimensional feature vector and has a binary class label.
5
00:00:21,660 --> 00:00:30,120
And the feature vector is related to the class label through an unknown parameter vector that governs the mean of of the class label.
6
00:00:30,120 --> 00:00:40,410
We'll call that parameter vector beta. Now, if you place a prior on that number vector, this becomes evasion, logistic regression model.
7
00:00:40,410 --> 00:00:45,810
And one day I want to highlight about this, about this genitive models that it's pretty simple to express.
8
00:00:45,810 --> 00:00:52,080
But even in this simple case, the posterior distribution over the unknown parameters is complex.
9
00:00:52,080 --> 00:00:57,170
We don't know the normalisation, constant and exact, integrate and tractable.
10
00:00:57,170 --> 00:01:06,840
So what does one normally do? Well, one will often reach for Markov chain Monte Carlo CMC to eventually draw samples from the posterior distribution.
11
00:01:06,840 --> 00:01:12,720
And the benefit of this is that you can then approximate these intractable posterior expectations that are written in red.
12
00:01:12,720 --> 00:01:20,870
These integrals with asymptotically exact sample estimates, which are just averages over your sample plants and blue.
13
00:01:20,870 --> 00:01:31,600
But there's also a drawback. And one drawback is that every new CMC sample point requires iterating through your entire observed dataset.
14
00:01:31,600 --> 00:01:36,870
And this can be pretty prohibitive when your dataset is too large.
15
00:01:36,870 --> 00:01:45,630
So motivating this work is how do we scale Markov chain Monte Carlo Osteria in France to massive datasets.
16
00:01:45,630 --> 00:01:53,670
And over the past decade, this template solution has been emerging, which is approximate CMC with subsect posteriors.
17
00:01:53,670 --> 00:01:56,670
The idea here is that instead of running your exact Markov chain,
18
00:01:56,670 --> 00:02:02,570
you'll approximate it in a way that only uses a subset of your data to draw every new sample point.
19
00:02:02,570 --> 00:02:05,690
The benefit is that you have reduced computational overhead.
20
00:02:05,690 --> 00:02:10,640
So it means you have faster sampling and you can reduce your Monte Carlo variance quickly.
21
00:02:10,640 --> 00:02:15,640
But there's also a drawback, which is that it introduces this bias. That's awesome.
22
00:02:15,640 --> 00:02:22,190
It doesn't go away. And the reason why is that your target distribution is no longer the stationary distribution of your chain.
23
00:02:22,190 --> 00:02:27,230
So you've that you sample forever would always be sampling from the wrong distribution.
24
00:02:27,230 --> 00:02:31,340
But the hope is that for the fixed amount of sampling time that you have in practise,
25
00:02:31,340 --> 00:02:38,490
the reduction in variance that you've introduced will outweigh the bias. And this will be lead to better inference.
26
00:02:38,490 --> 00:02:45,780
So I think this is a promising approach, but it introduces a number of new challenges for inference.
27
00:02:45,780 --> 00:02:51,840
For instance, how do we compare and evaluate the samples that are being produced by these various approximate CMC procedures?
28
00:02:51,840 --> 00:02:59,350
How do we know how good they are? How do we select an approximate M.C., M.C. sampler and how do we tune the type of parameters?
29
00:02:59,350 --> 00:03:08,560
Because they're always hyper parameters that you need to tune. And then how do we quantify this bias, Ference Trade-Off that I just mentioned?
30
00:03:08,560 --> 00:03:16,270
So there are standard procedures for CMC assessment, like effective sample size trace plots.
31
00:03:16,270 --> 00:03:20,920
Diagnostics. But they all assume eventual convergence to the target distribution.
32
00:03:20,920 --> 00:03:28,570
And they don't account for this additional Assam Toric bias that's inherent to these approximate CMC methods.
33
00:03:28,570 --> 00:03:39,750
So in this talk, I'll introduce to you some new quality measures that are suitable for comparing the quality of these approximate CMC samples.
34
00:03:39,750 --> 00:03:42,450
And a bit more generally.
35
00:03:42,450 --> 00:03:50,330
We'd like to develop a measure that's suitable for comparing the quality of any two samples that are approximating a common target distribution.
36
00:03:50,330 --> 00:03:54,880
So here's here's our setup for the talk. We'll be focussing on continuous targets.
37
00:03:54,880 --> 00:03:59,870
P Target Distribution's P with support and all of our to the D.
38
00:03:59,870 --> 00:04:07,220
But you can relax that to any convex upset and we'll say that our targets have a density little P that's known up to normalisation,
39
00:04:07,220 --> 00:04:19,060
but that integration under the distribution is intractable. So to approximate expectations under P., we're going to make make use of a sample.
40
00:04:19,060 --> 00:04:23,140
So we have sample points X1 through XM and.
41
00:04:23,140 --> 00:04:29,470
Together, these define a discrete distribution, which is just the empirical distribution that places one over end mass at each point.
42
00:04:29,470 --> 00:04:34,560
So we'll call it distribution cubin. And what we will do with Cunard's will approximate integrals under P.
43
00:04:34,560 --> 00:04:41,030
Expectation's under P. With sample averages under Q in blue.
44
00:04:41,030 --> 00:04:44,280
And it's I should say at this point that I'm not going to make any particular assumption about the
45
00:04:44,280 --> 00:04:49,890
provenance of these sample points so they could be samples from your favourite approximate Markov chain,
46
00:04:49,890 --> 00:04:54,030
Monte Carlo algorithm, but they could also be deterministic points, for instance,
47
00:04:54,030 --> 00:05:00,890
produced by quadrature rule, or they could be iodines points from some auxiliary distribution.
48
00:05:00,890 --> 00:05:09,560
OK, so given the sample. Our goal is to quantify how well, sample expectations, approximate target expectations.
49
00:05:09,560 --> 00:05:16,070
And we have a few criteria for doing this. First, we want are our quality measure, our measure.
50
00:05:16,070 --> 00:05:22,340
A way of quantifying. To detect when our sample sequence actually is converging to the target.
51
00:05:22,340 --> 00:05:27,040
And we also want this tact when the sample sequence is not conversion to the target.
52
00:05:27,040 --> 00:05:37,160
And we want to do all of this in a way that's computationally feasible so that you can use this quality measure to assess your Markov chains.
53
00:05:37,160 --> 00:05:44,690
OK, well, given these goals, I think it's natural to consider the family of integral probability metrics because these
54
00:05:44,690 --> 00:05:51,000
are measures that quantify the maximum discrepancy between sample and target expectations.
55
00:05:51,000 --> 00:05:55,580
Overclass of real value test functions. Call that H.
56
00:05:55,580 --> 00:06:01,670
And when is classes sufficiently large convergence of your if your integral probability metric, I'll call it IPN for short.
57
00:06:01,670 --> 00:06:08,140
So convergence of your IPN to zero implies that your sample sequence is converging weekly to P,
58
00:06:08,140 --> 00:06:14,040
and that's good because that satisfies our second requirement. Detection of non convergence.
59
00:06:14,040 --> 00:06:17,130
And there are lots of examples that you may be familiar with that fall into this,
60
00:06:17,130 --> 00:06:25,130
this family, the L1 battisti in distance the deadly metric, even total variation.
61
00:06:25,130 --> 00:06:31,820
But for for vast distances and to end the deadly metric there, there's a problem, there's a problem for our for our criteria.
62
00:06:31,820 --> 00:06:39,230
And the problem is that the definition of the integral probability metric involves integration under P.
63
00:06:39,230 --> 00:06:49,030
And that's exactly what we don't know how to do. So most of these IP memes that you're familiar with can't actually be computed in practise.
64
00:06:49,030 --> 00:06:51,790
So how can we get around this?
65
00:06:51,790 --> 00:07:01,810
Well, one idea is that we could only consider functions that we know test function to age, that we know apriori are mean zero under P.
66
00:07:01,810 --> 00:07:08,410
If we could do that, then we'll get rid of this integration under P. And then the IPN computation only depends on our sample.
67
00:07:08,410 --> 00:07:15,400
And that's potentially something that's tractable, something that we can compute. That's an interesting idea.
68
00:07:15,400 --> 00:07:20,310
But how do we find test functions that are mean zero under P.
69
00:07:20,310 --> 00:07:30,090
And. Once we find them, how do we know that the resulting IPN will actually track sample sequence convergence in the way we wanted to?
70
00:07:30,090 --> 00:07:34,800
And then finally, even once we've gotten rid of the expectation under P,
71
00:07:34,800 --> 00:07:38,460
we're still stuck with this infinite dimensional optimisation problem over test function.
72
00:07:38,460 --> 00:07:44,680
So how do we solve that in practise? So come back to the third question in a in a bit.
73
00:07:44,680 --> 00:07:54,020
But first, I'd like to show you how we can use ideas from Psycho Stein's method to answer these first two questions.
74
00:07:54,020 --> 00:07:59,160
OK. Stein's method, I'm sure many of you are already familiar with this, but you know, Stein,
75
00:07:59,160 --> 00:08:06,140
Charles Sharfstein in the 1960s and 70s produced this recipe for controlling convergence and distribution.
76
00:08:06,140 --> 00:08:12,590
And he did it to provide a central limit, to prove a central limit theorem and to provide rates of convergence for the central limit era.
77
00:08:12,590 --> 00:08:16,970
And I like to think of it as a three step approach.
78
00:08:16,970 --> 00:08:26,390
So in step one, you want to identify an operator, what's called a T. and a set of functions G that have a following property.
79
00:08:26,390 --> 00:08:31,760
When you pass functions from your set through your operator, the result is means zero under your target.
80
00:08:31,760 --> 00:08:37,220
So this operator generates means uro functions under P, which is great.
81
00:08:37,220 --> 00:08:43,550
That's exactly what we want for our R improved IPN.
82
00:08:43,550 --> 00:08:49,620
OK, so say we've picked this operate and we pick the set, judges are going gonna be the domain of the operator together,
83
00:08:49,620 --> 00:08:53,690
TNG will define what will will define what we'll call a style discrepancy,
84
00:08:53,690 --> 00:09:03,260
which is an IBM type measure that has no explicit integration under P, because by design, all the test functions are means you're under P.
85
00:09:03,260 --> 00:09:07,110
So call it sign discrepancy S. And it's a function of your sample.
86
00:09:07,110 --> 00:09:12,440
The operator you picked and the set you picked. Now, I'll come back to how we actually find such an operator in a moment,
87
00:09:12,440 --> 00:09:16,910
but first I want to show you what we'll do with it in Stine's one, what one would normally do with it.
88
00:09:16,910 --> 00:09:21,260
Stine's method. So the second step in science method is typically the lower bound.
89
00:09:21,260 --> 00:09:28,370
This time discrepancy by some reference IPN that you know and trust like a stay stand distance.
90
00:09:28,370 --> 00:09:32,660
And the purpose there is to show that if you're STEINERT 70 goes to zero,
91
00:09:32,660 --> 00:09:38,870
then your your faithful IPN is also going to zero, which would mean that if your Stine's comes, he goes to zero.
92
00:09:38,870 --> 00:09:42,560
Then you know that your sample sequence is converging to P, and that's great.
93
00:09:42,560 --> 00:09:48,480
That solves our our second requirement detection of non conversions.
94
00:09:48,480 --> 00:09:56,670
Now, typically, this requires analysis, but it can be performed once at once in advance for large classes of target distributions.
95
00:09:56,670 --> 00:10:04,000
And then you can go off and happily use your time discrepancy knowing that it controls convergence in this way.
96
00:10:04,000 --> 00:10:10,660
The third step in science, maths, that is typically the upper bound herstein discrepancy. By any means necessary to demonstrate convergence to zero.
97
00:10:10,660 --> 00:10:18,920
This is this would satisfy our second our first requirement. Now, this for us is probably the least interesting step, because in practise,
98
00:10:18,920 --> 00:10:23,930
what we're going to do is actually compute the tiny discrepancy and see how big it is. But it would be nice to know that.
99
00:10:23,930 --> 00:10:29,480
Nice to know, Apriori, that if your sample was converting to P, your time discrepancy wouldn't d go to zero.
100
00:10:29,480 --> 00:10:33,940
So I'll give you some results of this flavour as well.
101
00:10:33,940 --> 00:10:40,920
Now, the standard used for this recipe is as an analytical tool to prove convergence and distribution, to approve central limit theorem.
102
00:10:40,920 --> 00:10:43,020
Just elif rates and the central limit theorems.
103
00:10:43,020 --> 00:10:51,470
But our goal here is to develop it into a practical quality measure that you can use to assess, for instance, approximate M.S., M.S.
104
00:10:51,470 --> 00:10:59,270
So to do that, I'm going to have to give you a very specific instantiation of this otherwise abstract recipe.
105
00:10:59,270 --> 00:11:04,880
So let's do that. Let's first pick one of these operators, we'll call it a Stein operator T.
106
00:11:04,880 --> 00:11:11,020
Remember, we want to find a T that generates mean zero functions under our target distribution P.
107
00:11:11,020 --> 00:11:16,900
Well, to do this, we're going to appeal to this beautiful idea due to Andrew Barr or called the generator method.
108
00:11:16,900 --> 00:11:25,210
And the idea is that if you can find a Markov process that has P, is it stationary distribution and under very mild conditions,
109
00:11:25,210 --> 00:11:30,460
the infinitesimal generator of that process generates mean zero functions under P.
110
00:11:30,460 --> 00:11:35,110
Now, I've given you the definition of an infinitesimal generator here, but we don't actually need that definition.
111
00:11:35,110 --> 00:11:41,380
We're only going to focus on one specific instance which I've written in my my green box in the bottom.
112
00:11:41,380 --> 00:11:46,940
So if we pick a particular Mark-Up process, this is the land of diffusion targeting P.
113
00:11:46,940 --> 00:11:52,300
Then I've written down the form of its generator and from the generator we're going to produce an operator.
114
00:11:52,300 --> 00:11:58,930
And here's the Stein operator. We're going to use a called T.P Land of Einstein operator.
115
00:11:58,930 --> 00:12:03,490
And the main thing I want to highlight what this operator is that it depends on the target, P.
116
00:12:03,490 --> 00:12:06,280
Only through the gradient of its log density.
117
00:12:06,280 --> 00:12:14,450
So this means that this is an operator that could be computable, even if P can't be normalised, even if you don't know that normalisation constant.
118
00:12:14,450 --> 00:12:25,240
Those familiar with Stein's method might also see this as a multivariate generalisation of the density method operator due to Stiehm.
119
00:12:25,240 --> 00:12:32,830
OK. So that's going to be our working Stein operative for most of the talk. We need to know whether it produces mean zero functions.
120
00:12:32,830 --> 00:12:36,760
For that, we need to actually choose which input functions G to pass through it.
121
00:12:36,760 --> 00:12:40,930
And it turns out that one suitable set is the set that have called the classical style set,
122
00:12:40,930 --> 00:12:48,130
which is just a set of functions that are bounded, that have bounded gradients and that have lipshitz gradients.
123
00:12:48,130 --> 00:12:53,930
If you pass any of these functions to your operator, the result will be mean zero, which is exactly what we want.
124
00:12:53,930 --> 00:13:00,310
So now we've picked our operator, T. And we've picked Stein, said G.
125
00:13:00,310 --> 00:13:05,950
Can I just ask what does the storm mean in front of the top of the No.
126
00:13:05,950 --> 00:13:13,120
Oh, yeah, so sorry. So this is, you know, the status Perama tries by some norm, which could be whatever norm you want, your favourite norm.
127
00:13:13,120 --> 00:13:17,470
And then this normal star just represents the dual norm for that norm.
128
00:13:17,470 --> 00:13:25,880
But. Most of the talk. You should think of the norm as the Euclidian norm and then the the norm, Starr is also the Euclidian norm that's there.
129
00:13:25,880 --> 00:13:30,450
So you can basically ignore the star, but it's more generally the dual norm of the normal person.
130
00:13:30,450 --> 00:13:37,270
Okay. Thank you. OK.
131
00:13:37,270 --> 00:13:44,680
So together, this choice of operator TI Incentive strategy produce a Stein discrepancy.
132
00:13:44,680 --> 00:13:48,370
For the purpose of a talk, I'll call that a classical style discrepancy.
133
00:13:48,370 --> 00:13:54,480
And now you need to know, does this time discrepancy actually detect convergent and non convergence in the way that we want?
134
00:13:54,480 --> 00:14:00,390
So what does that mean that means. Is it the case that this time discrepancy goes to zero?
135
00:14:00,390 --> 00:14:08,800
If and only if your sample sequence converges to be. Now, in the univariate case, this is actually a well-known result for many targets.
136
00:14:08,800 --> 00:14:12,400
It's known that for many targets this time, discrepancy goes to zero.
137
00:14:12,400 --> 00:14:20,550
Only if the Bosher Stein distance goes to zero. Here are some references that provide results of that flavour.
138
00:14:20,550 --> 00:14:25,440
But in the multivariate case, relatively few targets have been analysed.
139
00:14:25,440 --> 00:14:31,500
There's been substantial work on multivariate Gaussians. There's been some work on the Dirichlet distribution and a few other targets.
140
00:14:31,500 --> 00:14:35,100
But many targets haven't been assessed in the same way.
141
00:14:35,100 --> 00:14:41,070
So to extend the breadth of the literature, my collaborators and I proved the following result.
142
00:14:41,070 --> 00:14:49,560
And it says that if the land of diffusion underlying the opera that we picked off, the land of diffusion of your target couples' an integral rate,
143
00:14:49,560 --> 00:14:55,020
that means basically that mixes quickly and great into your log density is Lipshitz.
144
00:14:55,020 --> 00:14:58,710
Then we have exactly what we want. You're Steinman's reference. He goes to zero.
145
00:14:58,710 --> 00:15:05,270
If and only if you're for sustained distance goes to zero. And so what are examples of distributions to satisfy these properties?
146
00:15:05,270 --> 00:15:08,750
Well, any strongly law can cave distribution falls into this family,
147
00:15:08,750 --> 00:15:13,580
including the bagian logistic regression with calcium Pryors that I mentioned at the beginning of the talk.
148
00:15:13,580 --> 00:15:17,540
But it also covers non-striking often kape distributions like robust students to
149
00:15:17,540 --> 00:15:23,080
regression with calcium Pryors or even calcium mixtures which are multi-modal.
150
00:15:23,080 --> 00:15:27,520
Now, I should say, though, that this class is one class for which we can provide such a result.
151
00:15:27,520 --> 00:15:29,830
But these conditions are definitely not necessary.
152
00:15:29,830 --> 00:15:35,500
But we hope that this sort of result will provide a template for bounding style discrepancies for other classes as well,
153
00:15:35,500 --> 00:15:43,820
so that we can expand the reach of this style discrepancy use.
154
00:15:43,820 --> 00:15:50,480
OK, so let me give you a simple example of using this time discrepancy in practise, the simplest possible example.
155
00:15:50,480 --> 00:15:56,450
So here your target is just going to be a standard normal distribution and you're going to observe samples
156
00:15:56,450 --> 00:16:03,580
either from the standard normal or from a student's t distribution with matching mean and variance.
157
00:16:03,580 --> 00:16:06,880
So if you're assessing these two samples, what would you like to see?
158
00:16:06,880 --> 00:16:14,130
You'd like to see is that as my sample size grows and I'm getting samples from the I'm getting a sample from the correct distribution,
159
00:16:14,130 --> 00:16:19,270
then my sign discrepancy goes to zero. And that's what you're seeing in this red line as a function of your sample size.
160
00:16:19,270 --> 00:16:25,900
You're seeing this style discrepancy value that was computed. And indeed, this is going to zero at one over root and rate.
161
00:16:25,900 --> 00:16:29,320
But now, if you receiving draws from the wrong distribution,
162
00:16:29,320 --> 00:16:34,190
you'd like your sign discrepancy to be bounded away from zero and that's what you're seeing this to line as I observe draw.
163
00:16:34,190 --> 00:16:44,210
So my students t first a sign discrepancy decreases, but eventually it bottoms out because you're not getting a better approximation to the normal.
164
00:16:44,210 --> 00:16:50,660
OK. So that's something. I mean, we get the vet, we get the value back from a side discrepancy, but it turns out that on top of recovering the value,
165
00:16:50,660 --> 00:16:57,320
you can also recover the function that function G that optimise this time discrepancy, that worst case function.
166
00:16:57,320 --> 00:17:00,380
And I'm showing you those worst case functions here on the right.
167
00:17:00,380 --> 00:17:06,530
I think even more interpretable than a worst case function G is the function after you've passed it through your Stein operator,
168
00:17:06,530 --> 00:17:12,740
because that becomes the test function. That should have been means zero under the under the target, the Gaussian.
169
00:17:12,740 --> 00:17:15,320
But isn't mean zero under your sample.
170
00:17:15,320 --> 00:17:23,030
And it's the thing that it's the function that has the worst case violation as a mean that's farthest from zero.
171
00:17:23,030 --> 00:17:26,750
So. It's I don't over interpret any of these examples.
172
00:17:26,750 --> 00:17:31,610
I'll just give you a little indication of what you can read off from such a from such a function H.
173
00:17:31,610 --> 00:17:37,190
So on the bottom right corner here, you see the worst case test function for our students t sample.
174
00:17:37,190 --> 00:17:43,130
And what you see is that there's a lot of mass, a lot of large function values in the tails.
175
00:17:43,130 --> 00:17:47,540
And this is because the students t is oversampling the tails relative to a Gaussian.
176
00:17:47,540 --> 00:18:00,660
And so that's where you can exploit the difference between your sample. That's we can most exploit the non gassy entity of your sample.
177
00:18:00,660 --> 00:18:04,350
OK. Here's a more interesting example. This is a posterior inference example,
178
00:18:04,350 --> 00:18:08,970
so our target now is going to be a posterior density formed by a prior product of
179
00:18:08,970 --> 00:18:16,490
a prior and a number of likely good terms from a from a from our EL datapoints.
180
00:18:16,490 --> 00:18:25,520
To in this case, we're going to be assessing a particular algorithm called stochastic gradient legend dynamics, which is an approximate CMC algorithm.
181
00:18:25,520 --> 00:18:29,090
It it operates. It does a random walk through space.
182
00:18:29,090 --> 00:18:36,260
The way it does this is by starting at a certain point, taking a step in the gradient direction in the direction of the grade if your log density.
183
00:18:36,260 --> 00:18:40,400
But if your data set is large, computing grid Inverloch log density is expensive.
184
00:18:40,400 --> 00:18:50,120
And so it approximates that using a mini batch to randomly subsample a set of your points and a stochastic gradient instead of a full gradient.
185
00:18:50,120 --> 00:18:55,740
Then add some Gaussian noise and that becomes the on the next step in your Markov chain.
186
00:18:55,740 --> 00:19:00,360
Now, you might see this as an approximation to a few other albums with what you might be familiar.
187
00:19:00,360 --> 00:19:05,160
One is called the Metropolis Adjusted Land of an Algorithm. So this approximates that in two ways.
188
00:19:05,160 --> 00:19:09,930
One, instead of using the actual gradient of your log density, it uses this stochastic gradient.
189
00:19:09,930 --> 00:19:14,630
And two, it throws away the Metropolis Hastings correction.
190
00:19:14,630 --> 00:19:20,790
So the correction is usually employed to ensure that you're Markov chain is indeed targeting your target distribution,
191
00:19:20,790 --> 00:19:26,940
that it has the target as a stationary distribution. But computing the Metropolis Hastings correction is usually very slow for large datasets.
192
00:19:26,940 --> 00:19:31,530
So this algorithm just throws away the Metropolis Hastings correction and hopes for the best.
193
00:19:31,530 --> 00:19:35,440
This is what makes it approximate CMC. OK.
194
00:19:35,440 --> 00:19:39,220
Well, one other we're doing approximate NCMEC, a number of things matter a lot.
195
00:19:39,220 --> 00:19:43,900
And one of those things is the step size. This is how large of a step you taken.
196
00:19:43,900 --> 00:19:46,780
Your stochastic gradient direction is Epsilon.
197
00:19:46,780 --> 00:19:53,560
If your step size is very small, then after a fixed amount of sampling time, you'll barely have moved from your starting point.
198
00:19:53,560 --> 00:19:59,980
So that's slow mixing. So that's bad. That's going to give you a bad approximation to the target and the target I've drawn here in
199
00:19:59,980 --> 00:20:06,170
terms of it's drawing the equity density contours of the target distribution here for reference.
200
00:20:06,170 --> 00:20:10,640
On the other hand, if your step sizes too large, then you'll have fast mixing.
201
00:20:10,640 --> 00:20:20,430
But you'll be mixing very quickly to a very wrong distribution. Here you see that the sample is greatly over, dispersed relative to the target.
202
00:20:20,430 --> 00:20:26,140
So this is a problem. You like to pick something in between. So how do you do that?
203
00:20:26,140 --> 00:20:30,850
Well, you could try to use a you would like to. OK. We like to pick this step size.
204
00:20:30,850 --> 00:20:36,550
So do we do that? You could try to use a standard selection criterion like effective sample size.
205
00:20:36,550 --> 00:20:43,630
This is commonly used to assess pilot change for Mark-Up, for CMC and to pick hyper parameters.
206
00:20:43,630 --> 00:20:48,040
But in this case, effective sample size is really just a measure of variance.
207
00:20:48,040 --> 00:20:50,920
It's a measure of auto correlation amongst your sample points.
208
00:20:50,920 --> 00:20:57,130
So it will, in this case, always pick the sample that says here's here's my measure, effective sample size.
209
00:20:57,130 --> 00:21:03,400
Actually, I've assessed the effective sample size of different values of my of my step size parameter epsilon.
210
00:21:03,400 --> 00:21:08,020
And it basically just says, pick the biggest epsilon you can because the variance looks the best.
211
00:21:08,020 --> 00:21:11,860
So it says, pick this one. And the reason why it picks this one is that it's not.
212
00:21:11,860 --> 00:21:17,800
It has no way of accounting for bias. It doesn't know that this is targeting the wrong distribution asymptotically.
213
00:21:17,800 --> 00:21:25,970
So it can account for the fact this is over dispersed. An alternative would be to use this time discrepancy that I just mentioned.
214
00:21:25,970 --> 00:21:31,490
And here I'm showing you the Stein discrepancy, values computed for each step size here, a lower is better.
215
00:21:31,490 --> 00:21:35,780
So the sign discrepancy says, OK. A very small epsilon is giving you a bad approximation.
216
00:21:35,780 --> 00:21:40,460
A very large epsilon is giving you a bad approximation, but something in between is giving you the best.
217
00:21:40,460 --> 00:21:45,860
So pick this value and the value that it picks is this one in the middle, which you can see here,
218
00:21:45,860 --> 00:21:57,020
certainly from these plots, is like the best approximation to our target.
219
00:21:57,020 --> 00:22:04,250
OK, so one thing I didn't mention about the sign discrepancy that I just evaluated for you was how I computed it.
220
00:22:04,250 --> 00:22:07,280
And it turns out the way a computer does by solving a linear programme,
221
00:22:07,280 --> 00:22:12,380
there's a particular programme that you saw that gives you back the value of the sign discrepancy. And that's great.
222
00:22:12,380 --> 00:22:17,900
But maybe you don't want to carry it. Maybe you want to solve a linear programme every time you have to evaluate your sample.
223
00:22:17,900 --> 00:22:23,930
You might find that cumbersome, too slow. And so now what I'm going to do is I'm going to identify an alternative, Stein said.
224
00:22:23,930 --> 00:22:29,460
That is a bit more user friendly, just easier to use than the one I just described.
225
00:22:29,460 --> 00:22:33,660
And to do this, I'm going to appeal to reproducing Col's.
226
00:22:33,660 --> 00:22:46,500
And I want to build upon an idea of Chris Oates Girolami in Japan, who used means zero reproducing kernels to improve Monte Carlo integration.
227
00:22:46,500 --> 00:22:53,070
So just a quick background on reducing colonels. Colonel, do you think of it as a function with two arguments?
228
00:22:53,070 --> 00:22:59,690
It's symmetric in those two arguments and it's positive semi definite meaning that if I took any
229
00:22:59,690 --> 00:23:06,230
points from my my sample space and I formed the matrix of pairwise evaluation to that function,
230
00:23:06,230 --> 00:23:11,910
then that matrix is always a positive, something definite matrix. So what are some examples?
231
00:23:11,910 --> 00:23:17,670
Probably the most common example is what's called the Gaussian kernel, which just looks like the W of a Gaussian distribution.
232
00:23:17,670 --> 00:23:20,600
And so that will be appearing in our talk in a second.
233
00:23:20,600 --> 00:23:27,320
Just to give you a second example, here's an inverse multi quadrio kernel, which looks a lot like the Gaussian kernel of its decayed polynomial.
234
00:23:27,320 --> 00:23:32,290
Instead of squared exponentially. OK.
235
00:23:32,290 --> 00:23:40,840
What we actually want from this colonel is the space that it generates, every colonel generates a what's called a reproducing Colonel Hilbert space.
236
00:23:40,840 --> 00:23:46,310
And we're going to use that to choose a different style set to pass through our operator.
237
00:23:46,310 --> 00:23:52,610
In particular, we're going to use this sign set, which some call the Colonel Stein said, and this is just the set of functions.
238
00:23:52,610 --> 00:23:57,530
So the function we've been using throughout the talk, we're actually vector value. They have de components.
239
00:23:57,530 --> 00:24:02,910
So this is a set of vector value functions such that every component belongs to your R.K.
240
00:24:02,910 --> 00:24:09,190
Yes, you're reproducing Colonel Hilbert space and the norms of those functions are jointly bounded by one.
241
00:24:09,190 --> 00:24:13,280
So every function component belongs to RCA. Yes. And then you put all the norms together.
242
00:24:13,280 --> 00:24:19,900
They're about to buy one. That's what I said is. What's good about this set?
243
00:24:19,900 --> 00:24:28,780
Well, the very nice property that this set has is that this tiny discrepancy that he uses and that is you plug this into your Stiehm discrepancy.
244
00:24:28,780 --> 00:24:32,590
It has a closed form solution. And to form that solution.
245
00:24:32,590 --> 00:24:40,230
All you have to do is take your input, Colonel Kay. You pass it, you apply the Styne operator to each argument of Kay.
246
00:24:40,230 --> 00:24:49,910
That gives you back a new kernel, which is what we call a Stein kernel. And then you sum up those Stein kernels across of every pair of sample points.
247
00:24:49,910 --> 00:24:55,140
So, again, the current this the resulting Colonel Stein discrepancy is just a pairwise,
248
00:24:55,140 --> 00:25:00,470
some of pairwise evaluations of this new Stein colonel that's formed by applying Herstein operator to your base.
249
00:25:00,470 --> 00:25:05,880
Colonel. So that's pretty cool.
250
00:25:05,880 --> 00:25:10,080
Clothes form does take quadratic time in your sample size.
251
00:25:10,080 --> 00:25:19,800
But it's also very paralyser able so you can evaluate all of your pairwise evaluations in parallel on a cluster or across course.
252
00:25:19,800 --> 00:25:24,780
OK, so now we have another time discrepancy and then we have the same questions.
253
00:25:24,780 --> 00:25:31,320
Does it detect? Does it detection on convergence, for instance? Is it the case that your sample sequence converges to P.
254
00:25:31,320 --> 00:25:39,480
Whenever you're Stiehm, discrepancy converges to zero. Well, it turns out that in the one dimensional case.
255
00:25:39,480 --> 00:25:41,820
This is true and pretty broad generality.
256
00:25:41,820 --> 00:25:48,660
If you pick any of your favourite kernels like that Gaussian kernel or the inverse multi quadrant, then if you're set,
257
00:25:48,660 --> 00:25:53,530
if your target distribution belongs to basically the same set of distributions that we consider before.
258
00:25:53,530 --> 00:25:58,020
Here I am. I've changed the name, but it's basically the same set as before.
259
00:25:58,020 --> 00:26:00,510
If you're if your target belongs to the script at P,
260
00:26:00,510 --> 00:26:08,730
which is the set of targets with Lipshitz grab log density and that satisfied distance, strong lock and cavity.
261
00:26:08,730 --> 00:26:15,210
This is a property that's kind of like being strongly Loken K, but it only is only a measurement of the tail, so it doesn't have to hold locally.
262
00:26:15,210 --> 00:26:18,930
So even Gaussian mixtures that are multi-modal satisfy this property.
263
00:26:18,930 --> 00:26:24,440
If people longs to the set of target distributions and you pick your favourite Kernell favourite,
264
00:26:24,440 --> 00:26:28,800
we'll say universal kernel, then your sample sequence converges to P.
265
00:26:28,800 --> 00:26:32,070
Whenever your state discrepancy goes to zero.
266
00:26:32,070 --> 00:26:38,010
So in terms of targets covered, it's all the same targets from before Bayesian logistic regression students t regression with calcium prioress,
267
00:26:38,010 --> 00:26:42,150
calcium mixtures, chemical variance, they'll fall into this class.
268
00:26:42,150 --> 00:26:47,850
And this also justifies the use of Casti with popular kernels like the Gaussian kernel.
269
00:26:47,850 --> 00:26:52,870
Another popular it is called the Materne Colonel and even the inverse multi quadrant that I mentioned before.
270
00:26:52,870 --> 00:26:57,160
But this result is a little bit limited because this is only for one dimensional targets.
271
00:26:57,160 --> 00:27:03,930
And often when we're dealing with CMC, we're working with multidimensional targets. So what can we say in that case?
272
00:27:03,930 --> 00:27:09,180
Well, it turns out that in higher dimensions, many cases break down,
273
00:27:09,180 --> 00:27:16,400
they actually fail to detect an on convergence even when you're working with a very easy target like a Gaussian.
274
00:27:16,400 --> 00:27:26,300
So here's the result. It says if you're if you're in dimensioned three or greater and you're the colonel you pick has a very rapidly decaying tail.
275
00:27:26,300 --> 00:27:31,360
So it's a Gaussian kernel if some a turn kernel which decays exponentially,
276
00:27:31,360 --> 00:27:36,910
if it's one of these inverse multi Quadro kernels that the case poorly normally. But at the polynomial, the K is too quick.
277
00:27:36,910 --> 00:27:44,140
Then I can give you a sample sequence that will send your sine discrepancy is zero but not converted to P.
278
00:27:44,140 --> 00:27:48,830
Which is actually really bad because that means you can't trust your style discrepancy.
279
00:27:48,830 --> 00:27:54,660
And in fact, the sample sequence doesn't converge, the sample sequence that I can construct will converge to anything.
280
00:27:54,660 --> 00:28:00,900
It really just places a bunch of mass on the surface of a sphere and then shoots the radius of the sphere off to infinity.
281
00:28:00,900 --> 00:28:09,140
So your sample is not converting to anything. It's just shooting all the mass off to infinity. And yet your time discrepancy is still going to zero.
282
00:28:09,140 --> 00:28:16,340
I don't trust you with this. So when you say not converging, so this is converging in distribution, right?
283
00:28:16,340 --> 00:28:23,380
Yeah, that's right. Secondary in distribution to anyway. That's right. There's also a convergence on compact sets, of course.
284
00:28:23,380 --> 00:28:30,110
Yeah. So, I mean, the real issue here is that you're sending all of your mass to infinity.
285
00:28:30,110 --> 00:28:34,410
All of it. And so it's essentially converting to the zero measure. Yeah.
286
00:28:34,410 --> 00:28:42,630
So. So in that sense I suppose. And they converge at the foot, converge because on any compact that you're left is nothing.
287
00:28:42,630 --> 00:28:46,650
Yes, but not to any problem. We won't convert any probability distribution. That's true. Yes.
288
00:28:46,650 --> 00:28:52,410
And that's that's, I think the most part about. But yes, it will convert. It is converting to the zero measure, as in convergence.
289
00:28:52,410 --> 00:28:56,340
But that is OK. I think this is the problem. That is the whole problem, actually.
290
00:28:56,340 --> 00:29:00,930
That is the whole problem. First time currently to work in the way you want it to.
291
00:29:00,930 --> 00:29:06,450
You have to be able to distinguish the zero measure from Pete because the zero measure also has means zero functions.
292
00:29:06,450 --> 00:29:14,930
And so to speak. Yet to build a Singlish, those two things. And the problem is that if you're Col's decaying too quickly,
293
00:29:14,930 --> 00:29:19,800
the these very light tails are ignoring that excess mass that you're sending into the tails.
294
00:29:19,800 --> 00:29:28,250
And so it doesn't notice that you're shooting all the mass off to infinity. Essentially, you're not controlling type this.
295
00:29:28,250 --> 00:29:32,780
So that's a problem. So what can we do?
296
00:29:32,780 --> 00:29:37,000
Well, I just said, well, the problem seems to be these like tales they're ignoring massive, if any.
297
00:29:37,000 --> 00:29:41,630
What if we use heavier tails? And it turns out that that is enough to solve the problem.
298
00:29:41,630 --> 00:29:49,700
If you use one of these inverse multi Khwaja kernels, these Pawling O'Meally, the cane kernels, but you have a cane not too slowly if you choose them.
299
00:29:49,700 --> 00:29:55,730
If you choose the decayed parameter between to be between minus one and zero, then everything works.
300
00:29:55,730 --> 00:29:59,330
If you're if you're distribution belongs to our standard class and you use this.
301
00:29:59,330 --> 00:30:06,200
I am to Colonel. Then indeed if you're Stiehm discrepancy goes to zero, your sample sequence is conversion weekly to P.
302
00:30:06,200 --> 00:30:13,630
In any dimension. OK, so let's go.
303
00:30:13,630 --> 00:30:17,190
We have a solution. Now, what about the other side? What about detecting convergence?
304
00:30:17,190 --> 00:30:24,140
We want to show that if my sample sequence is converting to P, then my sine discrepancy is also going to zero.
305
00:30:24,140 --> 00:30:27,290
Well, turns out that this is true and pretty broad generality.
306
00:30:27,290 --> 00:30:36,680
If your kernel is bounded and twice continuously differentiable and your grab Luppi is Lipshitz and Square enterable under P.
307
00:30:36,680 --> 00:30:42,580
So that's a pretty mild conditions. Then your stand discrepancy will go to zero whenever the Bosher stand distance goes to zero.
308
00:30:42,580 --> 00:30:47,780
So if you're Cubans converting to be embarrassed in distance, your tiny discrepancy will also converge to zero.
309
00:30:47,780 --> 00:30:59,170
And this covers all of your favourite kernels in them in any dimension.
310
00:30:59,170 --> 00:31:01,960
OK, so let me give you an example of using this.
311
00:31:01,960 --> 00:31:06,730
This kernel style discrepancy, now we're going to consider another one of these approximate CMC procedures.
312
00:31:06,730 --> 00:31:13,060
This was designed for scalability. This one is called stochastic gradient fissure scoring, or STF s for short.
313
00:31:13,060 --> 00:31:17,170
And in the initial paper, there are two different variants of this algorithm proposed.
314
00:31:17,170 --> 00:31:21,850
One that is more of an exact variant that involves inverting a DVD matrix.
315
00:31:21,850 --> 00:31:28,220
Another that tries to reduce computational effort by converting into agonal matrix instead.
316
00:31:28,220 --> 00:31:32,450
We're going to operate on a fifty one dimensional problem in this case,
317
00:31:32,450 --> 00:31:43,680
using a post here distribution over parameter Bayesian logistic regression posterior using monist handwritten digit images.
318
00:31:43,680 --> 00:31:48,120
OK, so here's the result, basically, we're comparing two different.
319
00:31:48,120 --> 00:31:55,440
I said we're comparing two different versions of the CMC algorithm approximate I'm CMC algorithm and teal and red.
320
00:31:55,440 --> 00:31:59,970
If you apply the same discrepancy, you see that across sample sizes.
321
00:31:59,970 --> 00:32:06,180
The full version, the F version just seems to be better, has a lower and a lower style discrepancy value.
322
00:32:06,180 --> 00:32:12,690
So this would suggest that you should always use that unless the speed up from using the diagonal version as much as very large.
323
00:32:12,690 --> 00:32:16,360
But it turns out that the speed up from using the diagonal version is quite mild.
324
00:32:16,360 --> 00:32:21,180
No differences like point o one seven seconds versus Ponto and nine seconds. Not much of a gain.
325
00:32:21,180 --> 00:32:26,060
So the quality this would suggest that the quality that you're losing by using the diagonal version is not worth it.
326
00:32:26,060 --> 00:32:33,120
Not the speed up you're getting is not worth the quality being lost. But that's just the output of this discrepancy measure.
327
00:32:33,120 --> 00:32:42,030
You might want to might wonder, what does that have to do with reality? So to give you an external cheque on this, Amun, this assessment,
328
00:32:42,030 --> 00:32:51,310
what I've done is I've run a very long and trustworthy Markov chain to generate a ground truth sample from the posterior.
329
00:32:51,310 --> 00:32:57,320
In this case. I believe it's either Hamiltonian Monte Carlo or in Metropolis suggested it led to an algorithm.
330
00:32:57,320 --> 00:33:01,490
So this is a Markov chain that actually is targeted at the distribution, run it for a very long time.
331
00:33:01,490 --> 00:33:07,010
And I've used that to as a surrogate surrogate ground truth for the true posterior.
332
00:33:07,010 --> 00:33:14,240
And then I'm comparing the meat, the produced means and co variances from our approximate CMC algorithms.
333
00:33:14,240 --> 00:33:16,190
So in the left column here,
334
00:33:16,190 --> 00:33:23,570
I'm showing you the the the best meaningful variance approximation across pairwise marginals and in the right column showing you the worst.
335
00:33:23,570 --> 00:33:30,330
So on the top, we have the diagonal version of the algorithm and at the bottom we have the full version of the algorithm.
336
00:33:30,330 --> 00:33:38,280
What you basically see is that even in the best case, the diagonal version is doing a poor job of approximating pairwise marginal variances.
337
00:33:38,280 --> 00:33:44,460
And even in the worst case, the full version of the album is doing a great job at approximating equal variances.
338
00:33:44,460 --> 00:33:48,680
So it's just an external cheque on the on the quality of these to approximate.
339
00:33:48,680 --> 00:33:55,140
CMC albums. And indeed, just by looking at pairwise marginals, you can see that the full version is doing much better than diagonal version.
340
00:33:55,140 --> 00:33:59,970
The benefit, though, of using this time discrepancy over what I've just done on the right is that you don't
341
00:33:59,970 --> 00:34:11,060
need this extra auxillary surrogate ground truth chain to perform your assessment.
342
00:34:11,060 --> 00:34:16,480
OK. So so far, I focussed wholly on sample quality comparison.
343
00:34:16,480 --> 00:34:22,440
And I like to highlight a few other application of these tick through the application of these techniques.
344
00:34:22,440 --> 00:34:32,420
So as an example, a few groups have proposed using Colonel Stein as references like the one I just described, to perform goodness of fit testing,
345
00:34:32,420 --> 00:34:41,440
in particular to Will Koski at all use the Acasti with a Gaussian kernel to carry out a variety of goodness effect testing experiments.
346
00:34:41,440 --> 00:34:47,180
The authors of this call. All right, Tom. Hey, Arthur.
347
00:34:47,180 --> 00:34:51,960
Arthur is part of that, too. Arthur Gretton is a great idea.
348
00:34:51,960 --> 00:34:56,340
And one thing they found was that the if you use a default Gaussian kernel,
349
00:34:56,340 --> 00:35:00,360
you could experience a considerable loss of power as your dimensioned D increases.
350
00:35:00,360 --> 00:35:08,580
And, you know, I think this is in part related to the problem that we highlighted before with a Gaussian kernel being unable to detect,
351
00:35:08,580 --> 00:35:12,030
unable distinguish the target from the zero measure in higher dimensions.
352
00:35:12,030 --> 00:35:18,420
So what we're going to do is we're just going to repeat their experiment, plugging in the I am Q kernel for the Gaussian kernel.
353
00:35:18,420 --> 00:35:27,480
OK. So here it is in the setup. Your target P is a multivariate standard Gaussian.
354
00:35:27,480 --> 00:35:31,770
You're seeing a sample that's not Gaussian. And you want to test whether it is Gaussian or not.
355
00:35:31,770 --> 00:35:38,070
And specifically, a sample is formed by spiking one of the components with a a uniform random variable.
356
00:35:38,070 --> 00:35:46,260
And so you want to test the power of this? We want to compute the power of this test for the testing, this departure from normality.
357
00:35:46,260 --> 00:35:51,840
And they also compare with the standard normal normality test of bearing House and Enza. So what do we see?
358
00:35:51,840 --> 00:35:57,890
Well, we see that in low dimensions, the Gaussian. All these tests, the Gaussian KSTU, the Barenholtz and hence the test.
359
00:35:57,890 --> 00:36:00,870
They all do quite well. They have full power at detecting these departures.
360
00:36:00,870 --> 00:36:05,280
But as a dimension increases, you see that the power of the Gaussian test decays very rapidly.
361
00:36:05,280 --> 00:36:11,340
By the time you're in dimensioned twenty five, you've lost all of your power. But if you just substitute this, I am.
362
00:36:11,340 --> 00:36:16,980
Q Colonel with polynomial, the k the one that we describe that controls convergence for the Gaussian kernel.
363
00:36:16,980 --> 00:36:25,520
You actually maintain full power across the range of dimensions. OK.
364
00:36:25,520 --> 00:36:29,690
So another thought is that, OK, we've we've looked at. We're assessing sample quality.
365
00:36:29,690 --> 00:36:34,350
What if we could we also improve the quality of our sample? You've you've generally a sample somehow.
366
00:36:34,350 --> 00:36:42,770
I wanna produce a better approximation to my target. Well, one way of doing this might be to assign a weight to each of your sample points
367
00:36:42,770 --> 00:36:47,220
and then optimise those weights to minimise your Col's time discrepancy. This is the idea.
368
00:36:47,220 --> 00:36:53,720
And a paper by Lou and Lee in twenty sixteen. And they again do this with a Gaussian kernel.
369
00:36:53,720 --> 00:36:59,490
And we're going to show you some benefits from doing this with an IQ kernel instead.
370
00:36:59,490 --> 00:37:05,880
So what I'm showing you here is the main square area between the true means and the estimated means,
371
00:37:05,880 --> 00:37:10,530
using a sample, using a sample, a weighted sample or original equal weight sample.
372
00:37:10,530 --> 00:37:17,790
So your initial sample is just an I.D. draw from your target. Then we're going to learn a re waiting using a Gaussian.
373
00:37:17,790 --> 00:37:21,840
I'm q sorry, I'm guessing KSTU guessing style discrepancy.
374
00:37:21,840 --> 00:37:29,420
If you do that, you see uniform improvement in an approximation quality across dimensions from dimension two up to one hundred.
375
00:37:29,420 --> 00:37:41,740
And then if you use an Q colonel and said you see yet further improvements, approximation, quality.
376
00:37:41,740 --> 00:37:49,960
OK. So you might. So in this last example, we were improving sample quality by re weighting points.
377
00:37:49,960 --> 00:37:54,250
You might also imagine that you could improve sample quality by shifting locations of your points.
378
00:37:54,250 --> 00:37:58,450
You start with the initial sample and then you move those points around to improve the quality.
379
00:37:58,450 --> 00:38:04,900
And this is the idea of Stein variation of gradient descent. You know, Leo and Wang.
380
00:38:04,900 --> 00:38:05,260
Essentially,
381
00:38:05,260 --> 00:38:14,470
it uses the colonel's time discrepancy directions like the worst case test functions from Colonel Stein's Pepsi problem to update locations of points.
382
00:38:14,470 --> 00:38:20,060
And it turns out that you can also view this update as a as a as an approximate gradient step
383
00:38:20,060 --> 00:38:25,950
in the Khail divergence between your sample approximation and your target distribution.
384
00:38:25,950 --> 00:38:32,550
Now, the conversion, the organ is not completely clear, but it's a very it's very simple to implement and seems to work well.
385
00:38:32,550 --> 00:38:36,510
One other possible downside is that the update also costs N squared time,
386
00:38:36,510 --> 00:38:41,340
so to update your set of sample points, if you're updating all of your points on every step.
387
00:38:41,340 --> 00:38:50,250
And that takes anywhere time per update. Another approach and other related approaches.
388
00:38:50,250 --> 00:38:53,790
Use your style discrepancy to greedily select the next points.
389
00:38:53,790 --> 00:39:01,050
So start with a start with no approximation. Pick the best single point approximation to your target, a Winterstein discrepancy.
390
00:39:01,050 --> 00:39:05,700
And then repeatedly add in additional points to improve that approximation.
391
00:39:05,700 --> 00:39:10,920
That's another that's another way of selecting Denovo points for proximity.
392
00:39:10,920 --> 00:39:16,320
Your target distribution. Different from just sampling or using CMC.
393
00:39:16,320 --> 00:39:23,370
And you know, we can show that this actually sends your case easy to cast you to zero at a particular rate that a square log end over end rate.
394
00:39:23,370 --> 00:39:32,620
So you can prove convergence of this algorithm. One possible downside is that to pick each new point, you have to solve this optimisation problem.
395
00:39:32,620 --> 00:39:38,500
You have to. You have to solve this to pick the next point, X, X and you this all this optimisation problem over to the T.
396
00:39:38,500 --> 00:39:41,080
Which could be expensive or difficult.
397
00:39:41,080 --> 00:39:48,940
So we proposed a modification which replaces that expensive optimisation with just sampling from a short Markoff chain.
398
00:39:48,940 --> 00:39:51,790
And it turns out if you have a Markov chain that's mixing quickly anyway,
399
00:39:51,790 --> 00:39:56,920
instead of optimising over all the space, you can just generate a short Markov chain.
400
00:39:56,920 --> 00:40:01,750
Pick the best point from the iterates of that chain. And use that as your next.
401
00:40:01,750 --> 00:40:10,290
The next point in your representation. And then it maintains the same rate of convergence.
402
00:40:10,290 --> 00:40:14,580
So here's an example, what those sample points could look like if you're doing em CMC.
403
00:40:14,580 --> 00:40:20,670
Typically the problem, the the lack of approximation quality comes from clumping.
404
00:40:20,670 --> 00:40:22,470
Coupling it with clumping that comes from randomness.
405
00:40:22,470 --> 00:40:29,370
So you'll have to see places where you have your oversampling in places and also large gaps just due to randomness.
406
00:40:29,370 --> 00:40:44,210
Whereas the Stiehm Points CMC algorithm typically spaces out your points to minimise redundancy and they give you a more compact representation.
407
00:40:44,210 --> 00:40:50,390
So I'm showing you some examples of this algorithm against some alternatives.
408
00:40:50,390 --> 00:41:01,400
This one is from a kinetic model of automatic control. So this is a 10 dimensional posterior distribution from the parameters that involve solving.
409
00:41:01,400 --> 00:41:08,510
So the likelihood here involves solving an O.D. And so you'd like to minimise the number of those likelihood evaluations.
410
00:41:08,510 --> 00:41:13,610
And so what I'm showing you is as a function of the number of about likelihood evaluations,
411
00:41:13,610 --> 00:41:21,180
how good is my approximation to the target distribution using a variety of different particle generation procedures?
412
00:41:21,180 --> 00:41:27,280
So comparing Markov chain Monte Carlo algorithms to S PGD appear in red to this nine points.
413
00:41:27,280 --> 00:41:33,680
I'm CMC algorithm, which are these solid purple and blue line to the bottom.
414
00:41:33,680 --> 00:41:39,080
So. The dash lines are NCMEC.
415
00:41:39,080 --> 00:41:43,190
Yes, these dash lines are MSNBC. Red is as PGD.
416
00:41:43,190 --> 00:41:46,920
Essentially, what you're seeing was having with FSU disease because every update ticks end square.
417
00:41:46,920 --> 00:41:54,510
Time it progresses slow. Eventually, it's going to converge to a good solution, but it just takes much longer than these other procedures.
418
00:41:54,510 --> 00:42:05,540
CMC does real relatively well, but you can actually get to better quality more quickly using these Stiehm Point M CMC procedures.
419
00:42:05,540 --> 00:42:12,990
Here is another example from a Financial Times series. I'll just skip over that.
420
00:42:12,990 --> 00:42:19,020
OK, so I've highlighted a few applications and some of the technology that's useful.
421
00:42:19,020 --> 00:42:24,150
I wanted to end with some opportunities for future development areas where I feel like more work is still needed.
422
00:42:24,150 --> 00:42:30,150
So first, improving scalability of style discrepencies while maintaining convergence control.
423
00:42:30,150 --> 00:42:37,290
So this tiny discrepancy itself involves this great into your log density where the gradient, your low density is too expensive.
424
00:42:37,290 --> 00:42:46,120
Can you still monitor convergence in a way that's both computationally feasible and theoretically sound?
425
00:42:46,120 --> 00:42:53,100
My collaborators, I have started doing some investigation into this. We have some new work this year called Stochastics Stein Discrepencies,
426
00:42:53,100 --> 00:42:59,760
which essentially just takes you replaces the GRAP great Inverloch entity with a random stochastic gradient.
427
00:42:59,760 --> 00:43:04,930
And we show that if you do this, you can still control convergence with probability one.
428
00:43:04,930 --> 00:43:12,180
So that's pretty exciting. But I think there's still much more to be explored in that space.
429
00:43:12,180 --> 00:43:14,100
If you're using one of these Col Stein discrepancies,
430
00:43:14,100 --> 00:43:21,120
I mentioned that the expense is quadratic in your sample size and that might be to be prohibitive.
431
00:43:21,120 --> 00:43:28,380
Your sample size is quite large and there's been a good bit of work on this now and trying to improve that.
432
00:43:28,380 --> 00:43:32,580
So Arthur and Collaborator's have work on what are called finite sets time discrepancies,
433
00:43:32,580 --> 00:43:36,450
which essentially are using low rank kernels and provide linear runtime.
434
00:43:36,450 --> 00:43:41,590
But there are still questions around the convergence control of these of these discrepancy measures.
435
00:43:41,590 --> 00:43:46,870
With Jonathan Huggins, I've worked on a different type of style discrepancy, some of which are actually,
436
00:43:46,870 --> 00:43:50,260
Colonel, some discrepancy, some are related, but they're actually they use different function classes.
437
00:43:50,260 --> 00:43:52,570
They're called the random feature style discrepancies.
438
00:43:52,570 --> 00:43:57,520
And you can think of these as providing essentially a stochastic lowering kernels that have near linear runtime,
439
00:43:57,520 --> 00:44:04,180
but also provide high probability conversions control. But there there's still some caveats to establish that convergence control.
440
00:44:04,180 --> 00:44:11,020
We needed to assume something about the moments of your approximating approximating samples that to be uniformly bounded.
441
00:44:11,020 --> 00:44:18,390
And so it'd be nice to get away from that requirement and just have something that works all the time.
442
00:44:18,390 --> 00:44:24,120
Now, I also proposed this particular operator to you that came from the Lensman diffusion call called the lens of an operator,
443
00:44:24,120 --> 00:44:30,480
but actually an infinite number of operators characterise the target distribution. So how do we know which one we should pick?
444
00:44:30,480 --> 00:44:34,480
What is the best operator and how does that impact your discrepancy?
445
00:44:34,480 --> 00:44:41,730
And what's practical? These are all open questions. But I can tell you, I can give you example why this matters.
446
00:44:41,730 --> 00:44:46,050
If I give you a class of distribution's for which the same discrepancy definitely does control convergence,
447
00:44:46,050 --> 00:44:51,990
but here's an example of distributions for which it does not. If you're dealing with, say, a student's t distribution.
448
00:44:51,990 --> 00:44:59,340
So the grading your love density is bounded and you're using any kernel, any C0 kernel like the Gaussian kernel or the item.
449
00:44:59,340 --> 00:45:06,870
Q Colonel Kernel, anything that's standard is zero. Then you're Stiehm discrepancy will not detect an unconverted, will not control convergence.
450
00:45:06,870 --> 00:45:15,440
In particular, I can always produce a sample that will stand your sign as revenue to zero, even though the sample is not converting to P.
451
00:45:15,440 --> 00:45:21,350
That's bad. It's the same sort of problem that we saw in load in low dimensions for certain kernels.
452
00:45:21,350 --> 00:45:29,150
But now this is happening for all the cartels. All your favourite colonels and every dimension.
453
00:45:29,150 --> 00:45:35,430
So what can we do about this? Well, my some of my collaborators and I like Andrew Dunkin's, Vashion, Volmer, Jackson,
454
00:45:35,430 --> 00:45:40,150
Gore, and have been investigating alternative operators called the Fusion site operators.
455
00:45:40,150 --> 00:45:43,430
This is essentially the idea here is that instead of using Relenza and diffusion,
456
00:45:43,430 --> 00:45:52,540
use a different diffusion and you can pick the diffusion in a way with that counteracts the decay or lack of growth of your
457
00:45:52,540 --> 00:45:58,440
of the grab bag density so that you can still detect convergence even if you're dealing with a heavy tail distribution.
458
00:45:58,440 --> 00:46:06,880
So that might be one solution, but I think there's still lots of open questions there.
459
00:46:06,880 --> 00:46:15,700
Some groups have been using these standards committee ideas to train generative models like generative adversarial networks and
460
00:46:15,700 --> 00:46:26,400
variation of auto coders so that you can generate new instances of of different classes of of different data distribution.
461
00:46:26,400 --> 00:46:32,580
So in this case, it's generating new celebrity faces.
462
00:46:32,580 --> 00:46:40,800
You can also use these ideas to develop algorithms with with provable, finite sample guarantees for optimisation.
463
00:46:40,800 --> 00:46:47,490
Even Vernon convex optimisation. And the basic idea here is that if you're going if you want to minimise a particular function,
464
00:46:47,490 --> 00:46:53,010
f you're going to do this by trying to sample from what's called a gibs distributions.
465
00:46:53,010 --> 00:47:00,100
You're going to take a target distribution. Make it E to the minus F or me to eat to the minus F type some scalar.
466
00:47:00,100 --> 00:47:02,730
And you say, like, can I sample from a distribution.
467
00:47:02,730 --> 00:47:08,220
And if you can simplify my distribution and you can increase the size of that scalar at the right rate,
468
00:47:08,220 --> 00:47:15,410
then you can eventually end up minimising that function. And we found that for fun, even for a class of functions.
469
00:47:15,410 --> 00:47:19,610
They're not convicts. But satisfy this other property called distant deceptively.
470
00:47:19,610 --> 00:47:33,330
We can actually design Markov chains that will minimise the function with explicit rate of convergence.
471
00:47:33,330 --> 00:47:39,100
And here's an example of one very simple case of such a function, like here's a function that is not convex,
472
00:47:39,100 --> 00:47:45,250
but it's really just because instead of growing convexity from the tails, it actually has concave growth away from the tails.
473
00:47:45,250 --> 00:47:50,410
And so if you use an algorithm like gradient descent, it has very little signal about where the optimum is.
474
00:47:50,410 --> 00:47:55,570
Even though you seems clear from this picture of the Optimus as you just read zero. Eventually you'll get there.
475
00:47:55,570 --> 00:48:00,910
But if you do in gradient descent, you're going to proceed actually very slowly. And this black line is showing that we're taking.
476
00:48:00,910 --> 00:48:05,080
It takes green and sent about 10000 thousand iterations to get close to the optimum here.
477
00:48:05,080 --> 00:48:08,290
If you use the land of an algorithm to try to sample from the gibs,
478
00:48:08,290 --> 00:48:12,460
distribution will enter and I'll go and actually go off in the wrong direction because it has
479
00:48:12,460 --> 00:48:17,380
poor mixing for distributions like this for for distributions with heavy tails like this.
480
00:48:17,380 --> 00:48:25,030
But if you take our design diffusion, which is based on changing the grand log density,
481
00:48:25,030 --> 00:48:35,440
scaling up the grab low density in a particular way, then you converge to the optimum like fifteen iterations.
482
00:48:35,440 --> 00:48:40,700
OK, and there are a number of other tests which people have been exploring using these tools,
483
00:48:40,700 --> 00:48:44,630
parameter estimation and models where maximum likelihood is intractable.
484
00:48:44,630 --> 00:48:49,730
His reference for that thinning the output of your Markov chain in a way that to get a
485
00:48:49,730 --> 00:48:53,600
more compressed representation of a posterior distribution is a reference for that,
486
00:48:53,600 --> 00:49:00,210
too. And I'll leave you with a slide of future directions and open questions that we have something to talk about.
487
00:49:00,210 --> 00:49:02,537
So thank you for your time.