1 00:00:00,940 --> 00:00:09,220 OK, so welcome everybody to this first virtual edition of the CSIRO talks for this term of a very happy to have Radford Neal, 2 00:00:09,220 --> 00:00:19,270 who was my advisor at you of T presenting a remotely, and he's going to talk about some of the recent CMC work. 3 00:00:19,270 --> 00:00:28,950 He has been doing so again. Thank you, Rutherford four, for being able to join us virtually, and I'll hand over to you. 4 00:00:28,950 --> 00:00:34,650 I'd like to talk about a new idea for doing IMC, in particular, 5 00:00:34,650 --> 00:00:45,060 acceptance of proposals for the agency and in particular its application to larger than methods for which it seems to be of most potential use. 6 00:00:45,060 --> 00:00:56,490 And how this new AMC AMC method might help with sampling from the posterior distribution and hierarchical Bayesian models. 7 00:00:56,490 --> 00:01:01,030 So. Sam, huh? 8 00:01:01,030 --> 00:01:04,630 OK. See how the Iraqis are OK here. 9 00:01:04,630 --> 00:01:11,920 OK, now they're raking. So I'll start off with some of the motivation here, assuming you know a bit about AMC, AMC, but if you don't, 10 00:01:11,920 --> 00:01:21,490 I'll I'll do a quick review of some CMC after that and then get to the the new AMC methods. 11 00:01:21,490 --> 00:01:32,450 And then briefly discuss how they actually get applied to the things that are the motivating models that I discussed at the beginning. 12 00:01:32,450 --> 00:01:40,790 So hierarchical Bayesian models are are arise because for realistic problems, 13 00:01:40,790 --> 00:01:49,100 it's usually the case that you don't want to try to specify the prior directly in terms of the actual model parameters, 14 00:01:49,100 --> 00:01:54,560 but instead it's more convenient to specified hierarchically with higher level hyper parameters 15 00:01:54,560 --> 00:01:59,900 in addition to the lower level parameters that actually control the distribution of the data. 16 00:01:59,900 --> 00:02:12,170 So, for example, for a simple, multivariate regression problem with K covariance, we'll have a model for the response in terms of the covariates, 17 00:02:12,170 --> 00:02:20,180 the covariate vector XY here, and we can phrase that as a dot product with regression coefficients and an intercept, 18 00:02:20,180 --> 00:02:24,440 and we have a standard deviation of the residuals sigma. 19 00:02:24,440 --> 00:02:33,650 Those are those are model parameters, and we could try to give a direct directly give priors to alpha betas and and sigma. 20 00:02:33,650 --> 00:02:40,580 But it's more convenient when we try to try to express realistic prior information 21 00:02:40,580 --> 00:02:44,570 that we instead say that the beta is given a higher level hyper parameter. 22 00:02:44,570 --> 00:02:55,100 Tao are independent with normal distributions with both variance tao squared and that Tao itself is given a a prior distribution. 23 00:02:55,100 --> 00:03:00,980 The reason this is is better is that if we tried to do a direct prior in terms of saying that 24 00:03:00,980 --> 00:03:07,160 Beta JS are all independent with normal distribution mean zero and standard deviation seven, 25 00:03:07,160 --> 00:03:11,090 we in practise usually don't actually know that. 26 00:03:11,090 --> 00:03:15,410 We don't know that the standard deviation of the beta should be seven. It could be. 27 00:03:15,410 --> 00:03:22,940 Before we look at the data that we it, it could be that maybe the covariates aren't actually very useful for predicting Y, 28 00:03:22,940 --> 00:03:27,990 so all the beta should be close to zero. But probably we hope that's not the case if we're doing this at all. 29 00:03:27,990 --> 00:03:34,370 So we think maybe the beta should be bigger, but we're not sure how big we don't know how predictable Y is. 30 00:03:34,370 --> 00:03:39,110 And so we can't plug in any specific standard deviation for the betas. 31 00:03:39,110 --> 00:03:48,800 Instead, we express this in terms of of the betas having a standard deviation TAO and then we give some prior distribution to tile. 32 00:03:48,800 --> 00:03:55,370 Of course, we could integrate over tau and come up with with some direct distribution for the betas. 33 00:03:55,370 --> 00:04:01,070 But in in complex hierarchical models, doing that integration might be difficult. 34 00:04:01,070 --> 00:04:05,300 In any case, produces a direct prior. That's very hard to understand. 35 00:04:05,300 --> 00:04:13,400 And so it's it's preferable to do things this way. 36 00:04:13,400 --> 00:04:21,720 Right now, that's a very simple, hierarchical model for more complex models, we might have more than one group of related parameters, 37 00:04:21,720 --> 00:04:29,220 each of those groups having a a unknown variance that we then give a prior to is a higher level hyper parameter. 38 00:04:29,220 --> 00:04:40,290 So in neural networks models in particular, we we have perhaps a number of different hidden layers that connect to the inputs and and to the outputs. 39 00:04:40,290 --> 00:04:47,340 And so there's groups of weights that are, say, connecting the inputs to this hidden layer or this hidden layer to that hidden layer. 40 00:04:47,340 --> 00:04:55,650 And each of those groups might be given some, some prior distribution, whose variance depends on a higher level hyper parameter. 41 00:04:55,650 --> 00:05:05,550 And these higher, higher level hyper parameters can represent a higher level information about what the problem is like. 42 00:05:05,550 --> 00:05:12,270 Such as which covariates are actually relevant or whether the the model should 43 00:05:12,270 --> 00:05:18,300 be an additive model or one that allows for interactions and things like that. 44 00:05:18,300 --> 00:05:25,650 You also get what is mathematically the same setup when you have models with latent variables or random effects, 45 00:05:25,650 --> 00:05:29,430 whatever you want to call them, because in a Bayesian context, 46 00:05:29,430 --> 00:05:38,250 this is is is basically the same sort of thing is the hierarchical priors for what one might instead call parameters rather than latent variables. 47 00:05:38,250 --> 00:05:44,220 So here, for instance, is a one way random effects model in which we have some random effects in UI, 48 00:05:44,220 --> 00:05:56,940 whose distribution is controlled by a higher level parameter TAO, which in turn is given some prior distribution. 49 00:05:56,940 --> 00:06:07,620 Now, if we're doing a Bayesian inference for this, we we usually end up doing more CMC and that's OK for simple models. 50 00:06:07,620 --> 00:06:12,090 In fact, Gibbs sampling works OK for such simple hierarchical models. 51 00:06:12,090 --> 00:06:24,480 That was these were amongst the examples early on in the statisticians use of CMC with with the gel found in Smith's paper, for instance. 52 00:06:24,480 --> 00:06:31,380 For more complex models, though, we may have to do something more general because we don't know how to do good sampling or it's very slow. 53 00:06:31,380 --> 00:06:37,380 We might use some general random walk metropolis method or Hamiltonian Monte Carlo, 54 00:06:37,380 --> 00:06:44,730 but there's the problem with this if the posterior distribution for for variance hyper parameter is broad, 55 00:06:44,730 --> 00:06:49,050 we don't have so much data that it gets pinned down exactly what it should be. 56 00:06:49,050 --> 00:06:54,780 Then the Markov chain we use for sampling from the posterior distribution needs to 57 00:06:54,780 --> 00:06:59,940 sample both regions where there are tight constraints on the local parameters. 58 00:06:59,940 --> 00:07:06,900 The beta jay, for instance, in that simple regression model, and we'll have tight constraints on them when Tao is small, 59 00:07:06,900 --> 00:07:14,520 because then the prior will constrain them to be close to zero. But if the distribution for Tao is broad, 60 00:07:14,520 --> 00:07:22,350 then we'll also have have high probability regions where towers law is large and the beta genes are much less constrained. 61 00:07:22,350 --> 00:07:28,530 Now, when we try to do AMC, AMC for this with a random walk metropolis or with HMRC, 62 00:07:28,530 --> 00:07:34,560 the the tuning parameters for these methods are that our best are different in these different regions. 63 00:07:34,560 --> 00:07:41,160 Either the proposal with four random walk metropolis or the step size four for HMRC, 64 00:07:41,160 --> 00:07:47,730 and if we then use a proposal that's only suitable for the less constrained region, 65 00:07:47,730 --> 00:07:55,830 we can end up with a with a Markov chain that almost never visits the more constrained region. 66 00:07:55,830 --> 00:08:02,130 And on the other hand, if we tuned for the more constrained region, it might be extremely slow in the less constrained region. 67 00:08:02,130 --> 00:08:09,120 So you can see the the the sometimes horrible consequences of this in a slow set sampling paper 68 00:08:09,120 --> 00:08:21,070 where there's a final example of a final distribution that say a model of this sort of situation. 69 00:08:21,070 --> 00:08:26,260 So there would be various possible solutions to this that we might try. 70 00:08:26,260 --> 00:08:28,000 One is that we could just say, well, 71 00:08:28,000 --> 00:08:36,760 we need different step sizes or proposal widths in different regions so we can just randomly pick a step size each step. 72 00:08:36,760 --> 00:08:43,000 And from a distribution broad enough that we sometimes get ones appropriate for each region. 73 00:08:43,000 --> 00:08:50,200 The problem with this is that we then end up wasting a lot of time on unsuitable values for the for the tuning parameters. 74 00:08:50,200 --> 00:08:54,610 We get them once in a while for the parameters. 75 00:08:54,610 --> 00:09:00,730 The tuning parameters that are suitable for the region. We're in it at this time, but we may waste lots of time. 76 00:09:00,730 --> 00:09:07,360 Now there might be a way to get around this. That's discussed in in one of my papers on what I call the shortcut method that 77 00:09:07,360 --> 00:09:13,390 tries to cut short the computation time you need for the unsuitable values. 78 00:09:13,390 --> 00:09:17,230 But that's not the topic of this talk. 79 00:09:17,230 --> 00:09:26,340 We might decide that instead, we should reprioritize the model, saying, for instance, that the Beta J regression coefficients are equal to Tiao Times. 80 00:09:26,340 --> 00:09:36,790 Zayda J. And Zayda J is just given a normal zero one distribution so that we end up with better j having a standard deviation TAO. 81 00:09:36,790 --> 00:09:43,330 And now the prior constraint on Z to J doesn't vary. So we don't have the problem with that. 82 00:09:43,330 --> 00:09:48,010 When Tao is small, Zaidi's is is more tightly confined. 83 00:09:48,010 --> 00:09:53,740 However, we have the opposite problem that now, depending on what tower he is, 84 00:09:53,740 --> 00:10:01,540 the the way that the data constrains Zeta J is now going to vary, which might cause similar sorts of problems. 85 00:10:01,540 --> 00:10:09,610 So another possibility is that we just alternate iterate iterations that only update for low level parameters 86 00:10:09,610 --> 00:10:18,480 like the beta JS with it with iterations that update only the high level hyper parameters like the TAU. 87 00:10:18,480 --> 00:10:22,440 If we do that, then when we do the low level parameter updates, 88 00:10:22,440 --> 00:10:30,600 it's it's valid to change the proposal with the HMRC step size based on the current values of 89 00:10:30,600 --> 00:10:35,640 the higher level hyper parameters because they're not changing during the updates for doing so. 90 00:10:35,640 --> 00:10:40,140 That doesn't undermine the correctness of the of the methods. 91 00:10:40,140 --> 00:10:41,970 Of course, you have to have some way of doing that, 92 00:10:41,970 --> 00:10:52,620 some scheme for how you change the the lower level parameters based on the current values at the higher level ones, but that sometimes may be doable. 93 00:10:52,620 --> 00:11:02,730 Also, the higher level updates may be pretty simple. Perhaps a disadvantage, though, is that if the lower level updates are lengthy. 94 00:11:02,730 --> 00:11:09,570 So, for instance, HMRC spends a lot of computation time and computing a long trajectory typically. 95 00:11:09,570 --> 00:11:21,720 And that means then that we do the higher level updates only infrequently as far as the as the amount of movement at the lower level that that complex 96 00:11:21,720 --> 00:11:31,500 update does and that that may be disadvantageous and that we don't get his frequent interaction between the low level and high level parameters, 97 00:11:31,500 --> 00:11:39,570 which is necessary in order to fully explore the posterior distribution. 98 00:11:39,570 --> 00:11:46,050 So that's that problem is one motivation for this work. Another motivation is handling discrete variables, 99 00:11:46,050 --> 00:11:55,380 which arise in many statistical models like mixture models that have component indicators where which component each data item came from, 100 00:11:55,380 --> 00:11:59,910 or if you're handling missing categorical data and so forth. 101 00:11:59,910 --> 00:12:05,490 Now, these discrete variables aren't a particular problem. If you're doing Gibbs sampling or random walk metropolis, 102 00:12:05,490 --> 00:12:10,890 you just have a proposal distribution that also proposes new values for these discrete parameters, 103 00:12:10,890 --> 00:12:13,500 or that samples from their conditional distribution. 104 00:12:13,500 --> 00:12:22,320 But they can't be directly handled in a gradient based methods such as aid simply because they're discrete and don't have derivatives. 105 00:12:22,320 --> 00:12:31,860 So one approach would be again to alternate AMC updates for the continuous variables with other updates that are only used for the discrete variables. 106 00:12:31,860 --> 00:12:39,960 But again, we have the issue that if if the AMC updates are are complex updates doing lots of computation time, 107 00:12:39,960 --> 00:12:49,110 then we're spending we're doing a lot of work at the low level before we at the for the continuous variables in this case, 108 00:12:49,110 --> 00:12:58,560 before we change the discrete variables and assuming there's dependencies between these that are doing a lot of work on discrete variables, 109 00:12:58,560 --> 00:13:04,050 on continuous variables without changing the discrete ones is maybe not too useful because 110 00:13:04,050 --> 00:13:09,450 you really have to look at changing both sets of variables in order to move around the space. 111 00:13:09,450 --> 00:13:21,050 So the the question is, can we find some method with the advantages of each AMC that avoids this problem? 112 00:13:21,050 --> 00:13:29,330 So before getting to to that question, I'll do a bit of background on HMRC. 113 00:13:29,330 --> 00:13:38,390 So this is going to be pretty quick if you've never heard of it CMC before, but just to orient people here, 114 00:13:38,390 --> 00:13:46,850 the CMC approach is that, first of all, we specify our model, including both the likelihood function and the prior. 115 00:13:46,850 --> 00:13:52,190 And we assume these are computable up to some unknown normalising constant. 116 00:13:52,190 --> 00:13:57,710 So the posterior density is also computable up to a non normalising constant. 117 00:13:57,710 --> 00:14:03,650 We then define a Markov chain that will converge to this posterior distribution from any starting point. 118 00:14:03,650 --> 00:14:09,800 And we run this chain, perhaps from many starting points for long enough for it to approximately converge and then 119 00:14:09,800 --> 00:14:15,800 use the points that we get after that to estimate expectations with respect to the posterior. 120 00:14:15,800 --> 00:14:16,340 For instance, 121 00:14:16,340 --> 00:14:28,890 posterior means the parameters or predictions for new data points that involve an integration over the posterior distribution for parameters. 122 00:14:28,890 --> 00:14:36,120 Now, the basis for doing this is that the Markov chain has to converge to our desired distribution, 123 00:14:36,120 --> 00:14:42,570 and for that to be true, it's necessary that if the desired distribution and variant. 124 00:14:42,570 --> 00:14:51,690 So if we want to sample from a distribution with density pie, we need need to have it that the integral of of pie of X times, 125 00:14:51,690 --> 00:14:59,700 the transition probability for the Markov chain to go from X to X prime is equal to the two pie of X primes. 126 00:14:59,700 --> 00:15:06,030 So that basically says if ever we actually got to the right distribution, then after we did another transition, 127 00:15:06,030 --> 00:15:13,920 we'd still be in the right distribution, which is a requirement if we're going to ever converge to the right distribution and stay there. 128 00:15:13,920 --> 00:15:24,060 This isn't a sufficient condition because we have to also make sure we don't get trapped in some subset of the state space. 129 00:15:24,060 --> 00:15:33,990 It's typical for for many CMC methods, that invariance is proved by first proving reversibility of the transitions. 130 00:15:33,990 --> 00:15:45,600 So reversibility for four transitions of a Markov chain is that the probability of starting in a state X if you started according to the distribution 131 00:15:45,600 --> 00:15:55,680 pie and then moving to X prime from X is the same as the probability of starting in X prime if you start from pie and then moving from x prime to X. 132 00:15:55,680 --> 00:16:07,380 So basically, this says that once you reach convergence, the transitions from x two x prime happen as often as those from x prime to X. 133 00:16:07,380 --> 00:16:15,390 And if this is true, then invariance will also be true, which is easily shown by just integrating both sides. 134 00:16:15,390 --> 00:16:21,930 So this is one way of proving in variance, but it's not the it's not a necessary condition for it for in variance. 135 00:16:21,930 --> 00:16:27,960 There are non reversible Markov chains for which this isn't true that nevertheless leave the 136 00:16:27,960 --> 00:16:37,690 distribution pie invariant and these these non reversible methods can often be very useful. 137 00:16:37,690 --> 00:16:37,960 Now, 138 00:16:37,960 --> 00:16:49,090 the the oldest NCMEC method is what's called the metropolis algorithm and is a very general way of defining a transition that's reversible for PI, 139 00:16:49,090 --> 00:16:54,100 involving first proposing a State X star based on the current state. 140 00:16:54,100 --> 00:17:02,410 We're in X and then either accepting or rejecting that proposal based on the ratio of densities for X star and the current state. 141 00:17:02,410 --> 00:17:06,160 If we reject the same, the new state is the same as the Old State. 142 00:17:06,160 --> 00:17:10,270 Otherwise, the new state is the proposed the star are. 143 00:17:10,270 --> 00:17:17,710 The proposals are done according to some distribution that all writers as the X star give an X and this has to be symmetrical. 144 00:17:17,710 --> 00:17:26,560 And the acceptance probability for this proposal is the minimum of one and the ratio of Pi X start overpay x. 145 00:17:26,560 --> 00:17:34,480 Now it's it's straightforward to show that this transition is reversible and therefore it leaves PI invariant. 146 00:17:34,480 --> 00:17:40,180 So that's. And since this, all we need is this ratio, we don't need the normalising constant for PI. 147 00:17:40,180 --> 00:17:51,650 And so this is a very widely applicable method, both for Bayesian inference and for statistical physics, for that matter. 148 00:17:51,650 --> 00:18:00,410 So here's an illustration of the metropolis method applied to a by variant Gaussian distribution with high correlation point nine. 149 00:18:00,410 --> 00:18:10,890 And with the proposal distribution being a a normal distribution with mean equal to the current state and with some standard deviation here. 150 00:18:10,890 --> 00:18:18,860 Point three. Uh, with the with the proposals being independent for the two coordinates, 151 00:18:18,860 --> 00:18:24,890 you do much better if you make proposals that actually obey the the high correlation. 152 00:18:24,890 --> 00:18:35,360 In this example, this is an example that's meant to illustrate a more, more practical problems in which there may well be these high correlations, 153 00:18:35,360 --> 00:18:37,230 but we don't know what they are to start with, 154 00:18:37,230 --> 00:18:45,250 so we assume we can't carefully tune our proposal distribution to to do best with that particular correlation. 155 00:18:45,250 --> 00:18:51,160 So the green points here are an aid sample from pie, just so you can see what the distribution is like. 156 00:18:51,160 --> 00:18:58,420 And the black points show 250 transitions, the Markov chain starting from from over here. 157 00:18:58,420 --> 00:19:03,310 And we see that initially the starting point is very low probability under this distribution, 158 00:19:03,310 --> 00:19:10,270 and we see that we we progress reasonably systematically towards a high probability region. 159 00:19:10,270 --> 00:19:19,460 But after we get there, we move around by a random walk, which is rather inefficient way of exploring this distribution. 160 00:19:19,460 --> 00:19:28,700 The red lines show the rejected proposals that where we stayed at the current point rather than moving. 161 00:19:28,700 --> 00:19:32,690 So here's a extension of this example, 162 00:19:32,690 --> 00:19:41,720 where we actually have 20 copies of this by various normal distribution that are the copies being independent of each other, 163 00:19:41,720 --> 00:19:46,580 and we'll again assume that we don't know anything about this distributions or our proposals 164 00:19:46,580 --> 00:19:51,710 are just from this spherical Gaussian with with now a standard deviation point 07. 165 00:19:51,710 --> 00:20:01,340 And we now see that things are much, much slower because we had to use a standard deviation of 0.07 rather than 0.3, 166 00:20:01,340 --> 00:20:09,380 because we're proposing changes to all all 20 dimensions at the same time. 167 00:20:09,380 --> 00:20:17,390 And to get a good acceptance rate changing all those variables at once, we have to do a smaller change. 168 00:20:17,390 --> 00:20:21,170 And so we see we now take smaller steps to get towards the distribution, 169 00:20:21,170 --> 00:20:26,660 and we still proceed systematically pretty much towards the high probability regions. 170 00:20:26,660 --> 00:20:32,450 But once we get here, we then are again doing a random walk. But now with much smaller step size. 171 00:20:32,450 --> 00:20:45,120 So this now actually explores the distribution 18 times slower than we had within when we had only one copy of the bi variant normal. 172 00:20:45,120 --> 00:20:52,860 So this is an illustration of of the slowness of exploring things with random walks, 173 00:20:52,860 --> 00:20:57,960 because the random walks have no tendency to keep going in the same direction. 174 00:20:57,960 --> 00:21:06,630 And as a simple example, suppose we did a random walk in which X plus one was just x t plus from plus some 175 00:21:06,630 --> 00:21:14,190 normal draw from from a standard normal distribution independently for each time. 176 00:21:14,190 --> 00:21:26,850 Then you can easily sort of add up the variance of these entities and see that X T plus K is likely to be only about square root of K away from x t. 177 00:21:26,850 --> 00:21:33,790 The variance of the difference is is is k from adding up those k independent normals. 178 00:21:33,790 --> 00:21:39,420 But that means the standard deviation is on the root K and we're only likely to be about that far away. 179 00:21:39,420 --> 00:21:51,210 So we get enormously faster distribution, enormously faster exploration if we can somehow not do this random walk or. 180 00:21:51,210 --> 00:22:00,540 Well, there's actually two approaches here. We could try to make it so that the changes we make our big rather than small are if 181 00:22:00,540 --> 00:22:05,040 they're so big that they sort of move you from one side of the distribution to the other, 182 00:22:05,040 --> 00:22:16,680 then we don't care about the random walk anymore. Or if we have to still do make small changes, we want to have them not do a random walk. 183 00:22:16,680 --> 00:22:22,530 And that will imply that the transitions have to be non reversible because if the transitions are reversible, 184 00:22:22,530 --> 00:22:29,010 the chance of going from X T to plus one has to be the same as the chance of going the other direction. 185 00:22:29,010 --> 00:22:37,450 So it's inevitable that there's a random walk aspect to things. 186 00:22:37,450 --> 00:22:43,690 Now, just a technical point here is if we have several Markov chain transitions like care them to, 187 00:22:43,690 --> 00:22:54,760 you want to take all of which leave the distribution and variant, then the operation of just applying them in sequence will also leave PI invariant. 188 00:22:54,760 --> 00:23:00,640 Hopefully, that's fairly intuitive that if each of these that invariant, the combination does. 189 00:23:00,640 --> 00:23:07,390 But it's not the case that if they're all reversible, then the combination will also be reversible. 190 00:23:07,390 --> 00:23:11,800 That's not generally true, but it's OK that they're not reversible. 191 00:23:11,800 --> 00:23:22,870 That's not a requirement for for CMC. A simple example is is the frequently used Gibbs sampling procedure where we just where we have a 192 00:23:22,870 --> 00:23:28,930 multivariate state and we update each component of it in turn by sampling from its conditional distribution. 193 00:23:28,930 --> 00:23:39,310 And because we're if we're doing it systematically over the key components, then then that's a a sequential application of K transitions. 194 00:23:39,310 --> 00:23:44,260 And although it leaves the right distribution invariant, it's not reversible. 195 00:23:44,260 --> 00:23:50,980 This non reversible ability for Gibbs sampling seems to have no particularly amazing consequences. 196 00:23:50,980 --> 00:24:02,850 But in other situations, arranging for your transitions to be non reversible can be extremely advantageous. 197 00:24:02,850 --> 00:24:06,030 So I'll now get on to Hamiltonian Monte-Carlo, 198 00:24:06,030 --> 00:24:14,970 which is a very popular method of sampling from posterior distributions now used, for instance, in the stand package. 199 00:24:14,970 --> 00:24:21,180 The reason it's popular is it's much more efficient than than a simple metropolis methods where, 200 00:24:21,180 --> 00:24:29,220 for instance, you just propose a new stage by just adding random noise to the current state. 201 00:24:29,220 --> 00:24:36,690 The way HMRC works is we augment the variable of interest that I'll call X with a momentum 202 00:24:36,690 --> 00:24:42,240 variable of the same dimensionality that has a Gaussian distribution independent of X, 203 00:24:42,240 --> 00:24:50,400 and the transition then consists of two parts. We first sample a value for the momentum from its distribution, 204 00:24:50,400 --> 00:24:59,340 which might be something very simple, like like a normal with with identity covariance matrix. 205 00:24:59,340 --> 00:25:06,210 And then we do a metropolis update in which the proposal is found by simulating Hamiltonian dynamics. 206 00:25:06,210 --> 00:25:17,160 That takes us from our current point x p to some proposed X star p star and then we accept or reject it in the usual metropolis fashion. 207 00:25:17,160 --> 00:25:26,730 So this is also a combination of two steps, each of which leaves the distribution invariant and in fact, each of them is reversible. 208 00:25:26,730 --> 00:25:34,080 And so the the sequential application of the two also leaves the just the distribution invariant. 209 00:25:34,080 --> 00:25:38,220 Now, if the Hamiltonian dynamic simulation were exact, 210 00:25:38,220 --> 00:25:45,060 we'd always accept because that's that would be that's a consequence of properties that Hamiltonian dynamics. 211 00:25:45,060 --> 00:25:54,150 But in practise, we can't do the simulation exactly. We instead approximate the dynamics by doing what are called leapfrog steps, 212 00:25:54,150 --> 00:26:03,150 each of which sort of moves us forward in in in simulated time here by by an amount ADA. 213 00:26:03,150 --> 00:26:13,500 And so we managed to go for a total time Tao by doing a L steps, each of which go for a for a time ADA. 214 00:26:13,500 --> 00:26:18,990 So this is a descriptors discourteous ation of the of the dynamics that isn't exact. 215 00:26:18,990 --> 00:26:29,990 And because of that, rejection is possible with the rejection probability being higher if we take bigger steps. 216 00:26:29,990 --> 00:26:42,290 So here is the BI variant Gaussian again, we can see that now we were doing a 10 leapfrog steps in the simulation with a step size of 0.1 six, 217 00:26:42,290 --> 00:26:47,000 and that gets us directly from our starting point to a point down here. 218 00:26:47,000 --> 00:26:53,780 And then we move around in big steps after that with occasional rejections that you can see with the red lines. 219 00:26:53,780 --> 00:26:58,400 But when we don't reject, we move quite a quite a distance along the distribution. 220 00:26:58,400 --> 00:27:08,000 And therefore, although HMRC is actually reversible, it doesn't particularly matter that it's reversible because it's taking big steps. 221 00:27:08,000 --> 00:27:14,510 And so it doesn't matter that they are then equally likely to go the other direction because 222 00:27:14,510 --> 00:27:23,340 we don't have to put together a lot of small steps in order to get any appreciable distance. 223 00:27:23,340 --> 00:27:32,430 So here are some trajectories from HMRC. They start off with the red trajectory here, which was accepted and then so on. 224 00:27:32,430 --> 00:27:39,060 For other colours, this blue one was rejected. And so we went back here for the next one. 225 00:27:39,060 --> 00:27:47,160 The the dynamics confines these trajectories to the high probability region and also keeps them going 226 00:27:47,160 --> 00:27:52,390 in the same direction until it's necessary to turn around when you get to the end of the distribution. 227 00:27:52,390 --> 00:28:03,200 So that's that that momentum keeping you going in the same direction is why there isn't a random walk within a trajectory. 228 00:28:03,200 --> 00:28:12,710 And here we have HMRC for the replicated by variant Gaussian with 10 copies of the buy variants Gaussian. 229 00:28:12,710 --> 00:28:18,980 We have to reduce the step size in order to get a reasonable rejection rate in higher dimensions. 230 00:28:18,980 --> 00:28:27,110 And so there is some sort of penalty, which dimensionality? But the penalty is much less than with the metropolis methods. 231 00:28:27,110 --> 00:28:33,440 The penalty basically grows as the five fourth five fourth power of the dimensionality, 232 00:28:33,440 --> 00:28:42,710 whereas it grows linearly with the dimensionality for simple metropolis methods. 233 00:28:42,710 --> 00:28:50,540 And now I'll get two lines of in Monte Carlo, which is the method that that I'll actually be improving here. 234 00:28:50,540 --> 00:28:58,980 So Landgraf in Monte Carlo, although it wasn't originally thought of this way, is actually equivalent to doing each AMC with a single leapfrog step. 235 00:28:58,980 --> 00:29:09,350 So it's just a special case. And. Doing a special case in detail here, we once again sample a a momentum variable. 236 00:29:09,350 --> 00:29:12,560 And then we compute a proposal with a single leapfrog step, 237 00:29:12,560 --> 00:29:18,590 which I have written out this way involving the gradient of the log of the probability density. 238 00:29:18,590 --> 00:29:31,730 And then we accept or reject that that proposal and we can actually write out this proposal x star sort of going 239 00:29:31,730 --> 00:29:40,160 through what what actually happened here as the current state plus a a movement in the direction of the gradient, 240 00:29:40,160 --> 00:29:48,880 plus the noise, which is a another way that lines of in methods are thought of. 241 00:29:48,880 --> 00:29:51,690 And here's Lindsey Vonn for a bi variant Gaussian. 242 00:29:51,690 --> 00:30:01,240 Are we are we sort of overshoots the first step overshoots to the opposite side of the distribution and also a low probability region, 243 00:30:01,240 --> 00:30:11,950 but it soon manages to get down into the distribution. Once it's there, it again does a random walk just like random microdroplets. 244 00:30:11,950 --> 00:30:18,700 There is actually an advantage to a range of in over random walk metropolis from using the gradient information, 245 00:30:18,700 --> 00:30:24,240 but it still suffers from this random walk in efficiency. 246 00:30:24,240 --> 00:30:34,230 And here we have it for the replicated by variant goes in and we have to reduce the step size to get a reasonable acceptance rate. 247 00:30:34,230 --> 00:30:37,260 And so things get worse with dimensionality. 248 00:30:37,260 --> 00:30:46,530 This scaling with dimensionality is is better than for Simple Metropolis Random Walk Metropolis updates, but it's worse than for HMRC. 249 00:30:46,530 --> 00:30:51,600 It's scales, I think by by the dimensionality of the power for thirds, 250 00:30:51,600 --> 00:30:59,880 which is not as good as the mentioned outside the power five fourths, which is what HMRC is like. 251 00:30:59,880 --> 00:31:04,650 OK, so that's the background, culminating in the laser beam method, 252 00:31:04,650 --> 00:31:13,330 and I'll discuss how one can improve the larger than method by making it non reversible. 253 00:31:13,330 --> 00:31:18,250 So the there are two non reversible improvements here, actually. 254 00:31:18,250 --> 00:31:26,050 This is made it hard for me to come up with a catchy, short description of of the new method. 255 00:31:26,050 --> 00:31:31,690 So the first non reversible improvement was made by Horowitz back in 1991, 256 00:31:31,690 --> 00:31:40,300 and it modifies the launch of an method by only partially changing the momentum variable each time. 257 00:31:40,300 --> 00:31:44,410 So we're for the original answer, then methods. 258 00:31:44,410 --> 00:31:52,750 You can think of the state as consisting only of X because you re sample p independently of the previous P at the start of each update, 259 00:31:52,750 --> 00:31:56,740 so you don't really have to save it from the last one if you don't want to. 260 00:31:56,740 --> 00:32:06,910 For the Horowitz's modification, you do keep in the state all the time because you don't completely replace P, but instead only slightly modify it. 261 00:32:06,910 --> 00:32:13,570 We change P to by an auto regressive update of making it equal to alpha p plus square root 262 00:32:13,570 --> 00:32:20,440 of one minus alpha squared times some noise and if alpha is only slightly less than one. 263 00:32:20,440 --> 00:32:24,330 This makes only a slight change to P. 264 00:32:24,330 --> 00:32:33,960 Then we knew proposed a new state by doing a leapfrog step as before, and we accept or reject this proposal the usual way. 265 00:32:33,960 --> 00:32:42,420 A detail that didn't matter in the original Langeveldt method but does now is that the proposal not only does a leapfrog step from XP, 266 00:32:42,420 --> 00:32:47,310 but after doing that it also negates P that ensures that the proposal is 267 00:32:47,310 --> 00:32:52,170 symmetrical and that we'd go back where we came from if we were to do it again. 268 00:32:52,170 --> 00:32:59,910 That is crucial to that to it. Leaving the right distribution and variant, which we proved by proving rigorous stability, 269 00:32:59,910 --> 00:33:04,350 which requires this negation for the original leapfrog method that doesn't you don't 270 00:33:04,350 --> 00:33:08,490 have to actually put that in your programme because you're going to forget anyway. 271 00:33:08,490 --> 00:33:17,130 But for this method, you do have to really negate P R, and then the next step is you just negate P. 272 00:33:17,130 --> 00:33:26,010 So there are three updates making up one of these overall persistent momentum lines of updates. 273 00:33:26,010 --> 00:33:34,500 Each of these three updates leaves the correct distribution invariant, so they in sequence also leave the correct distribution variant. 274 00:33:34,500 --> 00:33:43,350 Each of them is actually also reversible, but the sequential combination is not reversible. 275 00:33:43,350 --> 00:33:50,130 And in particular, you can see what's going to happen here when you accept we re slightly change P. 276 00:33:50,130 --> 00:33:54,150 So P is is mostly pointing in the same direction it used to. 277 00:33:54,150 --> 00:34:01,380 Then we propose a state by a leapfrog step, which will tend to move X in the direction p and suppose we accepted. 278 00:34:01,380 --> 00:34:08,190 Then we accepted this proposal in which we negated p r and then we negate P again. 279 00:34:08,190 --> 00:34:15,240 So well, we undid the negation here. So P is now still pointing in the in the direction it was pointing to at the start. 280 00:34:15,240 --> 00:34:21,420 Here are, well, maybe depending on how much it changed in leapfrog in the leapfrog step, 281 00:34:21,420 --> 00:34:27,750 but often it would be still pointing in the same direct general direction as it had at the very beginning. 282 00:34:27,750 --> 00:34:36,270 However, if we do a rejection, then we negated. Then we didn't negate P here because we didn't accept that proposal where it was negated. 283 00:34:36,270 --> 00:34:41,760 Instead, we had just the original P after step one and so that in step three, we negate P. 284 00:34:41,760 --> 00:34:45,420 And so now we're going in the opposite direction from where we were before, 285 00:34:45,420 --> 00:34:51,090 which is not very good because now we're going backwards and largely going over states. 286 00:34:51,090 --> 00:35:00,120 We we just came from that isn't very efficient, so we'd like to avoid that. 287 00:35:00,120 --> 00:35:11,220 But so we have to set the step size small enough that rejections are quite infrequent in order that we don't do that reversible reversal of direction. 288 00:35:11,220 --> 00:35:17,900 So here's a here's what we get when we do that for the by variant Gaussian, we are. 289 00:35:17,900 --> 00:35:24,630 We move from a low probability point to the start towards the the high probability region, 290 00:35:24,630 --> 00:35:32,250 a bit slower than we did for the regular olanzapine method because we're only slightly changing that, changing the momentum now. 291 00:35:32,250 --> 00:35:41,790 So it takes a bit longer to dissipate the the energy we had from being at this high energy point in statistical physics terms. 292 00:35:41,790 --> 00:35:48,930 But we do after a while and after that, we explore the distribution in small steps. 293 00:35:48,930 --> 00:35:56,910 But with the random walk mostly suppressed because we don't keep changing direction by by sampling a new P. 294 00:35:56,910 --> 00:36:05,970 Instead, we only slightly change each time and keep going in the same direction, assuming we don't have a rejection which results in this reversing. 295 00:36:05,970 --> 00:36:11,250 So you can see these successive states here are going in the same direction, mostly. 296 00:36:11,250 --> 00:36:20,040 Eventually we turn around, of course, but where we're doing less of a random walk than we were with the simple change have been method. 297 00:36:20,040 --> 00:36:28,460 But to achieve that, we do have to make the step size be pretty small, so we don't do these rejections that reverse things. 298 00:36:28,460 --> 00:36:38,240 And here it is for the replicated by various Gaussian. We have to make the step size even smaller to keep the rejection rate low. 299 00:36:38,240 --> 00:36:42,950 But having done that, we can see that we do proceed in the same direction for quite a few iterations, 300 00:36:42,950 --> 00:36:49,070 if you can sort of see these these dots in there. So. 301 00:36:49,070 --> 00:36:57,860 So we we have managed to buy this method to suppress the random walks of the regular Langeveldt method. 302 00:36:57,860 --> 00:37:07,510 But there's a cost to that in having to choose a really small step size to reduce the rejection rates to a very small value. 303 00:37:07,510 --> 00:37:14,770 So that's the that's sort of the slightly disappointing aspect of this, that having done all this, 304 00:37:14,770 --> 00:37:20,860 we see that we need a very small step size in order to get the rejection rates small. 305 00:37:20,860 --> 00:37:27,610 So even though then we we avoid the random walks. It's still pretty slow compared to Hamiltonian Monte Carlo, 306 00:37:27,610 --> 00:37:34,660 which can use a larger ADA and which doesn't reverse and in the middle of its trajectories, 307 00:37:34,660 --> 00:37:38,290 except when it has to turn around because it reaches a low probability region. 308 00:37:38,290 --> 00:37:44,680 And so because of that Hamiltonian Monte Carlo with long trajectory lengths, 309 00:37:44,680 --> 00:37:54,040 a large number of leapfrog steps L tends to be better than than Horowitz's persistent momentum version of Langeveldt. 310 00:37:54,040 --> 00:38:02,530 So now we come to the new innovation here, which is to say that well, as well as introducing Non Reversibility Bay, 311 00:38:02,530 --> 00:38:14,080 not completely replacing in each iteration will also introduce non reversibility into the acceptance decision, which which may seem strange. 312 00:38:14,080 --> 00:38:25,410 But alas, so what I mean by that? So to decide I'm accepting a metropolis proposal to move from X to X star, 313 00:38:25,410 --> 00:38:32,360 we have to accept that with probability that's the minimum of one and the ratio PI x star over Pi X. 314 00:38:32,360 --> 00:38:40,740 An easy way to do that is you just generate a uniform zero one random variable you and you check whether you is less than that ratio. 315 00:38:40,740 --> 00:38:43,800 If the ratio is greater than one, of course it's less. 316 00:38:43,800 --> 00:38:50,220 If the ratio is less than one, then you are going to accept with probability equal to that ratio. 317 00:38:50,220 --> 00:38:56,940 Now that's equivalent to saying that you'll check whether PI of X star is greater than S where X is a random variable. 318 00:38:56,940 --> 00:39:00,810 That's uniform on the interval from zero to pi of X. 319 00:39:00,810 --> 00:39:07,860 You can see that. So I just sort of moving pi of X to the other side of that inequality. 320 00:39:07,860 --> 00:39:11,520 And now rather than choosing this as randomly, independently, 321 00:39:11,520 --> 00:39:21,930 each time we can make as part of the state and updated in any way that leaves the joint distribution of of X and S invariant, 322 00:39:21,930 --> 00:39:26,770 in which X is independent of X, and that in that just distribution. 323 00:39:26,770 --> 00:39:35,160 Well, no, it's not independent. It's it's uniform over zero to pi of X, so it's not incentive X. 324 00:39:35,160 --> 00:39:44,700 Now it's more convenient to actually think in terms of U, that's x over pi of X and U is independent of X. 325 00:39:44,700 --> 00:39:54,000 So we can think of you as being a independent uniform zero one around a variable that's part of the state along with X. 326 00:39:54,000 --> 00:40:05,430 And then we can make our accept rejects decision in this way, using the you part of the state when we're looking at a proposal to change X. 327 00:40:05,430 --> 00:40:15,540 So, so now we can do any update to you that that leaves this joint distribution and very end. 328 00:40:15,540 --> 00:40:24,210 And one possible update is that for some constant delta, we either add or subtract Delta Times five x two s, 329 00:40:24,210 --> 00:40:28,470 which is the same as adding or subtracting Delta to you. 330 00:40:28,470 --> 00:40:37,440 And well, we then reflect off the boundaries that zero in pie of X or zero and one effort if we're thinking in terms of you. 331 00:40:37,440 --> 00:40:42,210 So here's some implementation details as to how we managed to reflect our boundaries. 332 00:40:42,210 --> 00:40:54,600 We can let SB pay of X times the absolute value of B, where the now has the uniform distribution from minus one plus one independent of X, 333 00:40:54,600 --> 00:41:00,630 and we update thee by adding delta and then subtracting two if the ends up greater than one. 334 00:41:00,630 --> 00:41:09,210 So this this this effectively implements reflecting off the boundaries from zero to y at zero and one for you. 335 00:41:09,210 --> 00:41:14,760 And now, if X changes to x prime, then we make corresponding change a V to the prime. 336 00:41:14,760 --> 00:41:24,570 That's the the old V times the ratio of Pi X over PI x prime and that it keeps s unchanged. 337 00:41:24,570 --> 00:41:33,750 And so that's why this is a valid procedure is that we're not actually changing as in in this in this change to X. 338 00:41:33,750 --> 00:41:43,790 So that's a that's a bit of a quick explanation as to why this is a valid method, but I'll continue on. 339 00:41:43,790 --> 00:41:55,610 So now the question is why, why might this be of any use and the use is that by doing this, we can cluster together rejections. 340 00:41:55,610 --> 00:42:06,380 This isn't going to change the rejection rate because still you has the same normal as the same uniform zero one distribution as it did before. 341 00:42:06,380 --> 00:42:18,410 And so on average, we we get the same rejection rate as before, but by by moving you slowly from zero to one and back again, 342 00:42:18,410 --> 00:42:24,920 we can end up having a rejections clustered together and acceptances clustered together. 343 00:42:24,920 --> 00:42:34,970 So in particular, if the rejection rate isn't high as as would be the case for a large given method with a reasonable step size, 344 00:42:34,970 --> 00:42:40,190 then the ratio pi start over payouts is usually going to be close to one. 345 00:42:40,190 --> 00:42:47,420 So you mostly changes as a result of adding delta and reflecting off the boundaries. 346 00:42:47,420 --> 00:42:56,690 So if if Delta is small, then then you is going to be near zero for a while as we just slowly change it by adding that delta, 347 00:42:56,690 --> 00:43:05,000 then it'll be near one for a while and then it'll reflect off one and and start decreasing and be near zero for a while again. 348 00:43:05,000 --> 00:43:12,950 So that has the effect of clustering the rejections because we're much more likely to reject when you is near one. 349 00:43:12,950 --> 00:43:20,030 Because remember, we we reject, we check whether we accept by seeing if you is less than this ratio. 350 00:43:20,030 --> 00:43:26,600 So if you is close to zero where much more, that's much more likely to be true than if you is close to one. 351 00:43:26,600 --> 00:43:36,080 So the effect of this is that we'll tend to get a bunch of rejections when you is near one, but then get lots of acceptances when you is near zero. 352 00:43:36,080 --> 00:43:47,420 And now the the benefit of this is that if the for doing the Horowitz's larger than procedure with the momentum only slightly 353 00:43:47,420 --> 00:43:58,850 changed each time by we get we avoid random walk behaviour as long as we don't have a rejection that reverses the momentum. 354 00:43:58,850 --> 00:44:06,110 Whereas when we do have a rejection, we we we reverse, we reverse momentum and and things don't work so well anymore. 355 00:44:06,110 --> 00:44:14,810 So we want to see by clustering the acceptances together, we'll get more consecutive acceptances that keep moving in the same direction. 356 00:44:14,810 --> 00:44:20,090 And you can see how that is better by a simple sort of model here. 357 00:44:20,090 --> 00:44:29,900 Suppose they each accepted move moves the direction D in some direct moves, moves the distance in some direction, 358 00:44:29,900 --> 00:44:35,390 and each rejection randomised is the direction which isn't quite what happens with with that method. 359 00:44:35,390 --> 00:44:44,630 But let's suppose that then if we do 20 times some number K iterations in which they take the form of four acceptances, 360 00:44:44,630 --> 00:44:51,290 one reject for acceptances, one reject and so forth. This all repeated K times. 361 00:44:51,290 --> 00:45:03,110 Then on average, the distance we move is going to be the square root of 4K because we are there are four K groups of four accepts and one reject. 362 00:45:03,110 --> 00:45:09,170 And so each of which moves in a random walk between those groups. 363 00:45:09,170 --> 00:45:17,570 And so the the sort of distance that we move in in those four K groups of of moves is the distance moved in one, 364 00:45:17,570 --> 00:45:24,890 which is four times the square root of the number of those which works out is eight times the square root of K. 365 00:45:24,890 --> 00:45:32,780 But if instead we do these 20 K iterations in the form of 16 acceptances, followed by four rejections, 366 00:45:32,780 --> 00:45:43,550 which is the same acceptance probability as above, then we get each of these moves the distance 16 D and we have K of them doing a random walk. 367 00:45:43,550 --> 00:45:49,070 And so we move 16 times the square root of K, which you can see is twice as much as that. 368 00:45:49,070 --> 00:45:57,560 So there's an advantage to clustering the rejections together, because then you also cluster the acceptances together, inevitably. 369 00:45:57,560 --> 00:46:04,520 So that's the the motivation for trying to do this non reversible scheme for doing acceptances 370 00:46:04,520 --> 00:46:14,410 involving just adding delta to you rather than sampling a independent value of you each time. 371 00:46:14,410 --> 00:46:26,020 So here's a illustration of this for the base variant Gaussian as before. 372 00:46:26,020 --> 00:46:31,060 Here. Well, it's a bit you can sort of see that it doesn't do a random walk by it, 373 00:46:31,060 --> 00:46:35,560 by how it goes here before it actually has a rejection and has to reverse there. 374 00:46:35,560 --> 00:46:42,670 And while there's a few too many points to see things clearly here, but it's it's not it's not doing a random walk in there, 375 00:46:42,670 --> 00:46:54,350 and that's even though it's using a larger step size than for the previous illustration of the hoards Horowitz's version of the avalanche method. 376 00:46:54,350 --> 00:46:58,310 And here we are for the replicated by variant goes in again, 377 00:46:58,310 --> 00:47:07,730 we can use a larger step size than we did before, and we still get a lot of suppression of random walks. 378 00:47:07,730 --> 00:47:13,480 As you can see here, for instance, it consistently goes in the same direction for many iterations. 379 00:47:13,480 --> 00:47:17,830 So we do need to tune things, of course. 380 00:47:17,830 --> 00:47:22,300 So we have to keep it a small enough that we don't have a have a really high rejection rate, 381 00:47:22,300 --> 00:47:28,240 but we don't have to keep the rejection rate as small as before. And we also have to tune Delta. 382 00:47:28,240 --> 00:47:37,810 It seems that about one minus alpha is is right, although I don't have any any precise ideas about how Delta should be tuned. 383 00:47:37,810 --> 00:47:49,300 And when we do that, tuning properly compared to a properly tuned HMRC, we get about the same performance, it seems on this example. 384 00:47:49,300 --> 00:47:55,330 And if we look at the value of you, add each iteration here is the value of you, 385 00:47:55,330 --> 00:48:03,820 each iteration for the regular persistent lines of method that Horowitz came up with and the rejections are in red here. 386 00:48:03,820 --> 00:48:09,250 So you is sampled independently each iteration. So we get the expected scatter here. 387 00:48:09,250 --> 00:48:19,180 And well, there are some of those are rejections. Here's what we get with the non reversible update of you at the beginning. 388 00:48:19,180 --> 00:48:25,180 We get a bunch of rejections because we started in a very low probability point, which causes problems at the beginning. 389 00:48:25,180 --> 00:48:30,850 So you can sort of ignore that and then see what happens after we've gotten to the high probability region. 390 00:48:30,850 --> 00:48:37,660 And you can see we we have a lot of consecutive rejections when you is close to one, 391 00:48:37,660 --> 00:48:42,550 but then we have a big stretch of time here where there's no rejections when you is close to zero. 392 00:48:42,550 --> 00:48:46,570 So we've successfully clustered the rejections together, 393 00:48:46,570 --> 00:48:54,820 making them less damaging as far as as randomising the direction so we can manage to suppress random walks for a 394 00:48:54,820 --> 00:49:09,510 long time here and make better progress than if we have rejections mixed in there that keep reversing the momentum. 395 00:49:09,510 --> 00:49:20,060 OK, so now I'll I have a few minutes to talk about how these could actually be, this method could actually be applied. 396 00:49:20,060 --> 00:49:31,760 Here's a test idea regarding discrete variables, so this is a simple distribution, which means we can actually tell what the right answer is. 397 00:49:31,760 --> 00:49:39,950 That's meant to be a model for more complex situations where we have a bunch of continuous variables that we want to update with HMRC. 398 00:49:39,950 --> 00:49:47,480 But we also have some discrete variables which we can't update with HMRC. 399 00:49:47,480 --> 00:49:49,880 So we're going to update those with Gibbs sampling, 400 00:49:49,880 --> 00:49:58,400 and we alternate the updates of the discrete variables with Gibbs sampling with updates of continuous variables with HMRC. 401 00:49:58,400 --> 00:50:00,740 Now here there are only two continuous variables, 402 00:50:00,740 --> 00:50:08,720 but for this example that you can imagine that there is lots more and that the updates of the continuous variables are actually the dominant cost. 403 00:50:08,720 --> 00:50:13,120 And so I'm going to assess the the methods on that basis. 404 00:50:13,120 --> 00:50:24,900 The the HMRC updates are the ones that are costly. And so if you do that and you do if you do the non reverse of the lounge, then not reversible, 405 00:50:24,900 --> 00:50:35,790 both in terms of Horowitz's modification to only partially change p and the non reversible change to you for the acceptance probability, 406 00:50:35,790 --> 00:50:38,670 then you can if you turn things properly, 407 00:50:38,670 --> 00:50:47,010 you you get a method that's 1.8 three times more efficient than if you tune HMRC optimally for this problem, at least. 408 00:50:47,010 --> 00:50:56,080 Well, I didn't tune them precisely optimally. You can see which values I checked, and then I picked the best for each of these methods. 409 00:50:56,080 --> 00:51:00,850 So that's that's encouraging. I don't know what the scaling is here, 410 00:51:00,850 --> 00:51:07,390 and this is a relatively low dimensional problem and such if you have much higher 411 00:51:07,390 --> 00:51:12,280 dimensional continuous variables and perhaps more discrete variables for that matter, 412 00:51:12,280 --> 00:51:14,920 how the whole thing scales. 413 00:51:14,920 --> 00:51:23,050 I don't know whether you are, whether you still get only roughly twice the efficiency, which would be slightly disappointing. 414 00:51:23,050 --> 00:51:26,110 I mean, getting twice being twice as efficient is nice, 415 00:51:26,110 --> 00:51:34,210 but you'd like if you were even more efficient or whether instead the advantage gets bigger and bigger as the dimensionality goes up. 416 00:51:34,210 --> 00:51:44,020 Or I suppose it's possible to get smaller as the study goes up. I don't know the answers to those questions yet. 417 00:51:44,020 --> 00:51:53,470 One can also look at how this method works for Bayesian neural network models, which can be very complex hierarchical models. 418 00:51:53,470 --> 00:52:03,610 So here's a a description of some networks of increasing complexity, which are actually the ones I'm playing around with at the moment. 419 00:52:03,610 --> 00:52:08,050 At the simplest level, we could for a regression network, for instance, 420 00:52:08,050 --> 00:52:14,980 we might have one hidden layer of of 10 units if you're familiar with a neural network models. 421 00:52:14,980 --> 00:52:25,120 Those are a common, although somewhat old fashioned activation function for 10h is is a common activation function for the hidden units. 422 00:52:25,120 --> 00:52:30,230 So we have one layer of those, and then that means we have three groups of weights. 423 00:52:30,230 --> 00:52:38,500 There's the weights from the inputs to those hidden units. There's what are called the biases or the intercept terms for the hidden units. 424 00:52:38,500 --> 00:52:43,180 And there are the weights from the hidden units to the output. 425 00:52:43,180 --> 00:52:47,110 And in the simplest model, that's that's those are the only groups you have. 426 00:52:47,110 --> 00:52:57,100 So you have three hyper parameters controlling those three groups. You also have a have a four regression model, you have a parameter. 427 00:52:57,100 --> 00:53:00,130 That's the residual variance for the response, 428 00:53:00,130 --> 00:53:10,540 which also is effectively plays the role of a hyper parameter as far as its effect on how the model works to make a more complicated model. 429 00:53:10,540 --> 00:53:15,100 But say that, well, we're not really sure about how relevant each input is, 430 00:53:15,100 --> 00:53:22,780 so we can instead have a have K plus three hyper parameters in which we have a separate 431 00:53:22,780 --> 00:53:27,670 hyper parameter for the for the weights out of each input into the hidden unit. 432 00:53:27,670 --> 00:53:34,660 So we have such hyper parameters and then we need since we got K of them, we're not we're not sure what prior to give them. 433 00:53:34,660 --> 00:53:40,240 So in the same way as we did before, we say, well, since we're not sure we'll we'll create a higher level, 434 00:53:40,240 --> 00:53:45,250 higher parameter that controls the distribution of these K lower level hyper parameters. 435 00:53:45,250 --> 00:53:46,690 So that's another one. 436 00:53:46,690 --> 00:53:56,630 And then we we have the hidden unit biases that we have a hyper parameter to control and the hidden output weights going up in complexity. 437 00:53:56,630 --> 00:54:05,050 We could say, well, maybe maybe the function here of these inputs has a additive structure where it's a a a function 438 00:54:05,050 --> 00:54:10,270 of one input of input one plus the function of input two plus the function of input three. 439 00:54:10,270 --> 00:54:17,770 Or maybe it's at least sort of approximately that. So we could have a model in which we have a K hidden layers, 440 00:54:17,770 --> 00:54:26,170 each of which looks at only one input for the K different inputs and and connects to the output, of course. 441 00:54:26,170 --> 00:54:33,520 And those K hidden layers can compute the key components of an objective function. 442 00:54:33,520 --> 00:54:35,380 But maybe it's not completely additive, 443 00:54:35,380 --> 00:54:41,740 so we'll have another hidden layer that looks at all inputs and could model interactions, and that also goes to the output. 444 00:54:41,740 --> 00:54:48,840 And you add it all up. We have three K plus capas three hyper parameters there. 445 00:54:48,840 --> 00:54:53,280 So now, how well do the methods work on these models? 446 00:54:53,280 --> 00:55:01,680 Well, that's what I've been running simulations for for the last week or two, and it's actually pretty hard to say. 447 00:55:01,680 --> 00:55:07,470 One reason is we don't know the correct answer. These are obviously these. 448 00:55:07,470 --> 00:55:16,030 These are. Are models that are far too complex to to to know with certainty what the correct answer is. 449 00:55:16,030 --> 00:55:20,770 We could look at the air for predictions on a test set. 450 00:55:20,770 --> 00:55:30,110 But unfortunately, there's no guarantee that the method that's best at sampling from the posterior will actually produce the best predictions. 451 00:55:30,110 --> 00:55:38,740 The data might not be well fitted to the model and in which case you're better off predicting. 452 00:55:38,740 --> 00:55:46,870 You know, doing the wrong thing might actually be better if the model is not very good or even if the model is good, it's still based on limited data. 453 00:55:46,870 --> 00:55:53,890 You have a posterior spreader spread out someplace, and even if that includes the good values, 454 00:55:53,890 --> 00:55:58,900 it also will include some not so good values if you don't have a lot of data. 455 00:55:58,900 --> 00:56:05,410 And if you're sampling method doesn't really explore the full distribution, but happens to be in the good spot instead. 456 00:56:05,410 --> 00:56:09,670 I could look better, even though it's actually worse as far as sampling the posterior. 457 00:56:09,670 --> 00:56:19,240 So it's hard to tell. I've been playing around with a so I previously played around with a simple classification 458 00:56:19,240 --> 00:56:24,220 model and looking at the correlate auto correlation time for the hyper parameters. 459 00:56:24,220 --> 00:56:32,830 The non-negotiable Lanzmann method was a little bit better than HMRC for the more complex regression models from the previous slide. 460 00:56:32,830 --> 00:56:38,500 I've played around with six inputs, all of which are relevant to varying degrees. 461 00:56:38,500 --> 00:56:47,620 And with with the true function having both additive and intra interaction aspects and five training cases. 462 00:56:47,620 --> 00:56:52,420 And if you didn't fit the simplest of the models with only three hyper parameters, 463 00:56:52,420 --> 00:56:58,810 HMRC seems to do a slightly better job at prediction than the of the Langeveldt methods. 464 00:56:58,810 --> 00:57:08,920 But for the more complex models, the best predictions are made by by non reversible Langeveldt methods, although it's there's a lot of variation here. 465 00:57:08,920 --> 00:57:18,820 You change tuning parameters and what doesn't seem to be drastically large amounts and you get much different predictions are much, 466 00:57:18,820 --> 00:57:26,800 much different predictive performance. So this this this it makes it complicated to see what what's actually going on. 467 00:57:26,800 --> 00:57:35,890 So I'm still playing around with this to see how much better you can get than HMRC by playing around with the non-responsive methods. 468 00:57:35,890 --> 00:57:41,520 So that's it. OK. 469 00:57:41,520 --> 00:57:49,020 Thank you very much for the talk, Robert. And now we'll open the floor to questions. 470 00:57:49,020 --> 00:57:53,940 I should say perhaps that there is a bibliography here if you want to look at the slides at the end. 471 00:57:53,940 --> 00:58:11,490 Okay. Mm-Hmm. OK, thank you. So if you would like to ask a question, please unmute, resolve and ask. 472 00:58:11,490 --> 00:58:15,900 Hi, can you hear me? Yes, I can hear you. 473 00:58:15,900 --> 00:58:20,520 Hi, hello, Rothfeld. Very nice talk. Thank you. 474 00:58:20,520 --> 00:58:25,910 My question is whether you try to do similar things with Hamiltonian Monte-Carlo itself. 475 00:58:25,910 --> 00:58:32,010 Like, can you do also a non reversible, persistent? Yes. 476 00:58:32,010 --> 00:58:40,050 I've expressed Horowitz's method as applying to lingerie, which is HMRC with a single leapfrog step, 477 00:58:40,050 --> 00:58:46,170 but the same thing can be done for HMRC with with more than one frog step. 478 00:58:46,170 --> 00:58:53,430 You can decide that you'll do it trajectories of 10 steps and then you'll also not completely replace the momentum before the next one. 479 00:58:53,430 --> 00:58:57,210 You'll do the same sort of thing. I've only partially changing the momentum. 480 00:58:57,210 --> 00:59:04,980 And then you have the stimulus thing that if you actually rejected that trajectory, you'll end up reversing the momentum. 481 00:59:04,980 --> 00:59:12,690 And so you can do combined ones that that sort of tried to suppress random walks by a combination 482 00:59:12,690 --> 00:59:20,700 of doing multiple leapfrog steps per trajectory and only partially changing the momentum. 483 00:59:20,700 --> 00:59:27,690 Kennedy, I believe, investigated this at one point and found that disappointingly, 484 00:59:27,690 --> 00:59:35,970 you can only get a small benefit from the sort of hybrid methods that that you can see. 485 00:59:35,970 --> 00:59:44,040 This is a tuning parameter, then is not only do you tune the step size and the number of leapfrog steps, but also how much you replace the momentum. 486 00:59:44,040 --> 00:59:53,070 And with that, that space of an extra tuning parameter, you can get slightly better than if you stuck to straight HMRC. 487 00:59:53,070 --> 01:00:00,750 But but not not not all that much in particular didn't change the the scaling with dimensionality. 488 01:00:00,750 --> 01:00:11,520 I think there's a paper by Kennedy and maybe some other people on that. Thank you. 489 01:00:11,520 --> 01:00:22,460 Thank you, Peter. Anybody else? I mean, you hear me. 490 01:00:22,460 --> 01:00:26,120 Yep. All right. Thanks very much. 491 01:00:26,120 --> 01:00:33,610 That's very interesting. I thought that truck was making you a random variable. 492 01:00:33,610 --> 01:00:42,550 This was really neat. But if I understand correctly, the update was not deterministic, it was stochastic. 493 01:00:42,550 --> 01:00:46,540 Is that correct? Yeah. Well. Well, no, it's no. 494 01:00:46,540 --> 01:00:51,730 It's actually deterministic. If you. I mean, the actual update itself. 495 01:00:51,730 --> 01:00:57,250 Of course, things change sarcastically at a larger in a larger fashion. 496 01:00:57,250 --> 01:01:01,480 But the actual the the actual thing in the state is that, as I have expressed, 497 01:01:01,480 --> 01:01:09,070 it is is v and you is the absolute value of V and then X is pi of X times the absolute value of the. 498 01:01:09,070 --> 01:01:17,290 And the update to V is itself is to simply add delta and then subtract two if if it's greater than one. 499 01:01:17,290 --> 01:01:26,080 But that's the actual update to be. But then V also has to change when exchanges of X changes to X star attacks prime. 500 01:01:26,080 --> 01:01:31,270 Then you have to change V to v times pi of X over Pi of X x prime. 501 01:01:31,270 --> 01:01:37,600 So that the the proposal that resulted in that will have been stochastic. 502 01:01:37,600 --> 01:01:47,650 So. So we will change sarcastically as a result of of the changes to X, which is what you see here as long as we reject. 503 01:01:47,650 --> 01:01:52,270 So we are changing. X v precedes deterministic Lee. 504 01:01:52,270 --> 01:01:57,040 Well, you see these deterministic v up to one and then reflects and comes back down. 505 01:01:57,040 --> 01:02:01,360 That corresponds to V having switched to minus one. 506 01:02:01,360 --> 01:02:05,560 And then and then the increases in V at that point are decreases in U. 507 01:02:05,560 --> 01:02:13,150 But as soon as there's an acceptance, there's a bit of jitter here because the ratio of Pi x prime to Pi X won't have been one. 508 01:02:13,150 --> 01:02:21,520 And so for the acceptances, there's some slight stochastic aspect resulting from the stochastic changes to it to X. 509 01:02:21,520 --> 01:02:31,820 And so there's no constraints that's there, that Delta not perfectly divide the interval and sort of thinking that I want to. 510 01:02:31,820 --> 01:02:44,090 OK, so you could you expressed as you just at Delta, but you could out Delta plus some noise if you felt like it, in fact. 511 01:02:44,090 --> 01:02:48,920 And in fact, the noise distribution can be anything at all. 512 01:02:48,920 --> 01:02:53,750 And you might do that if you were worried that you are otherwise. 513 01:02:53,750 --> 01:02:58,280 So, for instance, if you're sampling from a uniform distribution. 514 01:02:58,280 --> 01:03:07,130 So the the ratios of Pi ex-Prime over Pi X are always one four four accepted moves then. 515 01:03:07,130 --> 01:03:15,830 And you chose Delta to be exactly one half. Then you might worry that you're not sampling properly because they wouldn't actually matter. 516 01:03:15,830 --> 01:03:20,630 Because yeah, if you'd make it slightly more complex thing in which nevertheless, 517 01:03:20,630 --> 01:03:27,550 the ratios of Pi X primer grouping, there are always some, some small number of values, then it's conceivable. 518 01:03:27,550 --> 01:03:32,180 But you can end up being nonorganic by just adding Delta all the time. 519 01:03:32,180 --> 01:03:37,750 So if you're worried about that, you can add Delta and some amount of noise if you feel like it. 520 01:03:37,750 --> 01:03:42,440 Thanks. OK. 521 01:03:42,440 --> 01:03:47,300 Any last questions for a for us? 522 01:03:47,300 --> 01:04:00,450 Need more questions for. Last call. 523 01:04:00,450 --> 01:04:11,890 Mm hmm. Okay. I should maybe mention that amongst the references is a paper on archive on this, which is this one here. 524 01:04:11,890 --> 01:04:23,400 Non reversibly updating uniform zero one value for Metropolis except reject decisions, which is on the archive. 525 01:04:23,400 --> 01:04:33,630 OK. OK, I guess then we'll wrap up the talk, and I think you get rougher for being able to join virtually in the air for great talk. 526 01:04:33,630 --> 01:04:37,966 Thank you. Well, thanks for. Thanks for inviting me. OK, thank.