1
00:00:00,940 --> 00:00:09,220
OK, so welcome everybody to this first virtual edition of the CSIRO talks for this term of a very happy to have Radford Neal,

2
00:00:09,220 --> 00:00:19,270
who was my advisor at you of T presenting a remotely, and he's going to talk about some of the recent CMC work.

3
00:00:19,270 --> 00:00:28,950
He has been doing so again. Thank you, Rutherford four, for being able to join us virtually, and I'll hand over to you.

4
00:00:28,950 --> 00:00:34,650
I'd like to talk about a new idea for doing IMC, in particular,

5
00:00:34,650 --> 00:00:45,060
acceptance of proposals for the agency and in particular its application to larger than methods for which it seems to be of most potential use.

6
00:00:45,060 --> 00:00:56,490
And how this new AMC AMC method might help with sampling from the posterior distribution and hierarchical Bayesian models.

7
00:00:56,490 --> 00:01:01,030
So. Sam, huh?

8
00:01:01,030 --> 00:01:04,630
OK. See how the Iraqis are OK here.

9
00:01:04,630 --> 00:01:11,920
OK, now they're raking. So I'll start off with some of the motivation here, assuming you know a bit about AMC, AMC, but if you don't,

10
00:01:11,920 --> 00:01:21,490
I'll I'll do a quick review of some CMC after that and then get to the the new AMC methods.

11
00:01:21,490 --> 00:01:32,450
And then briefly discuss how they actually get applied to the things that are the motivating models that I discussed at the beginning.

12
00:01:32,450 --> 00:01:40,790
So hierarchical Bayesian models are are arise because for realistic problems,

13
00:01:40,790 --> 00:01:49,100
it's usually the case that you don't want to try to specify the prior directly in terms of the actual model parameters,

14
00:01:49,100 --> 00:01:54,560
but instead it's more convenient to specified hierarchically with higher level hyper parameters

15
00:01:54,560 --> 00:01:59,900
in addition to the lower level parameters that actually control the distribution of the data.

16
00:01:59,900 --> 00:02:12,170
So, for example, for a simple, multivariate regression problem with K covariance, we'll have a model for the response in terms of the covariates,

17
00:02:12,170 --> 00:02:20,180
the covariate vector XY here, and we can phrase that as a dot product with regression coefficients and an intercept,

18
00:02:20,180 --> 00:02:24,440
and we have a standard deviation of the residuals sigma.

19
00:02:24,440 --> 00:02:33,650
Those are those are model parameters, and we could try to give a direct directly give priors to alpha betas and and sigma.

20
00:02:33,650 --> 00:02:40,580
But it's more convenient when we try to try to express realistic prior information

21
00:02:40,580 --> 00:02:44,570
that we instead say that the beta is given a higher level hyper parameter.

22
00:02:44,570 --> 00:02:55,100
Tao are independent with normal distributions with both variance tao squared and that Tao itself is given a a prior distribution.

23
00:02:55,100 --> 00:03:00,980
The reason this is is better is that if we tried to do a direct prior in terms of saying that

24
00:03:00,980 --> 00:03:07,160
Beta JS are all independent with normal distribution mean zero and standard deviation seven,

25
00:03:07,160 --> 00:03:11,090
we in practise usually don't actually know that.

26
00:03:11,090 --> 00:03:15,410
We don't know that the standard deviation of the beta should be seven. It could be.

27
00:03:15,410 --> 00:03:22,940
Before we look at the data that we it, it could be that maybe the covariates aren't actually very useful for predicting Y,

28
00:03:22,940 --> 00:03:27,990
so all the beta should be close to zero. But probably we hope that's not the case if we're doing this at all.

29
00:03:27,990 --> 00:03:34,370
So we think maybe the beta should be bigger, but we're not sure how big we don't know how predictable Y is.

30
00:03:34,370 --> 00:03:39,110
And so we can't plug in any specific standard deviation for the betas.

31
00:03:39,110 --> 00:03:48,800
Instead, we express this in terms of of the betas having a standard deviation TAO and then we give some prior distribution to tile.

32
00:03:48,800 --> 00:03:55,370
Of course, we could integrate over tau and come up with with some direct distribution for the betas.

33
00:03:55,370 --> 00:04:01,070
But in in complex hierarchical models, doing that integration might be difficult.

34
00:04:01,070 --> 00:04:05,300
In any case, produces a direct prior. That's very hard to understand.

35
00:04:05,300 --> 00:04:13,400
And so it's it's preferable to do things this way.

36
00:04:13,400 --> 00:04:21,720
Right now, that's a very simple, hierarchical model for more complex models, we might have more than one group of related parameters,

37
00:04:21,720 --> 00:04:29,220
each of those groups having a a unknown variance that we then give a prior to is a higher level hyper parameter.

38
00:04:29,220 --> 00:04:40,290
So in neural networks models in particular, we we have perhaps a number of different hidden layers that connect to the inputs and and to the outputs.

39
00:04:40,290 --> 00:04:47,340
And so there's groups of weights that are, say, connecting the inputs to this hidden layer or this hidden layer to that hidden layer.

40
00:04:47,340 --> 00:04:55,650
And each of those groups might be given some, some prior distribution, whose variance depends on a higher level hyper parameter.

41
00:04:55,650 --> 00:05:05,550
And these higher, higher level hyper parameters can represent a higher level information about what the problem is like.

42
00:05:05,550 --> 00:05:12,270
Such as which covariates are actually relevant or whether the the model should

43
00:05:12,270 --> 00:05:18,300
be an additive model or one that allows for interactions and things like that.

44
00:05:18,300 --> 00:05:25,650
You also get what is mathematically the same setup when you have models with latent variables or random effects,

45
00:05:25,650 --> 00:05:29,430
whatever you want to call them, because in a Bayesian context,

46
00:05:29,430 --> 00:05:38,250
this is is is basically the same sort of thing is the hierarchical priors for what one might instead call parameters rather than latent variables.

47
00:05:38,250 --> 00:05:44,220
So here, for instance, is a one way random effects model in which we have some random effects in UI,

48
00:05:44,220 --> 00:05:56,940
whose distribution is controlled by a higher level parameter TAO, which in turn is given some prior distribution.

49
00:05:56,940 --> 00:06:07,620
Now, if we're doing a Bayesian inference for this, we we usually end up doing more CMC and that's OK for simple models.

50
00:06:07,620 --> 00:06:12,090
In fact, Gibbs sampling works OK for such simple hierarchical models.

51
00:06:12,090 --> 00:06:24,480
That was these were amongst the examples early on in the statisticians use of CMC with with the gel found in Smith's paper, for instance.

52
00:06:24,480 --> 00:06:31,380
For more complex models, though, we may have to do something more general because we don't know how to do good sampling or it's very slow.

53
00:06:31,380 --> 00:06:37,380
We might use some general random walk metropolis method or Hamiltonian Monte Carlo,

54
00:06:37,380 --> 00:06:44,730
but there's the problem with this if the posterior distribution for for variance hyper parameter is broad,

55
00:06:44,730 --> 00:06:49,050
we don't have so much data that it gets pinned down exactly what it should be.

56
00:06:49,050 --> 00:06:54,780
Then the Markov chain we use for sampling from the posterior distribution needs to

57
00:06:54,780 --> 00:06:59,940
sample both regions where there are tight constraints on the local parameters.

58
00:06:59,940 --> 00:07:06,900
The beta jay, for instance, in that simple regression model, and we'll have tight constraints on them when Tao is small,

59
00:07:06,900 --> 00:07:14,520
because then the prior will constrain them to be close to zero. But if the distribution for Tao is broad,

60
00:07:14,520 --> 00:07:22,350
then we'll also have have high probability regions where towers law is large and the beta genes are much less constrained.

61
00:07:22,350 --> 00:07:28,530
Now, when we try to do AMC, AMC for this with a random walk metropolis or with HMRC,

62
00:07:28,530 --> 00:07:34,560
the the tuning parameters for these methods are that our best are different in these different regions.

63
00:07:34,560 --> 00:07:41,160
Either the proposal with four random walk metropolis or the step size four for HMRC,

64
00:07:41,160 --> 00:07:47,730
and if we then use a proposal that's only suitable for the less constrained region,

65
00:07:47,730 --> 00:07:55,830
we can end up with a with a Markov chain that almost never visits the more constrained region.

66
00:07:55,830 --> 00:08:02,130
And on the other hand, if we tuned for the more constrained region, it might be extremely slow in the less constrained region.

67
00:08:02,130 --> 00:08:09,120
So you can see the the the sometimes horrible consequences of this in a slow set sampling paper

68
00:08:09,120 --> 00:08:21,070
where there's a final example of a final distribution that say a model of this sort of situation.

69
00:08:21,070 --> 00:08:26,260
So there would be various possible solutions to this that we might try.

70
00:08:26,260 --> 00:08:28,000
One is that we could just say, well,

71
00:08:28,000 --> 00:08:36,760
we need different step sizes or proposal widths in different regions so we can just randomly pick a step size each step.

72
00:08:36,760 --> 00:08:43,000
And from a distribution broad enough that we sometimes get ones appropriate for each region.

73
00:08:43,000 --> 00:08:50,200
The problem with this is that we then end up wasting a lot of time on unsuitable values for the for the tuning parameters.

74
00:08:50,200 --> 00:08:54,610
We get them once in a while for the parameters.

75
00:08:54,610 --> 00:09:00,730
The tuning parameters that are suitable for the region. We're in it at this time, but we may waste lots of time.

76
00:09:00,730 --> 00:09:07,360
Now there might be a way to get around this. That's discussed in in one of my papers on what I call the shortcut method that

77
00:09:07,360 --> 00:09:13,390
tries to cut short the computation time you need for the unsuitable values.

78
00:09:13,390 --> 00:09:17,230
But that's not the topic of this talk.

79
00:09:17,230 --> 00:09:26,340
We might decide that instead, we should reprioritize the model, saying, for instance, that the Beta J regression coefficients are equal to Tiao Times.

80
00:09:26,340 --> 00:09:36,790
Zayda J. And Zayda J is just given a normal zero one distribution so that we end up with better j having a standard deviation TAO.

81
00:09:36,790 --> 00:09:43,330
And now the prior constraint on Z to J doesn't vary. So we don't have the problem with that.

82
00:09:43,330 --> 00:09:48,010
When Tao is small, Zaidi's is is more tightly confined.

83
00:09:48,010 --> 00:09:53,740
However, we have the opposite problem that now, depending on what tower he is,

84
00:09:53,740 --> 00:10:01,540
the the way that the data constrains Zeta J is now going to vary, which might cause similar sorts of problems.

85
00:10:01,540 --> 00:10:09,610
So another possibility is that we just alternate iterate iterations that only update for low level parameters

86
00:10:09,610 --> 00:10:18,480
like the beta JS with it with iterations that update only the high level hyper parameters like the TAU.

87
00:10:18,480 --> 00:10:22,440
If we do that, then when we do the low level parameter updates,

88
00:10:22,440 --> 00:10:30,600
it's it's valid to change the proposal with the HMRC step size based on the current values of

89
00:10:30,600 --> 00:10:35,640
the higher level hyper parameters because they're not changing during the updates for doing so.

90
00:10:35,640 --> 00:10:40,140
That doesn't undermine the correctness of the of the methods.

91
00:10:40,140 --> 00:10:41,970
Of course, you have to have some way of doing that,

92
00:10:41,970 --> 00:10:52,620
some scheme for how you change the the lower level parameters based on the current values at the higher level ones, but that sometimes may be doable.

93
00:10:52,620 --> 00:11:02,730
Also, the higher level updates may be pretty simple. Perhaps a disadvantage, though, is that if the lower level updates are lengthy.

94
00:11:02,730 --> 00:11:09,570
So, for instance, HMRC spends a lot of computation time and computing a long trajectory typically.

95
00:11:09,570 --> 00:11:21,720
And that means then that we do the higher level updates only infrequently as far as the as the amount of movement at the lower level that that complex

96
00:11:21,720 --> 00:11:31,500
update does and that that may be disadvantageous and that we don't get his frequent interaction between the low level and high level parameters,

97
00:11:31,500 --> 00:11:39,570
which is necessary in order to fully explore the posterior distribution.

98
00:11:39,570 --> 00:11:46,050
So that's that problem is one motivation for this work. Another motivation is handling discrete variables,

99
00:11:46,050 --> 00:11:55,380
which arise in many statistical models like mixture models that have component indicators where which component each data item came from,

100
00:11:55,380 --> 00:11:59,910
or if you're handling missing categorical data and so forth.

101
00:11:59,910 --> 00:12:05,490
Now, these discrete variables aren't a particular problem. If you're doing Gibbs sampling or random walk metropolis,

102
00:12:05,490 --> 00:12:10,890
you just have a proposal distribution that also proposes new values for these discrete parameters,

103
00:12:10,890 --> 00:12:13,500
or that samples from their conditional distribution.

104
00:12:13,500 --> 00:12:22,320
But they can't be directly handled in a gradient based methods such as aid simply because they're discrete and don't have derivatives.

105
00:12:22,320 --> 00:12:31,860
So one approach would be again to alternate AMC updates for the continuous variables with other updates that are only used for the discrete variables.

106
00:12:31,860 --> 00:12:39,960
But again, we have the issue that if if the AMC updates are are complex updates doing lots of computation time,

107
00:12:39,960 --> 00:12:49,110
then we're spending we're doing a lot of work at the low level before we at the for the continuous variables in this case,

108
00:12:49,110 --> 00:12:58,560
before we change the discrete variables and assuming there's dependencies between these that are doing a lot of work on discrete variables,

109
00:12:58,560 --> 00:13:04,050
on continuous variables without changing the discrete ones is maybe not too useful because

110
00:13:04,050 --> 00:13:09,450
you really have to look at changing both sets of variables in order to move around the space.

111
00:13:09,450 --> 00:13:21,050
So the the question is, can we find some method with the advantages of each AMC that avoids this problem?

112
00:13:21,050 --> 00:13:29,330
So before getting to to that question, I'll do a bit of background on HMRC.

113
00:13:29,330 --> 00:13:38,390
So this is going to be pretty quick if you've never heard of it CMC before, but just to orient people here,

114
00:13:38,390 --> 00:13:46,850
the CMC approach is that, first of all, we specify our model, including both the likelihood function and the prior.

115
00:13:46,850 --> 00:13:52,190
And we assume these are computable up to some unknown normalising constant.

116
00:13:52,190 --> 00:13:57,710
So the posterior density is also computable up to a non normalising constant.

117
00:13:57,710 --> 00:14:03,650
We then define a Markov chain that will converge to this posterior distribution from any starting point.

118
00:14:03,650 --> 00:14:09,800
And we run this chain, perhaps from many starting points for long enough for it to approximately converge and then

119
00:14:09,800 --> 00:14:15,800
use the points that we get after that to estimate expectations with respect to the posterior.

120
00:14:15,800 --> 00:14:16,340
For instance,

121
00:14:16,340 --> 00:14:28,890
posterior means the parameters or predictions for new data points that involve an integration over the posterior distribution for parameters.

122
00:14:28,890 --> 00:14:36,120
Now, the basis for doing this is that the Markov chain has to converge to our desired distribution,

123
00:14:36,120 --> 00:14:42,570
and for that to be true, it's necessary that if the desired distribution and variant.

124
00:14:42,570 --> 00:14:51,690
So if we want to sample from a distribution with density pie, we need need to have it that the integral of of pie of X times,

125
00:14:51,690 --> 00:14:59,700
the transition probability for the Markov chain to go from X to X prime is equal to the two pie of X primes.

126
00:14:59,700 --> 00:15:06,030
So that basically says if ever we actually got to the right distribution, then after we did another transition,

127
00:15:06,030 --> 00:15:13,920
we'd still be in the right distribution, which is a requirement if we're going to ever converge to the right distribution and stay there.

128
00:15:13,920 --> 00:15:24,060
This isn't a sufficient condition because we have to also make sure we don't get trapped in some subset of the state space.

129
00:15:24,060 --> 00:15:33,990
It's typical for for many CMC methods, that invariance is proved by first proving reversibility of the transitions.

130
00:15:33,990 --> 00:15:45,600
So reversibility for four transitions of a Markov chain is that the probability of starting in a state X if you started according to the distribution

131
00:15:45,600 --> 00:15:55,680
pie and then moving to X prime from X is the same as the probability of starting in X prime if you start from pie and then moving from x prime to X.

132
00:15:55,680 --> 00:16:07,380
So basically, this says that once you reach convergence, the transitions from x two x prime happen as often as those from x prime to X.

133
00:16:07,380 --> 00:16:15,390
And if this is true, then invariance will also be true, which is easily shown by just integrating both sides.

134
00:16:15,390 --> 00:16:21,930
So this is one way of proving in variance, but it's not the it's not a necessary condition for it for in variance.

135
00:16:21,930 --> 00:16:27,960
There are non reversible Markov chains for which this isn't true that nevertheless leave the

136
00:16:27,960 --> 00:16:37,690
distribution pie invariant and these these non reversible methods can often be very useful.

137
00:16:37,690 --> 00:16:37,960
Now,

138
00:16:37,960 --> 00:16:49,090
the the oldest NCMEC method is what's called the metropolis algorithm and is a very general way of defining a transition that's reversible for PI,

139
00:16:49,090 --> 00:16:54,100
involving first proposing a State X star based on the current state.

140
00:16:54,100 --> 00:17:02,410
We're in X and then either accepting or rejecting that proposal based on the ratio of densities for X star and the current state.

141
00:17:02,410 --> 00:17:06,160
If we reject the same, the new state is the same as the Old State.

142
00:17:06,160 --> 00:17:10,270
Otherwise, the new state is the proposed the star are.

143
00:17:10,270 --> 00:17:17,710
The proposals are done according to some distribution that all writers as the X star give an X and this has to be symmetrical.

144
00:17:17,710 --> 00:17:26,560
And the acceptance probability for this proposal is the minimum of one and the ratio of Pi X start overpay x.

145
00:17:26,560 --> 00:17:34,480
Now it's it's straightforward to show that this transition is reversible and therefore it leaves PI invariant.

146
00:17:34,480 --> 00:17:40,180
So that's. And since this, all we need is this ratio, we don't need the normalising constant for PI.

147
00:17:40,180 --> 00:17:51,650
And so this is a very widely applicable method, both for Bayesian inference and for statistical physics, for that matter.

148
00:17:51,650 --> 00:18:00,410
So here's an illustration of the metropolis method applied to a by variant Gaussian distribution with high correlation point nine.

149
00:18:00,410 --> 00:18:10,890
And with the proposal distribution being a a normal distribution with mean equal to the current state and with some standard deviation here.

150
00:18:10,890 --> 00:18:18,860
Point three. Uh, with the with the proposals being independent for the two coordinates,

151
00:18:18,860 --> 00:18:24,890
you do much better if you make proposals that actually obey the the high correlation.

152
00:18:24,890 --> 00:18:35,360
In this example, this is an example that's meant to illustrate a more, more practical problems in which there may well be these high correlations,

153
00:18:35,360 --> 00:18:37,230
but we don't know what they are to start with,

154
00:18:37,230 --> 00:18:45,250
so we assume we can't carefully tune our proposal distribution to to do best with that particular correlation.

155
00:18:45,250 --> 00:18:51,160
So the green points here are an aid sample from pie, just so you can see what the distribution is like.

156
00:18:51,160 --> 00:18:58,420
And the black points show 250 transitions, the Markov chain starting from from over here.

157
00:18:58,420 --> 00:19:03,310
And we see that initially the starting point is very low probability under this distribution,

158
00:19:03,310 --> 00:19:10,270
and we see that we we progress reasonably systematically towards a high probability region.

159
00:19:10,270 --> 00:19:19,460
But after we get there, we move around by a random walk, which is rather inefficient way of exploring this distribution.

160
00:19:19,460 --> 00:19:28,700
The red lines show the rejected proposals that where we stayed at the current point rather than moving.

161
00:19:28,700 --> 00:19:32,690
So here's a extension of this example,

162
00:19:32,690 --> 00:19:41,720
where we actually have 20 copies of this by various normal distribution that are the copies being independent of each other,

163
00:19:41,720 --> 00:19:46,580
and we'll again assume that we don't know anything about this distributions or our proposals

164
00:19:46,580 --> 00:19:51,710
are just from this spherical Gaussian with with now a standard deviation point 07.

165
00:19:51,710 --> 00:20:01,340
And we now see that things are much, much slower because we had to use a standard deviation of 0.07 rather than 0.3,

166
00:20:01,340 --> 00:20:09,380
because we're proposing changes to all all 20 dimensions at the same time.

167
00:20:09,380 --> 00:20:17,390
And to get a good acceptance rate changing all those variables at once, we have to do a smaller change.

168
00:20:17,390 --> 00:20:21,170
And so we see we now take smaller steps to get towards the distribution,

169
00:20:21,170 --> 00:20:26,660
and we still proceed systematically pretty much towards the high probability regions.

170
00:20:26,660 --> 00:20:32,450
But once we get here, we then are again doing a random walk. But now with much smaller step size.

171
00:20:32,450 --> 00:20:45,120
So this now actually explores the distribution 18 times slower than we had within when we had only one copy of the bi variant normal.

172
00:20:45,120 --> 00:20:52,860
So this is an illustration of of the slowness of exploring things with random walks,

173
00:20:52,860 --> 00:20:57,960
because the random walks have no tendency to keep going in the same direction.

174
00:20:57,960 --> 00:21:06,630
And as a simple example, suppose we did a random walk in which X plus one was just x t plus from plus some

175
00:21:06,630 --> 00:21:14,190
normal draw from from a standard normal distribution independently for each time.

176
00:21:14,190 --> 00:21:26,850
Then you can easily sort of add up the variance of these entities and see that X T plus K is likely to be only about square root of K away from x t.

177
00:21:26,850 --> 00:21:33,790
The variance of the difference is is is k from adding up those k independent normals.

178
00:21:33,790 --> 00:21:39,420
But that means the standard deviation is on the root K and we're only likely to be about that far away.

179
00:21:39,420 --> 00:21:51,210
So we get enormously faster distribution, enormously faster exploration if we can somehow not do this random walk or.

180
00:21:51,210 --> 00:22:00,540
Well, there's actually two approaches here. We could try to make it so that the changes we make our big rather than small are if

181
00:22:00,540 --> 00:22:05,040
they're so big that they sort of move you from one side of the distribution to the other,

182
00:22:05,040 --> 00:22:16,680
then we don't care about the random walk anymore. Or if we have to still do make small changes, we want to have them not do a random walk.

183
00:22:16,680 --> 00:22:22,530
And that will imply that the transitions have to be non reversible because if the transitions are reversible,

184
00:22:22,530 --> 00:22:29,010
the chance of going from X T to plus one has to be the same as the chance of going the other direction.

185
00:22:29,010 --> 00:22:37,450
So it's inevitable that there's a random walk aspect to things.

186
00:22:37,450 --> 00:22:43,690
Now, just a technical point here is if we have several Markov chain transitions like care them to,

187
00:22:43,690 --> 00:22:54,760
you want to take all of which leave the distribution and variant, then the operation of just applying them in sequence will also leave PI invariant.

188
00:22:54,760 --> 00:23:00,640
Hopefully, that's fairly intuitive that if each of these that invariant, the combination does.

189
00:23:00,640 --> 00:23:07,390
But it's not the case that if they're all reversible, then the combination will also be reversible.

190
00:23:07,390 --> 00:23:11,800
That's not generally true, but it's OK that they're not reversible.

191
00:23:11,800 --> 00:23:22,870
That's not a requirement for for CMC. A simple example is is the frequently used Gibbs sampling procedure where we just where we have a

192
00:23:22,870 --> 00:23:28,930
multivariate state and we update each component of it in turn by sampling from its conditional distribution.

193
00:23:28,930 --> 00:23:39,310
And because we're if we're doing it systematically over the key components, then then that's a a sequential application of K transitions.

194
00:23:39,310 --> 00:23:44,260
And although it leaves the right distribution invariant, it's not reversible.

195
00:23:44,260 --> 00:23:50,980
This non reversible ability for Gibbs sampling seems to have no particularly amazing consequences.

196
00:23:50,980 --> 00:24:02,850
But in other situations, arranging for your transitions to be non reversible can be extremely advantageous.

197
00:24:02,850 --> 00:24:06,030
So I'll now get on to Hamiltonian Monte-Carlo,

198
00:24:06,030 --> 00:24:14,970
which is a very popular method of sampling from posterior distributions now used, for instance, in the stand package.

199
00:24:14,970 --> 00:24:21,180
The reason it's popular is it's much more efficient than than a simple metropolis methods where,

200
00:24:21,180 --> 00:24:29,220
for instance, you just propose a new stage by just adding random noise to the current state.

201
00:24:29,220 --> 00:24:36,690
The way HMRC works is we augment the variable of interest that I'll call X with a momentum

202
00:24:36,690 --> 00:24:42,240
variable of the same dimensionality that has a Gaussian distribution independent of X,

203
00:24:42,240 --> 00:24:50,400
and the transition then consists of two parts. We first sample a value for the momentum from its distribution,

204
00:24:50,400 --> 00:24:59,340
which might be something very simple, like like a normal with with identity covariance matrix.

205
00:24:59,340 --> 00:25:06,210
And then we do a metropolis update in which the proposal is found by simulating Hamiltonian dynamics.

206
00:25:06,210 --> 00:25:17,160
That takes us from our current point x p to some proposed X star p star and then we accept or reject it in the usual metropolis fashion.

207
00:25:17,160 --> 00:25:26,730
So this is also a combination of two steps, each of which leaves the distribution invariant and in fact, each of them is reversible.

208
00:25:26,730 --> 00:25:34,080
And so the the sequential application of the two also leaves the just the distribution invariant.

209
00:25:34,080 --> 00:25:38,220
Now, if the Hamiltonian dynamic simulation were exact,

210
00:25:38,220 --> 00:25:45,060
we'd always accept because that's that would be that's a consequence of properties that Hamiltonian dynamics.

211
00:25:45,060 --> 00:25:54,150
But in practise, we can't do the simulation exactly. We instead approximate the dynamics by doing what are called leapfrog steps,

212
00:25:54,150 --> 00:26:03,150
each of which sort of moves us forward in in in simulated time here by by an amount ADA.

213
00:26:03,150 --> 00:26:13,500
And so we managed to go for a total time Tao by doing a L steps, each of which go for a for a time ADA.

214
00:26:13,500 --> 00:26:18,990
So this is a descriptors discourteous ation of the of the dynamics that isn't exact.

215
00:26:18,990 --> 00:26:29,990
And because of that, rejection is possible with the rejection probability being higher if we take bigger steps.

216
00:26:29,990 --> 00:26:42,290
So here is the BI variant Gaussian again, we can see that now we were doing a 10 leapfrog steps in the simulation with a step size of 0.1 six,

217
00:26:42,290 --> 00:26:47,000
and that gets us directly from our starting point to a point down here.

218
00:26:47,000 --> 00:26:53,780
And then we move around in big steps after that with occasional rejections that you can see with the red lines.

219
00:26:53,780 --> 00:26:58,400
But when we don't reject, we move quite a quite a distance along the distribution.

220
00:26:58,400 --> 00:27:08,000
And therefore, although HMRC is actually reversible, it doesn't particularly matter that it's reversible because it's taking big steps.

221
00:27:08,000 --> 00:27:14,510
And so it doesn't matter that they are then equally likely to go the other direction because

222
00:27:14,510 --> 00:27:23,340
we don't have to put together a lot of small steps in order to get any appreciable distance.

223
00:27:23,340 --> 00:27:32,430
So here are some trajectories from HMRC. They start off with the red trajectory here, which was accepted and then so on.

224
00:27:32,430 --> 00:27:39,060
For other colours, this blue one was rejected. And so we went back here for the next one.

225
00:27:39,060 --> 00:27:47,160
The the dynamics confines these trajectories to the high probability region and also keeps them going

226
00:27:47,160 --> 00:27:52,390
in the same direction until it's necessary to turn around when you get to the end of the distribution.

227
00:27:52,390 --> 00:28:03,200
So that's that that momentum keeping you going in the same direction is why there isn't a random walk within a trajectory.

228
00:28:03,200 --> 00:28:12,710
And here we have HMRC for the replicated by variant Gaussian with 10 copies of the buy variants Gaussian.

229
00:28:12,710 --> 00:28:18,980
We have to reduce the step size in order to get a reasonable rejection rate in higher dimensions.

230
00:28:18,980 --> 00:28:27,110
And so there is some sort of penalty, which dimensionality? But the penalty is much less than with the metropolis methods.

231
00:28:27,110 --> 00:28:33,440
The penalty basically grows as the five fourth five fourth power of the dimensionality,

232
00:28:33,440 --> 00:28:42,710
whereas it grows linearly with the dimensionality for simple metropolis methods.

233
00:28:42,710 --> 00:28:50,540
And now I'll get two lines of in Monte Carlo, which is the method that that I'll actually be improving here.

234
00:28:50,540 --> 00:28:58,980
So Landgraf in Monte Carlo, although it wasn't originally thought of this way, is actually equivalent to doing each AMC with a single leapfrog step.

235
00:28:58,980 --> 00:29:09,350
So it's just a special case. And. Doing a special case in detail here, we once again sample a a momentum variable.

236
00:29:09,350 --> 00:29:12,560
And then we compute a proposal with a single leapfrog step,

237
00:29:12,560 --> 00:29:18,590
which I have written out this way involving the gradient of the log of the probability density.

238
00:29:18,590 --> 00:29:31,730
And then we accept or reject that that proposal and we can actually write out this proposal x star sort of going

239
00:29:31,730 --> 00:29:40,160
through what what actually happened here as the current state plus a a movement in the direction of the gradient,

240
00:29:40,160 --> 00:29:48,880
plus the noise, which is a another way that lines of in methods are thought of.

241
00:29:48,880 --> 00:29:51,690
And here's Lindsey Vonn for a bi variant Gaussian.

242
00:29:51,690 --> 00:30:01,240
Are we are we sort of overshoots the first step overshoots to the opposite side of the distribution and also a low probability region,

243
00:30:01,240 --> 00:30:11,950
but it soon manages to get down into the distribution. Once it's there, it again does a random walk just like random microdroplets.

244
00:30:11,950 --> 00:30:18,700
There is actually an advantage to a range of in over random walk metropolis from using the gradient information,

245
00:30:18,700 --> 00:30:24,240
but it still suffers from this random walk in efficiency.

246
00:30:24,240 --> 00:30:34,230
And here we have it for the replicated by variant goes in and we have to reduce the step size to get a reasonable acceptance rate.

247
00:30:34,230 --> 00:30:37,260
And so things get worse with dimensionality.

248
00:30:37,260 --> 00:30:46,530
This scaling with dimensionality is is better than for Simple Metropolis Random Walk Metropolis updates, but it's worse than for HMRC.

249
00:30:46,530 --> 00:30:51,600
It's scales, I think by by the dimensionality of the power for thirds,

250
00:30:51,600 --> 00:30:59,880
which is not as good as the mentioned outside the power five fourths, which is what HMRC is like.

251
00:30:59,880 --> 00:31:04,650
OK, so that's the background, culminating in the laser beam method,

252
00:31:04,650 --> 00:31:13,330
and I'll discuss how one can improve the larger than method by making it non reversible.

253
00:31:13,330 --> 00:31:18,250
So the there are two non reversible improvements here, actually.

254
00:31:18,250 --> 00:31:26,050
This is made it hard for me to come up with a catchy, short description of of the new method.

255
00:31:26,050 --> 00:31:31,690
So the first non reversible improvement was made by Horowitz back in 1991,

256
00:31:31,690 --> 00:31:40,300
and it modifies the launch of an method by only partially changing the momentum variable each time.

257
00:31:40,300 --> 00:31:44,410
So we're for the original answer, then methods.

258
00:31:44,410 --> 00:31:52,750
You can think of the state as consisting only of X because you re sample p independently of the previous P at the start of each update,

259
00:31:52,750 --> 00:31:56,740
so you don't really have to save it from the last one if you don't want to.

260
00:31:56,740 --> 00:32:06,910
For the Horowitz's modification, you do keep in the state all the time because you don't completely replace P, but instead only slightly modify it.

261
00:32:06,910 --> 00:32:13,570
We change P to by an auto regressive update of making it equal to alpha p plus square root

262
00:32:13,570 --> 00:32:20,440
of one minus alpha squared times some noise and if alpha is only slightly less than one.

263
00:32:20,440 --> 00:32:24,330
This makes only a slight change to P.

264
00:32:24,330 --> 00:32:33,960
Then we knew proposed a new state by doing a leapfrog step as before, and we accept or reject this proposal the usual way.

265
00:32:33,960 --> 00:32:42,420
A detail that didn't matter in the original Langeveldt method but does now is that the proposal not only does a leapfrog step from XP,

266
00:32:42,420 --> 00:32:47,310
but after doing that it also negates P that ensures that the proposal is

267
00:32:47,310 --> 00:32:52,170
symmetrical and that we'd go back where we came from if we were to do it again.

268
00:32:52,170 --> 00:32:59,910
That is crucial to that to it. Leaving the right distribution and variant, which we proved by proving rigorous stability,

269
00:32:59,910 --> 00:33:04,350
which requires this negation for the original leapfrog method that doesn't you don't

270
00:33:04,350 --> 00:33:08,490
have to actually put that in your programme because you're going to forget anyway.

271
00:33:08,490 --> 00:33:17,130
But for this method, you do have to really negate P R, and then the next step is you just negate P.

272
00:33:17,130 --> 00:33:26,010
So there are three updates making up one of these overall persistent momentum lines of updates.

273
00:33:26,010 --> 00:33:34,500
Each of these three updates leaves the correct distribution invariant, so they in sequence also leave the correct distribution variant.

274
00:33:34,500 --> 00:33:43,350
Each of them is actually also reversible, but the sequential combination is not reversible.

275
00:33:43,350 --> 00:33:50,130
And in particular, you can see what's going to happen here when you accept we re slightly change P.

276
00:33:50,130 --> 00:33:54,150
So P is is mostly pointing in the same direction it used to.

277
00:33:54,150 --> 00:34:01,380
Then we propose a state by a leapfrog step, which will tend to move X in the direction p and suppose we accepted.

278
00:34:01,380 --> 00:34:08,190
Then we accepted this proposal in which we negated p r and then we negate P again.

279
00:34:08,190 --> 00:34:15,240
So well, we undid the negation here. So P is now still pointing in the in the direction it was pointing to at the start.

280
00:34:15,240 --> 00:34:21,420
Here are, well, maybe depending on how much it changed in leapfrog in the leapfrog step,

281
00:34:21,420 --> 00:34:27,750
but often it would be still pointing in the same direct general direction as it had at the very beginning.

282
00:34:27,750 --> 00:34:36,270
However, if we do a rejection, then we negated. Then we didn't negate P here because we didn't accept that proposal where it was negated.

283
00:34:36,270 --> 00:34:41,760
Instead, we had just the original P after step one and so that in step three, we negate P.

284
00:34:41,760 --> 00:34:45,420
And so now we're going in the opposite direction from where we were before,

285
00:34:45,420 --> 00:34:51,090
which is not very good because now we're going backwards and largely going over states.

286
00:34:51,090 --> 00:35:00,120
We we just came from that isn't very efficient, so we'd like to avoid that.

287
00:35:00,120 --> 00:35:11,220
But so we have to set the step size small enough that rejections are quite infrequent in order that we don't do that reversible reversal of direction.

288
00:35:11,220 --> 00:35:17,900
So here's a here's what we get when we do that for the by variant Gaussian, we are.

289
00:35:17,900 --> 00:35:24,630
We move from a low probability point to the start towards the the high probability region,

290
00:35:24,630 --> 00:35:32,250
a bit slower than we did for the regular olanzapine method because we're only slightly changing that, changing the momentum now.

291
00:35:32,250 --> 00:35:41,790
So it takes a bit longer to dissipate the the energy we had from being at this high energy point in statistical physics terms.

292
00:35:41,790 --> 00:35:48,930
But we do after a while and after that, we explore the distribution in small steps.

293
00:35:48,930 --> 00:35:56,910
But with the random walk mostly suppressed because we don't keep changing direction by by sampling a new P.

294
00:35:56,910 --> 00:36:05,970
Instead, we only slightly change each time and keep going in the same direction, assuming we don't have a rejection which results in this reversing.

295
00:36:05,970 --> 00:36:11,250
So you can see these successive states here are going in the same direction, mostly.

296
00:36:11,250 --> 00:36:20,040
Eventually we turn around, of course, but where we're doing less of a random walk than we were with the simple change have been method.

297
00:36:20,040 --> 00:36:28,460
But to achieve that, we do have to make the step size be pretty small, so we don't do these rejections that reverse things.

298
00:36:28,460 --> 00:36:38,240
And here it is for the replicated by various Gaussian. We have to make the step size even smaller to keep the rejection rate low.

299
00:36:38,240 --> 00:36:42,950
But having done that, we can see that we do proceed in the same direction for quite a few iterations,

300
00:36:42,950 --> 00:36:49,070
if you can sort of see these these dots in there. So.

301
00:36:49,070 --> 00:36:57,860
So we we have managed to buy this method to suppress the random walks of the regular Langeveldt method.

302
00:36:57,860 --> 00:37:07,510
But there's a cost to that in having to choose a really small step size to reduce the rejection rates to a very small value.

303
00:37:07,510 --> 00:37:14,770
So that's the that's sort of the slightly disappointing aspect of this, that having done all this,

304
00:37:14,770 --> 00:37:20,860
we see that we need a very small step size in order to get the rejection rates small.

305
00:37:20,860 --> 00:37:27,610
So even though then we we avoid the random walks. It's still pretty slow compared to Hamiltonian Monte Carlo,

306
00:37:27,610 --> 00:37:34,660
which can use a larger ADA and which doesn't reverse and in the middle of its trajectories,

307
00:37:34,660 --> 00:37:38,290
except when it has to turn around because it reaches a low probability region.

308
00:37:38,290 --> 00:37:44,680
And so because of that Hamiltonian Monte Carlo with long trajectory lengths,

309
00:37:44,680 --> 00:37:54,040
a large number of leapfrog steps L tends to be better than than Horowitz's persistent momentum version of Langeveldt.

310
00:37:54,040 --> 00:38:02,530
So now we come to the new innovation here, which is to say that well, as well as introducing Non Reversibility Bay,

311
00:38:02,530 --> 00:38:14,080
not completely replacing in each iteration will also introduce non reversibility into the acceptance decision, which which may seem strange.

312
00:38:14,080 --> 00:38:25,410
But alas, so what I mean by that? So to decide I'm accepting a metropolis proposal to move from X to X star,

313
00:38:25,410 --> 00:38:32,360
we have to accept that with probability that's the minimum of one and the ratio PI x star over Pi X.

314
00:38:32,360 --> 00:38:40,740
An easy way to do that is you just generate a uniform zero one random variable you and you check whether you is less than that ratio.

315
00:38:40,740 --> 00:38:43,800
If the ratio is greater than one, of course it's less.

316
00:38:43,800 --> 00:38:50,220
If the ratio is less than one, then you are going to accept with probability equal to that ratio.

317
00:38:50,220 --> 00:38:56,940
Now that's equivalent to saying that you'll check whether PI of X star is greater than S where X is a random variable.

318
00:38:56,940 --> 00:39:00,810
That's uniform on the interval from zero to pi of X.

319
00:39:00,810 --> 00:39:07,860
You can see that. So I just sort of moving pi of X to the other side of that inequality.

320
00:39:07,860 --> 00:39:11,520
And now rather than choosing this as randomly, independently,

321
00:39:11,520 --> 00:39:21,930
each time we can make as part of the state and updated in any way that leaves the joint distribution of of X and S invariant,

322
00:39:21,930 --> 00:39:26,770
in which X is independent of X, and that in that just distribution.

323
00:39:26,770 --> 00:39:35,160
Well, no, it's not independent. It's it's uniform over zero to pi of X, so it's not incentive X.

324
00:39:35,160 --> 00:39:44,700
Now it's more convenient to actually think in terms of U, that's x over pi of X and U is independent of X.

325
00:39:44,700 --> 00:39:54,000
So we can think of you as being a independent uniform zero one around a variable that's part of the state along with X.

326
00:39:54,000 --> 00:40:05,430
And then we can make our accept rejects decision in this way, using the you part of the state when we're looking at a proposal to change X.

327
00:40:05,430 --> 00:40:15,540
So, so now we can do any update to you that that leaves this joint distribution and very end.

328
00:40:15,540 --> 00:40:24,210
And one possible update is that for some constant delta, we either add or subtract Delta Times five x two s,

329
00:40:24,210 --> 00:40:28,470
which is the same as adding or subtracting Delta to you.

330
00:40:28,470 --> 00:40:37,440
And well, we then reflect off the boundaries that zero in pie of X or zero and one effort if we're thinking in terms of you.

331
00:40:37,440 --> 00:40:42,210
So here's some implementation details as to how we managed to reflect our boundaries.

332
00:40:42,210 --> 00:40:54,600
We can let SB pay of X times the absolute value of B, where the now has the uniform distribution from minus one plus one independent of X,

333
00:40:54,600 --> 00:41:00,630
and we update thee by adding delta and then subtracting two if the ends up greater than one.

334
00:41:00,630 --> 00:41:09,210
So this this this effectively implements reflecting off the boundaries from zero to y at zero and one for you.

335
00:41:09,210 --> 00:41:14,760
And now, if X changes to x prime, then we make corresponding change a V to the prime.

336
00:41:14,760 --> 00:41:24,570
That's the the old V times the ratio of Pi X over PI x prime and that it keeps s unchanged.

337
00:41:24,570 --> 00:41:33,750
And so that's why this is a valid procedure is that we're not actually changing as in in this in this change to X.

338
00:41:33,750 --> 00:41:43,790
So that's a that's a bit of a quick explanation as to why this is a valid method, but I'll continue on.

339
00:41:43,790 --> 00:41:55,610
So now the question is why, why might this be of any use and the use is that by doing this, we can cluster together rejections.

340
00:41:55,610 --> 00:42:06,380
This isn't going to change the rejection rate because still you has the same normal as the same uniform zero one distribution as it did before.

341
00:42:06,380 --> 00:42:18,410
And so on average, we we get the same rejection rate as before, but by by moving you slowly from zero to one and back again,

342
00:42:18,410 --> 00:42:24,920
we can end up having a rejections clustered together and acceptances clustered together.

343
00:42:24,920 --> 00:42:34,970
So in particular, if the rejection rate isn't high as as would be the case for a large given method with a reasonable step size,

344
00:42:34,970 --> 00:42:40,190
then the ratio pi start over payouts is usually going to be close to one.

345
00:42:40,190 --> 00:42:47,420
So you mostly changes as a result of adding delta and reflecting off the boundaries.

346
00:42:47,420 --> 00:42:56,690
So if if Delta is small, then then you is going to be near zero for a while as we just slowly change it by adding that delta,

347
00:42:56,690 --> 00:43:05,000
then it'll be near one for a while and then it'll reflect off one and and start decreasing and be near zero for a while again.

348
00:43:05,000 --> 00:43:12,950
So that has the effect of clustering the rejections because we're much more likely to reject when you is near one.

349
00:43:12,950 --> 00:43:20,030
Because remember, we we reject, we check whether we accept by seeing if you is less than this ratio.

350
00:43:20,030 --> 00:43:26,600
So if you is close to zero where much more, that's much more likely to be true than if you is close to one.

351
00:43:26,600 --> 00:43:36,080
So the effect of this is that we'll tend to get a bunch of rejections when you is near one, but then get lots of acceptances when you is near zero.

352
00:43:36,080 --> 00:43:47,420
And now the the benefit of this is that if the for doing the Horowitz's larger than procedure with the momentum only slightly

353
00:43:47,420 --> 00:43:58,850
changed each time by we get we avoid random walk behaviour as long as we don't have a rejection that reverses the momentum.

354
00:43:58,850 --> 00:44:06,110
Whereas when we do have a rejection, we we we reverse, we reverse momentum and and things don't work so well anymore.

355
00:44:06,110 --> 00:44:14,810
So we want to see by clustering the acceptances together, we'll get more consecutive acceptances that keep moving in the same direction.

356
00:44:14,810 --> 00:44:20,090
And you can see how that is better by a simple sort of model here.

357
00:44:20,090 --> 00:44:29,900
Suppose they each accepted move moves the direction D in some direct moves, moves the distance in some direction,

358
00:44:29,900 --> 00:44:35,390
and each rejection randomised is the direction which isn't quite what happens with with that method.

359
00:44:35,390 --> 00:44:44,630
But let's suppose that then if we do 20 times some number K iterations in which they take the form of four acceptances,

360
00:44:44,630 --> 00:44:51,290
one reject for acceptances, one reject and so forth. This all repeated K times.

361
00:44:51,290 --> 00:45:03,110
Then on average, the distance we move is going to be the square root of 4K because we are there are four K groups of four accepts and one reject.

362
00:45:03,110 --> 00:45:09,170
And so each of which moves in a random walk between those groups.

363
00:45:09,170 --> 00:45:17,570
And so the the sort of distance that we move in in those four K groups of of moves is the distance moved in one,

364
00:45:17,570 --> 00:45:24,890
which is four times the square root of the number of those which works out is eight times the square root of K.

365
00:45:24,890 --> 00:45:32,780
But if instead we do these 20 K iterations in the form of 16 acceptances, followed by four rejections,

366
00:45:32,780 --> 00:45:43,550
which is the same acceptance probability as above, then we get each of these moves the distance 16 D and we have K of them doing a random walk.

367
00:45:43,550 --> 00:45:49,070
And so we move 16 times the square root of K, which you can see is twice as much as that.

368
00:45:49,070 --> 00:45:57,560
So there's an advantage to clustering the rejections together, because then you also cluster the acceptances together, inevitably.

369
00:45:57,560 --> 00:46:04,520
So that's the the motivation for trying to do this non reversible scheme for doing acceptances

370
00:46:04,520 --> 00:46:14,410
involving just adding delta to you rather than sampling a independent value of you each time.

371
00:46:14,410 --> 00:46:26,020
So here's a illustration of this for the base variant Gaussian as before.

372
00:46:26,020 --> 00:46:31,060
Here. Well, it's a bit you can sort of see that it doesn't do a random walk by it,

373
00:46:31,060 --> 00:46:35,560
by how it goes here before it actually has a rejection and has to reverse there.

374
00:46:35,560 --> 00:46:42,670
And while there's a few too many points to see things clearly here, but it's it's not it's not doing a random walk in there,

375
00:46:42,670 --> 00:46:54,350
and that's even though it's using a larger step size than for the previous illustration of the hoards Horowitz's version of the avalanche method.

376
00:46:54,350 --> 00:46:58,310
And here we are for the replicated by variant goes in again,

377
00:46:58,310 --> 00:47:07,730
we can use a larger step size than we did before, and we still get a lot of suppression of random walks.

378
00:47:07,730 --> 00:47:13,480
As you can see here, for instance, it consistently goes in the same direction for many iterations.

379
00:47:13,480 --> 00:47:17,830
So we do need to tune things, of course.

380
00:47:17,830 --> 00:47:22,300
So we have to keep it a small enough that we don't have a have a really high rejection rate,

381
00:47:22,300 --> 00:47:28,240
but we don't have to keep the rejection rate as small as before. And we also have to tune Delta.

382
00:47:28,240 --> 00:47:37,810
It seems that about one minus alpha is is right, although I don't have any any precise ideas about how Delta should be tuned.

383
00:47:37,810 --> 00:47:49,300
And when we do that, tuning properly compared to a properly tuned HMRC, we get about the same performance, it seems on this example.

384
00:47:49,300 --> 00:47:55,330
And if we look at the value of you, add each iteration here is the value of you,

385
00:47:55,330 --> 00:48:03,820
each iteration for the regular persistent lines of method that Horowitz came up with and the rejections are in red here.

386
00:48:03,820 --> 00:48:09,250
So you is sampled independently each iteration. So we get the expected scatter here.

387
00:48:09,250 --> 00:48:19,180
And well, there are some of those are rejections. Here's what we get with the non reversible update of you at the beginning.

388
00:48:19,180 --> 00:48:25,180
We get a bunch of rejections because we started in a very low probability point, which causes problems at the beginning.

389
00:48:25,180 --> 00:48:30,850
So you can sort of ignore that and then see what happens after we've gotten to the high probability region.

390
00:48:30,850 --> 00:48:37,660
And you can see we we have a lot of consecutive rejections when you is close to one,

391
00:48:37,660 --> 00:48:42,550
but then we have a big stretch of time here where there's no rejections when you is close to zero.

392
00:48:42,550 --> 00:48:46,570
So we've successfully clustered the rejections together,

393
00:48:46,570 --> 00:48:54,820
making them less damaging as far as as randomising the direction so we can manage to suppress random walks for a

394
00:48:54,820 --> 00:49:09,510
long time here and make better progress than if we have rejections mixed in there that keep reversing the momentum.

395
00:49:09,510 --> 00:49:20,060
OK, so now I'll I have a few minutes to talk about how these could actually be, this method could actually be applied.

396
00:49:20,060 --> 00:49:31,760
Here's a test idea regarding discrete variables, so this is a simple distribution, which means we can actually tell what the right answer is.

397
00:49:31,760 --> 00:49:39,950
That's meant to be a model for more complex situations where we have a bunch of continuous variables that we want to update with HMRC.

398
00:49:39,950 --> 00:49:47,480
But we also have some discrete variables which we can't update with HMRC.

399
00:49:47,480 --> 00:49:49,880
So we're going to update those with Gibbs sampling,

400
00:49:49,880 --> 00:49:58,400
and we alternate the updates of the discrete variables with Gibbs sampling with updates of continuous variables with HMRC.

401
00:49:58,400 --> 00:50:00,740
Now here there are only two continuous variables,

402
00:50:00,740 --> 00:50:08,720
but for this example that you can imagine that there is lots more and that the updates of the continuous variables are actually the dominant cost.

403
00:50:08,720 --> 00:50:13,120
And so I'm going to assess the the methods on that basis.

404
00:50:13,120 --> 00:50:24,900
The the HMRC updates are the ones that are costly. And so if you do that and you do if you do the non reverse of the lounge, then not reversible,

405
00:50:24,900 --> 00:50:35,790
both in terms of Horowitz's modification to only partially change p and the non reversible change to you for the acceptance probability,

406
00:50:35,790 --> 00:50:38,670
then you can if you turn things properly,

407
00:50:38,670 --> 00:50:47,010
you you get a method that's 1.8 three times more efficient than if you tune HMRC optimally for this problem, at least.

408
00:50:47,010 --> 00:50:56,080
Well, I didn't tune them precisely optimally. You can see which values I checked, and then I picked the best for each of these methods.

409
00:50:56,080 --> 00:51:00,850
So that's that's encouraging. I don't know what the scaling is here,

410
00:51:00,850 --> 00:51:07,390
and this is a relatively low dimensional problem and such if you have much higher

411
00:51:07,390 --> 00:51:12,280
dimensional continuous variables and perhaps more discrete variables for that matter,

412
00:51:12,280 --> 00:51:14,920
how the whole thing scales.

413
00:51:14,920 --> 00:51:23,050
I don't know whether you are, whether you still get only roughly twice the efficiency, which would be slightly disappointing.

414
00:51:23,050 --> 00:51:26,110
I mean, getting twice being twice as efficient is nice,

415
00:51:26,110 --> 00:51:34,210
but you'd like if you were even more efficient or whether instead the advantage gets bigger and bigger as the dimensionality goes up.

416
00:51:34,210 --> 00:51:44,020
Or I suppose it's possible to get smaller as the study goes up. I don't know the answers to those questions yet.

417
00:51:44,020 --> 00:51:53,470
One can also look at how this method works for Bayesian neural network models, which can be very complex hierarchical models.

418
00:51:53,470 --> 00:52:03,610
So here's a a description of some networks of increasing complexity, which are actually the ones I'm playing around with at the moment.

419
00:52:03,610 --> 00:52:08,050
At the simplest level, we could for a regression network, for instance,

420
00:52:08,050 --> 00:52:14,980
we might have one hidden layer of of 10 units if you're familiar with a neural network models.

421
00:52:14,980 --> 00:52:25,120
Those are a common, although somewhat old fashioned activation function for 10h is is a common activation function for the hidden units.

422
00:52:25,120 --> 00:52:30,230
So we have one layer of those, and then that means we have three groups of weights.

423
00:52:30,230 --> 00:52:38,500
There's the weights from the inputs to those hidden units. There's what are called the biases or the intercept terms for the hidden units.

424
00:52:38,500 --> 00:52:43,180
And there are the weights from the hidden units to the output.

425
00:52:43,180 --> 00:52:47,110
And in the simplest model, that's that's those are the only groups you have.

426
00:52:47,110 --> 00:52:57,100
So you have three hyper parameters controlling those three groups. You also have a have a four regression model, you have a parameter.

427
00:52:57,100 --> 00:53:00,130
That's the residual variance for the response,

428
00:53:00,130 --> 00:53:10,540
which also is effectively plays the role of a hyper parameter as far as its effect on how the model works to make a more complicated model.

429
00:53:10,540 --> 00:53:15,100
But say that, well, we're not really sure about how relevant each input is,

430
00:53:15,100 --> 00:53:22,780
so we can instead have a have K plus three hyper parameters in which we have a separate

431
00:53:22,780 --> 00:53:27,670
hyper parameter for the for the weights out of each input into the hidden unit.

432
00:53:27,670 --> 00:53:34,660
So we have such hyper parameters and then we need since we got K of them, we're not we're not sure what prior to give them.

433
00:53:34,660 --> 00:53:40,240
So in the same way as we did before, we say, well, since we're not sure we'll we'll create a higher level,

434
00:53:40,240 --> 00:53:45,250
higher parameter that controls the distribution of these K lower level hyper parameters.

435
00:53:45,250 --> 00:53:46,690
So that's another one.

436
00:53:46,690 --> 00:53:56,630
And then we we have the hidden unit biases that we have a hyper parameter to control and the hidden output weights going up in complexity.

437
00:53:56,630 --> 00:54:05,050
We could say, well, maybe maybe the function here of these inputs has a additive structure where it's a a a function

438
00:54:05,050 --> 00:54:10,270
of one input of input one plus the function of input two plus the function of input three.

439
00:54:10,270 --> 00:54:17,770
Or maybe it's at least sort of approximately that. So we could have a model in which we have a K hidden layers,

440
00:54:17,770 --> 00:54:26,170
each of which looks at only one input for the K different inputs and and connects to the output, of course.

441
00:54:26,170 --> 00:54:33,520
And those K hidden layers can compute the key components of an objective function.

442
00:54:33,520 --> 00:54:35,380
But maybe it's not completely additive,

443
00:54:35,380 --> 00:54:41,740
so we'll have another hidden layer that looks at all inputs and could model interactions, and that also goes to the output.

444
00:54:41,740 --> 00:54:48,840
And you add it all up. We have three K plus capas three hyper parameters there.

445
00:54:48,840 --> 00:54:53,280
So now, how well do the methods work on these models?

446
00:54:53,280 --> 00:55:01,680
Well, that's what I've been running simulations for for the last week or two, and it's actually pretty hard to say.

447
00:55:01,680 --> 00:55:07,470
One reason is we don't know the correct answer. These are obviously these.

448
00:55:07,470 --> 00:55:16,030
These are. Are models that are far too complex to to to know with certainty what the correct answer is.

449
00:55:16,030 --> 00:55:20,770
We could look at the air for predictions on a test set.

450
00:55:20,770 --> 00:55:30,110
But unfortunately, there's no guarantee that the method that's best at sampling from the posterior will actually produce the best predictions.

451
00:55:30,110 --> 00:55:38,740
The data might not be well fitted to the model and in which case you're better off predicting.

452
00:55:38,740 --> 00:55:46,870
You know, doing the wrong thing might actually be better if the model is not very good or even if the model is good, it's still based on limited data.

453
00:55:46,870 --> 00:55:53,890
You have a posterior spreader spread out someplace, and even if that includes the good values,

454
00:55:53,890 --> 00:55:58,900
it also will include some not so good values if you don't have a lot of data.

455
00:55:58,900 --> 00:56:05,410
And if you're sampling method doesn't really explore the full distribution, but happens to be in the good spot instead.

456
00:56:05,410 --> 00:56:09,670
I could look better, even though it's actually worse as far as sampling the posterior.

457
00:56:09,670 --> 00:56:19,240
So it's hard to tell. I've been playing around with a so I previously played around with a simple classification

458
00:56:19,240 --> 00:56:24,220
model and looking at the correlate auto correlation time for the hyper parameters.

459
00:56:24,220 --> 00:56:32,830
The non-negotiable Lanzmann method was a little bit better than HMRC for the more complex regression models from the previous slide.

460
00:56:32,830 --> 00:56:38,500
I've played around with six inputs, all of which are relevant to varying degrees.

461
00:56:38,500 --> 00:56:47,620
And with with the true function having both additive and intra interaction aspects and five training cases.

462
00:56:47,620 --> 00:56:52,420
And if you didn't fit the simplest of the models with only three hyper parameters,

463
00:56:52,420 --> 00:56:58,810
HMRC seems to do a slightly better job at prediction than the of the Langeveldt methods.

464
00:56:58,810 --> 00:57:08,920
But for the more complex models, the best predictions are made by by non reversible Langeveldt methods, although it's there's a lot of variation here.

465
00:57:08,920 --> 00:57:18,820
You change tuning parameters and what doesn't seem to be drastically large amounts and you get much different predictions are much,

466
00:57:18,820 --> 00:57:26,800
much different predictive performance. So this this this it makes it complicated to see what what's actually going on.

467
00:57:26,800 --> 00:57:35,890
So I'm still playing around with this to see how much better you can get than HMRC by playing around with the non-responsive methods.

468
00:57:35,890 --> 00:57:41,520
So that's it. OK.

469
00:57:41,520 --> 00:57:49,020
Thank you very much for the talk, Robert. And now we'll open the floor to questions.

470
00:57:49,020 --> 00:57:53,940
I should say perhaps that there is a bibliography here if you want to look at the slides at the end.

471
00:57:53,940 --> 00:58:11,490
Okay. Mm-Hmm. OK, thank you. So if you would like to ask a question, please unmute, resolve and ask.

472
00:58:11,490 --> 00:58:15,900
Hi, can you hear me? Yes, I can hear you.

473
00:58:15,900 --> 00:58:20,520
Hi, hello, Rothfeld. Very nice talk. Thank you.

474
00:58:20,520 --> 00:58:25,910
My question is whether you try to do similar things with Hamiltonian Monte-Carlo itself.

475
00:58:25,910 --> 00:58:32,010
Like, can you do also a non reversible, persistent? Yes.

476
00:58:32,010 --> 00:58:40,050
I've expressed Horowitz's method as applying to lingerie, which is HMRC with a single leapfrog step,

477
00:58:40,050 --> 00:58:46,170
but the same thing can be done for HMRC with with more than one frog step.

478
00:58:46,170 --> 00:58:53,430
You can decide that you'll do it trajectories of 10 steps and then you'll also not completely replace the momentum before the next one.

479
00:58:53,430 --> 00:58:57,210
You'll do the same sort of thing. I've only partially changing the momentum.

480
00:58:57,210 --> 00:59:04,980
And then you have the stimulus thing that if you actually rejected that trajectory, you'll end up reversing the momentum.

481
00:59:04,980 --> 00:59:12,690
And so you can do combined ones that that sort of tried to suppress random walks by a combination

482
00:59:12,690 --> 00:59:20,700
of doing multiple leapfrog steps per trajectory and only partially changing the momentum.

483
00:59:20,700 --> 00:59:27,690
Kennedy, I believe, investigated this at one point and found that disappointingly,

484
00:59:27,690 --> 00:59:35,970
you can only get a small benefit from the sort of hybrid methods that that you can see.

485
00:59:35,970 --> 00:59:44,040
This is a tuning parameter, then is not only do you tune the step size and the number of leapfrog steps, but also how much you replace the momentum.

486
00:59:44,040 --> 00:59:53,070
And with that, that space of an extra tuning parameter, you can get slightly better than if you stuck to straight HMRC.

487
00:59:53,070 --> 01:00:00,750
But but not not not all that much in particular didn't change the the scaling with dimensionality.

488
01:00:00,750 --> 01:00:11,520
I think there's a paper by Kennedy and maybe some other people on that. Thank you.

489
01:00:11,520 --> 01:00:22,460
Thank you, Peter. Anybody else? I mean, you hear me.

490
01:00:22,460 --> 01:00:26,120
Yep. All right. Thanks very much.

491
01:00:26,120 --> 01:00:33,610
That's very interesting. I thought that truck was making you a random variable.

492
01:00:33,610 --> 01:00:42,550
This was really neat. But if I understand correctly, the update was not deterministic, it was stochastic.

493
01:00:42,550 --> 01:00:46,540
Is that correct? Yeah. Well. Well, no, it's no.

494
01:00:46,540 --> 01:00:51,730
It's actually deterministic. If you. I mean, the actual update itself.

495
01:00:51,730 --> 01:00:57,250
Of course, things change sarcastically at a larger in a larger fashion.

496
01:00:57,250 --> 01:01:01,480
But the actual the the actual thing in the state is that, as I have expressed,

497
01:01:01,480 --> 01:01:09,070
it is is v and you is the absolute value of V and then X is pi of X times the absolute value of the.

498
01:01:09,070 --> 01:01:17,290
And the update to V is itself is to simply add delta and then subtract two if if it's greater than one.

499
01:01:17,290 --> 01:01:26,080
But that's the actual update to be. But then V also has to change when exchanges of X changes to X star attacks prime.

500
01:01:26,080 --> 01:01:31,270
Then you have to change V to v times pi of X over Pi of X x prime.

501
01:01:31,270 --> 01:01:37,600
So that the the proposal that resulted in that will have been stochastic.

502
01:01:37,600 --> 01:01:47,650
So. So we will change sarcastically as a result of of the changes to X, which is what you see here as long as we reject.

503
01:01:47,650 --> 01:01:52,270
So we are changing. X v precedes deterministic Lee.

504
01:01:52,270 --> 01:01:57,040
Well, you see these deterministic v up to one and then reflects and comes back down.

505
01:01:57,040 --> 01:02:01,360
That corresponds to V having switched to minus one.

506
01:02:01,360 --> 01:02:05,560
And then and then the increases in V at that point are decreases in U.

507
01:02:05,560 --> 01:02:13,150
But as soon as there's an acceptance, there's a bit of jitter here because the ratio of Pi x prime to Pi X won't have been one.

508
01:02:13,150 --> 01:02:21,520
And so for the acceptances, there's some slight stochastic aspect resulting from the stochastic changes to it to X.

509
01:02:21,520 --> 01:02:31,820
And so there's no constraints that's there, that Delta not perfectly divide the interval and sort of thinking that I want to.

510
01:02:31,820 --> 01:02:44,090
OK, so you could you expressed as you just at Delta, but you could out Delta plus some noise if you felt like it, in fact.

511
01:02:44,090 --> 01:02:48,920
And in fact, the noise distribution can be anything at all.

512
01:02:48,920 --> 01:02:53,750
And you might do that if you were worried that you are otherwise.

513
01:02:53,750 --> 01:02:58,280
So, for instance, if you're sampling from a uniform distribution.

514
01:02:58,280 --> 01:03:07,130
So the the ratios of Pi ex-Prime over Pi X are always one four four accepted moves then.

515
01:03:07,130 --> 01:03:15,830
And you chose Delta to be exactly one half. Then you might worry that you're not sampling properly because they wouldn't actually matter.

516
01:03:15,830 --> 01:03:20,630
Because yeah, if you'd make it slightly more complex thing in which nevertheless,

517
01:03:20,630 --> 01:03:27,550
the ratios of Pi X primer grouping, there are always some, some small number of values, then it's conceivable.

518
01:03:27,550 --> 01:03:32,180
But you can end up being nonorganic by just adding Delta all the time.

519
01:03:32,180 --> 01:03:37,750
So if you're worried about that, you can add Delta and some amount of noise if you feel like it.

520
01:03:37,750 --> 01:03:42,440
Thanks. OK.

521
01:03:42,440 --> 01:03:47,300
Any last questions for a for us?

522
01:03:47,300 --> 01:04:00,450
Need more questions for. Last call.

523
01:04:00,450 --> 01:04:11,890
Mm hmm. Okay. I should maybe mention that amongst the references is a paper on archive on this, which is this one here.

524
01:04:11,890 --> 01:04:23,400
Non reversibly updating uniform zero one value for Metropolis except reject decisions, which is on the archive.

525
01:04:23,400 --> 01:04:33,630
OK. OK, I guess then we'll wrap up the talk, and I think you get rougher for being able to join virtually in the air for great talk.

526
01:04:33,630 --> 01:04:37,966
Thank you. Well, thanks for. Thanks for inviting me. OK, thank.