1
00:00:04,780 --> 00:00:11,890
OK, so let's start. So welcome, everyone, for this Friday's seminar talk.

2
00:00:11,890 --> 00:00:18,220
So this is our second psychoanalyse elsewhere Satchmo seminar this term.

3
00:00:18,220 --> 00:00:23,770
So the speaker today is Trengove from Texas A&M University.

4
00:00:23,770 --> 00:00:31,660
So I try back in 2008 when he was postdoc at Rice University and he visited University of Toronto,

5
00:00:31,660 --> 00:00:40,180
my supervisor, Jeff Rozental, and then starting to start starting from that time we worked together.

6
00:00:40,180 --> 00:00:50,350
So now he's moved to I think two years ago, he moved to Texas A&M University and now he's assistant professor there.

7
00:00:50,350 --> 00:00:55,330
So our collaboration is very incredible and other successful, I would say.

8
00:00:55,330 --> 00:01:01,270
So I highly recommend you talk to him if you have some common interests with him.

9
00:01:01,270 --> 00:01:11,650
So he worked. He has a very broad research interests other than Bayesian methodology and the computation we will hear today.

10
00:01:11,650 --> 00:01:20,710
We also work on statistical genetics, graphical models, stochastic control and experimental design.

11
00:01:20,710 --> 00:01:26,530
So today he will talk about complexity of a local CMC search for high dimensional model selection.

12
00:01:26,530 --> 00:01:34,510
I believe it includes some of our work, also also his own work with his student at Texas.

13
00:01:34,510 --> 00:01:39,850
So I will leave to join very much prior to the introduction.

14
00:01:39,850 --> 00:01:49,240
Thank you so much for having me here. It's it's really a huge pleasure, a huge honour for me to speak about my research and to this seminar.

15
00:01:49,240 --> 00:02:00,050
So today I would like to talk about some of my written books on local and Simsim methods for high dimensional models and action.

16
00:02:00,050 --> 00:02:10,100
And actually, the talk is mostly based on my and the joint work with Restring and Gareth Roberts,

17
00:02:10,100 --> 00:02:17,260
Jeffrey Rosen and vets and the results are based on a joint a joint work with my students.

18
00:02:17,260 --> 00:02:25,120
And and I will also mention one results. Well, which is from which comes from an ongoing project.

19
00:02:25,120 --> 00:02:33,340
So before I start my talk, I would like to share with you guys a story behind the birth of MSNBC.

20
00:02:33,340 --> 00:02:35,620
Perhaps a lot of you already know this.

21
00:02:35,620 --> 00:02:48,370
So the Metrobus algorithm was first implemented by Dr. Rosenbluth, the woman in this picture in nineteen fifty three, Dr. Rosenbluth.

22
00:02:48,370 --> 00:02:54,070
She is actually from Texas. And then and she received her Ph.D. at the age of twenty two.

23
00:02:54,070 --> 00:03:02,680
And shortly after that, she implemented a Metropolis algorithm. And at that time, well, of course you call it everything in machine language.

24
00:03:02,680 --> 00:03:10,150
But after she implemented major obvious algorithm, she left academia and probably never worked on research again.

25
00:03:10,150 --> 00:03:20,110
And in two thousand and three. So some way interview with her about about this very first implementation of this algorithm.

26
00:03:20,110 --> 00:03:28,600
And she said that she was very surprised that someone still even remembers this algorithm.

27
00:03:28,600 --> 00:03:34,600
Well, but as as we all know, well, not only do we do we still remember this algorithm,

28
00:03:34,600 --> 00:03:42,380
but actually we have been working extensively on the theory and the methodology of seeing the past, that maybe 30 years.

29
00:03:42,380 --> 00:03:48,290
Unfortunately, Dr. Rosenblum has passed away at the end of the last year.

30
00:03:48,290 --> 00:03:52,390
And this is the obituary posted on the IMC website.

31
00:03:52,390 --> 00:03:56,650
And you can find the complete story of her life in New York Times.

32
00:03:56,650 --> 00:04:01,930
And I think it's very inspiring, very amazing, very unique story.

33
00:04:01,930 --> 00:04:09,030
Well, anyway, let's start my talk and let's first quickly review what is the nature of the algorithm.

34
00:04:09,030 --> 00:04:14,490
So in this talk will be mostly concerned with a financial stake space,

35
00:04:14,490 --> 00:04:26,280
so I'm just going to use this big arse to denote an arbitrary final state of space, and I will use PI to denote our target distribution on this space.

36
00:04:26,280 --> 00:04:32,500
Let's assume the support of PI is just given by us, so given any transition matrix,

37
00:04:32,500 --> 00:04:39,150
we can construct another churnalism Matrix P according to this formula and we all know that,

38
00:04:39,150 --> 00:04:45,780
well, just by construction, this new transition matrix is reversible, we suspect.

39
00:04:45,780 --> 00:04:54,630
So if further irreducible and aperiodic, then this macro train will converge to PI and if we can simulate a dismal chain,

40
00:04:54,630 --> 00:05:02,130
then the samples are generated from this Markov chain can be used to approximate the targeted distribution PI.

41
00:05:02,130 --> 00:05:11,370
And the and the William Cohen observation here is that to to this market train, we only needed to know and and normalise the of part.

42
00:05:11,370 --> 00:05:20,640
And then we can say this P well, defines its algorithm for somebody for one part and the Ferguson Hastings' algorithm.

43
00:05:20,640 --> 00:05:28,400
Well, we say this Matrix K is the proposal and this minimum term is called the acceptance probability.

44
00:05:28,400 --> 00:05:30,210
The meaning is pretty straightforward.

45
00:05:30,210 --> 00:05:39,180
So if we want a similar guess, just P then we just well, in each iteration propose a new state according to the proposal distribution,

46
00:05:39,180 --> 00:05:45,900
and then we decide whether to accept it or not according to the acceptance probability.

47
00:05:45,900 --> 00:05:50,370
And metropolises are reasons and more generally speaking,

48
00:05:50,370 --> 00:05:57,480
and Simsim methods have been widely used in the statistics because in statistics very often we are

49
00:05:57,480 --> 00:06:06,660
faced with very complicated posterior distributions whose normalising constant is not trackable.

50
00:06:06,660 --> 00:06:12,500
And henceforth I would I would just refer to Piasta posterior distribution.

51
00:06:12,500 --> 00:06:18,170
And in this talk, we are mostly interested in a class of problems called motor selection,

52
00:06:18,170 --> 00:06:25,880
and perhaps this is the most important class of statistical problems defined on its creed status basis.

53
00:06:25,880 --> 00:06:29,330
And for these problems, the size of the state of space?

54
00:06:29,330 --> 00:06:40,600
S typically depends on a parameter P, which is just the number of variables or the number of features we have in our dataset.

55
00:06:40,600 --> 00:06:47,560
Well, when one example is just the variable selection for verbal selection, the state of space,

56
00:06:47,560 --> 00:06:56,150
as is the collection of order of all the binary sequences with less equal to the size of as of these two to the.

57
00:06:56,150 --> 00:07:02,020
And another example is a structural problem where we want to identify, well,

58
00:07:02,020 --> 00:07:09,460
the so-called thag directly the acyclic wuv underlying our multivariate data.

59
00:07:09,460 --> 00:07:11,710
And for these problems, the state of space,

60
00:07:11,710 --> 00:07:23,940
as is the collection of all the decks with pianos and then as the size of us grows super exponentially with Pete.

61
00:07:23,940 --> 00:07:30,390
Well, we will be mostly interested in high dimensional settings where he is much larger than the sample size.

62
00:07:30,390 --> 00:07:36,540
So we have to impose some specific constraint so that the model is identifiable.

63
00:07:36,540 --> 00:07:46,140
But even if we take into account the specific constraint, in most cases, the size of the size of us still grows super polynomial with.

64
00:07:46,140 --> 00:07:54,270
So of course, we may wonder, well, then then are the algorithms efficient enough?

65
00:07:54,270 --> 00:07:58,200
For these emotional problems, because theoretically speaking,

66
00:07:58,200 --> 00:08:08,250
we can always say compare that with approximately messers like bioregional bias or maybe the ABC method and Simms's asymptotically exact,

67
00:08:08,250 --> 00:08:15,090
because if we can generate infinitely many samples from MSNBC, we are able to exactly recover the posterior.

68
00:08:15,090 --> 00:08:24,180
But of course, in reality, especially for emotional problems, we really wonder how fast can these MSNBC algorithms converge?

69
00:08:24,180 --> 00:08:32,320
If we if it cannot converge in a reasonable amount of time, then it's probably not that useful.

70
00:08:32,320 --> 00:08:42,640
And the this is a very useful property that are very open, we want to verify for our algorithm is called racketed mixing,

71
00:08:42,640 --> 00:08:54,250
we say, and the algorithm is rapidly mixing if it's mixing time growth only polynomial with the complexity parameters and the.

72
00:08:54,250 --> 00:09:02,100
Because because the state of space can grow super polynomial, it was so actually the mixing is a very informative property.

73
00:09:02,100 --> 00:09:14,550
It tells us that, well, our algorithm, the complexity of our algorithm grows much more slowly compared to the problem size.

74
00:09:14,550 --> 00:09:26,730
And in recent years, there a massive the the local inform that MSNBC has become pretty popular and to explain what is locally informed, MSNBC.

75
00:09:26,730 --> 00:09:36,930
Well, let's first introduce some not notation. So I'm going to use this an axe to denote the so called neighbourhood of the point X,

76
00:09:36,930 --> 00:09:46,640
and it is just a collection of all the points that I can propose given the current point X.

77
00:09:46,640 --> 00:09:55,740
And now in practise, I would say that for almost every case algorithm to find out discrete status based,

78
00:09:55,740 --> 00:09:58,830
this neighbourhood only grows the size of the neighbourhood,

79
00:09:58,830 --> 00:10:10,440
only grows polynomial with P and understand the way well to implement and which is to use the so-called random book proposal,

80
00:10:10,440 --> 00:10:12,690
which just means that, well, given the neighbourhood,

81
00:10:12,690 --> 00:10:27,440
I'm going to propose one neighbouring states randomly with equal probability so we can write K as this one of course is to indicate a function.

82
00:10:27,440 --> 00:10:32,120
For Mitchell Hastings, algorithms on final status basis, actually, for any market trends,

83
00:10:32,120 --> 00:10:37,880
on final status basis, we can describe the market watching as as a graph.

84
00:10:37,880 --> 00:10:45,260
And here in this picture, I'm using these black thought to represent the state of my state space.

85
00:10:45,260 --> 00:10:50,750
And if two states are neighbours and let's connect them with the law.

86
00:10:50,750 --> 00:10:55,670
So if we look at this state X zero, then this X zero has six neighbours, excellent.

87
00:10:55,670 --> 00:11:05,270
X two, blah, blah, until X six. And these blue bars indicate well how large the posterior probability is at that point.

88
00:11:05,270 --> 00:11:11,960
Now, if we use the random wug well, which the Hastings unwisdom and at the point,

89
00:11:11,960 --> 00:11:19,550
I'm just going to randomly draw one neighbour amongst these six states randomly with equal probability.

90
00:11:19,550 --> 00:11:25,820
And for this distribution, we'll probably we will say that at the point zero,

91
00:11:25,820 --> 00:11:35,450
maybe we want to move to X four because the post here mass looks pretty large on the point X four and the posterior mass on the other five points.

92
00:11:35,450 --> 00:11:43,920
I mean, extra, extra, extra five, six are actually smaller than the probability of X zero.

93
00:11:43,920 --> 00:11:52,350
So intuitively speaking, we want to say the best move seems to be from zero to four and the probability

94
00:11:52,350 --> 00:11:58,560
of this move is just one six because we are using the random of proposals.

95
00:11:58,560 --> 00:12:03,510
So it means that, well, on average, we probably need maybe say,

96
00:12:03,510 --> 00:12:16,230
around the six iterations to move from zero to four and it is Ristic reasoning also tells us that the mixing time of of the mark of the mixing time

97
00:12:16,230 --> 00:12:31,360
of this of this metropolis has this algorithm random its algorithm on this final stage space needs to be at least as large as the neighbourhood size.

98
00:12:31,360 --> 00:12:39,280
Because because we need to say approximately, for example, six iterations here to move from zero to export or and I am sorry,

99
00:12:39,280 --> 00:12:43,630
I forgot to to to to to define what is the mix and how I would define that later.

100
00:12:43,630 --> 00:12:54,100
But roughly speaking, makes in time just a process how many iterations we need for the chain to reach station the.

101
00:12:54,100 --> 00:12:59,950
And then the idea of the law calling for proposals is that, well,

102
00:12:59,950 --> 00:13:11,140
since we all things we actually can tell that this X four is is is a much better move than all the all the other five neighbouring states,

103
00:13:11,140 --> 00:13:16,500
then why not, let's say, just to propose X four with a of probability.

104
00:13:16,500 --> 00:13:21,910
And this is kind of the motivation behind the so-called locally informed proposals.

105
00:13:21,910 --> 00:13:25,570
So given the neighbourhood, let's turn our proposal probabilities.

106
00:13:25,570 --> 00:13:36,810
Let's evaluate the political landscape pie in the the neighbourhood, and then let's propose those states with larger posterior probabilities.

107
00:13:36,810 --> 00:13:47,520
So if at so point zero, for this example, we'll probably propose X for the next step if we use locally informed proposals.

108
00:13:47,520 --> 00:13:51,730
So here is the mathematical definition of the local informed populace.

109
00:13:51,730 --> 00:14:00,450
Let's use F to denote any arbitrary, non-active function and then let's define a new proposal matrix.

110
00:14:00,450 --> 00:14:03,090
According to this formula.

111
00:14:03,090 --> 00:14:11,910
This indicator, function one an just means that we are still considering the same neighbourhood as the previous Random House Hastings.

112
00:14:11,910 --> 00:14:18,450
And according to this formula, we are going to propose a neighbouring state, why?

113
00:14:18,450 --> 00:14:25,020
With probability proportional to PI over Pontiac's.

114
00:14:25,020 --> 00:14:28,880
And that because we have introduced this proposal weights, so we need to.

115
00:14:28,880 --> 00:14:35,430
So we need to calculate is normalising constant Zak's on the denominator.

116
00:14:35,430 --> 00:14:45,760
So this is a novel. So the X is the normalising constant for our informed proposal distribution's.

117
00:14:45,760 --> 00:14:51,940
And according to our previous heuristic arguments, we would say that, of course,

118
00:14:51,940 --> 00:14:57,740
we want to choose some function that is now decreasing because we want to assign

119
00:14:57,740 --> 00:15:04,600
larger proposal ways to those states with larger posterior probabilities.

120
00:15:04,600 --> 00:15:13,510
But one main disadvantage of this local informed scheme is that so to implement this,

121
00:15:13,510 --> 00:15:20,200
inform the proposals, we have to evaluate the posterior landscape pie in the current neighbourhood.

122
00:15:20,200 --> 00:15:25,560
We have to evaluate the pie for every Y in the current neighbourhood.

123
00:15:25,560 --> 00:15:33,720
So that we will be able to calculate this number, I think, consent and then two, and then we can exactly calculate the probabilities,

124
00:15:33,720 --> 00:15:38,160
but of course this can take a lot of time, although the neighbour who decides,

125
00:15:38,160 --> 00:15:44,130
well, it's it's much smaller compared to the compared to the whole state of space.

126
00:15:44,130 --> 00:15:51,470
But if it's really large, then then this local evaluation might take a lot of time.

127
00:15:51,470 --> 00:15:53,480
And although this well,

128
00:15:53,480 --> 00:16:06,830
and I want to say that although this locally informed framework was just recently recently proposed by senator in a paper in 2020,

129
00:16:06,830 --> 00:16:18,050
but actually this idea of locally evaluating pie in the current neighbourhood has been widely used in many other measures,

130
00:16:18,050 --> 00:16:22,520
and it has also been used in other Nullum same methods.

131
00:16:22,520 --> 00:16:31,190
So for all these matters that rely on this idea of locally American Pie, OK, we can ask the following question.

132
00:16:31,190 --> 00:16:39,680
Can these methods achieve a sufficiently fast the convergence rate that can offset the cost of computing?

133
00:16:39,680 --> 00:16:46,190
This nominating council is in every iteration and to the best of our knowledge question

134
00:16:46,190 --> 00:16:52,970
has never been rigorously addressed for high dimensional statistical problem.

135
00:16:52,970 --> 00:17:02,710
And this is what we are going to do in today's talk. We are going to answer this question for the variable selection part.

136
00:17:02,710 --> 00:17:12,010
And before I talk about before I introduce our algorithm and and talk about how we do that through the proof or the mixing time,

137
00:17:12,010 --> 00:17:17,110
so I would like to first offer some general intuition about this problem.

138
00:17:17,110 --> 00:17:25,880
So to be able to analyse the convergence rates of an algorithm for models, model, selection problem.

139
00:17:25,880 --> 00:17:30,890
We want to we want to know something about about this poster distribution pie,

140
00:17:30,890 --> 00:17:38,960
we want to be able to say something about how this pie looks like because the state of space goes super,

141
00:17:38,960 --> 00:17:45,140
super enomoto with a very low vitacost exponential rate or even super exponentially with people.

142
00:17:45,140 --> 00:17:50,480
So, of course, I mean, we don't expect revenue mixing can can hold it for any for any pie.

143
00:17:50,480 --> 00:17:59,330
Right. We have to be able to say something about the pie so that it is possible to prove say something like Rachna mixing.

144
00:17:59,330 --> 00:18:04,880
And the one thing that is very natural to assume from a statistical point of view

145
00:18:04,880 --> 00:18:11,630
is the so-called strong selection consistency for basic model selection procedure.

146
00:18:11,630 --> 00:18:15,590
We say that it has strong selection consistency.

147
00:18:15,590 --> 00:18:27,560
If for some point X star we have the posterior probability of X Star goes to one in probability with respect to the true data generating process.

148
00:18:27,560 --> 00:18:32,900
And of course, this star can be interpreted as the Truman.

149
00:18:32,900 --> 00:18:38,000
So if we think about the classical asymptotic regime where is fixed and it goes to infinity.

150
00:18:38,000 --> 00:18:46,910
Well, such kind of consistent consistency results are very expensive and in the high dimensional settings of things that are actually pretty similar.

151
00:18:46,910 --> 00:18:54,500
So all we need, all we need, all we need is that the sample size is not too small compared to P,

152
00:18:54,500 --> 00:19:03,390
then we should have such kind of consistency of results. We should be able to recover the true model from the data.

153
00:19:03,390 --> 00:19:10,530
And if the posterior distribution really concentrates a single best model X star, then according to the Michael theory,

154
00:19:10,530 --> 00:19:18,390
we actually know that the mixing time of the enzymes is just equivalent to the hitting time of this Baska model.

155
00:19:18,390 --> 00:19:20,790
And intuitively, this seems to be pretty reasonable.

156
00:19:20,790 --> 00:19:27,360
And actually it makes a lot of practical sense as well, because if PI concentrates on one that's the model,

157
00:19:27,360 --> 00:19:35,970
then, you know, our I guess our goal is just to to to to identify the single best model.

158
00:19:35,970 --> 00:19:39,080
So hard to prove this strong selection, consistency.

159
00:19:39,080 --> 00:19:49,070
So one way is to show that this poster, this distribution pie is using the model and the unique mode is sufficient, right?

160
00:19:49,070 --> 00:19:58,860
Shock. And well, I guess that I know that well, probably a lot of you have had some questions about this,

161
00:19:58,860 --> 00:20:06,780
about this consistency and this Animoto condition, and I will return to this unit model condition in the last section of my talk.

162
00:20:06,780 --> 00:20:12,030
But for the moment, I just want to make one remark that is, even if we can prove,

163
00:20:12,030 --> 00:20:17,820
even if we can show the poster distribution is your model on the model space,

164
00:20:17,820 --> 00:20:24,720
we actually cannot really say that much about how the posterior distribution looks like.

165
00:20:24,720 --> 00:20:35,130
So let me explain this point of using some examples. So this is Combivir, the normal distribution with two independent coordinates.

166
00:20:35,130 --> 00:20:39,000
And here I only plot this distribution in the first quadrant.

167
00:20:39,000 --> 00:20:46,560
And it's well, I mean, this is a pretty perfect unit model distribution and why we're seeing unit model distributions.

168
00:20:46,560 --> 00:20:53,850
I guess a lot of us are thinking about distributions that look like this.

169
00:20:53,850 --> 00:20:55,530
Now, let's consider another example.

170
00:20:55,530 --> 00:21:06,090
OK, so so not let the stage space as well, just be the collection of nine points, one, two, three times, one, two, three.

171
00:21:06,090 --> 00:21:12,370
And then let's consider this post distribution point.

172
00:21:12,370 --> 00:21:19,150
And the for this state of space, well, we will just use the most natural network with a relation that is, I will say,

173
00:21:19,150 --> 00:21:27,530
two states, two points, our neighbours, if the Euclidean distance between the two points is equal to one.

174
00:21:27,530 --> 00:21:34,880
Then using this neighbour who is a relation and if we look at this well, this political landscape,

175
00:21:34,880 --> 00:21:45,410
then we actually can tell that this distribution is also model, although it may look a little strange, but actually it is unemotional.

176
00:21:45,410 --> 00:21:53,980
So we may let's say we first look at this point one one, then the point one has two neighbours, one, two and a two one one.

177
00:21:53,980 --> 00:21:59,360
One is not a local model because the point of two, one has a larger probability.

178
00:21:59,360 --> 00:22:06,250
The point of two, one is not a local border either, because the point three, one has a larger posterior probability.

179
00:22:06,250 --> 00:22:13,330
And actually, if we simulated this algorithm for somebody from this poster distribution,

180
00:22:13,330 --> 00:22:25,000
then and let's say we we initialise the chain at the point with one that we probably will see that the chain can move along this black pass.

181
00:22:25,000 --> 00:22:34,790
It will move from one one to two one and then three one. And then eventually it will reach the mode at the point when three.

182
00:22:34,790 --> 00:22:44,090
So this so this distribution is also you anymore, but it looks very different from the previous by the normal example and of

183
00:22:44,090 --> 00:22:48,530
course we can make it continuous and if we make this distribution continuous,

184
00:22:48,530 --> 00:22:55,640
then we will have something that looks like this is very different from from this perfect union model distribution.

185
00:22:55,640 --> 00:23:04,160
But this is still unavoidable. This is what I mean by why we're saying that even if we can say the Posetti distribution model,

186
00:23:04,160 --> 00:23:09,380
we we don't really know that much about about the shape of the distribution.

187
00:23:09,380 --> 00:23:16,310
It can still be pretty irregular. And although you may say, well, this example looks pretty artificial,

188
00:23:16,310 --> 00:23:23,120
but actually I would say for model structural problems like verbal selection, this example is really,

189
00:23:23,120 --> 00:23:33,560
really describes this example really gives a pretty typical scenario that we will encounter, because the variables we have high dimensional setting,

190
00:23:33,560 --> 00:23:41,030
they are correlated and at the correlation structure amongst these variables can be pretty complicated.

191
00:23:41,030 --> 00:23:46,880
And I want to make this point because in the literature, very often we we see that,

192
00:23:46,880 --> 00:23:53,660
well, people justify why they are always as are useful by considering,

193
00:23:53,660 --> 00:24:02,690
say, a high dimensional setting and by assuming that the distribution has independent it coordinates.

194
00:24:02,690 --> 00:24:07,070
But I want to say that well, I mean, of course, those those analysis pretty useful.

195
00:24:07,070 --> 00:24:11,650
It provides us with a lot of insights into the behaviour of the algorithm.

196
00:24:11,650 --> 00:24:15,320
But it's an assumption like an independent coin.

197
00:24:15,320 --> 00:24:25,100
This is not really realistic in the high dimensional setting, and it is something that we probably can never prove using a statistical theory.

198
00:24:25,100 --> 00:24:33,320
And on the contrary, this ultimate assumption is something that we can prove is something we can justify according to the statistical theory,

199
00:24:33,320 --> 00:24:39,330
because all we need to require is that the sample size be sufficiently large.

200
00:24:39,330 --> 00:24:47,610
And the point is that although the point is that although the unipolar distribution may start to be restrictive at the first glance,

201
00:24:47,610 --> 00:24:52,800
but actually the possible distribution can still look pretty complicated,

202
00:24:52,800 --> 00:25:00,720
pretty irregular, and in that kind of give rise to a lot of difficulties in the theoretical

203
00:25:00,720 --> 00:25:08,650
analysis of of our relations on this post here for these posteriorly solutions.

204
00:25:08,650 --> 00:25:12,220
And next, well, we will just focus on the verbal seduction problem.

205
00:25:12,220 --> 00:25:17,310
And now we'll show you what ours and we can propose for the rebels actual problem.

206
00:25:17,310 --> 00:25:25,240
And and that will show that our algorithm can achieve a surprisingly fast mixing rate.

207
00:25:25,240 --> 00:25:30,580
So let's just consider the standard response. Linear regression model Wilkos to explain that.

208
00:25:30,580 --> 00:25:40,330
Plus, Absol, where abstinence represents the normal errors and acts is the bipartisan matrix, where the sample size is the number of variables.

209
00:25:40,330 --> 00:25:47,500
And because we are interested in the high dimensional settings where is much greater than and so we will impose a specific constraint.

210
00:25:47,500 --> 00:25:54,970
So we assume that most entries of batur are zero or at least the most edges of data are negligible.

211
00:25:54,970 --> 00:26:01,970
And therefore this verbal seduction problem, I'm going to use Gamma to denote a state of our state of space.

212
00:26:01,970 --> 00:26:06,250
And this is the index set of the now zero entries of Batur.

213
00:26:06,250 --> 00:26:12,550
In other words, Gamma is the set of variables that have now negligible effects.

214
00:26:12,550 --> 00:26:20,300
And the goal of verbal selection, of course, is to identify is to learn this gamma from the observed data.

215
00:26:20,300 --> 00:26:26,980
And I will use this up as zero to download our model space for this for this unbearable selection.

216
00:26:26,980 --> 00:26:38,390
OK, so as zero is the collection of all the subsets of on the integers, one, two, P with size less than or equal to zero.

217
00:26:38,390 --> 00:26:45,080
So this is the sparsity parameter and it may grow with it may grow as well.

218
00:26:45,080 --> 00:26:55,250
Or we can say it with P and actually we, we want to assume this as zero goes to infinity and goes to infinity.

219
00:26:55,250 --> 00:27:01,660
So this is the moral space for our crime problem and therefore the verbal seduction problem,

220
00:27:01,660 --> 00:27:12,890
I guess we all know that the most classical solution, which is called the ad swap sampler.

221
00:27:12,890 --> 00:27:15,920
So for every state gammer in our safe space,

222
00:27:15,920 --> 00:27:26,150
let's define three different neighbourhoods and let me use another now and swap to denote these three neighbourhoods.

223
00:27:26,150 --> 00:27:37,130
So that, of course, it is the collection of all the neighbouring states that that we can move to by adding one covariate to the current model.

224
00:27:37,130 --> 00:27:41,750
And that is this is the space that we can move to by removing one Kovarik.

225
00:27:41,750 --> 00:27:52,120
And the swap is the space that we can move to by swapping one covariate in the current model with one kolaric that is not in the current model.

226
00:27:52,120 --> 00:27:54,880
And and, of course, we can track that well,

227
00:27:54,880 --> 00:28:10,110
the size of the the size of the additions and the deletion neighbourhood is just equal to P and the size of the small neighbourhood is given by this.

228
00:28:10,110 --> 00:28:18,060
And they're using this as a data swap moves, then we can define a random walk Arabism,

229
00:28:18,060 --> 00:28:23,430
for example, here we can let's consider this proposal scheme, which probably would have.

230
00:28:23,430 --> 00:28:30,420
Let's just let's propose one state was probably the one that's just randomly proposed one addition or the move,

231
00:28:30,420 --> 00:28:37,490
which probably was probably the one that's proposed one swap move with equal probably.

232
00:28:37,490 --> 00:28:44,030
And in this case, this altruism is also coded, symmetric, rather more hastings',

233
00:28:44,030 --> 00:28:54,170
because you can track the proposal primarily from Gammer to Gamoran is always equal to the proposal, probably from Gamoran together.

234
00:28:54,170 --> 00:29:01,220
Although this it looks pretty naive, but I want to say that because of its simplicity,

235
00:29:01,220 --> 00:29:07,100
it's actually wider use amongst practitioners, it is pretty easy, pretty straightforward to implement.

236
00:29:07,100 --> 00:29:10,700
And the idea behind it is also easy to understand.

237
00:29:10,700 --> 00:29:20,840
And in a seminal paper published in 2009, 16 Annals of Statistics by Michael Jordan and Malcolm Wainwright,

238
00:29:20,840 --> 00:29:27,410
it is shown that in the mile high dimensional assumptions, this this kind of naive,

239
00:29:27,410 --> 00:29:34,040
symmetrical, random look which Hastings Armisen is actually rapidly mixing.

240
00:29:34,040 --> 00:29:41,710
So the complexity of this random walk algorithm only grows polynomial with only.

241
00:29:41,710 --> 00:29:49,940
And the author of The Mixing Time Bomb, well, and this mixing compound can be interpreted as the complexity of this random,

242
00:29:49,940 --> 00:29:58,300
which is always is this mixing compound is is roughly given by this expression I see roughly here,

243
00:29:58,300 --> 00:30:06,430
because I have only I have admittedly some high parameters which are which are not really very important.

244
00:30:06,430 --> 00:30:17,590
So it is P an S zero square times and the proof of the mixing compound relies on the canonical pass method.

245
00:30:17,590 --> 00:30:25,840
And it's kind of the most widely used method for studying the convergence of multiple finite state spaces.

246
00:30:25,840 --> 00:30:33,280
And then and the main idea of the approved well is actually just based on that strong selection consistency.

247
00:30:33,280 --> 00:30:37,210
So we verify the strong selection, consistency for the purpose of action problem,

248
00:30:37,210 --> 00:30:42,880
and we prove that the distribution with high probability will be using the model.

249
00:30:42,880 --> 00:30:54,270
And then we and then you can use this method. And it showed that while the repetition of the random walk algorithm is actually mixing.

250
00:30:54,270 --> 00:30:59,910
But, of course, given that the random walk hours and already performed well, pretty well,

251
00:30:59,910 --> 00:31:08,270
that we may ask, how fast can the mixing time of an informed and serious and for verbal seduction be?

252
00:31:08,270 --> 00:31:19,050
And we and of course, we want to prove that the missing time of uninformed altruism is much faster than mixing time earth of the random walk hours.

253
00:31:19,050 --> 00:31:29,410
Otherwise, theoretically, we cannot really say why we should prefer a local law enforcement proposal scheme.

254
00:31:29,410 --> 00:31:34,690
So the of the main challenges for this analysis.

255
00:31:34,690 --> 00:31:44,530
Well, so I guess there are two main challenges for for for the for the analysis of the reasons for this verbal seduction problem.

256
00:31:44,530 --> 00:31:49,420
First of all, although we can say that we are under some high dimensional assumptions,

257
00:31:49,420 --> 00:31:55,150
we can prove the poster distribution, the model, and actually gets you in the majority.

258
00:31:55,150 --> 00:32:04,120
Or they imply that given any say, if we are at a model, which is not the true model,

259
00:32:04,120 --> 00:32:12,350
then we can always move to a better model by adding one coverage or removing one COGAT.

260
00:32:12,350 --> 00:32:17,650
Here, let me use this star to denote the truth that I know the true model.

261
00:32:17,650 --> 00:32:22,510
And I would refer to the cowbirds in Afghanistan as influential cowbirds.

262
00:32:22,510 --> 00:32:33,340
And the other programmes will be just quoting now, influential. So the idea is that, well, if the current model is missing some influential COGAT,

263
00:32:33,340 --> 00:32:38,590
well, then maybe we can use some additional move to increase the positive probability.

264
00:32:38,590 --> 00:32:45,790
And if we end, if we have already all the infrastructure upgrades and we have also included some non-commercial

265
00:32:45,790 --> 00:32:52,840
commerce that maybe we can remove some Nightwish programmes and then we can arrive at a better model.

266
00:32:52,840 --> 00:32:59,470
But the problem here is that because of the dependency structure amongst these these Marable's,

267
00:32:59,470 --> 00:33:08,890
so actually the post-war landscape of the landscape can still be pretty irregular in the sense that well beyond this,

268
00:33:08,890 --> 00:33:13,900
we cannot really see much beyond this other modality.

269
00:33:13,900 --> 00:33:22,300
For example, although we may, although intuitively, intuitively we may conjecture that if the sample size is really large,

270
00:33:22,300 --> 00:33:28,270
then perhaps the poster probably will always increase if we add an influential

271
00:33:28,270 --> 00:33:34,120
covariate and the perception of it will increase if we remove a large influential.

272
00:33:34,120 --> 00:33:44,740
But actually, in general, these claims are not right. So in general, we cannot say that's adding any knowledge, adding any influential covariate.

273
00:33:44,740 --> 00:33:50,740
They can increase their support. We also we can say that, well, they are always with high priority.

274
00:33:50,740 --> 00:33:57,070
They are always exists when such good means. But we cannot make claims like this.

275
00:33:57,070 --> 00:34:04,720
We cannot say any any addition of any financial covariate will increase the percent probability.

276
00:34:04,720 --> 00:34:09,340
And this is just because the variables can be correlated.

277
00:34:09,340 --> 00:34:18,310
And perhaps it is helpful to recall the previous example. The point the point here is that, well, although the poster disputants you the model,

278
00:34:18,310 --> 00:34:27,560
but it doesn't mean that the posterior probability will increase whenever we move in a direction pointing to the morgue.

279
00:34:27,560 --> 00:34:35,710
OK, so for example, in this example, if we move from one one to one to the actually the probability decreases a lot.

280
00:34:35,710 --> 00:34:40,810
So actually here we cannot move in the direction pointing to the mode.

281
00:34:40,810 --> 00:34:50,260
And that's exactly I mean, this intuition applies applies to our natural selection problem as well.

282
00:34:50,260 --> 00:34:57,250
And another challenge here is that we have to choose this locally and from the proposal scheme carefully,

283
00:34:57,250 --> 00:35:09,400
because if we choose a category there, very often we will end up with will end up end up with an algorithm that can get stuck at local nodes.

284
00:35:09,400 --> 00:35:18,310
So. So let's consider this very naive way of implementing any form of the proposal.

285
00:35:18,310 --> 00:35:25,090
So given the neighbourhood and which is the collection of all the states that I can move to buy when additions or deletions work,

286
00:35:25,090 --> 00:35:31,750
move, let's propose a new state with property proportional to the post area.

287
00:35:31,750 --> 00:35:39,580
This is kind of the most naive way, the most straightforward way of implementing a law calling for the proposal scheme.

288
00:35:39,580 --> 00:35:45,190
But actually, it's not going to work. It's going to fail completely. So we can consider a very simple scenario.

289
00:35:45,190 --> 00:35:49,810
Let's take a start. It's just has two variables, one and two.

290
00:35:49,810 --> 00:35:52,810
And let's say the current model is the empty set.

291
00:35:52,810 --> 00:36:00,820
Of course, then we care about what's the transition probability from moving empty said to to to to the model.

292
00:36:00,820 --> 00:36:07,480
We just convert one. Or maybe the model is just a collaborative to and then we can calculate we can actually

293
00:36:07,480 --> 00:36:13,940
calculated that the transition probably from the empty side to to the model with only cover,

294
00:36:13,940 --> 00:36:25,420
the one is bounded by this ratio, by the ratio of the percent probability of of of the single the model to the positive probability of a true model.

295
00:36:25,420 --> 00:36:28,780
And now if the sample size is sufficiently large that, of course,

296
00:36:28,780 --> 00:36:33,700
the true mother will have the positive probability of the true mother will goes to one.

297
00:36:33,700 --> 00:36:38,150
So this ratio will this ratio will go to zero.

298
00:36:38,150 --> 00:36:44,690
And this will imply that although at all, although at the empty model,

299
00:36:44,690 --> 00:36:56,960
we will probably say keep proposing adding covariate one or maybe adding Coverer two, but actually these proposals will almost always get rejected.

300
00:36:56,960 --> 00:37:03,440
And in effect, we will just stay at this empty set for for a lot of iterations.

301
00:37:03,440 --> 00:37:08,780
And the change is not really moving. So this is not going to be a good algorithm.

302
00:37:08,780 --> 00:37:15,500
And actually, if we think about this general definition of the local local law enforcement proposal scheme,

303
00:37:15,500 --> 00:37:25,550
so we will see that the main reason here is that the main difficulty here is that we actually cannot really say much about about this.

304
00:37:25,550 --> 00:37:34,460
Normalising constant Zicam. And this is and this, again, is because, well, we are considering, say,

305
00:37:34,460 --> 00:37:43,010
a discreet space problem and that the landscape can can be very irregular or more precisely speaking,

306
00:37:43,010 --> 00:37:52,100
we can say that the political landscape can change drastically if we just move from the current state to a neighbouring state.

307
00:37:52,100 --> 00:37:58,370
Is it's totally different from from from a continuous state of space where we have enough continuity.

308
00:37:58,370 --> 00:38:04,790
And then we can say that if we move locally around the country, the point the political landscape should have the change mantra.

309
00:38:04,790 --> 00:38:09,740
But if a model structural problems for private institutions defined on the screen

310
00:38:09,740 --> 00:38:14,000
state of spaces where they can be pretty irregular and it is normalising,

311
00:38:14,000 --> 00:38:21,450
constant can change drastically if we just move, say, from the current model to a neighbouring model.

312
00:38:21,450 --> 00:38:29,730
And then this creates a lot of difficulty in designing efficient, locally informed proposal schemes.

313
00:38:29,730 --> 00:38:34,290
So we tried a lot of different ways to to design this local reform proposals.

314
00:38:34,290 --> 00:38:41,090
And eventually we found that a single solution was just to use bounded proposal weights.

315
00:38:41,090 --> 00:38:47,310
So now I'm going to introduce our algorithm, which we call it locally.

316
00:38:47,310 --> 00:38:55,860
Well, it is always in the way of locally informed and the of the proposals and the idea of our algorithm is pretty simple.

317
00:38:55,860 --> 00:39:03,870
So first of all, let's partition the neighbourhood of Kharma into three subsets, the addition neighbourhood to neighbourhood and swap neighbourhood.

318
00:39:03,870 --> 00:39:09,270
And then let's perform the propose a weighting within each neighbourhood separately.

319
00:39:09,270 --> 00:39:17,240
And here I'm using this function W two to note that the proposal weight that I assigned to a neighbouring state.

320
00:39:17,240 --> 00:39:23,060
And and therefore, every neighbouring state government crime in the neighbourhood.

321
00:39:23,060 --> 00:39:29,390
I'm just going to use this bounded proposal weights, so I will wait.

322
00:39:29,390 --> 00:39:36,860
This proposal using the racial pytka over Obopay, Gummo truncal at both sides.

323
00:39:36,860 --> 00:39:40,820
And by properly choosing these truncation thresholds,

324
00:39:40,820 --> 00:39:51,500
we will obtain a locally informed algorithm that is very, very efficient, as we will see shortly.

325
00:39:51,500 --> 00:40:02,560
And the main result of our paper is that we can actually prove under under essentially the same assumptions used by that paper of young and old,

326
00:40:02,560 --> 00:40:08,300
the paper on the random hastings' for the for the verbal seduction problems.

327
00:40:08,300 --> 00:40:18,770
The mixing time of our algorithm is bounded by sea and with the is just a universal constant.

328
00:40:18,770 --> 00:40:26,800
So the point here is that this mixing rate does not depend on the parameter P.

329
00:40:26,800 --> 00:40:32,770
And we call this mixing rate dimension free mixing because it doesn't depend on.

330
00:40:32,770 --> 00:40:36,280
And the point here is that in high school settings is much larger than so.

331
00:40:36,280 --> 00:40:42,490
We care much more about the people care about than the sample size of.

332
00:40:42,490 --> 00:40:49,360
And if we compare this mixing rate of our algorithm with the mixing rate,

333
00:40:49,360 --> 00:40:57,010
with the mixing bomb of that of that random algorithm in the paper by you at all,

334
00:40:57,010 --> 00:41:02,620
then we'll see that even after we take into account the taking into account the

335
00:41:02,620 --> 00:41:08,890
fact that we need to locally evaluate the political landscape in every iteration.

336
00:41:08,890 --> 00:41:21,060
The total complexity of our little algorithm is still smaller than the mixing time bound provided by young adults paper.

337
00:41:21,060 --> 00:41:28,170
And if you check our paper, you will see that's actually the mixing compound for our algorithm,

338
00:41:28,170 --> 00:41:33,810
the rounding our paper is is even slightly smaller than this.

339
00:41:33,810 --> 00:41:40,970
But anyway, the difference is not it's not so important end. And I don't want to introduce those heavy notations so legit.

340
00:41:40,970 --> 00:41:45,750
So that's why I just presented this simple version of our result.

341
00:41:45,750 --> 00:41:56,120
The mixing compound can be targeted by a multiple of which is a pretty well, which arguably is kind of a surprising result.

342
00:41:56,120 --> 00:42:05,160
And let's before I talk about proof, let's just let's do some say simulation to check whether our algorithms work well in practise.

343
00:42:05,160 --> 00:42:09,240
So we first consider the simulation settings of those paper.

344
00:42:09,240 --> 00:42:19,530
It's a pretty simple setting where we always say sample the truth that a star such that the size of gamma stock is equal to 10.

345
00:42:19,530 --> 00:42:30,060
So they are always just 10 in financial coverers. And we always initialise our algorithms with a random generator, that model with 10 covariates.

346
00:42:30,060 --> 00:42:35,550
And then when the signal to noise ratio is is sufficiently large, well,

347
00:42:35,550 --> 00:42:44,460
then we expect that both random Hastings and a lit a match will be able to identify the posterior mode in a reasonable amount of time.

348
00:42:44,460 --> 00:42:51,390
And we observed that actually lit a match is much, much faster than the random walk and match.

349
00:42:51,390 --> 00:42:57,960
So for uncowed will solve the peak of the five thousand and the four independent design matrix randomly match

350
00:42:57,960 --> 00:43:04,980
takes approximately 15 seconds to find the mode and the data only takes about zero point of one second.

351
00:43:04,980 --> 00:43:16,250
And if we consider correlated, it is our matrix, then the advantage of pretty much well is is is slightly more substantial actually.

352
00:43:16,250 --> 00:43:25,610
But when the true model but if but if we can, but if the true model is the is the model, then we will observe that the landscape tends to be flat.

353
00:43:25,610 --> 00:43:32,990
And then in that case, random walk. Michala Hastings also performs pretty well and actually it is actually better than that.

354
00:43:32,990 --> 00:43:42,440
How much. And and I want to make a point here on the commercial aspects of their much that is well

355
00:43:42,440 --> 00:43:48,560
recorded in every iteration of how much we need to evaluate the landscape in the neighbourhood.

356
00:43:48,560 --> 00:43:57,860
Right. And this and this actually makes the so-called black walnut black the estimation of Batur.

357
00:43:57,860 --> 00:44:07,550
Easy to hear this. Blackwall This globalisation means that we can estimate Beda using conditioning.

358
00:44:07,550 --> 00:44:15,350
So we look at every inch of data separately. For example, let's look at the first of the paper and less condition,

359
00:44:15,350 --> 00:44:27,290
all the P minus one coordinates of the current of the current model gamma condition or that condition on well,

360
00:44:27,290 --> 00:44:32,960
whether whether we have selected the other P minus one covariance in our current model.

361
00:44:32,960 --> 00:44:40,820
And let's see how the posterior probability changes if we include this variable or if we remove this variable.

362
00:44:40,820 --> 00:44:48,380
And by doing these conditioning calculations, we will be able to obtain a so-called appropriate the estimate of BEDA,

363
00:44:48,380 --> 00:44:52,610
which has a much smaller variance than 10 debate at that.

364
00:44:52,610 --> 00:44:58,340
We got we just the sample from the fundamental hastings' algorithm and that if

365
00:44:58,340 --> 00:45:05,750
we look at say that the performance of this low the of data will see that,

366
00:45:05,750 --> 00:45:16,710
then actually it's it's just way much better then than the patents that we sample from the wrong from the random walk out of six hours.

367
00:45:16,710 --> 00:45:21,530
But I don't want to emphasise too much on the advantages of this validation

368
00:45:21,530 --> 00:45:25,940
scheme because it is probably a little bit controversial in the sense that,

369
00:45:25,940 --> 00:45:38,360
well, this is we do expect that in general settings, Rob Black on black retardation is always going to help us in terms of the estimation of data.

370
00:45:38,360 --> 00:45:46,460
For example, if the Post is really multimodal, then perhaps the problem is not going to help us that much harder.

371
00:45:46,460 --> 00:45:52,640
The point I want to make here is that when we implement, say, a local form, the proposal schemes,

372
00:45:52,640 --> 00:45:59,240
when we do all those local evaluation of the landscape in the neighbourhood, these calculations,

373
00:45:59,240 --> 00:46:04,730
although they may look expensive, but they can be made use of in multiple different ways,

374
00:46:04,730 --> 00:46:15,540
we can use them to to our proposal probabilities and we can use them to calculate, for example, the estimates of.

375
00:46:15,540 --> 00:46:18,960
And and, of course, we may say that, well, in reality,

376
00:46:18,960 --> 00:46:28,290
we would never have say we'd probably never encountered a union model distribution, we have a distribution that is somewhat multimodal.

377
00:46:28,290 --> 00:46:34,180
And then let's consider how our aggressive works for more realistic settings.

378
00:46:34,180 --> 00:46:38,640
So for the second assimilation setting, I'm just going to sample this.

379
00:46:38,640 --> 00:46:42,330
So I'm going to consider a much more complicated,

380
00:46:42,330 --> 00:46:51,180
dependent structure for our is not matrix and I'm going to sample 100 influential cagers randomly and that they effect size.

381
00:46:51,180 --> 00:46:55,680
I sample from normal distributions. So some so some influential programmes.

382
00:46:55,680 --> 00:47:01,260
They may have larger effects, some influential converts, they may have smaller effects.

383
00:47:01,260 --> 00:47:08,040
And because we have 100 influential cowbirds and a design matrix has a very complicated, dependent structure.

384
00:47:08,040 --> 00:47:12,570
So we expect that the posterior landscape will be multimodal.

385
00:47:12,570 --> 00:47:22,050
And now for this simulation setting, we still observe that amateurism performs much better than random things and that we

386
00:47:22,050 --> 00:47:27,420
can measure the performance of the two algorithms using the effective sample size,

387
00:47:27,420 --> 00:47:31,770
well, dividing the body, but by the runtime.

388
00:47:31,770 --> 00:47:43,470
And then we see that the effect is in fact a sample size of little eMESH algorithm is always much larger than the sample size of the random book,

389
00:47:43,470 --> 00:47:52,830
which is ours. And this says that, well, actually our little algorithm is also pretty efficient in terms of exploring,

390
00:47:52,830 --> 00:47:58,650
say, a realistic multimodal posterior distribution for the probable selection problem.

391
00:47:58,650 --> 00:48:06,270
And finally, we also consider the real data analysis which try to give us datasets and analyse approximately

392
00:48:06,270 --> 00:48:13,230
5000 of its peers around the three hundred thousand and the for this very large dataset,

393
00:48:13,230 --> 00:48:21,330
we still observe that it performs much better than the random look and the effect of sample size of random book.

394
00:48:21,330 --> 00:48:28,320
And it's just about two per minute. But little match has a sample size approximately.

395
00:48:28,320 --> 00:48:32,760
So it is three per minute.

396
00:48:32,760 --> 00:48:46,290
OK, finally, let me just briefly discuss say how we prove our how we prove how we theoretically study the mixing time of our little amateur hours.

397
00:48:46,290 --> 00:48:51,510
You have about a minute left for the proof. Thank you very much. Yeah.

398
00:48:51,510 --> 00:49:03,610
So we'll use a conditional approach. OK, and and so so this is what I mean by driv the condition.

399
00:49:03,610 --> 00:49:15,690
So I guess I need to skip the technical details here. And and I want to say that so recall that previously I was saying that for the random people.

400
00:49:15,690 --> 00:49:19,410
True. That the mixing rate using the passive message. Right.

401
00:49:19,410 --> 00:49:26,910
But and I said that pass measures are kind of the most wanted to use the solutions for Malcolm, change our final spaces.

402
00:49:26,910 --> 00:49:34,590
But here we take a somewhat unusual and dreft the conditional approach I see it is somewhat

403
00:49:34,590 --> 00:49:41,500
unusual because this condition has been very used in the mixing time analysis of this algorithms,

404
00:49:41,500 --> 00:49:51,390
discrete status spaces. It has been drift conditioned approaches have been widely used for groups samples on a continuous basis.

405
00:49:51,390 --> 00:49:55,770
But let's use just let's try a conditional approach for for our algorithm.

406
00:49:55,770 --> 00:50:06,990
And and another kind of unusual feature of our approach is that we actually proved the two different conditions for our algorithm on a state of space.

407
00:50:06,990 --> 00:50:15,000
And we would then try to combine these two conditions and then find a missing time bound for our data image outwards.

408
00:50:15,000 --> 00:50:21,600
And to explain this to stage of the condition, I want to first partition our state of space.

409
00:50:21,600 --> 00:50:26,670
So I'm going to classify all the models into two groups, wainscot and the feature models.

410
00:50:26,670 --> 00:50:28,150
And the other is called Overfitting.

411
00:50:28,150 --> 00:50:37,060
The models, of course, graphical models just refer to the models that have already included all the influential covariates.

412
00:50:37,060 --> 00:50:43,240
And now we are going to prove one of the conditions for all the overfitting amount of sarin,

413
00:50:43,240 --> 00:50:51,580
yet when when you have a condition for all the over models and then we'll prove another condition for all the undertaker models,

414
00:50:51,580 --> 00:50:58,810
and then we'll try to combine these two conditions and show that the change is going to be mixing quickly.

415
00:50:58,810 --> 00:51:02,780
So this picture briefly illustrates well, the idea of our cruise.

416
00:51:02,780 --> 00:51:09,940
This is our modern space and the most of the models are undefeated, which are represented by this grey area.

417
00:51:09,940 --> 00:51:16,930
And a small subset in this modern space are overtaking the models which are represented by this blue circle.

418
00:51:16,930 --> 00:51:24,670
And now what we propose is to stage a condition. Of course, it is based on some theoretical insights into the biggest problem.

419
00:51:24,670 --> 00:51:31,510
There exists a lot of high dimensional results for stepwise procedures, for natural selection.

420
00:51:31,510 --> 00:51:36,640
And those results tell us that if the current model is under fitted,

421
00:51:36,640 --> 00:51:45,910
then we can always add some colour to the current model so that we will be able to explain the variation in our

422
00:51:45,910 --> 00:51:54,760
response back to Y and the ones we have included all the influential colours and the model becomes overfitting.

423
00:51:54,760 --> 00:52:02,290
Then we will be able to remove those nonconfidential colours because then they become kind of useless.

424
00:52:02,290 --> 00:52:08,560
So this intuition tells us that if we consider the behaviour of atavism and actually

425
00:52:08,560 --> 00:52:13,420
also the behaviour of the random look algorithm for the selection problem,

426
00:52:13,420 --> 00:52:16,210
then if we start from an unofficial model,

427
00:52:16,210 --> 00:52:28,390
then the chain should first say kind of drifts towards the set of all the overfitting models so that we can explain the variation in what.

428
00:52:28,390 --> 00:52:37,240
And once the chain arrives at this set of overfitting models, then we will drift towards the true model.

429
00:52:37,240 --> 00:52:45,490
So this is the the main intuition behind our two stage condition when the commission shows that the train moves quickly from,

430
00:52:45,490 --> 00:52:48,580
say, any unofficial model to an overview of the model.

431
00:52:48,580 --> 00:52:59,430
And another condition shows that given once we arrive at an overflowing model, then we will quickly move towards the true model.

432
00:52:59,430 --> 00:53:03,470
I'm going to skip these technical details and the all the paper,

433
00:53:03,470 --> 00:53:14,720
you will find some general results for how to efficiently combining to drive the conditions on a state of space and the are paper.

434
00:53:14,720 --> 00:53:18,890
We actually derived these results for general status basis.

435
00:53:18,890 --> 00:53:22,550
So these results, I mean, they are not limited to two discrete state of spaces.

436
00:53:22,550 --> 00:53:25,100
And actually, you can also generalise our again,

437
00:53:25,100 --> 00:53:32,270
our argument to the case of more than two of the conditions we consider stringent conditions for traffic conditions, et cetera.

438
00:53:32,270 --> 00:53:35,720
As long as the state space has such a NASA structure,

439
00:53:35,720 --> 00:53:43,520
then you can use our argument to combine multiple different conditions and then obtain a mixing time bomb.

440
00:53:43,520 --> 00:53:52,070
So I think I will maybe use another two minutes to briefly discuss some generalisation of the results that I have discovered so far.

441
00:53:52,070 --> 00:54:00,980
So so this live amateur hour with them and and and all the intuition that I have been talking about,

442
00:54:00,980 --> 00:54:05,780
of course, can be applied to other models, selection process.

443
00:54:05,780 --> 00:54:13,970
I mean, other than the selection problem and the actually for any general discrete state spaces,

444
00:54:13,970 --> 00:54:23,120
we can we can prove some we can prove some useful result for these locally informed and C schemes.

445
00:54:23,120 --> 00:54:32,270
We can actually prove that for any discrete spaces, as long as the possible distribution satisfies a universal condition,

446
00:54:32,270 --> 00:54:40,490
that we will be able to construct a locally informed MSFC algorithm such that its convergence

447
00:54:40,490 --> 00:54:48,620
rate can be bounded by a universal constant that doesn't depend on the number of variables.

448
00:54:48,620 --> 00:54:55,250
So the dimension of remixing can always be achieved for other models of action problems,

449
00:54:55,250 --> 00:55:00,890
not only not only the variable selection and that we can always devise such locally

450
00:55:00,890 --> 00:55:06,440
informed proposal schemes that can achieve these dimension Friedrichsen rates.

451
00:55:06,440 --> 00:55:14,430
And then the critical step in all these in the devising of these locally informed proposals and

452
00:55:14,430 --> 00:55:21,100
the theoretical analysis of these locally informed MSFC is to act is always to verify this,

453
00:55:21,100 --> 00:55:25,760
know the condition you want to prove that what the poster distribution is, you the model,

454
00:55:25,760 --> 00:55:34,880
and then you would be able to and then you would be able to use some arguments to establish that that dimension free mixing weight so limited,

455
00:55:34,880 --> 00:55:40,520
so limited by time. I guess I'm just going to say quickly wrap up my talk.

456
00:55:40,520 --> 00:55:45,880
OK, so, so, so today I have been talking about it.

457
00:55:45,880 --> 00:55:55,730
Is this this little algorithm, which is kind of a simple but we want to say a pretty efficient solution to the variable selection problem,

458
00:55:55,730 --> 00:55:59,030
and it can attain a provable dimension, free mix and.

459
00:55:59,030 --> 00:56:06,080
Right. And I and I want to say that for their local informed addresses, the proposal schemes,

460
00:56:06,080 --> 00:56:10,760
that the implementation of the proposed scheme can always be easily paralysed.

461
00:56:10,760 --> 00:56:18,120
So actually, in reality, it's not going to take it can be it can be implemented in a much faster way than than even though

462
00:56:18,120 --> 00:56:24,260
simulation results that we have that I have shown and because of the simplicity of our algorithm,

463
00:56:24,260 --> 00:56:31,100
you can also combine our proposal schemes with other more advanced techniques such as blocking,

464
00:56:31,100 --> 00:56:38,690
tampering, lifting, etc. But whether these whether these advanced techniques are going to further improve the mixing rate or

465
00:56:38,690 --> 00:56:44,720
the total or whether these skills are going to further reduce the total complexity of the algorithm.

466
00:56:44,720 --> 00:56:52,330
Well, I guess it's pretty hard to say and that we need to do a simulation to to to to verify.

467
00:56:52,330 --> 00:57:00,980
And this methodology can be generalised to other models of action, as well as we can verify a so-called unmowed.

468
00:57:00,980 --> 00:57:05,000
And I use the model condition and the limited by time.

469
00:57:05,000 --> 00:57:11,480
I cannot talk about the other problems, but in another of my work trying to work with my student,

470
00:57:11,480 --> 00:57:17,180
we have verified this during the model condition for this structural problem.

471
00:57:17,180 --> 00:57:21,320
So so far, the only problem for the graphical model problem,

472
00:57:21,320 --> 00:57:27,530
you can also say device such locally informed and same measures which will

473
00:57:27,530 --> 00:57:32,720
achieve a dimension for mixing weight under some high dimensional assumptions.

474
00:57:32,720 --> 00:57:37,160
So that's kind of all I want to talk about for you. That's all I want to talk about that.

475
00:57:37,160 --> 00:57:44,030
That's all I want to talk about today. Thank you much for your attention. I hope we still have time for questions.

476
00:57:44,030 --> 00:57:52,310
Thank you for for the great talk. So we still have one minute, perhaps one or two quick questions from the audience.

477
00:57:52,310 --> 00:57:59,020
Jeff. OK, so any any questions from the audience?

478
00:57:59,020 --> 00:58:03,000
I was actually clapping, not putting my hand up, but I did have it.

479
00:58:03,000 --> 00:58:08,800
Thank you. That was lovely talking. Very nicely explained. So I did have a question at first sight.

480
00:58:08,800 --> 00:58:14,010
It looks like I need to calculate the marginal likelihoods to calculate those parts.

481
00:58:14,010 --> 00:58:21,330
Right. So I'm thinking, do you have any suggestions on is there like a is there like a randomised version

482
00:58:21,330 --> 00:58:27,810
of this or something like is it enough to have unbiased estimates or something?

483
00:58:27,810 --> 00:58:35,670
So you are people we would consider kind of simplicity where the marginal likelihoods can be, can be, can be evaluated across the board.

484
00:58:35,670 --> 00:58:38,310
So I assume they are available.

485
00:58:38,310 --> 00:58:46,590
Essentially, this means that we are not considering a hierarchical financial model to assume that all the parameters are given.

486
00:58:46,590 --> 00:58:51,840
So we can always evaluate this pretty easily.

487
00:58:51,840 --> 00:59:01,380
But if you want. But if, if, if, if we have, say, a no hyper parameters and the likelihood is are not tractable,

488
00:59:01,380 --> 00:59:07,170
then maybe you can you say maybe those super marginal algorithms and then try to

489
00:59:07,170 --> 00:59:11,850
combine status through the model MSNBC with our local law enforcement proposals,

490
00:59:11,850 --> 00:59:14,970
and then you will still be able to obtain similar results.

491
00:59:14,970 --> 00:59:22,020
I guess the main idea is that essentially I'm in this talk, I was just focussing on how to make the proposals.

492
00:59:22,020 --> 00:59:28,020
So I appreciate I think that what you're doing is very useful.

493
00:59:28,020 --> 00:59:41,100
And if I may join a second question would be so that algorithm you wrote down with that truncation, you truncated the top and bottom weights.

494
00:59:41,100 --> 00:59:48,000
I mean, that kind of came out of nowhere. What what can you do better?

495
00:59:48,000 --> 00:59:55,150
Is that like do you have a feeling that's the best you can do or or what's the intuition?

496
00:59:55,150 --> 00:59:59,160
So it's a great question.

497
00:59:59,160 --> 01:00:07,050
I forgot to expand the intuition. So so so I guess the main difficulty in designing these local informal proposals is

498
01:00:07,050 --> 01:00:12,840
that although it's pretty easy to find a proposal that's with high probability,

499
01:00:12,840 --> 01:00:21,030
it's going to propose those states with larger parties, the difficulty is that how can we make sure those proposals will get accepted?

500
01:00:21,030 --> 01:00:27,480
So that's the difficulty. And the to make the proposals, get to make the proposals, get accepted it,

501
01:00:27,480 --> 01:00:38,640
then you want to you want to kind of make sure that you also want to take into account the proposal probability of the of the process moving back.

502
01:00:38,640 --> 01:00:42,720
Right. That that's that proposal probably cannot be too small.

503
01:00:42,720 --> 01:00:49,260
So that's why eventually we we found this truncation scheme, because if we use this truncation scheme,

504
01:00:49,260 --> 01:00:54,240
then the proposal probability of any state can be found in the proposal.

505
01:00:54,240 --> 01:01:02,370
Probability of any state can be bounded. So theoretically, this is a great advantage because it makes it makes the proof possible.

506
01:01:02,370 --> 01:01:07,620
And yes, yes, yes, yes, yes.

507
01:01:07,620 --> 01:01:21,800
That's kind of the main issue. I hope it is helpful. Yeah. OK, so any other quick question from the audience?

508
01:01:21,800 --> 01:01:25,850
OK, I think we can stop today, so thank you, everyone, for coming.

509
01:01:25,850 --> 01:01:31,400
Also trying again for the sweet talk. So this is a second Laza seminar this term.

510
01:01:31,400 --> 01:01:37,430
So we'll see you next Friday for the last seminar this term.

511
01:01:37,430 --> 01:01:41,540
So have a great weekend, everyone. Thank you much.

512
01:01:41,540 --> 01:01:47,870
Stronger much for the invitation. Thank you very much for the great questions. And thank you for your attention.

513
01:01:47,870 --> 01:01:51,791
Let's check the speaker again.