1 00:00:04,780 --> 00:00:11,890 OK, so let's start. So welcome, everyone, for this Friday's seminar talk. 2 00:00:11,890 --> 00:00:18,220 So this is our second psychoanalyse elsewhere Satchmo seminar this term. 3 00:00:18,220 --> 00:00:23,770 So the speaker today is Trengove from Texas A&M University. 4 00:00:23,770 --> 00:00:31,660 So I try back in 2008 when he was postdoc at Rice University and he visited University of Toronto, 5 00:00:31,660 --> 00:00:40,180 my supervisor, Jeff Rozental, and then starting to start starting from that time we worked together. 6 00:00:40,180 --> 00:00:50,350 So now he's moved to I think two years ago, he moved to Texas A&M University and now he's assistant professor there. 7 00:00:50,350 --> 00:00:55,330 So our collaboration is very incredible and other successful, I would say. 8 00:00:55,330 --> 00:01:01,270 So I highly recommend you talk to him if you have some common interests with him. 9 00:01:01,270 --> 00:01:11,650 So he worked. He has a very broad research interests other than Bayesian methodology and the computation we will hear today. 10 00:01:11,650 --> 00:01:20,710 We also work on statistical genetics, graphical models, stochastic control and experimental design. 11 00:01:20,710 --> 00:01:26,530 So today he will talk about complexity of a local CMC search for high dimensional model selection. 12 00:01:26,530 --> 00:01:34,510 I believe it includes some of our work, also also his own work with his student at Texas. 13 00:01:34,510 --> 00:01:39,850 So I will leave to join very much prior to the introduction. 14 00:01:39,850 --> 00:01:49,240 Thank you so much for having me here. It's it's really a huge pleasure, a huge honour for me to speak about my research and to this seminar. 15 00:01:49,240 --> 00:02:00,050 So today I would like to talk about some of my written books on local and Simsim methods for high dimensional models and action. 16 00:02:00,050 --> 00:02:10,100 And actually, the talk is mostly based on my and the joint work with Restring and Gareth Roberts, 17 00:02:10,100 --> 00:02:17,260 Jeffrey Rosen and vets and the results are based on a joint a joint work with my students. 18 00:02:17,260 --> 00:02:25,120 And and I will also mention one results. Well, which is from which comes from an ongoing project. 19 00:02:25,120 --> 00:02:33,340 So before I start my talk, I would like to share with you guys a story behind the birth of MSNBC. 20 00:02:33,340 --> 00:02:35,620 Perhaps a lot of you already know this. 21 00:02:35,620 --> 00:02:48,370 So the Metrobus algorithm was first implemented by Dr. Rosenbluth, the woman in this picture in nineteen fifty three, Dr. Rosenbluth. 22 00:02:48,370 --> 00:02:54,070 She is actually from Texas. And then and she received her Ph.D. at the age of twenty two. 23 00:02:54,070 --> 00:03:02,680 And shortly after that, she implemented a Metropolis algorithm. And at that time, well, of course you call it everything in machine language. 24 00:03:02,680 --> 00:03:10,150 But after she implemented major obvious algorithm, she left academia and probably never worked on research again. 25 00:03:10,150 --> 00:03:20,110 And in two thousand and three. So some way interview with her about about this very first implementation of this algorithm. 26 00:03:20,110 --> 00:03:28,600 And she said that she was very surprised that someone still even remembers this algorithm. 27 00:03:28,600 --> 00:03:34,600 Well, but as as we all know, well, not only do we do we still remember this algorithm, 28 00:03:34,600 --> 00:03:42,380 but actually we have been working extensively on the theory and the methodology of seeing the past, that maybe 30 years. 29 00:03:42,380 --> 00:03:48,290 Unfortunately, Dr. Rosenblum has passed away at the end of the last year. 30 00:03:48,290 --> 00:03:52,390 And this is the obituary posted on the IMC website. 31 00:03:52,390 --> 00:03:56,650 And you can find the complete story of her life in New York Times. 32 00:03:56,650 --> 00:04:01,930 And I think it's very inspiring, very amazing, very unique story. 33 00:04:01,930 --> 00:04:09,030 Well, anyway, let's start my talk and let's first quickly review what is the nature of the algorithm. 34 00:04:09,030 --> 00:04:14,490 So in this talk will be mostly concerned with a financial stake space, 35 00:04:14,490 --> 00:04:26,280 so I'm just going to use this big arse to denote an arbitrary final state of space, and I will use PI to denote our target distribution on this space. 36 00:04:26,280 --> 00:04:32,500 Let's assume the support of PI is just given by us, so given any transition matrix, 37 00:04:32,500 --> 00:04:39,150 we can construct another churnalism Matrix P according to this formula and we all know that, 38 00:04:39,150 --> 00:04:45,780 well, just by construction, this new transition matrix is reversible, we suspect. 39 00:04:45,780 --> 00:04:54,630 So if further irreducible and aperiodic, then this macro train will converge to PI and if we can simulate a dismal chain, 40 00:04:54,630 --> 00:05:02,130 then the samples are generated from this Markov chain can be used to approximate the targeted distribution PI. 41 00:05:02,130 --> 00:05:11,370 And the and the William Cohen observation here is that to to this market train, we only needed to know and and normalise the of part. 42 00:05:11,370 --> 00:05:20,640 And then we can say this P well, defines its algorithm for somebody for one part and the Ferguson Hastings' algorithm. 43 00:05:20,640 --> 00:05:28,400 Well, we say this Matrix K is the proposal and this minimum term is called the acceptance probability. 44 00:05:28,400 --> 00:05:30,210 The meaning is pretty straightforward. 45 00:05:30,210 --> 00:05:39,180 So if we want a similar guess, just P then we just well, in each iteration propose a new state according to the proposal distribution, 46 00:05:39,180 --> 00:05:45,900 and then we decide whether to accept it or not according to the acceptance probability. 47 00:05:45,900 --> 00:05:50,370 And metropolises are reasons and more generally speaking, 48 00:05:50,370 --> 00:05:57,480 and Simsim methods have been widely used in the statistics because in statistics very often we are 49 00:05:57,480 --> 00:06:06,660 faced with very complicated posterior distributions whose normalising constant is not trackable. 50 00:06:06,660 --> 00:06:12,500 And henceforth I would I would just refer to Piasta posterior distribution. 51 00:06:12,500 --> 00:06:18,170 And in this talk, we are mostly interested in a class of problems called motor selection, 52 00:06:18,170 --> 00:06:25,880 and perhaps this is the most important class of statistical problems defined on its creed status basis. 53 00:06:25,880 --> 00:06:29,330 And for these problems, the size of the state of space? 54 00:06:29,330 --> 00:06:40,600 S typically depends on a parameter P, which is just the number of variables or the number of features we have in our dataset. 55 00:06:40,600 --> 00:06:47,560 Well, when one example is just the variable selection for verbal selection, the state of space, 56 00:06:47,560 --> 00:06:56,150 as is the collection of order of all the binary sequences with less equal to the size of as of these two to the. 57 00:06:56,150 --> 00:07:02,020 And another example is a structural problem where we want to identify, well, 58 00:07:02,020 --> 00:07:09,460 the so-called thag directly the acyclic wuv underlying our multivariate data. 59 00:07:09,460 --> 00:07:11,710 And for these problems, the state of space, 60 00:07:11,710 --> 00:07:23,940 as is the collection of all the decks with pianos and then as the size of us grows super exponentially with Pete. 61 00:07:23,940 --> 00:07:30,390 Well, we will be mostly interested in high dimensional settings where he is much larger than the sample size. 62 00:07:30,390 --> 00:07:36,540 So we have to impose some specific constraint so that the model is identifiable. 63 00:07:36,540 --> 00:07:46,140 But even if we take into account the specific constraint, in most cases, the size of the size of us still grows super polynomial with. 64 00:07:46,140 --> 00:07:54,270 So of course, we may wonder, well, then then are the algorithms efficient enough? 65 00:07:54,270 --> 00:07:58,200 For these emotional problems, because theoretically speaking, 66 00:07:58,200 --> 00:08:08,250 we can always say compare that with approximately messers like bioregional bias or maybe the ABC method and Simms's asymptotically exact, 67 00:08:08,250 --> 00:08:15,090 because if we can generate infinitely many samples from MSNBC, we are able to exactly recover the posterior. 68 00:08:15,090 --> 00:08:24,180 But of course, in reality, especially for emotional problems, we really wonder how fast can these MSNBC algorithms converge? 69 00:08:24,180 --> 00:08:32,320 If we if it cannot converge in a reasonable amount of time, then it's probably not that useful. 70 00:08:32,320 --> 00:08:42,640 And the this is a very useful property that are very open, we want to verify for our algorithm is called racketed mixing, 71 00:08:42,640 --> 00:08:54,250 we say, and the algorithm is rapidly mixing if it's mixing time growth only polynomial with the complexity parameters and the. 72 00:08:54,250 --> 00:09:02,100 Because because the state of space can grow super polynomial, it was so actually the mixing is a very informative property. 73 00:09:02,100 --> 00:09:14,550 It tells us that, well, our algorithm, the complexity of our algorithm grows much more slowly compared to the problem size. 74 00:09:14,550 --> 00:09:26,730 And in recent years, there a massive the the local inform that MSNBC has become pretty popular and to explain what is locally informed, MSNBC. 75 00:09:26,730 --> 00:09:36,930 Well, let's first introduce some not notation. So I'm going to use this an axe to denote the so called neighbourhood of the point X, 76 00:09:36,930 --> 00:09:46,640 and it is just a collection of all the points that I can propose given the current point X. 77 00:09:46,640 --> 00:09:55,740 And now in practise, I would say that for almost every case algorithm to find out discrete status based, 78 00:09:55,740 --> 00:09:58,830 this neighbourhood only grows the size of the neighbourhood, 79 00:09:58,830 --> 00:10:10,440 only grows polynomial with P and understand the way well to implement and which is to use the so-called random book proposal, 80 00:10:10,440 --> 00:10:12,690 which just means that, well, given the neighbourhood, 81 00:10:12,690 --> 00:10:27,440 I'm going to propose one neighbouring states randomly with equal probability so we can write K as this one of course is to indicate a function. 82 00:10:27,440 --> 00:10:32,120 For Mitchell Hastings, algorithms on final status basis, actually, for any market trends, 83 00:10:32,120 --> 00:10:37,880 on final status basis, we can describe the market watching as as a graph. 84 00:10:37,880 --> 00:10:45,260 And here in this picture, I'm using these black thought to represent the state of my state space. 85 00:10:45,260 --> 00:10:50,750 And if two states are neighbours and let's connect them with the law. 86 00:10:50,750 --> 00:10:55,670 So if we look at this state X zero, then this X zero has six neighbours, excellent. 87 00:10:55,670 --> 00:11:05,270 X two, blah, blah, until X six. And these blue bars indicate well how large the posterior probability is at that point. 88 00:11:05,270 --> 00:11:11,960 Now, if we use the random wug well, which the Hastings unwisdom and at the point, 89 00:11:11,960 --> 00:11:19,550 I'm just going to randomly draw one neighbour amongst these six states randomly with equal probability. 90 00:11:19,550 --> 00:11:25,820 And for this distribution, we'll probably we will say that at the point zero, 91 00:11:25,820 --> 00:11:35,450 maybe we want to move to X four because the post here mass looks pretty large on the point X four and the posterior mass on the other five points. 92 00:11:35,450 --> 00:11:43,920 I mean, extra, extra, extra five, six are actually smaller than the probability of X zero. 93 00:11:43,920 --> 00:11:52,350 So intuitively speaking, we want to say the best move seems to be from zero to four and the probability 94 00:11:52,350 --> 00:11:58,560 of this move is just one six because we are using the random of proposals. 95 00:11:58,560 --> 00:12:03,510 So it means that, well, on average, we probably need maybe say, 96 00:12:03,510 --> 00:12:16,230 around the six iterations to move from zero to four and it is Ristic reasoning also tells us that the mixing time of of the mark of the mixing time 97 00:12:16,230 --> 00:12:31,360 of this of this metropolis has this algorithm random its algorithm on this final stage space needs to be at least as large as the neighbourhood size. 98 00:12:31,360 --> 00:12:39,280 Because because we need to say approximately, for example, six iterations here to move from zero to export or and I am sorry, 99 00:12:39,280 --> 00:12:43,630 I forgot to to to to to define what is the mix and how I would define that later. 100 00:12:43,630 --> 00:12:54,100 But roughly speaking, makes in time just a process how many iterations we need for the chain to reach station the. 101 00:12:54,100 --> 00:12:59,950 And then the idea of the law calling for proposals is that, well, 102 00:12:59,950 --> 00:13:11,140 since we all things we actually can tell that this X four is is is a much better move than all the all the other five neighbouring states, 103 00:13:11,140 --> 00:13:16,500 then why not, let's say, just to propose X four with a of probability. 104 00:13:16,500 --> 00:13:21,910 And this is kind of the motivation behind the so-called locally informed proposals. 105 00:13:21,910 --> 00:13:25,570 So given the neighbourhood, let's turn our proposal probabilities. 106 00:13:25,570 --> 00:13:36,810 Let's evaluate the political landscape pie in the the neighbourhood, and then let's propose those states with larger posterior probabilities. 107 00:13:36,810 --> 00:13:47,520 So if at so point zero, for this example, we'll probably propose X for the next step if we use locally informed proposals. 108 00:13:47,520 --> 00:13:51,730 So here is the mathematical definition of the local informed populace. 109 00:13:51,730 --> 00:14:00,450 Let's use F to denote any arbitrary, non-active function and then let's define a new proposal matrix. 110 00:14:00,450 --> 00:14:03,090 According to this formula. 111 00:14:03,090 --> 00:14:11,910 This indicator, function one an just means that we are still considering the same neighbourhood as the previous Random House Hastings. 112 00:14:11,910 --> 00:14:18,450 And according to this formula, we are going to propose a neighbouring state, why? 113 00:14:18,450 --> 00:14:25,020 With probability proportional to PI over Pontiac's. 114 00:14:25,020 --> 00:14:28,880 And that because we have introduced this proposal weights, so we need to. 115 00:14:28,880 --> 00:14:35,430 So we need to calculate is normalising constant Zak's on the denominator. 116 00:14:35,430 --> 00:14:45,760 So this is a novel. So the X is the normalising constant for our informed proposal distribution's. 117 00:14:45,760 --> 00:14:51,940 And according to our previous heuristic arguments, we would say that, of course, 118 00:14:51,940 --> 00:14:57,740 we want to choose some function that is now decreasing because we want to assign 119 00:14:57,740 --> 00:15:04,600 larger proposal ways to those states with larger posterior probabilities. 120 00:15:04,600 --> 00:15:13,510 But one main disadvantage of this local informed scheme is that so to implement this, 121 00:15:13,510 --> 00:15:20,200 inform the proposals, we have to evaluate the posterior landscape pie in the current neighbourhood. 122 00:15:20,200 --> 00:15:25,560 We have to evaluate the pie for every Y in the current neighbourhood. 123 00:15:25,560 --> 00:15:33,720 So that we will be able to calculate this number, I think, consent and then two, and then we can exactly calculate the probabilities, 124 00:15:33,720 --> 00:15:38,160 but of course this can take a lot of time, although the neighbour who decides, 125 00:15:38,160 --> 00:15:44,130 well, it's it's much smaller compared to the compared to the whole state of space. 126 00:15:44,130 --> 00:15:51,470 But if it's really large, then then this local evaluation might take a lot of time. 127 00:15:51,470 --> 00:15:53,480 And although this well, 128 00:15:53,480 --> 00:16:06,830 and I want to say that although this locally informed framework was just recently recently proposed by senator in a paper in 2020, 129 00:16:06,830 --> 00:16:18,050 but actually this idea of locally evaluating pie in the current neighbourhood has been widely used in many other measures, 130 00:16:18,050 --> 00:16:22,520 and it has also been used in other Nullum same methods. 131 00:16:22,520 --> 00:16:31,190 So for all these matters that rely on this idea of locally American Pie, OK, we can ask the following question. 132 00:16:31,190 --> 00:16:39,680 Can these methods achieve a sufficiently fast the convergence rate that can offset the cost of computing? 133 00:16:39,680 --> 00:16:46,190 This nominating council is in every iteration and to the best of our knowledge question 134 00:16:46,190 --> 00:16:52,970 has never been rigorously addressed for high dimensional statistical problem. 135 00:16:52,970 --> 00:17:02,710 And this is what we are going to do in today's talk. We are going to answer this question for the variable selection part. 136 00:17:02,710 --> 00:17:12,010 And before I talk about before I introduce our algorithm and and talk about how we do that through the proof or the mixing time, 137 00:17:12,010 --> 00:17:17,110 so I would like to first offer some general intuition about this problem. 138 00:17:17,110 --> 00:17:25,880 So to be able to analyse the convergence rates of an algorithm for models, model, selection problem. 139 00:17:25,880 --> 00:17:30,890 We want to we want to know something about about this poster distribution pie, 140 00:17:30,890 --> 00:17:38,960 we want to be able to say something about how this pie looks like because the state of space goes super, 141 00:17:38,960 --> 00:17:45,140 super enomoto with a very low vitacost exponential rate or even super exponentially with people. 142 00:17:45,140 --> 00:17:50,480 So, of course, I mean, we don't expect revenue mixing can can hold it for any for any pie. 143 00:17:50,480 --> 00:17:59,330 Right. We have to be able to say something about the pie so that it is possible to prove say something like Rachna mixing. 144 00:17:59,330 --> 00:18:04,880 And the one thing that is very natural to assume from a statistical point of view 145 00:18:04,880 --> 00:18:11,630 is the so-called strong selection consistency for basic model selection procedure. 146 00:18:11,630 --> 00:18:15,590 We say that it has strong selection consistency. 147 00:18:15,590 --> 00:18:27,560 If for some point X star we have the posterior probability of X Star goes to one in probability with respect to the true data generating process. 148 00:18:27,560 --> 00:18:32,900 And of course, this star can be interpreted as the Truman. 149 00:18:32,900 --> 00:18:38,000 So if we think about the classical asymptotic regime where is fixed and it goes to infinity. 150 00:18:38,000 --> 00:18:46,910 Well, such kind of consistent consistency results are very expensive and in the high dimensional settings of things that are actually pretty similar. 151 00:18:46,910 --> 00:18:54,500 So all we need, all we need, all we need is that the sample size is not too small compared to P, 152 00:18:54,500 --> 00:19:03,390 then we should have such kind of consistency of results. We should be able to recover the true model from the data. 153 00:19:03,390 --> 00:19:10,530 And if the posterior distribution really concentrates a single best model X star, then according to the Michael theory, 154 00:19:10,530 --> 00:19:18,390 we actually know that the mixing time of the enzymes is just equivalent to the hitting time of this Baska model. 155 00:19:18,390 --> 00:19:20,790 And intuitively, this seems to be pretty reasonable. 156 00:19:20,790 --> 00:19:27,360 And actually it makes a lot of practical sense as well, because if PI concentrates on one that's the model, 157 00:19:27,360 --> 00:19:35,970 then, you know, our I guess our goal is just to to to to identify the single best model. 158 00:19:35,970 --> 00:19:39,080 So hard to prove this strong selection, consistency. 159 00:19:39,080 --> 00:19:49,070 So one way is to show that this poster, this distribution pie is using the model and the unique mode is sufficient, right? 160 00:19:49,070 --> 00:19:58,860 Shock. And well, I guess that I know that well, probably a lot of you have had some questions about this, 161 00:19:58,860 --> 00:20:06,780 about this consistency and this Animoto condition, and I will return to this unit model condition in the last section of my talk. 162 00:20:06,780 --> 00:20:12,030 But for the moment, I just want to make one remark that is, even if we can prove, 163 00:20:12,030 --> 00:20:17,820 even if we can show the poster distribution is your model on the model space, 164 00:20:17,820 --> 00:20:24,720 we actually cannot really say that much about how the posterior distribution looks like. 165 00:20:24,720 --> 00:20:35,130 So let me explain this point of using some examples. So this is Combivir, the normal distribution with two independent coordinates. 166 00:20:35,130 --> 00:20:39,000 And here I only plot this distribution in the first quadrant. 167 00:20:39,000 --> 00:20:46,560 And it's well, I mean, this is a pretty perfect unit model distribution and why we're seeing unit model distributions. 168 00:20:46,560 --> 00:20:53,850 I guess a lot of us are thinking about distributions that look like this. 169 00:20:53,850 --> 00:20:55,530 Now, let's consider another example. 170 00:20:55,530 --> 00:21:06,090 OK, so so not let the stage space as well, just be the collection of nine points, one, two, three times, one, two, three. 171 00:21:06,090 --> 00:21:12,370 And then let's consider this post distribution point. 172 00:21:12,370 --> 00:21:19,150 And the for this state of space, well, we will just use the most natural network with a relation that is, I will say, 173 00:21:19,150 --> 00:21:27,530 two states, two points, our neighbours, if the Euclidean distance between the two points is equal to one. 174 00:21:27,530 --> 00:21:34,880 Then using this neighbour who is a relation and if we look at this well, this political landscape, 175 00:21:34,880 --> 00:21:45,410 then we actually can tell that this distribution is also model, although it may look a little strange, but actually it is unemotional. 176 00:21:45,410 --> 00:21:53,980 So we may let's say we first look at this point one one, then the point one has two neighbours, one, two and a two one one. 177 00:21:53,980 --> 00:21:59,360 One is not a local model because the point of two, one has a larger probability. 178 00:21:59,360 --> 00:22:06,250 The point of two, one is not a local border either, because the point three, one has a larger posterior probability. 179 00:22:06,250 --> 00:22:13,330 And actually, if we simulated this algorithm for somebody from this poster distribution, 180 00:22:13,330 --> 00:22:25,000 then and let's say we we initialise the chain at the point with one that we probably will see that the chain can move along this black pass. 181 00:22:25,000 --> 00:22:34,790 It will move from one one to two one and then three one. And then eventually it will reach the mode at the point when three. 182 00:22:34,790 --> 00:22:44,090 So this so this distribution is also you anymore, but it looks very different from the previous by the normal example and of 183 00:22:44,090 --> 00:22:48,530 course we can make it continuous and if we make this distribution continuous, 184 00:22:48,530 --> 00:22:55,640 then we will have something that looks like this is very different from from this perfect union model distribution. 185 00:22:55,640 --> 00:23:04,160 But this is still unavoidable. This is what I mean by why we're saying that even if we can say the Posetti distribution model, 186 00:23:04,160 --> 00:23:09,380 we we don't really know that much about about the shape of the distribution. 187 00:23:09,380 --> 00:23:16,310 It can still be pretty irregular. And although you may say, well, this example looks pretty artificial, 188 00:23:16,310 --> 00:23:23,120 but actually I would say for model structural problems like verbal selection, this example is really, 189 00:23:23,120 --> 00:23:33,560 really describes this example really gives a pretty typical scenario that we will encounter, because the variables we have high dimensional setting, 190 00:23:33,560 --> 00:23:41,030 they are correlated and at the correlation structure amongst these variables can be pretty complicated. 191 00:23:41,030 --> 00:23:46,880 And I want to make this point because in the literature, very often we we see that, 192 00:23:46,880 --> 00:23:53,660 well, people justify why they are always as are useful by considering, 193 00:23:53,660 --> 00:24:02,690 say, a high dimensional setting and by assuming that the distribution has independent it coordinates. 194 00:24:02,690 --> 00:24:07,070 But I want to say that well, I mean, of course, those those analysis pretty useful. 195 00:24:07,070 --> 00:24:11,650 It provides us with a lot of insights into the behaviour of the algorithm. 196 00:24:11,650 --> 00:24:15,320 But it's an assumption like an independent coin. 197 00:24:15,320 --> 00:24:25,100 This is not really realistic in the high dimensional setting, and it is something that we probably can never prove using a statistical theory. 198 00:24:25,100 --> 00:24:33,320 And on the contrary, this ultimate assumption is something that we can prove is something we can justify according to the statistical theory, 199 00:24:33,320 --> 00:24:39,330 because all we need to require is that the sample size be sufficiently large. 200 00:24:39,330 --> 00:24:47,610 And the point is that although the point is that although the unipolar distribution may start to be restrictive at the first glance, 201 00:24:47,610 --> 00:24:52,800 but actually the possible distribution can still look pretty complicated, 202 00:24:52,800 --> 00:25:00,720 pretty irregular, and in that kind of give rise to a lot of difficulties in the theoretical 203 00:25:00,720 --> 00:25:08,650 analysis of of our relations on this post here for these posteriorly solutions. 204 00:25:08,650 --> 00:25:12,220 And next, well, we will just focus on the verbal seduction problem. 205 00:25:12,220 --> 00:25:17,310 And now we'll show you what ours and we can propose for the rebels actual problem. 206 00:25:17,310 --> 00:25:25,240 And and that will show that our algorithm can achieve a surprisingly fast mixing rate. 207 00:25:25,240 --> 00:25:30,580 So let's just consider the standard response. Linear regression model Wilkos to explain that. 208 00:25:30,580 --> 00:25:40,330 Plus, Absol, where abstinence represents the normal errors and acts is the bipartisan matrix, where the sample size is the number of variables. 209 00:25:40,330 --> 00:25:47,500 And because we are interested in the high dimensional settings where is much greater than and so we will impose a specific constraint. 210 00:25:47,500 --> 00:25:54,970 So we assume that most entries of batur are zero or at least the most edges of data are negligible. 211 00:25:54,970 --> 00:26:01,970 And therefore this verbal seduction problem, I'm going to use Gamma to denote a state of our state of space. 212 00:26:01,970 --> 00:26:06,250 And this is the index set of the now zero entries of Batur. 213 00:26:06,250 --> 00:26:12,550 In other words, Gamma is the set of variables that have now negligible effects. 214 00:26:12,550 --> 00:26:20,300 And the goal of verbal selection, of course, is to identify is to learn this gamma from the observed data. 215 00:26:20,300 --> 00:26:26,980 And I will use this up as zero to download our model space for this for this unbearable selection. 216 00:26:26,980 --> 00:26:38,390 OK, so as zero is the collection of all the subsets of on the integers, one, two, P with size less than or equal to zero. 217 00:26:38,390 --> 00:26:45,080 So this is the sparsity parameter and it may grow with it may grow as well. 218 00:26:45,080 --> 00:26:55,250 Or we can say it with P and actually we, we want to assume this as zero goes to infinity and goes to infinity. 219 00:26:55,250 --> 00:27:01,660 So this is the moral space for our crime problem and therefore the verbal seduction problem, 220 00:27:01,660 --> 00:27:12,890 I guess we all know that the most classical solution, which is called the ad swap sampler. 221 00:27:12,890 --> 00:27:15,920 So for every state gammer in our safe space, 222 00:27:15,920 --> 00:27:26,150 let's define three different neighbourhoods and let me use another now and swap to denote these three neighbourhoods. 223 00:27:26,150 --> 00:27:37,130 So that, of course, it is the collection of all the neighbouring states that that we can move to by adding one covariate to the current model. 224 00:27:37,130 --> 00:27:41,750 And that is this is the space that we can move to by removing one Kovarik. 225 00:27:41,750 --> 00:27:52,120 And the swap is the space that we can move to by swapping one covariate in the current model with one kolaric that is not in the current model. 226 00:27:52,120 --> 00:27:54,880 And and, of course, we can track that well, 227 00:27:54,880 --> 00:28:10,110 the size of the the size of the additions and the deletion neighbourhood is just equal to P and the size of the small neighbourhood is given by this. 228 00:28:10,110 --> 00:28:18,060 And they're using this as a data swap moves, then we can define a random walk Arabism, 229 00:28:18,060 --> 00:28:23,430 for example, here we can let's consider this proposal scheme, which probably would have. 230 00:28:23,430 --> 00:28:30,420 Let's just let's propose one state was probably the one that's just randomly proposed one addition or the move, 231 00:28:30,420 --> 00:28:37,490 which probably was probably the one that's proposed one swap move with equal probably. 232 00:28:37,490 --> 00:28:44,030 And in this case, this altruism is also coded, symmetric, rather more hastings', 233 00:28:44,030 --> 00:28:54,170 because you can track the proposal primarily from Gammer to Gamoran is always equal to the proposal, probably from Gamoran together. 234 00:28:54,170 --> 00:29:01,220 Although this it looks pretty naive, but I want to say that because of its simplicity, 235 00:29:01,220 --> 00:29:07,100 it's actually wider use amongst practitioners, it is pretty easy, pretty straightforward to implement. 236 00:29:07,100 --> 00:29:10,700 And the idea behind it is also easy to understand. 237 00:29:10,700 --> 00:29:20,840 And in a seminal paper published in 2009, 16 Annals of Statistics by Michael Jordan and Malcolm Wainwright, 238 00:29:20,840 --> 00:29:27,410 it is shown that in the mile high dimensional assumptions, this this kind of naive, 239 00:29:27,410 --> 00:29:34,040 symmetrical, random look which Hastings Armisen is actually rapidly mixing. 240 00:29:34,040 --> 00:29:41,710 So the complexity of this random walk algorithm only grows polynomial with only. 241 00:29:41,710 --> 00:29:49,940 And the author of The Mixing Time Bomb, well, and this mixing compound can be interpreted as the complexity of this random, 242 00:29:49,940 --> 00:29:58,300 which is always is this mixing compound is is roughly given by this expression I see roughly here, 243 00:29:58,300 --> 00:30:06,430 because I have only I have admittedly some high parameters which are which are not really very important. 244 00:30:06,430 --> 00:30:17,590 So it is P an S zero square times and the proof of the mixing compound relies on the canonical pass method. 245 00:30:17,590 --> 00:30:25,840 And it's kind of the most widely used method for studying the convergence of multiple finite state spaces. 246 00:30:25,840 --> 00:30:33,280 And then and the main idea of the approved well is actually just based on that strong selection consistency. 247 00:30:33,280 --> 00:30:37,210 So we verify the strong selection, consistency for the purpose of action problem, 248 00:30:37,210 --> 00:30:42,880 and we prove that the distribution with high probability will be using the model. 249 00:30:42,880 --> 00:30:54,270 And then we and then you can use this method. And it showed that while the repetition of the random walk algorithm is actually mixing. 250 00:30:54,270 --> 00:30:59,910 But, of course, given that the random walk hours and already performed well, pretty well, 251 00:30:59,910 --> 00:31:08,270 that we may ask, how fast can the mixing time of an informed and serious and for verbal seduction be? 252 00:31:08,270 --> 00:31:19,050 And we and of course, we want to prove that the missing time of uninformed altruism is much faster than mixing time earth of the random walk hours. 253 00:31:19,050 --> 00:31:29,410 Otherwise, theoretically, we cannot really say why we should prefer a local law enforcement proposal scheme. 254 00:31:29,410 --> 00:31:34,690 So the of the main challenges for this analysis. 255 00:31:34,690 --> 00:31:44,530 Well, so I guess there are two main challenges for for for the for the analysis of the reasons for this verbal seduction problem. 256 00:31:44,530 --> 00:31:49,420 First of all, although we can say that we are under some high dimensional assumptions, 257 00:31:49,420 --> 00:31:55,150 we can prove the poster distribution, the model, and actually gets you in the majority. 258 00:31:55,150 --> 00:32:04,120 Or they imply that given any say, if we are at a model, which is not the true model, 259 00:32:04,120 --> 00:32:12,350 then we can always move to a better model by adding one coverage or removing one COGAT. 260 00:32:12,350 --> 00:32:17,650 Here, let me use this star to denote the truth that I know the true model. 261 00:32:17,650 --> 00:32:22,510 And I would refer to the cowbirds in Afghanistan as influential cowbirds. 262 00:32:22,510 --> 00:32:33,340 And the other programmes will be just quoting now, influential. So the idea is that, well, if the current model is missing some influential COGAT, 263 00:32:33,340 --> 00:32:38,590 well, then maybe we can use some additional move to increase the positive probability. 264 00:32:38,590 --> 00:32:45,790 And if we end, if we have already all the infrastructure upgrades and we have also included some non-commercial 265 00:32:45,790 --> 00:32:52,840 commerce that maybe we can remove some Nightwish programmes and then we can arrive at a better model. 266 00:32:52,840 --> 00:32:59,470 But the problem here is that because of the dependency structure amongst these these Marable's, 267 00:32:59,470 --> 00:33:08,890 so actually the post-war landscape of the landscape can still be pretty irregular in the sense that well beyond this, 268 00:33:08,890 --> 00:33:13,900 we cannot really see much beyond this other modality. 269 00:33:13,900 --> 00:33:22,300 For example, although we may, although intuitively, intuitively we may conjecture that if the sample size is really large, 270 00:33:22,300 --> 00:33:28,270 then perhaps the poster probably will always increase if we add an influential 271 00:33:28,270 --> 00:33:34,120 covariate and the perception of it will increase if we remove a large influential. 272 00:33:34,120 --> 00:33:44,740 But actually, in general, these claims are not right. So in general, we cannot say that's adding any knowledge, adding any influential covariate. 273 00:33:44,740 --> 00:33:50,740 They can increase their support. We also we can say that, well, they are always with high priority. 274 00:33:50,740 --> 00:33:57,070 They are always exists when such good means. But we cannot make claims like this. 275 00:33:57,070 --> 00:34:04,720 We cannot say any any addition of any financial covariate will increase the percent probability. 276 00:34:04,720 --> 00:34:09,340 And this is just because the variables can be correlated. 277 00:34:09,340 --> 00:34:18,310 And perhaps it is helpful to recall the previous example. The point the point here is that, well, although the poster disputants you the model, 278 00:34:18,310 --> 00:34:27,560 but it doesn't mean that the posterior probability will increase whenever we move in a direction pointing to the morgue. 279 00:34:27,560 --> 00:34:35,710 OK, so for example, in this example, if we move from one one to one to the actually the probability decreases a lot. 280 00:34:35,710 --> 00:34:40,810 So actually here we cannot move in the direction pointing to the mode. 281 00:34:40,810 --> 00:34:50,260 And that's exactly I mean, this intuition applies applies to our natural selection problem as well. 282 00:34:50,260 --> 00:34:57,250 And another challenge here is that we have to choose this locally and from the proposal scheme carefully, 283 00:34:57,250 --> 00:35:09,400 because if we choose a category there, very often we will end up with will end up end up with an algorithm that can get stuck at local nodes. 284 00:35:09,400 --> 00:35:18,310 So. So let's consider this very naive way of implementing any form of the proposal. 285 00:35:18,310 --> 00:35:25,090 So given the neighbourhood and which is the collection of all the states that I can move to buy when additions or deletions work, 286 00:35:25,090 --> 00:35:31,750 move, let's propose a new state with property proportional to the post area. 287 00:35:31,750 --> 00:35:39,580 This is kind of the most naive way, the most straightforward way of implementing a law calling for the proposal scheme. 288 00:35:39,580 --> 00:35:45,190 But actually, it's not going to work. It's going to fail completely. So we can consider a very simple scenario. 289 00:35:45,190 --> 00:35:49,810 Let's take a start. It's just has two variables, one and two. 290 00:35:49,810 --> 00:35:52,810 And let's say the current model is the empty set. 291 00:35:52,810 --> 00:36:00,820 Of course, then we care about what's the transition probability from moving empty said to to to to the model. 292 00:36:00,820 --> 00:36:07,480 We just convert one. Or maybe the model is just a collaborative to and then we can calculate we can actually 293 00:36:07,480 --> 00:36:13,940 calculated that the transition probably from the empty side to to the model with only cover, 294 00:36:13,940 --> 00:36:25,420 the one is bounded by this ratio, by the ratio of the percent probability of of of the single the model to the positive probability of a true model. 295 00:36:25,420 --> 00:36:28,780 And now if the sample size is sufficiently large that, of course, 296 00:36:28,780 --> 00:36:33,700 the true mother will have the positive probability of the true mother will goes to one. 297 00:36:33,700 --> 00:36:38,150 So this ratio will this ratio will go to zero. 298 00:36:38,150 --> 00:36:44,690 And this will imply that although at all, although at the empty model, 299 00:36:44,690 --> 00:36:56,960 we will probably say keep proposing adding covariate one or maybe adding Coverer two, but actually these proposals will almost always get rejected. 300 00:36:56,960 --> 00:37:03,440 And in effect, we will just stay at this empty set for for a lot of iterations. 301 00:37:03,440 --> 00:37:08,780 And the change is not really moving. So this is not going to be a good algorithm. 302 00:37:08,780 --> 00:37:15,500 And actually, if we think about this general definition of the local local law enforcement proposal scheme, 303 00:37:15,500 --> 00:37:25,550 so we will see that the main reason here is that the main difficulty here is that we actually cannot really say much about about this. 304 00:37:25,550 --> 00:37:34,460 Normalising constant Zicam. And this is and this, again, is because, well, we are considering, say, 305 00:37:34,460 --> 00:37:43,010 a discreet space problem and that the landscape can can be very irregular or more precisely speaking, 306 00:37:43,010 --> 00:37:52,100 we can say that the political landscape can change drastically if we just move from the current state to a neighbouring state. 307 00:37:52,100 --> 00:37:58,370 Is it's totally different from from from a continuous state of space where we have enough continuity. 308 00:37:58,370 --> 00:38:04,790 And then we can say that if we move locally around the country, the point the political landscape should have the change mantra. 309 00:38:04,790 --> 00:38:09,740 But if a model structural problems for private institutions defined on the screen 310 00:38:09,740 --> 00:38:14,000 state of spaces where they can be pretty irregular and it is normalising, 311 00:38:14,000 --> 00:38:21,450 constant can change drastically if we just move, say, from the current model to a neighbouring model. 312 00:38:21,450 --> 00:38:29,730 And then this creates a lot of difficulty in designing efficient, locally informed proposal schemes. 313 00:38:29,730 --> 00:38:34,290 So we tried a lot of different ways to to design this local reform proposals. 314 00:38:34,290 --> 00:38:41,090 And eventually we found that a single solution was just to use bounded proposal weights. 315 00:38:41,090 --> 00:38:47,310 So now I'm going to introduce our algorithm, which we call it locally. 316 00:38:47,310 --> 00:38:55,860 Well, it is always in the way of locally informed and the of the proposals and the idea of our algorithm is pretty simple. 317 00:38:55,860 --> 00:39:03,870 So first of all, let's partition the neighbourhood of Kharma into three subsets, the addition neighbourhood to neighbourhood and swap neighbourhood. 318 00:39:03,870 --> 00:39:09,270 And then let's perform the propose a weighting within each neighbourhood separately. 319 00:39:09,270 --> 00:39:17,240 And here I'm using this function W two to note that the proposal weight that I assigned to a neighbouring state. 320 00:39:17,240 --> 00:39:23,060 And and therefore, every neighbouring state government crime in the neighbourhood. 321 00:39:23,060 --> 00:39:29,390 I'm just going to use this bounded proposal weights, so I will wait. 322 00:39:29,390 --> 00:39:36,860 This proposal using the racial pytka over Obopay, Gummo truncal at both sides. 323 00:39:36,860 --> 00:39:40,820 And by properly choosing these truncation thresholds, 324 00:39:40,820 --> 00:39:51,500 we will obtain a locally informed algorithm that is very, very efficient, as we will see shortly. 325 00:39:51,500 --> 00:40:02,560 And the main result of our paper is that we can actually prove under under essentially the same assumptions used by that paper of young and old, 326 00:40:02,560 --> 00:40:08,300 the paper on the random hastings' for the for the verbal seduction problems. 327 00:40:08,300 --> 00:40:18,770 The mixing time of our algorithm is bounded by sea and with the is just a universal constant. 328 00:40:18,770 --> 00:40:26,800 So the point here is that this mixing rate does not depend on the parameter P. 329 00:40:26,800 --> 00:40:32,770 And we call this mixing rate dimension free mixing because it doesn't depend on. 330 00:40:32,770 --> 00:40:36,280 And the point here is that in high school settings is much larger than so. 331 00:40:36,280 --> 00:40:42,490 We care much more about the people care about than the sample size of. 332 00:40:42,490 --> 00:40:49,360 And if we compare this mixing rate of our algorithm with the mixing rate, 333 00:40:49,360 --> 00:40:57,010 with the mixing bomb of that of that random algorithm in the paper by you at all, 334 00:40:57,010 --> 00:41:02,620 then we'll see that even after we take into account the taking into account the 335 00:41:02,620 --> 00:41:08,890 fact that we need to locally evaluate the political landscape in every iteration. 336 00:41:08,890 --> 00:41:21,060 The total complexity of our little algorithm is still smaller than the mixing time bound provided by young adults paper. 337 00:41:21,060 --> 00:41:28,170 And if you check our paper, you will see that's actually the mixing compound for our algorithm, 338 00:41:28,170 --> 00:41:33,810 the rounding our paper is is even slightly smaller than this. 339 00:41:33,810 --> 00:41:40,970 But anyway, the difference is not it's not so important end. And I don't want to introduce those heavy notations so legit. 340 00:41:40,970 --> 00:41:45,750 So that's why I just presented this simple version of our result. 341 00:41:45,750 --> 00:41:56,120 The mixing compound can be targeted by a multiple of which is a pretty well, which arguably is kind of a surprising result. 342 00:41:56,120 --> 00:42:05,160 And let's before I talk about proof, let's just let's do some say simulation to check whether our algorithms work well in practise. 343 00:42:05,160 --> 00:42:09,240 So we first consider the simulation settings of those paper. 344 00:42:09,240 --> 00:42:19,530 It's a pretty simple setting where we always say sample the truth that a star such that the size of gamma stock is equal to 10. 345 00:42:19,530 --> 00:42:30,060 So they are always just 10 in financial coverers. And we always initialise our algorithms with a random generator, that model with 10 covariates. 346 00:42:30,060 --> 00:42:35,550 And then when the signal to noise ratio is is sufficiently large, well, 347 00:42:35,550 --> 00:42:44,460 then we expect that both random Hastings and a lit a match will be able to identify the posterior mode in a reasonable amount of time. 348 00:42:44,460 --> 00:42:51,390 And we observed that actually lit a match is much, much faster than the random walk and match. 349 00:42:51,390 --> 00:42:57,960 So for uncowed will solve the peak of the five thousand and the four independent design matrix randomly match 350 00:42:57,960 --> 00:43:04,980 takes approximately 15 seconds to find the mode and the data only takes about zero point of one second. 351 00:43:04,980 --> 00:43:16,250 And if we consider correlated, it is our matrix, then the advantage of pretty much well is is is slightly more substantial actually. 352 00:43:16,250 --> 00:43:25,610 But when the true model but if but if we can, but if the true model is the is the model, then we will observe that the landscape tends to be flat. 353 00:43:25,610 --> 00:43:32,990 And then in that case, random walk. Michala Hastings also performs pretty well and actually it is actually better than that. 354 00:43:32,990 --> 00:43:42,440 How much. And and I want to make a point here on the commercial aspects of their much that is well 355 00:43:42,440 --> 00:43:48,560 recorded in every iteration of how much we need to evaluate the landscape in the neighbourhood. 356 00:43:48,560 --> 00:43:57,860 Right. And this and this actually makes the so-called black walnut black the estimation of Batur. 357 00:43:57,860 --> 00:44:07,550 Easy to hear this. Blackwall This globalisation means that we can estimate Beda using conditioning. 358 00:44:07,550 --> 00:44:15,350 So we look at every inch of data separately. For example, let's look at the first of the paper and less condition, 359 00:44:15,350 --> 00:44:27,290 all the P minus one coordinates of the current of the current model gamma condition or that condition on well, 360 00:44:27,290 --> 00:44:32,960 whether whether we have selected the other P minus one covariance in our current model. 361 00:44:32,960 --> 00:44:40,820 And let's see how the posterior probability changes if we include this variable or if we remove this variable. 362 00:44:40,820 --> 00:44:48,380 And by doing these conditioning calculations, we will be able to obtain a so-called appropriate the estimate of BEDA, 363 00:44:48,380 --> 00:44:52,610 which has a much smaller variance than 10 debate at that. 364 00:44:52,610 --> 00:44:58,340 We got we just the sample from the fundamental hastings' algorithm and that if 365 00:44:58,340 --> 00:45:05,750 we look at say that the performance of this low the of data will see that, 366 00:45:05,750 --> 00:45:16,710 then actually it's it's just way much better then than the patents that we sample from the wrong from the random walk out of six hours. 367 00:45:16,710 --> 00:45:21,530 But I don't want to emphasise too much on the advantages of this validation 368 00:45:21,530 --> 00:45:25,940 scheme because it is probably a little bit controversial in the sense that, 369 00:45:25,940 --> 00:45:38,360 well, this is we do expect that in general settings, Rob Black on black retardation is always going to help us in terms of the estimation of data. 370 00:45:38,360 --> 00:45:46,460 For example, if the Post is really multimodal, then perhaps the problem is not going to help us that much harder. 371 00:45:46,460 --> 00:45:52,640 The point I want to make here is that when we implement, say, a local form, the proposal schemes, 372 00:45:52,640 --> 00:45:59,240 when we do all those local evaluation of the landscape in the neighbourhood, these calculations, 373 00:45:59,240 --> 00:46:04,730 although they may look expensive, but they can be made use of in multiple different ways, 374 00:46:04,730 --> 00:46:15,540 we can use them to to our proposal probabilities and we can use them to calculate, for example, the estimates of. 375 00:46:15,540 --> 00:46:18,960 And and, of course, we may say that, well, in reality, 376 00:46:18,960 --> 00:46:28,290 we would never have say we'd probably never encountered a union model distribution, we have a distribution that is somewhat multimodal. 377 00:46:28,290 --> 00:46:34,180 And then let's consider how our aggressive works for more realistic settings. 378 00:46:34,180 --> 00:46:38,640 So for the second assimilation setting, I'm just going to sample this. 379 00:46:38,640 --> 00:46:42,330 So I'm going to consider a much more complicated, 380 00:46:42,330 --> 00:46:51,180 dependent structure for our is not matrix and I'm going to sample 100 influential cagers randomly and that they effect size. 381 00:46:51,180 --> 00:46:55,680 I sample from normal distributions. So some so some influential programmes. 382 00:46:55,680 --> 00:47:01,260 They may have larger effects, some influential converts, they may have smaller effects. 383 00:47:01,260 --> 00:47:08,040 And because we have 100 influential cowbirds and a design matrix has a very complicated, dependent structure. 384 00:47:08,040 --> 00:47:12,570 So we expect that the posterior landscape will be multimodal. 385 00:47:12,570 --> 00:47:22,050 And now for this simulation setting, we still observe that amateurism performs much better than random things and that we 386 00:47:22,050 --> 00:47:27,420 can measure the performance of the two algorithms using the effective sample size, 387 00:47:27,420 --> 00:47:31,770 well, dividing the body, but by the runtime. 388 00:47:31,770 --> 00:47:43,470 And then we see that the effect is in fact a sample size of little eMESH algorithm is always much larger than the sample size of the random book, 389 00:47:43,470 --> 00:47:52,830 which is ours. And this says that, well, actually our little algorithm is also pretty efficient in terms of exploring, 390 00:47:52,830 --> 00:47:58,650 say, a realistic multimodal posterior distribution for the probable selection problem. 391 00:47:58,650 --> 00:48:06,270 And finally, we also consider the real data analysis which try to give us datasets and analyse approximately 392 00:48:06,270 --> 00:48:13,230 5000 of its peers around the three hundred thousand and the for this very large dataset, 393 00:48:13,230 --> 00:48:21,330 we still observe that it performs much better than the random look and the effect of sample size of random book. 394 00:48:21,330 --> 00:48:28,320 And it's just about two per minute. But little match has a sample size approximately. 395 00:48:28,320 --> 00:48:32,760 So it is three per minute. 396 00:48:32,760 --> 00:48:46,290 OK, finally, let me just briefly discuss say how we prove our how we prove how we theoretically study the mixing time of our little amateur hours. 397 00:48:46,290 --> 00:48:51,510 You have about a minute left for the proof. Thank you very much. Yeah. 398 00:48:51,510 --> 00:49:03,610 So we'll use a conditional approach. OK, and and so so this is what I mean by driv the condition. 399 00:49:03,610 --> 00:49:15,690 So I guess I need to skip the technical details here. And and I want to say that so recall that previously I was saying that for the random people. 400 00:49:15,690 --> 00:49:19,410 True. That the mixing rate using the passive message. Right. 401 00:49:19,410 --> 00:49:26,910 But and I said that pass measures are kind of the most wanted to use the solutions for Malcolm, change our final spaces. 402 00:49:26,910 --> 00:49:34,590 But here we take a somewhat unusual and dreft the conditional approach I see it is somewhat 403 00:49:34,590 --> 00:49:41,500 unusual because this condition has been very used in the mixing time analysis of this algorithms, 404 00:49:41,500 --> 00:49:51,390 discrete status spaces. It has been drift conditioned approaches have been widely used for groups samples on a continuous basis. 405 00:49:51,390 --> 00:49:55,770 But let's use just let's try a conditional approach for for our algorithm. 406 00:49:55,770 --> 00:50:06,990 And and another kind of unusual feature of our approach is that we actually proved the two different conditions for our algorithm on a state of space. 407 00:50:06,990 --> 00:50:15,000 And we would then try to combine these two conditions and then find a missing time bound for our data image outwards. 408 00:50:15,000 --> 00:50:21,600 And to explain this to stage of the condition, I want to first partition our state of space. 409 00:50:21,600 --> 00:50:26,670 So I'm going to classify all the models into two groups, wainscot and the feature models. 410 00:50:26,670 --> 00:50:28,150 And the other is called Overfitting. 411 00:50:28,150 --> 00:50:37,060 The models, of course, graphical models just refer to the models that have already included all the influential covariates. 412 00:50:37,060 --> 00:50:43,240 And now we are going to prove one of the conditions for all the overfitting amount of sarin, 413 00:50:43,240 --> 00:50:51,580 yet when when you have a condition for all the over models and then we'll prove another condition for all the undertaker models, 414 00:50:51,580 --> 00:50:58,810 and then we'll try to combine these two conditions and show that the change is going to be mixing quickly. 415 00:50:58,810 --> 00:51:02,780 So this picture briefly illustrates well, the idea of our cruise. 416 00:51:02,780 --> 00:51:09,940 This is our modern space and the most of the models are undefeated, which are represented by this grey area. 417 00:51:09,940 --> 00:51:16,930 And a small subset in this modern space are overtaking the models which are represented by this blue circle. 418 00:51:16,930 --> 00:51:24,670 And now what we propose is to stage a condition. Of course, it is based on some theoretical insights into the biggest problem. 419 00:51:24,670 --> 00:51:31,510 There exists a lot of high dimensional results for stepwise procedures, for natural selection. 420 00:51:31,510 --> 00:51:36,640 And those results tell us that if the current model is under fitted, 421 00:51:36,640 --> 00:51:45,910 then we can always add some colour to the current model so that we will be able to explain the variation in our 422 00:51:45,910 --> 00:51:54,760 response back to Y and the ones we have included all the influential colours and the model becomes overfitting. 423 00:51:54,760 --> 00:52:02,290 Then we will be able to remove those nonconfidential colours because then they become kind of useless. 424 00:52:02,290 --> 00:52:08,560 So this intuition tells us that if we consider the behaviour of atavism and actually 425 00:52:08,560 --> 00:52:13,420 also the behaviour of the random look algorithm for the selection problem, 426 00:52:13,420 --> 00:52:16,210 then if we start from an unofficial model, 427 00:52:16,210 --> 00:52:28,390 then the chain should first say kind of drifts towards the set of all the overfitting models so that we can explain the variation in what. 428 00:52:28,390 --> 00:52:37,240 And once the chain arrives at this set of overfitting models, then we will drift towards the true model. 429 00:52:37,240 --> 00:52:45,490 So this is the the main intuition behind our two stage condition when the commission shows that the train moves quickly from, 430 00:52:45,490 --> 00:52:48,580 say, any unofficial model to an overview of the model. 431 00:52:48,580 --> 00:52:59,430 And another condition shows that given once we arrive at an overflowing model, then we will quickly move towards the true model. 432 00:52:59,430 --> 00:53:03,470 I'm going to skip these technical details and the all the paper, 433 00:53:03,470 --> 00:53:14,720 you will find some general results for how to efficiently combining to drive the conditions on a state of space and the are paper. 434 00:53:14,720 --> 00:53:18,890 We actually derived these results for general status basis. 435 00:53:18,890 --> 00:53:22,550 So these results, I mean, they are not limited to two discrete state of spaces. 436 00:53:22,550 --> 00:53:25,100 And actually, you can also generalise our again, 437 00:53:25,100 --> 00:53:32,270 our argument to the case of more than two of the conditions we consider stringent conditions for traffic conditions, et cetera. 438 00:53:32,270 --> 00:53:35,720 As long as the state space has such a NASA structure, 439 00:53:35,720 --> 00:53:43,520 then you can use our argument to combine multiple different conditions and then obtain a mixing time bomb. 440 00:53:43,520 --> 00:53:52,070 So I think I will maybe use another two minutes to briefly discuss some generalisation of the results that I have discovered so far. 441 00:53:52,070 --> 00:54:00,980 So so this live amateur hour with them and and and all the intuition that I have been talking about, 442 00:54:00,980 --> 00:54:05,780 of course, can be applied to other models, selection process. 443 00:54:05,780 --> 00:54:13,970 I mean, other than the selection problem and the actually for any general discrete state spaces, 444 00:54:13,970 --> 00:54:23,120 we can we can prove some we can prove some useful result for these locally informed and C schemes. 445 00:54:23,120 --> 00:54:32,270 We can actually prove that for any discrete spaces, as long as the possible distribution satisfies a universal condition, 446 00:54:32,270 --> 00:54:40,490 that we will be able to construct a locally informed MSFC algorithm such that its convergence 447 00:54:40,490 --> 00:54:48,620 rate can be bounded by a universal constant that doesn't depend on the number of variables. 448 00:54:48,620 --> 00:54:55,250 So the dimension of remixing can always be achieved for other models of action problems, 449 00:54:55,250 --> 00:55:00,890 not only not only the variable selection and that we can always devise such locally 450 00:55:00,890 --> 00:55:06,440 informed proposal schemes that can achieve these dimension Friedrichsen rates. 451 00:55:06,440 --> 00:55:14,430 And then the critical step in all these in the devising of these locally informed proposals and 452 00:55:14,430 --> 00:55:21,100 the theoretical analysis of these locally informed MSFC is to act is always to verify this, 453 00:55:21,100 --> 00:55:25,760 know the condition you want to prove that what the poster distribution is, you the model, 454 00:55:25,760 --> 00:55:34,880 and then you would be able to and then you would be able to use some arguments to establish that that dimension free mixing weight so limited, 455 00:55:34,880 --> 00:55:40,520 so limited by time. I guess I'm just going to say quickly wrap up my talk. 456 00:55:40,520 --> 00:55:45,880 OK, so, so, so today I have been talking about it. 457 00:55:45,880 --> 00:55:55,730 Is this this little algorithm, which is kind of a simple but we want to say a pretty efficient solution to the variable selection problem, 458 00:55:55,730 --> 00:55:59,030 and it can attain a provable dimension, free mix and. 459 00:55:59,030 --> 00:56:06,080 Right. And I and I want to say that for their local informed addresses, the proposal schemes, 460 00:56:06,080 --> 00:56:10,760 that the implementation of the proposed scheme can always be easily paralysed. 461 00:56:10,760 --> 00:56:18,120 So actually, in reality, it's not going to take it can be it can be implemented in a much faster way than than even though 462 00:56:18,120 --> 00:56:24,260 simulation results that we have that I have shown and because of the simplicity of our algorithm, 463 00:56:24,260 --> 00:56:31,100 you can also combine our proposal schemes with other more advanced techniques such as blocking, 464 00:56:31,100 --> 00:56:38,690 tampering, lifting, etc. But whether these whether these advanced techniques are going to further improve the mixing rate or 465 00:56:38,690 --> 00:56:44,720 the total or whether these skills are going to further reduce the total complexity of the algorithm. 466 00:56:44,720 --> 00:56:52,330 Well, I guess it's pretty hard to say and that we need to do a simulation to to to to verify. 467 00:56:52,330 --> 00:57:00,980 And this methodology can be generalised to other models of action, as well as we can verify a so-called unmowed. 468 00:57:00,980 --> 00:57:05,000 And I use the model condition and the limited by time. 469 00:57:05,000 --> 00:57:11,480 I cannot talk about the other problems, but in another of my work trying to work with my student, 470 00:57:11,480 --> 00:57:17,180 we have verified this during the model condition for this structural problem. 471 00:57:17,180 --> 00:57:21,320 So so far, the only problem for the graphical model problem, 472 00:57:21,320 --> 00:57:27,530 you can also say device such locally informed and same measures which will 473 00:57:27,530 --> 00:57:32,720 achieve a dimension for mixing weight under some high dimensional assumptions. 474 00:57:32,720 --> 00:57:37,160 So that's kind of all I want to talk about for you. That's all I want to talk about that. 475 00:57:37,160 --> 00:57:44,030 That's all I want to talk about today. Thank you much for your attention. I hope we still have time for questions. 476 00:57:44,030 --> 00:57:52,310 Thank you for for the great talk. So we still have one minute, perhaps one or two quick questions from the audience. 477 00:57:52,310 --> 00:57:59,020 Jeff. OK, so any any questions from the audience? 478 00:57:59,020 --> 00:58:03,000 I was actually clapping, not putting my hand up, but I did have it. 479 00:58:03,000 --> 00:58:08,800 Thank you. That was lovely talking. Very nicely explained. So I did have a question at first sight. 480 00:58:08,800 --> 00:58:14,010 It looks like I need to calculate the marginal likelihoods to calculate those parts. 481 00:58:14,010 --> 00:58:21,330 Right. So I'm thinking, do you have any suggestions on is there like a is there like a randomised version 482 00:58:21,330 --> 00:58:27,810 of this or something like is it enough to have unbiased estimates or something? 483 00:58:27,810 --> 00:58:35,670 So you are people we would consider kind of simplicity where the marginal likelihoods can be, can be, can be evaluated across the board. 484 00:58:35,670 --> 00:58:38,310 So I assume they are available. 485 00:58:38,310 --> 00:58:46,590 Essentially, this means that we are not considering a hierarchical financial model to assume that all the parameters are given. 486 00:58:46,590 --> 00:58:51,840 So we can always evaluate this pretty easily. 487 00:58:51,840 --> 00:59:01,380 But if you want. But if, if, if, if we have, say, a no hyper parameters and the likelihood is are not tractable, 488 00:59:01,380 --> 00:59:07,170 then maybe you can you say maybe those super marginal algorithms and then try to 489 00:59:07,170 --> 00:59:11,850 combine status through the model MSNBC with our local law enforcement proposals, 490 00:59:11,850 --> 00:59:14,970 and then you will still be able to obtain similar results. 491 00:59:14,970 --> 00:59:22,020 I guess the main idea is that essentially I'm in this talk, I was just focussing on how to make the proposals. 492 00:59:22,020 --> 00:59:28,020 So I appreciate I think that what you're doing is very useful. 493 00:59:28,020 --> 00:59:41,100 And if I may join a second question would be so that algorithm you wrote down with that truncation, you truncated the top and bottom weights. 494 00:59:41,100 --> 00:59:48,000 I mean, that kind of came out of nowhere. What what can you do better? 495 00:59:48,000 --> 00:59:55,150 Is that like do you have a feeling that's the best you can do or or what's the intuition? 496 00:59:55,150 --> 00:59:59,160 So it's a great question. 497 00:59:59,160 --> 01:00:07,050 I forgot to expand the intuition. So so so I guess the main difficulty in designing these local informal proposals is 498 01:00:07,050 --> 01:00:12,840 that although it's pretty easy to find a proposal that's with high probability, 499 01:00:12,840 --> 01:00:21,030 it's going to propose those states with larger parties, the difficulty is that how can we make sure those proposals will get accepted? 500 01:00:21,030 --> 01:00:27,480 So that's the difficulty. And the to make the proposals, get to make the proposals, get accepted it, 501 01:00:27,480 --> 01:00:38,640 then you want to you want to kind of make sure that you also want to take into account the proposal probability of the of the process moving back. 502 01:00:38,640 --> 01:00:42,720 Right. That that's that proposal probably cannot be too small. 503 01:00:42,720 --> 01:00:49,260 So that's why eventually we we found this truncation scheme, because if we use this truncation scheme, 504 01:00:49,260 --> 01:00:54,240 then the proposal probability of any state can be found in the proposal. 505 01:00:54,240 --> 01:01:02,370 Probability of any state can be bounded. So theoretically, this is a great advantage because it makes it makes the proof possible. 506 01:01:02,370 --> 01:01:07,620 And yes, yes, yes, yes, yes. 507 01:01:07,620 --> 01:01:21,800 That's kind of the main issue. I hope it is helpful. Yeah. OK, so any other quick question from the audience? 508 01:01:21,800 --> 01:01:25,850 OK, I think we can stop today, so thank you, everyone, for coming. 509 01:01:25,850 --> 01:01:31,400 Also trying again for the sweet talk. So this is a second Laza seminar this term. 510 01:01:31,400 --> 01:01:37,430 So we'll see you next Friday for the last seminar this term. 511 01:01:37,430 --> 01:01:41,540 So have a great weekend, everyone. Thank you much. 512 01:01:41,540 --> 01:01:47,870 Stronger much for the invitation. Thank you very much for the great questions. And thank you for your attention. 513 01:01:47,870 --> 01:01:51,791 Let's check the speaker again.