1 00:00:01,980 --> 00:00:12,030 OK, you heard this. Today's the second edition of the competition and Statistics and Machine Learning Seminary 2 00:00:12,030 --> 00:00:19,110 in this Trinity term 2020 twenty first and today we're very happy to have I keep a diary. 3 00:00:19,110 --> 00:00:26,370 She will be speaking about practical, pre asymptotic diagnostic of Monte-Carlo estimating based on inference and machine learning. 4 00:00:26,370 --> 00:00:32,220 And I just wanted to tell you more about the Who is AKI if you don't know him. 5 00:00:32,220 --> 00:00:39,630 Aki in that he's an associate professor in Computational Probabilistic Modelling in Aalto University. 6 00:00:39,630 --> 00:00:45,570 His research interests are based on probability theory and methodology based on thorough work, 7 00:00:45,570 --> 00:00:51,210 providing realistic programming in fair methods such as patient propagation, 8 00:00:51,210 --> 00:00:58,860 operational base model diagnostics, model assessment and selection processes, and hierarchical models. 9 00:00:58,860 --> 00:01:06,900 He has published a very, many very relevant papers, both in the medical aspect transparency and also in applications. 10 00:01:06,900 --> 00:01:12,360 And also, he is an author of Of Course, I Hope You Like a very interesting book books. 11 00:01:12,360 --> 00:01:17,580 One of the things that I come here to be, which is a classic already and also which is more recent. 12 00:01:17,580 --> 00:01:25,980 The Radiation and other studies, which was published last year and is also very interesting because it addresses rigour applied, 13 00:01:25,980 --> 00:01:33,900 told and said very carefully using the Baoshan framework. So given this introduction, you're going to start now. 14 00:01:33,900 --> 00:01:39,150 Thanks. Okay. Thanks for the invitation and thanks for the introduction. 15 00:01:39,150 --> 00:01:46,920 I'm very happy to talk with you on that kind of theme in your groups. 16 00:01:46,920 --> 00:01:58,260 You are doing a lot of interesting work. Also, in addition to the university, I wanted to advertise the Initiative for Artificial Intelligence, 17 00:01:58,260 --> 00:02:08,380 which is a group of A.I. machine learning, probabilistic modelling researchers in Finland. 18 00:02:08,380 --> 00:02:20,700 Like, we have probably like maybe 100 people altogether, and we are advertising doctoral student and postdoc positions. 19 00:02:20,700 --> 00:02:30,810 So, yeah, I'm also part of an probabilistic programming framework development team and Orbis Diagnostics, 20 00:02:30,810 --> 00:02:35,940 but say some back, it's been quite a developer team. 21 00:02:35,940 --> 00:02:48,750 And obviously, this work, I talk this when you call our tourist listed here, and then in the end, I'll give us some more complete references. 22 00:02:48,750 --> 00:03:00,490 I. Kind of put this talk in a context that I'm interested in all parts of Bayesian workflow. 23 00:03:00,490 --> 00:03:07,930 And then this is figures from a recent archived paper on Based On Workflow. 24 00:03:07,930 --> 00:03:22,090 And then we today then talk more about the kind of the parts that are related or useful in here in convergence diagnostics, 25 00:03:22,090 --> 00:03:25,750 cross validation and and in some other parts. 26 00:03:25,750 --> 00:03:36,070 But kind of that. I'm interested in all part of this, but this talk is now focussed on one piece. 27 00:03:36,070 --> 00:03:47,320 So I will briefly remind you about the estimates you've got on sampling and central limitada, which is used to justify these often. 28 00:03:47,320 --> 00:03:49,390 And then I go to a concrete example. 29 00:03:49,390 --> 00:04:00,920 We were not cross validation, which was the concrete reason I started to think more about this pre-symptomatic behaviour and diagnostics for that. 30 00:04:00,920 --> 00:04:03,080 And then that's the main part. 31 00:04:03,080 --> 00:04:13,730 And then three other examples of using it in validating and improving the end result of various imprints, diagnosing the stochastic optimisation. 32 00:04:13,730 --> 00:04:17,840 And then last remaining again that it can be tough to use. 33 00:04:17,840 --> 00:04:27,050 Plain Monte Carlo estimates also show in Monte Carlo and Marco chain Monte Carlo. 34 00:04:27,050 --> 00:04:38,450 We get. Some drugs and Keita are exempt from this, and Pete, it could be then, 35 00:04:38,450 --> 00:04:46,040 like most true distribution for these parameters and we get drugs from the distribution. 36 00:04:46,040 --> 00:05:00,950 We may have some function aids and then we want to know expectations of the functioning of the AIDS data integrated over that distribution plan. 37 00:05:00,950 --> 00:05:08,390 We can do that with just empirical average. And so this could be just like identity and we get like poster. 38 00:05:08,390 --> 00:05:15,830 We mean it could be squaring, which is needed for variance or indicator of function. 39 00:05:15,830 --> 00:05:24,440 So we get probabilities and so on which are made it in like Bayesian interest. 40 00:05:24,440 --> 00:05:32,100 And then this is consistent, unbiased in case of emergency, also particularly. 41 00:05:32,100 --> 00:05:45,750 And then this is often justified, which central, libertarian, which then some fine experience variance of this is the expectation estimate goes down, 42 00:05:45,750 --> 00:06:02,190 which respected and so divided by the sample size, implied a sampling instead of getting to cross from directly from the target distribution directly. 43 00:06:02,190 --> 00:06:10,710 We instead have some other way it's easier to get and we still can't compute this expectations estimate. 44 00:06:10,710 --> 00:06:22,320 We just need to now wait them. And these weights are just using the ratio of the target and the purpose of distribution. 45 00:06:22,320 --> 00:06:34,290 Again, consistent, unbiased. And if if it's the case that one or two of these distributions that we have only are normalised, 46 00:06:34,290 --> 00:06:38,640 so we don't know the normalisation of constants we can do self normalised. 47 00:06:38,640 --> 00:06:49,320 So we just divide by the sum of these points consistent and by a decrease with the sample size. 48 00:06:49,320 --> 00:06:58,110 And these are quantities, consistency, unbiased, mixed or small bias, centrally midterm have been used to justify that. 49 00:06:58,110 --> 00:07:07,650 OK, now we are using these. I need your input on something. 50 00:07:07,650 --> 00:07:18,420 If this. That kind of data funk, some times the weights and the weights, if this is the patient's plan at variance, 51 00:07:18,420 --> 00:07:24,690 then the estimate has been at variance, which implies a kind of centrally meeting room. 52 00:07:24,690 --> 00:07:31,510 And then now, instead of directly just dividing by X. 53 00:07:31,510 --> 00:07:43,900 Variance in the weights decreases the performance, and then there's this effective sample size estimate. 54 00:07:43,900 --> 00:07:49,240 In this case, I show you just this approximation that ignores that function. 55 00:07:49,240 --> 00:08:01,720 They are also specific versions. And so we get the effective sample size just by one per sum of squared normalised weights. 56 00:08:01,720 --> 00:08:14,070 So the weights some two 110. And this is actually then related to just do the variance of these normalised weights. 57 00:08:14,070 --> 00:08:20,940 And then this is going to be smaller than this total sample size. 58 00:08:20,940 --> 00:08:42,960 And again. Began the useful and insulting the there's also additional useful aspect is that if it happens that there would be infinite variance, 59 00:08:42,960 --> 00:08:53,880 we still would have a finite bin. And then with the infinite variance these estimates, there are distribution converts to generalise. 60 00:08:53,880 --> 00:08:58,370 So the general is generally material called and we would converge towards this global distribution. 61 00:08:58,370 --> 00:09:17,080 It's useful. Tomorrow, later in this talk, so often it's kind of given granted that we have finite and son central and meteor and. 62 00:09:17,080 --> 00:09:23,730 Sometimes we can do it so that we can. 63 00:09:23,730 --> 00:09:30,480 Guaranteed by night marines, by constructions, and, for example, in case of input on sampling, 64 00:09:30,480 --> 00:09:43,560 if you choose to do so that the the weight issue rates are bound it and we know that we have also experienced some central interim holes, 65 00:09:43,560 --> 00:09:57,510 but it's not generally trivial. And I will also show that this pre-symptomatic behaviour can be quite different than what Trent Jolly mid-term says. 66 00:09:57,510 --> 00:10:07,060 I don't know. I called this example, which started my path to thinking about these diagnostics. 67 00:10:07,060 --> 00:10:14,840 We will not cross validation. Yes. True data generating mechanism is. 68 00:10:14,840 --> 00:10:28,010 Linear function y depends on x linearly plus less than just normally distributed variation around the mean. 69 00:10:28,010 --> 00:10:39,650 This one data really say some from that process and its linear model there that supposed to mean. 70 00:10:39,650 --> 00:10:49,520 And then because it's a finite realisation, we have uncertainty which can be then represented with post triggered growth. 71 00:10:49,520 --> 00:10:57,000 So I'm here. We can compute predictive distribution. 72 00:10:57,000 --> 00:11:02,230 And then, you know, we are just. 73 00:11:02,230 --> 00:11:08,530 This is the predictive distribution that if we would know the parameters and since we don't know parameters, 74 00:11:08,530 --> 00:11:16,750 this reflects the uncertainty and then we integrate with that uncertainty into parameters to get the predictive distribution, 75 00:11:16,750 --> 00:11:24,670 and then we would like to know how could these predictive distributions are for new data? 76 00:11:24,670 --> 00:11:33,100 But if we don't have the luxury of waiting new data on testing, we can instead then choose. 77 00:11:33,100 --> 00:11:39,350 And observation and remove that and then fit the bill again. 78 00:11:39,350 --> 00:11:46,770 And now the Green Line is then showing the what would be the fit without that observation. 79 00:11:46,770 --> 00:11:51,710 And we can also then computed leaving out predictive. 80 00:11:51,710 --> 00:12:00,860 Distribution. So now, instead of using that full data plus three or we are using is live on outpost area where these two know that 81 00:12:00,860 --> 00:12:08,840 we have left out the 18th observation and you can see then that this predictive distribution is different. 82 00:12:08,840 --> 00:12:21,920 But now we can also then use that left out observation as new kind of proxy for new future data and the importance why we need to do this. 83 00:12:21,920 --> 00:12:30,720 And of course, one indication is that if we were to evaluate the predictive performance just by evaluating. 84 00:12:30,720 --> 00:12:40,940 Like the predictive density, given that 18th observation, but conditioning on all the data, we get some predictive density. 85 00:12:40,940 --> 00:12:51,280 But usually these leave are not predictive densities often lower because it's more difficult to predict something that you did not see. 86 00:12:51,280 --> 00:13:01,200 And then we can make the summary so that we repeat this process for all in this case, 20 observations computer is predicted and took place. 87 00:13:01,200 --> 00:13:08,100 And it's actually useful to take a logarithm and some or advocates of these values, 88 00:13:08,100 --> 00:13:13,560 and this would be the summary of how could we think that these trees or predicted future data 89 00:13:13,560 --> 00:13:22,050 on this is almost an unbiased estimate of local street predictive density for new data. 90 00:13:22,050 --> 00:13:26,040 And then this can be used in model comparison models that it selects. 91 00:13:26,040 --> 00:13:34,920 And then, of course, we can compute other things using other utilities on cost functions than just plug score. 92 00:13:34,920 --> 00:13:45,030 And again, reminder if we would not use the easily one out predictive distributions, we would get higher value like here. 93 00:13:45,030 --> 00:13:47,040 The difference is a tree, 94 00:13:47,040 --> 00:14:00,270 which happens to be close to this number of parameters in the model intercept slope and the standard deviation of the residual parameters, 95 00:14:00,270 --> 00:14:10,440 which reflects that given in these three parameters, we wear kind of fitting to the specific data. 96 00:14:10,440 --> 00:14:16,860 And then this is and then then this is a biased estimate for the performance for the future. 97 00:14:16,860 --> 00:14:26,430 And then we say it's almost unbiased. And then there are these papers discussing specifically more about this level, of course. 98 00:14:26,430 --> 00:14:35,130 But I guess one other issue is that we would need to compute here in this case, 20 times. 99 00:14:35,130 --> 00:14:43,430 These are not posterior. If we did use NCMEC for the full posterior. 100 00:14:43,430 --> 00:14:54,750 And then we would do 20 times again this CMC for more complex models that can take a considerable amount of time. 101 00:14:54,750 --> 00:14:59,850 So. Now, if he already did. 102 00:14:59,850 --> 00:15:05,070 Sample from the full posteriori, and we already have these drawers. 103 00:15:05,070 --> 00:15:16,910 We can use this as a proposal distribution. And this is the target distribution and or in this case, we have been as many as we have no observation. 104 00:15:16,910 --> 00:15:25,520 And then we can use import on sampling important results, so we can't be. 105 00:15:25,520 --> 00:15:34,300 Leave on our posterior are divided by the filibuster, and if you think about the in the cases where we have factor rising likelihood. 106 00:15:34,300 --> 00:15:40,170 This story has one likelihood 10 more. And this one. 107 00:15:40,170 --> 00:15:52,470 And then if we ignore normally say, some terms than this, in contrast, so it's proportional to just one per one likelihood term. 108 00:15:52,470 --> 00:16:01,860 And there is no indexing this place across from the full poster, 109 00:16:01,860 --> 00:16:07,080 and then we can normalise this because we don't know now this normalisation constant and 110 00:16:07,080 --> 00:16:16,320 use so normalised in product sampling is again the same data on the pool posterior. 111 00:16:16,320 --> 00:16:27,750 And then for the full story, or we would have just the empirical average to get these predictive distribution. 112 00:16:27,750 --> 00:16:32,220 But now if we choose to one observation and leave it out. 113 00:16:32,220 --> 00:16:39,910 Now, notice the difference. I only see I have exactly the same. 114 00:16:39,910 --> 00:16:49,720 Lines in this speaker and a previous speaker, I only change that the Alpha channel has been. 115 00:16:49,720 --> 00:16:58,180 Chosen based on these weights, so if the weights is close to zero, Alpha Channel is zero and you can't see it, 116 00:16:58,180 --> 00:17:04,000 and if the weight is larger than golf, it's on a list closer to one. 117 00:17:04,000 --> 00:17:17,140 And so this is the only difference. Only way that the draws and these draws are now, then better reflects of the leaving out posterior. 118 00:17:17,140 --> 00:17:26,050 Of course, if the lever not pull through would be very, very different and we would not have that kind of cross at all, we would. 119 00:17:26,050 --> 00:17:34,810 Feiglin will this diagnostics I talk about is related to this. 120 00:17:34,810 --> 00:17:44,340 And so now we can then compute this lever, not predictive density of the left up data. 121 00:17:44,340 --> 00:17:49,380 So predictive density, but not the parameter values and then S.A.C. across from the pool plus three or anything, 122 00:17:49,380 --> 00:17:54,710 we just wait them using the normalised weights. 123 00:17:54,710 --> 00:18:08,030 So easy, but then the question is how reliable now this estimate is, is the very finite state centrally mid-term kicking in. 124 00:18:08,030 --> 00:18:19,940 There is no general analytic solution or specifically for the Lebanon, of course, will it based on the artificial results or normal linear models? 125 00:18:19,940 --> 00:18:30,980 And one specific other type of model, but not there's no durable solution. 126 00:18:30,980 --> 00:18:39,380 And it's also useful that the Typekit can take a test on the proposal because when we leave out one observation, there's more uncertainty. 127 00:18:39,380 --> 00:18:46,610 And then it's also likely that the base weights are not bounded. 128 00:18:46,610 --> 00:18:59,830 So what we can do is we can look. The empirically the distribution of the weights and unofficially now to 400 trials, 129 00:18:59,830 --> 00:19:07,440 so that it's a bit easier to see these largest weights in this is to here. 130 00:19:07,440 --> 00:19:13,580 And so even the largest weight, it's not close to one. 131 00:19:13,580 --> 00:19:16,960 So these are not these normalised weights on this. 132 00:19:16,960 --> 00:19:20,950 So yeah, they live there. 133 00:19:20,950 --> 00:19:31,360 And also this would be the what would this size of the weights if they would be all equal and we have a lot of weights around there. 134 00:19:31,360 --> 00:19:35,920 So it seems like that this is not not that bad case. 135 00:19:35,920 --> 00:19:50,050 This is four thousand troops and now it's difficult to see, but this the largest weight and now I can't even see if I have this really big. 136 00:19:50,050 --> 00:19:56,200 I can see that there are also some way out there, other weights. 137 00:19:56,200 --> 00:20:01,330 I mentioned the sample size estimation. Yes. So here we get. 138 00:20:01,330 --> 00:20:11,650 We had total sample size like four thousand and the sample size based on this variance approximation say around 400 400. 139 00:20:11,650 --> 00:20:16,120 It is not that kind of percentage efficiency attitude. 140 00:20:16,120 --> 00:20:23,140 But then the question is that is the variance finite? Can we trust these effective sample size estimate? 141 00:20:23,140 --> 00:20:32,770 And of course, with with any finite sample size, we will look at also, the variance estimates are also finite. 142 00:20:32,770 --> 00:20:40,000 So directly looking at the variance estimate doesn't tell whether the variance is finite. 143 00:20:40,000 --> 00:20:45,930 So what we can then do is use something else to estimate. 144 00:20:45,930 --> 00:20:57,220 Whether they're variances finite. So what we do is that we fit in, analysed broader distribution to the largest resource. 145 00:20:57,220 --> 00:21:03,670 So since so we can ignore the public and we choose some cut point. 146 00:21:03,670 --> 00:21:14,650 And then from that cut point, the journalist part of the distribution is such that it has some so cut point there and then it's decreasing, 147 00:21:14,650 --> 00:21:25,590 it has some scale parameter and then it has to say parameter that affects then I'll take the tennis. 148 00:21:25,590 --> 00:21:35,490 And we fight it today, the army did the last traces and this theory that saying that if we choose this plot point far enough in the tail. 149 00:21:35,490 --> 00:21:46,440 Then that's a very large set of distributions that these tail parties will approximate it with this already distribution. 150 00:21:46,440 --> 00:21:49,920 And the nice thing about the parade of distribution is that so that it has this 151 00:21:49,920 --> 00:21:56,860 scale parameter and say parameter and the safe parameter case substance now. 152 00:21:56,860 --> 00:22:07,450 Floor of one per tells the number of finite moments. And then we have finite variants and so probably maternal health. 153 00:22:07,450 --> 00:22:16,900 Can I ask a question? Yes. So this sort of extreme value theory usually needs for eye observations. 154 00:22:16,900 --> 00:22:20,320 But if you run a mathematician, then you have a Markov chain. 155 00:22:20,320 --> 00:22:24,320 You don't have I.D. observation. So is it? 156 00:22:24,320 --> 00:22:28,900 They don't. Yeah, yeah. So so the usually the spectrum value, Terry. 157 00:22:28,900 --> 00:22:31,150 So for the better data, 158 00:22:31,150 --> 00:22:47,220 they would be the issue if we would be asking a question that what is the probability that we would observe weight which is larger than something? 159 00:22:47,220 --> 00:22:51,960 But just for estimating what is that safe barometer? 160 00:22:51,960 --> 00:23:03,180 That's not the issue. So it's just that in a way that we would need to take the dependency only into account if we would say something about the like, 161 00:23:03,180 --> 00:23:13,630 how likely it is that we would see even extreme rates. OK. 162 00:23:13,630 --> 00:23:20,050 Thank you. But now so so we don't know. 163 00:23:20,050 --> 00:23:26,740 But we are not stating anything about the kind of how likely we would see more extreme waits. 164 00:23:26,740 --> 00:23:30,460 We just want to know the safe parameter. 165 00:23:30,460 --> 00:23:41,170 And then if this parameter would be less than half variance is finite, limit their impulse case less than one do not last time to limit wormholes. 166 00:23:41,170 --> 00:23:48,310 And so in this case, we estimate the case had to be zero point five two. 167 00:23:48,310 --> 00:23:59,660 So it would say that, OK, now central troubling me term doesn't help, but we will soon see that kind of this. 168 00:23:59,660 --> 00:24:13,590 Whether Kate is less than half or about half, there's no, you know, behaviour, there's no sharp dress hold. 169 00:24:13,590 --> 00:24:21,000 Before going to this diagnostic port, I'll add also because this diagnostic examples, 170 00:24:21,000 --> 00:24:26,000 they saw results based on specifically on part of this, would it import on sampling? 171 00:24:26,000 --> 00:24:32,300 So in addition to using these are actual distribution or diagnostics, 172 00:24:32,300 --> 00:24:43,010 we can also use it to replace the largest weights with ordered statistics of defeated Baroda distribution. 173 00:24:43,010 --> 00:24:49,840 And this is equivalent to using water to filter most noise out of which is an illustration. 174 00:24:49,840 --> 00:24:56,200 It blue line is from one repetition of symbolism in one simulation, 175 00:24:56,200 --> 00:25:04,120 we draw 10000 draws from some proposal distribution of some specific proposal distribution. 176 00:25:04,120 --> 00:25:07,370 And this specific target distribution. 177 00:25:07,370 --> 00:25:19,820 We rank the weights, so we sort of weights to increase in order, and you can see so basically it's the largest 10th largest, 100 largest weight. 178 00:25:19,820 --> 00:25:27,710 And then in each simulation you can see that there's a not of worrisome. In the latest wait. 179 00:25:27,710 --> 00:25:36,200 This approach also truncated important sampling, which was truncated a the of weights and, of course, and reduces variability. 180 00:25:36,200 --> 00:25:45,890 But now you can see that instead of having this jumping lines, these are smooth lines because we replaced. 181 00:25:45,890 --> 00:25:56,290 This kind of object. Ratios, which would at ratios based on creating this broader distribution. 182 00:25:56,290 --> 00:26:05,960 And it then reduces the variability compared to plain import on sampling and reduces bias complexity truncating input on something. 183 00:26:05,960 --> 00:26:10,100 And yes, an example, then again, 100 simulations, 184 00:26:10,100 --> 00:26:21,900 but now with the increasing SE estimating the normal system constant and in this case, we are integrating where the case. 185 00:26:21,900 --> 00:26:30,110 Like close to zero point five and then regular input on sampling has these jumps. 186 00:26:30,110 --> 00:26:42,890 And so it is an unreliable even if we get more observations that it can sometimes plot truncated impulse input on something doesn't have that issue. 187 00:26:42,890 --> 00:26:47,510 But you can see here it's getting below the truth about the use of this by us. 188 00:26:47,510 --> 00:26:56,270 And then I just put it in what I'm sampling. Hospitals reduce variability and bias. 189 00:26:56,270 --> 00:27:04,730 We also prove in the paper, thanks to Don Simpson, that this protest would have been put on some polling estimates, 190 00:27:04,730 --> 00:27:14,810 asymptotically contestant on his final appearance under some mild but complex conditions. 191 00:27:14,810 --> 00:27:29,810 Cool result also is. That we can connect this case to minimum sample size that gives some feel guarantees for the error. 192 00:27:29,810 --> 00:27:35,620 In self normalising import on sampling and. Simple equation, which looked like this, 193 00:27:35,620 --> 00:27:46,140 so let's say that we have the case where case small and zero corresponds to having and the distribution 194 00:27:46,140 --> 00:27:52,050 of weights being exponential and less than two year old would correspond to having bounded weights. 195 00:27:52,050 --> 00:27:56,460 In that case, let's say we have 100 drawers. 196 00:27:56,460 --> 00:28:06,160 If we were to have independent draws directly from the target distribution 100 drawers is often sufficient, for example, stream in expectation. 197 00:28:06,160 --> 00:28:17,530 And then we can what is that, if now would have a proposal on distribution such that the case estimated to be 2.7 for the same accuracy, 198 00:28:17,530 --> 00:28:28,670 we would need more than 100000 troops. You can see that there's no sharp change at gate zero point five. 199 00:28:28,670 --> 00:28:42,740 So that way. Previously, there was also some system to estimate like could make a hypothesis testing whether the variants it's fine or not. 200 00:28:42,740 --> 00:28:49,460 And then abandon the hope if the hypothesis test would say that variance is infinite. 201 00:28:49,460 --> 00:28:59,150 Instead, we just use the K continuously because there's still hope here. 202 00:28:59,150 --> 00:29:08,370 Although then, after about 0.7 begun of the required sample size, it's start to beat that [INAUDIBLE]. 203 00:29:08,370 --> 00:29:14,280 That it's not sensible one like in this lever, not cross validation. 204 00:29:14,280 --> 00:29:25,610 It would be better than to use something else than simple input on something, maybe even just refit, then with them CMC. 205 00:29:25,610 --> 00:29:36,170 That was the charity called Result and yes, an empirical comparison to the Terri sewing root mean square error knowing the truth. 206 00:29:36,170 --> 00:29:42,040 And let's focus first on just on the black lines which are and root mean square error. 207 00:29:42,040 --> 00:29:48,080 Given different proposals, distributions and it devalues your aunt Gail, 208 00:29:48,080 --> 00:29:52,840 tell how good these proposal distributions are because these are the card values. 209 00:29:52,840 --> 00:30:00,940 So from quite small and eventually around 0.7 and 0.9, and we can see that. 210 00:30:00,940 --> 00:30:04,670 So the error goes down when we get more growth. 211 00:30:04,670 --> 00:30:14,390 But this is also reflecting that like here we have this 100 dross and we have the kind of good proposal distribution. 212 00:30:14,390 --> 00:30:27,850 Is somewhere around here and then if we. Draw parallel on the horizontal line, we can see that if K would be around 0.7. 213 00:30:27,850 --> 00:30:37,700 It would also indirectly match that we need more than 100000 drawers to get the same accuracy. 214 00:30:37,700 --> 00:30:44,480 Direct flights are moved to call a standard error estimates based on pace, 215 00:30:44,480 --> 00:30:55,470 the effective sample size estimate I showed you before and then that's well, they match quite closely even. 216 00:30:55,470 --> 00:31:07,560 Here. This is close already to 0.7, and so this red dashed line and this black line onto the air and they are close by, 217 00:31:07,560 --> 00:31:16,410 then we can see that when they get close larger than this, our estimate starts to be also too optimistic. 218 00:31:16,410 --> 00:31:22,710 But it's it's not that bad because we know based on the K hat that we can't trust them anymore. 219 00:31:22,710 --> 00:31:31,830 But it's also interesting you can see that the angle of this so this angle of this is actually now corresponds to that. 220 00:31:31,830 --> 00:31:38,030 The speed of convergence is this. Why? 221 00:31:38,030 --> 00:31:47,540 Tickets are divided by X, but then with the card is. 222 00:31:47,540 --> 00:31:53,430 Clearly larger 22.5. The angle is a different. 223 00:31:53,430 --> 00:32:04,500 And we can summarise this so that here this is the convergence rate, convergence rate, so that. 224 00:32:04,500 --> 00:32:12,010 Usually. In the crude case, so we would have these s two minus one. 225 00:32:12,010 --> 00:32:18,950 But now we get worse convergence rates when case larger. 226 00:32:18,950 --> 00:32:35,180 And, for example, around 0.7 we have the convergence rate is then again, only then the like the Ouachita S to 0.6. 227 00:32:35,180 --> 00:32:50,150 There's a kind of smooth curve, which is partially, of course, because of the limited simulations and limited the length of the simulations. 228 00:32:50,150 --> 00:32:56,840 And then but it's also related that there's no really super smooth transition. 229 00:32:56,840 --> 00:33:03,860 Otherwise, this like starting from zero point five if we draw a line, this test line. 230 00:33:03,860 --> 00:33:15,050 Behaviour close quite close to here, the blue line is for the moment of the normalisation term best moment mean second moment. 231 00:33:15,050 --> 00:33:27,500 So using also then these functions specific estimates and they they have been kind of similar behaviour. 232 00:33:27,500 --> 00:33:34,700 So when we get to, we can say then OK, what would be the minimum sample size required? 233 00:33:34,700 --> 00:33:46,430 And then we could also say that how much we could expect that the variance decreases if you get more gross. 234 00:33:46,430 --> 00:33:54,560 So this is this and we do this empirical predicate using the lightest ratios we have oracle rule selecting 235 00:33:54,560 --> 00:34:00,680 how we need lightest ratios to use for estimating K and then they rule is also such that it fulfils. 236 00:34:00,680 --> 00:34:06,200 Then they attempt to take properties that we always go further in the tail. 237 00:34:06,200 --> 00:34:12,410 But at the same time, increasing the number of draws so that the estimate gets more accurate. 238 00:34:12,410 --> 00:34:23,900 Smallest amount in the experiment has been like 20 largest ratios from 100 ratios and useful, but of course has a lot of variation. 239 00:34:23,900 --> 00:34:31,700 We use empirical profile based estimate by Chang and statements excellent accuracy compared to NCMEC. 240 00:34:31,700 --> 00:34:40,080 So no seems that there's no need to look for more accurate estimates for this. 241 00:34:40,080 --> 00:34:55,410 And see the paper for more details, and the useful question has been from Israeli Eberstadt that why bother with part of the smoothing 242 00:34:55,410 --> 00:35:04,200 improving the estimate in cases where case law like the weather variances infinite's and. 243 00:35:04,200 --> 00:35:09,150 Why don't you just select the proposal so the drought ratios are bound it? 244 00:35:09,150 --> 00:35:17,760 Yes, that would be nice, but it's not always trivial, as you put it, on something leaving out. 245 00:35:17,760 --> 00:35:26,700 Even bigger problem is that high dimensional spaces are not intuitive and they are really scary. 246 00:35:26,700 --> 00:35:36,940 And so here's an example. By construction, we have found it resource finite metrics. 247 00:35:36,940 --> 00:35:44,850 So the target distribution is normal distribution. And then the proposal distribution is a studenti. 248 00:35:44,850 --> 00:35:51,500 It has thicker tails than normal. And so the important ratios are bounded. 249 00:35:51,500 --> 00:36:03,400 Always we. Yes, 100000 draws a lot and an soft, very number of diamonds. 250 00:36:03,400 --> 00:36:09,890 Yes, they affected sample size. So in. 251 00:36:09,890 --> 00:36:16,670 Lou Diamond sums, we get quite good efficiency close to 100 percent, it's efficiency. 252 00:36:16,670 --> 00:36:25,310 But then the effective sample size truck starts to drop and it goes and eventually the terror. 253 00:36:25,310 --> 00:36:40,550 Why is that? Also, if we look at the convergence rate that how much additional draws reduce the error, the conversion rate drop. 254 00:36:40,550 --> 00:36:53,560 So even if. By construction, we have bounded ratios by not variance, asymptotically central element Terim Homes. 255 00:36:53,560 --> 00:37:00,250 Previously, particularly, the practical convergence rate is less. 256 00:37:00,250 --> 00:37:12,310 And eventually close to even zero. The good thing is that quiet can detect diagnosed these present behaviour. 257 00:37:12,310 --> 00:37:23,140 So as we have found it resolves the eye, the true case actually less than zero and in the beginning in low dimensional hat is also less than zero. 258 00:37:23,140 --> 00:37:31,090 But then it goes above zero, then it goes about zero above zero point seven and are out there. 259 00:37:31,090 --> 00:37:37,390 Effective sample size must be zero as as we did, so that at around zero point seven, 260 00:37:37,390 --> 00:37:47,290 we would need more than 100000 troops and then it keeps growing and then competence rate compresses. 261 00:37:47,290 --> 00:37:59,540 So why is why is this happening like this? Why don't we see the asymptotic central limiting their own behaviour? 262 00:37:59,540 --> 00:38:04,670 So now you use a blue line for the normal distribution. 263 00:38:04,670 --> 00:38:15,590 And then this proposal to distribution with Red Line and this is a marginal density, something just a distance from the mall. 264 00:38:15,590 --> 00:38:23,750 So 500 dimensional distribution and then we draw 100000 cross. 265 00:38:23,750 --> 00:38:29,180 And then when we compute how far these drawers are from the moat. 266 00:38:29,180 --> 00:38:37,940 We get this kind of distribution for this one hundred thousand troops, and this is for the proposal and this is for the drug. 267 00:38:37,940 --> 00:38:49,970 So from these 100000 trust. Most of them are closer to the moat than what would be the draws. 268 00:38:49,970 --> 00:39:02,400 From the true distribution. Now, if we are here, what is the important threat is so we can say that the important race is about it. 269 00:39:02,400 --> 00:39:07,960 But look. At the scale of the y axis, the largest. 270 00:39:07,960 --> 00:39:14,640 Ways around that size. And also. 271 00:39:14,640 --> 00:39:24,760 Before we would see that these resource are bound, it would need to see some draws around here. 272 00:39:24,760 --> 00:39:32,520 And looking at this distribution, it's quite unlikely that we would get drugs around here. 273 00:39:32,520 --> 00:39:39,840 In this specific case, it's possible to compute that. The number of draws we need. 274 00:39:39,840 --> 00:39:47,310 To get something here, it's much, much larger than the number of atoms in the universe. 275 00:39:47,310 --> 00:39:52,170 So there's no hope that we would create asymptotic routine, 276 00:39:52,170 --> 00:40:06,110 and thus it is very useful that we can diagnose these pre-symptomatic behaviour if I scale this so that the light I sold, it's only up to 10 to 30. 277 00:40:06,110 --> 00:40:12,280 How the important racial code behaves here, it's almost like a wall. 278 00:40:12,280 --> 00:40:18,120 So in the region where we are getting draws. 279 00:40:18,120 --> 00:40:29,680 With reasonable sample sizes, it looks like it's interesting sort, so interesting support from infinite variance case. 280 00:40:29,680 --> 00:40:37,330 And then we can see this with this empirical approach, OK? 281 00:40:37,330 --> 00:40:43,660 It is also possible like in this case, it would make it less extreme. 282 00:40:43,660 --> 00:40:49,030 It would be possible that like maybe after 10000. 283 00:40:49,030 --> 00:40:56,590 We it looks like aquariums could be infinite unless unless say, let's say, 284 00:40:56,590 --> 00:41:04,120 after like one million drops, we might start to finally see that it's actually founded. 285 00:41:04,120 --> 00:41:16,270 It is possible to see this also then when when you get more, the product eventually starts to also recognise that it could be pounded. 286 00:41:16,270 --> 00:41:29,200 So it's not saying. Kind of what would be if we would give infinitely sampling it to us to say, is that OK at this moment? 287 00:41:29,200 --> 00:41:39,790 It's looks like your performance really bad and it looks like you are not getting much better if you get more trust. 288 00:41:39,790 --> 00:41:52,630 OK, so that was the the main part about the diagnostic. And then I'll discuss this quickly these few applications. 289 00:41:52,630 --> 00:41:57,340 So what is the end result of various small inference stochastic optimisation? 290 00:41:57,340 --> 00:42:00,850 For some of the barriers in Princeton, I took blame on people. 291 00:42:00,850 --> 00:42:08,950 I estimate it's been very useful in forensics, often like normal distribution excuse to some unknown Worcester distribution, 292 00:42:08,950 --> 00:42:16,600 and then we can use that as a proposal distribution sampling. 293 00:42:16,600 --> 00:42:21,430 And then that means that we can use it both for diagnostic and improving. 294 00:42:21,430 --> 00:42:31,430 In this case, we. We'd like to use the Southeast this this order diff different at various in France, 295 00:42:31,430 --> 00:42:43,010 which is kind of blackbox, very similar in France and concrete UMC. AMC here we used a U-turn sampling time looks like this. 296 00:42:43,010 --> 00:42:52,670 And you can see that we can run ATV in less time, but also that if we run it to short our time. 297 00:42:52,670 --> 00:43:03,200 The stochastic optimism hasn't converged yet. And then this normal approximation is bad, but we can use to have to recognise that, OK, 298 00:43:03,200 --> 00:43:11,420 now the optimistic sum result is actually good, and we can also see here that in addition, 299 00:43:11,420 --> 00:43:19,580 not diagnosing that exclude results instead of using just the normal approximation using normal as a proposal distribution, 300 00:43:19,580 --> 00:43:25,040 we can get lower root mean square error with this. 301 00:43:25,040 --> 00:43:36,850 And then again, but I just wouldn't call it a little. And if the preference. 302 00:43:36,850 --> 00:43:52,610 Then. It is 80 I and many other people are reasonable inference algorithms use stochastic optimism, and often in this paper it's been said that OK, 303 00:43:52,610 --> 00:44:02,270 we reduced the step size and when it all appeals, Robbins Monroe condition, it will eventually converts. 304 00:44:02,270 --> 00:44:12,140 And then they've used us the last iterate. But actually, then this optimisation, they stopped much earlier and then the last iterate can be very, 305 00:44:12,140 --> 00:44:16,340 very kind of noisy toys, and that's to stick optimisation. 306 00:44:16,340 --> 00:44:27,250 And then there's this Polyak report, operating autonomously direct operating, which instead use many of the iterations from the end. 307 00:44:27,250 --> 00:44:36,400 And here's an illustration of, again, how we can use this Gearhart. To sell that, especially when the number of dimensions increase, 308 00:44:36,400 --> 00:44:42,620 this optimisation problem gets harder using just lastly, tourist gets noisier gay had. 309 00:44:42,620 --> 00:44:50,650 So it's that the approximation is worse while with iterate rating we get we stay near zero point seven. 310 00:44:50,650 --> 00:44:56,740 So much more stable. 311 00:44:56,740 --> 00:45:09,850 Then in addition of looking at this stochastic optimisation plot, there's also that the what divergence to use use of this exclusive K, 312 00:45:09,850 --> 00:45:17,610 at least not too often to underestimate the uncertainty and so people have been proposing. 313 00:45:17,610 --> 00:45:26,730 Other divergences, which are more mass covering, such as inclusive kale and K squared and so on. 314 00:45:26,730 --> 00:45:33,830 And these many of these common divergences can be presented as expectations of the density racial. 315 00:45:33,830 --> 00:45:43,340 So this is the same density, the ratio as before, and then there's just some func some empirical average, so it's the same thing we've been doing. 316 00:45:43,340 --> 00:45:49,580 And here, what about these different looking, parent based divergences objectives? 317 00:45:49,580 --> 00:46:01,300 So exclusive club. Look, you and we can also know this earlier paper on Terry saying how many moments of. 318 00:46:01,300 --> 00:46:12,090 Well, the W we need and so here, Delta, which just needs to be larger than zero. 319 00:46:12,090 --> 00:46:23,760 And then more equitable, inclusive kale instead requires to find out moments and then Delta, which needs to be lots of them zero. 320 00:46:23,760 --> 00:46:37,860 So it's it's it. You can now guess that then it's more difficult to estimate inclusive kale, an exclusive kale given these Monte-Carlo estimates. 321 00:46:37,860 --> 00:46:47,580 And again, high dimensional spaces so exclusive kale it's known to and underestimate. 322 00:46:47,580 --> 00:46:57,660 And here also again, when we increase the dimensions to 40 and 50, it is the red one is the approximation it is underestimating. 323 00:46:57,660 --> 00:47:07,650 Inclusive Kale here in Lodi instance, it is all we're estimating, and it's giving pounded bone to drizzles, which is nice. 324 00:47:07,650 --> 00:47:11,310 But this overestimation in high dimensions is actually bad. 325 00:47:11,310 --> 00:47:20,700 Now these two concepts are not overlapping anymore, and it's unlikely that from this inclusive K.L. proposal on distribution, 326 00:47:20,700 --> 00:47:27,120 which is over estimating the scale we would get draws around here. 327 00:47:27,120 --> 00:47:35,250 And if you don't get trust there, we will get bad behaviour of this all. 328 00:47:35,250 --> 00:47:41,530 And here. All right, some. 329 00:47:41,530 --> 00:47:50,690 If we optimise exclusive scale so that we like to underestimate and also then how well we are actually. 330 00:47:50,690 --> 00:47:59,090 Being able to estimate the short time term, so this was needed for explosive gale and this was needed for inclusive gale, 331 00:47:59,090 --> 00:48:09,340 and this was needed for normalisation constant. So it's easy to estimate this exclusive scale. 332 00:48:09,340 --> 00:48:13,610 And also, the normalisation constant, we yeah, it goes up again, Europe wants that. 333 00:48:13,610 --> 00:48:24,170 And so not that good. But if you compare that if we are optimising inclusive kale, it's good in low demand since. 334 00:48:24,170 --> 00:48:38,810 But eventually, it will call much, much higher, so it's much worse both for the normalisation factor and that divergence measure itself. 335 00:48:38,810 --> 00:48:45,750 And it's just reminding that from the past decade, we could see also how many draws we would need. 336 00:48:45,750 --> 00:48:52,230 And in this stochastic optimisation that in each step of stochastic optimisation, we would be using these many draws. 337 00:48:52,230 --> 00:49:05,220 It's infeasible. And then also, what is the effect, then, that this Monte-Carlo estimates start to fail? 338 00:49:05,220 --> 00:49:12,290 But is that here? So the dust line is the true. Exclusive Castle. 339 00:49:12,290 --> 00:49:18,500 And this continuous line, the dark green is then what is the estimate? 340 00:49:18,500 --> 00:49:26,720 And so we are underestimating also the divergence and in this pre-symptomatic behaviour, 341 00:49:26,720 --> 00:49:36,470 they look like biased just because it's so unlikely that we would ever see these are overestimates. 342 00:49:36,470 --> 00:49:41,480 And so in into practise, we see this bias behaviour, 343 00:49:41,480 --> 00:49:52,100 which can explain how these results that lie steal these exclusive Gayle is so much more common than these other divergences. 344 00:49:52,100 --> 00:50:04,150 Why people had problems of getting other directions to work reliably and solving. 345 00:50:04,150 --> 00:50:11,320 And so we took small sample size to divert this kind of second corporate political bias, which can affect our results. 346 00:50:11,320 --> 00:50:18,040 And yeah, again, we may need more trust and the number of victims of the universe to make these rely. 347 00:50:18,040 --> 00:50:28,120 And yes, the paper with this more results from this. And the final example. 348 00:50:28,120 --> 00:50:36,460 So this. That's just reminding that what I was talking is not just important sampling and various inference, 349 00:50:36,460 --> 00:50:44,620 it can be used just for plain Monte-Carlo in this paper. 350 00:50:44,620 --> 00:50:55,550 But on an Oh, we did study this leave on outgrows validation, but we wanted to compare import until. 351 00:50:55,550 --> 00:51:07,900 You put on sampling, leaving out cross validation also to the case where we actually then use CMC to sample from this. 352 00:51:07,900 --> 00:51:15,250 And then these drawers are actually prompted directly from the Typekit distribution. 353 00:51:15,250 --> 00:51:23,650 But we did get seemingly biased results because. 354 00:51:23,650 --> 00:51:30,090 So this distribution here was normal distribution. 355 00:51:30,090 --> 00:51:38,330 Well, I then. Given some predicted new. 356 00:51:38,330 --> 00:51:49,400 And then, Sigma. And then when we have these draws of these parameters, if we go far enough in the cave. 357 00:51:49,400 --> 00:51:52,940 These densities are very close to zero. 358 00:51:52,940 --> 00:52:02,120 And then by chance, one of these is that much away from zero so that this distribution of these two entities is very, 359 00:52:02,120 --> 00:52:12,890 very skewed and we see the same behaviour as we could see with this important sampling case, 360 00:52:12,890 --> 00:52:21,200 and we can just look at it then using the positive diagnostic also directly how long tail? 361 00:52:21,200 --> 00:52:25,430 These drugs have I can see that, OK? 362 00:52:25,430 --> 00:52:34,100 Actually, now CMC is also failing to give give the exact look the result. 363 00:52:34,100 --> 00:52:40,220 Like with the fire a reasonable amount of trust. OK. 364 00:52:40,220 --> 00:52:51,990 That's it. And here are the papers I've mentioned throughout the talk and the voters. 365 00:52:51,990 --> 00:52:55,080 Yeah, thanks a lot. 366 00:52:55,080 --> 00:53:03,570 It's very interesting, everything like I wasn't aware of this threat or something to me, so it's good to know all the capabilities that it has, 367 00:53:03,570 --> 00:53:11,490 and I don't really get a place like you can ask questions if you want, like anyone who is in the public. 368 00:53:11,490 --> 00:53:19,920 But otherwise, I have a question which is related to this study in technique that you didn't talk about, 369 00:53:19,920 --> 00:53:24,510 which is that it seems to be addressing or your paper. 370 00:53:24,510 --> 00:53:32,470 There seems to be addressing this very pressing topic of how do we average models given that all models are wrong? 371 00:53:32,470 --> 00:53:37,140 And then you also use this technique in that case, right? 372 00:53:37,140 --> 00:53:42,390 So I wanted to know like what is the usefulness of use? 373 00:53:42,390 --> 00:53:47,850 But it is most important something in the context of a study just because of the computational resource or. 374 00:53:47,850 --> 00:53:55,230 Yes. Yeah. Yeah, that's why we we used that in. 375 00:53:55,230 --> 00:54:04,860 Really, because of the computational efficiency in when we've been when we did this, stacking with M.S., M.S., 376 00:54:04,860 --> 00:54:21,210 it usually also is working so well that can we get really these high caps or only a small amount of Typekit had so that reliability is good. 377 00:54:21,210 --> 00:54:30,270 We have also example in that paper with Bob variational in France, 378 00:54:30,270 --> 00:54:40,320 and then we have other paper which has more results with the variational in France, in which case it's more likely that we get like hats. 379 00:54:40,320 --> 00:54:48,800 But even then, this but I just wanted important sampling can. 380 00:54:48,800 --> 00:55:01,400 B so that the stacking helps, even if we don't get, like, close to true estimates. 381 00:55:01,400 --> 00:55:06,410 Hmm. Even if the era starts to increase, it's still in the correct direction. 382 00:55:06,410 --> 00:55:21,480 And also in this. The last paper mentioned on this slide currently showing, there's also results showing that depending on the kind of what? 383 00:55:21,480 --> 00:55:23,760 What we are interested like if we are interested in posturing, 384 00:55:23,760 --> 00:55:32,760 mean we do get estimates that are closer to pastrami and even if the caper is quite large. 385 00:55:32,760 --> 00:55:37,840 I'm saying this very small in press. I see. Mm hmm. 386 00:55:37,840 --> 00:55:45,860 We have also saw this item in our paper implicitly adaptive importance sampling has been. 387 00:55:45,860 --> 00:55:58,160 Algorithm that can improve. So like in this specific to this important sampling, live on a cross validation list that. 388 00:55:58,160 --> 00:56:09,920 We don't have like no distribution for the proposal distribution that we just have draws from the 389 00:56:09,920 --> 00:56:17,910 proposal distribution and then this implicitly adopted the input on sampling discusses how we can adopt. 390 00:56:17,910 --> 00:56:27,330 Set of cross using our fine transformations so that we can match then better Typekit distribution, 391 00:56:27,330 --> 00:56:37,950 but with the benefit that we don't need to be able to know the exact shape of that proposal. 392 00:56:37,950 --> 00:56:46,150 And then then and then this would be also useful in in a way that if this. 393 00:56:46,150 --> 00:56:54,280 Basic, but I just moved at Langley when out starts to fail without need to really run. 394 00:56:54,280 --> 00:57:02,030 Mercy, Mercy, this can give better accuracy. I see. 395 00:57:02,030 --> 00:57:16,920 Thank you. Yeah, so, uh, since I got there, no questions in the Chattanooga, uh, what I do think again and again for coming. 396 00:57:16,920 --> 00:57:23,640 So this one hand up by saying who is sort of the government or is it wearing? 397 00:57:23,640 --> 00:57:28,980 I don't know. Well, so maybe it's a clapping symbol? 398 00:57:28,980 --> 00:57:34,300 Yeah, probably. OK, so I. Mr. Mr. Yeah. 399 00:57:34,300 --> 00:57:44,640 Mm hmm. Thanks a lot. OK. 400 00:57:44,640 --> 00:57:48,400 Thank you very much. Right.