1
00:00:01,980 --> 00:00:12,030
OK, you heard this. Today's the second edition of the competition and Statistics and Machine Learning Seminary

2
00:00:12,030 --> 00:00:19,110
in this Trinity term 2020 twenty first and today we're very happy to have I keep a diary.

3
00:00:19,110 --> 00:00:26,370
She will be speaking about practical, pre asymptotic diagnostic of Monte-Carlo estimating based on inference and machine learning.

4
00:00:26,370 --> 00:00:32,220
And I just wanted to tell you more about the Who is AKI if you don't know him.

5
00:00:32,220 --> 00:00:39,630
Aki in that he's an associate professor in Computational Probabilistic Modelling in Aalto University.

6
00:00:39,630 --> 00:00:45,570
His research interests are based on probability theory and methodology based on thorough work,

7
00:00:45,570 --> 00:00:51,210
providing realistic programming in fair methods such as patient propagation,

8
00:00:51,210 --> 00:00:58,860
operational base model diagnostics, model assessment and selection processes, and hierarchical models.

9
00:00:58,860 --> 00:01:06,900
He has published a very, many very relevant papers, both in the medical aspect transparency and also in applications.

10
00:01:06,900 --> 00:01:12,360
And also, he is an author of Of Course, I Hope You Like a very interesting book books.

11
00:01:12,360 --> 00:01:17,580
One of the things that I come here to be, which is a classic already and also which is more recent.

12
00:01:17,580 --> 00:01:25,980
The Radiation and other studies, which was published last year and is also very interesting because it addresses rigour applied,

13
00:01:25,980 --> 00:01:33,900
told and said very carefully using the Baoshan framework. So given this introduction, you're going to start now.

14
00:01:33,900 --> 00:01:39,150
Thanks. Okay. Thanks for the invitation and thanks for the introduction.

15
00:01:39,150 --> 00:01:46,920
I'm very happy to talk with you on that kind of theme in your groups.

16
00:01:46,920 --> 00:01:58,260
You are doing a lot of interesting work. Also, in addition to the university, I wanted to advertise the Initiative for Artificial Intelligence,

17
00:01:58,260 --> 00:02:08,380
which is a group of A.I. machine learning, probabilistic modelling researchers in Finland.

18
00:02:08,380 --> 00:02:20,700
Like, we have probably like maybe 100 people altogether, and we are advertising doctoral student and postdoc positions.

19
00:02:20,700 --> 00:02:30,810
So, yeah, I'm also part of an probabilistic programming framework development team and Orbis Diagnostics,

20
00:02:30,810 --> 00:02:35,940
but say some back, it's been quite a developer team.

21
00:02:35,940 --> 00:02:48,750
And obviously, this work, I talk this when you call our tourist listed here, and then in the end, I'll give us some more complete references.

22
00:02:48,750 --> 00:03:00,490
I. Kind of put this talk in a context that I'm interested in all parts of Bayesian workflow.

23
00:03:00,490 --> 00:03:07,930
And then this is figures from a recent archived paper on Based On Workflow.

24
00:03:07,930 --> 00:03:22,090
And then we today then talk more about the kind of the parts that are related or useful in here in convergence diagnostics,

25
00:03:22,090 --> 00:03:25,750
cross validation and and in some other parts.

26
00:03:25,750 --> 00:03:36,070
But kind of that. I'm interested in all part of this, but this talk is now focussed on one piece.

27
00:03:36,070 --> 00:03:47,320
So I will briefly remind you about the estimates you've got on sampling and central limitada, which is used to justify these often.

28
00:03:47,320 --> 00:03:49,390
And then I go to a concrete example.

29
00:03:49,390 --> 00:04:00,920
We were not cross validation, which was the concrete reason I started to think more about this pre-symptomatic behaviour and diagnostics for that.

30
00:04:00,920 --> 00:04:03,080
And then that's the main part.

31
00:04:03,080 --> 00:04:13,730
And then three other examples of using it in validating and improving the end result of various imprints, diagnosing the stochastic optimisation.

32
00:04:13,730 --> 00:04:17,840
And then last remaining again that it can be tough to use.

33
00:04:17,840 --> 00:04:27,050
Plain Monte Carlo estimates also show in Monte Carlo and Marco chain Monte Carlo.

34
00:04:27,050 --> 00:04:38,450
We get. Some drugs and Keita are exempt from this, and Pete, it could be then,

35
00:04:38,450 --> 00:04:46,040
like most true distribution for these parameters and we get drugs from the distribution.

36
00:04:46,040 --> 00:05:00,950
We may have some function aids and then we want to know expectations of the functioning of the AIDS data integrated over that distribution plan.

37
00:05:00,950 --> 00:05:08,390
We can do that with just empirical average. And so this could be just like identity and we get like poster.

38
00:05:08,390 --> 00:05:15,830
We mean it could be squaring, which is needed for variance or indicator of function.

39
00:05:15,830 --> 00:05:24,440
So we get probabilities and so on which are made it in like Bayesian interest.

40
00:05:24,440 --> 00:05:32,100
And then this is consistent, unbiased in case of emergency, also particularly.

41
00:05:32,100 --> 00:05:45,750
And then this is often justified, which central, libertarian, which then some fine experience variance of this is the expectation estimate goes down,

42
00:05:45,750 --> 00:06:02,190
which respected and so divided by the sample size, implied a sampling instead of getting to cross from directly from the target distribution directly.

43
00:06:02,190 --> 00:06:10,710
We instead have some other way it's easier to get and we still can't compute this expectations estimate.

44
00:06:10,710 --> 00:06:22,320
We just need to now wait them. And these weights are just using the ratio of the target and the purpose of distribution.

45
00:06:22,320 --> 00:06:34,290
Again, consistent, unbiased. And if if it's the case that one or two of these distributions that we have only are normalised,

46
00:06:34,290 --> 00:06:38,640
so we don't know the normalisation of constants we can do self normalised.

47
00:06:38,640 --> 00:06:49,320
So we just divide by the sum of these points consistent and by a decrease with the sample size.

48
00:06:49,320 --> 00:06:58,110
And these are quantities, consistency, unbiased, mixed or small bias, centrally midterm have been used to justify that.

49
00:06:58,110 --> 00:07:07,650
OK, now we are using these. I need your input on something.

50
00:07:07,650 --> 00:07:18,420
If this. That kind of data funk, some times the weights and the weights, if this is the patient's plan at variance,

51
00:07:18,420 --> 00:07:24,690
then the estimate has been at variance, which implies a kind of centrally meeting room.

52
00:07:24,690 --> 00:07:31,510
And then now, instead of directly just dividing by X.

53
00:07:31,510 --> 00:07:43,900
Variance in the weights decreases the performance, and then there's this effective sample size estimate.

54
00:07:43,900 --> 00:07:49,240
In this case, I show you just this approximation that ignores that function.

55
00:07:49,240 --> 00:08:01,720
They are also specific versions. And so we get the effective sample size just by one per sum of squared normalised weights.

56
00:08:01,720 --> 00:08:14,070
So the weights some two 110. And this is actually then related to just do the variance of these normalised weights.

57
00:08:14,070 --> 00:08:20,940
And then this is going to be smaller than this total sample size.

58
00:08:20,940 --> 00:08:42,960
And again. Began the useful and insulting the there's also additional useful aspect is that if it happens that there would be infinite variance,

59
00:08:42,960 --> 00:08:53,880
we still would have a finite bin. And then with the infinite variance these estimates, there are distribution converts to generalise.

60
00:08:53,880 --> 00:08:58,370
So the general is generally material called and we would converge towards this global distribution.

61
00:08:58,370 --> 00:09:17,080
It's useful. Tomorrow, later in this talk, so often it's kind of given granted that we have finite and son central and meteor and.

62
00:09:17,080 --> 00:09:23,730
Sometimes we can do it so that we can.

63
00:09:23,730 --> 00:09:30,480
Guaranteed by night marines, by constructions, and, for example, in case of input on sampling,

64
00:09:30,480 --> 00:09:43,560
if you choose to do so that the the weight issue rates are bound it and we know that we have also experienced some central interim holes,

65
00:09:43,560 --> 00:09:57,510
but it's not generally trivial. And I will also show that this pre-symptomatic behaviour can be quite different than what Trent Jolly mid-term says.

66
00:09:57,510 --> 00:10:07,060
I don't know. I called this example, which started my path to thinking about these diagnostics.

67
00:10:07,060 --> 00:10:14,840
We will not cross validation. Yes. True data generating mechanism is.

68
00:10:14,840 --> 00:10:28,010
Linear function y depends on x linearly plus less than just normally distributed variation around the mean.

69
00:10:28,010 --> 00:10:39,650
This one data really say some from that process and its linear model there that supposed to mean.

70
00:10:39,650 --> 00:10:49,520
And then because it's a finite realisation, we have uncertainty which can be then represented with post triggered growth.

71
00:10:49,520 --> 00:10:57,000
So I'm here. We can compute predictive distribution.

72
00:10:57,000 --> 00:11:02,230
And then, you know, we are just.

73
00:11:02,230 --> 00:11:08,530
This is the predictive distribution that if we would know the parameters and since we don't know parameters,

74
00:11:08,530 --> 00:11:16,750
this reflects the uncertainty and then we integrate with that uncertainty into parameters to get the predictive distribution,

75
00:11:16,750 --> 00:11:24,670
and then we would like to know how could these predictive distributions are for new data?

76
00:11:24,670 --> 00:11:33,100
But if we don't have the luxury of waiting new data on testing, we can instead then choose.

77
00:11:33,100 --> 00:11:39,350
And observation and remove that and then fit the bill again.

78
00:11:39,350 --> 00:11:46,770
And now the Green Line is then showing the what would be the fit without that observation.

79
00:11:46,770 --> 00:11:51,710
And we can also then computed leaving out predictive.

80
00:11:51,710 --> 00:12:00,860
Distribution. So now, instead of using that full data plus three or we are using is live on outpost area where these two know that

81
00:12:00,860 --> 00:12:08,840
we have left out the 18th observation and you can see then that this predictive distribution is different.

82
00:12:08,840 --> 00:12:21,920
But now we can also then use that left out observation as new kind of proxy for new future data and the importance why we need to do this.

83
00:12:21,920 --> 00:12:30,720
And of course, one indication is that if we were to evaluate the predictive performance just by evaluating.

84
00:12:30,720 --> 00:12:40,940
Like the predictive density, given that 18th observation, but conditioning on all the data, we get some predictive density.

85
00:12:40,940 --> 00:12:51,280
But usually these leave are not predictive densities often lower because it's more difficult to predict something that you did not see.

86
00:12:51,280 --> 00:13:01,200
And then we can make the summary so that we repeat this process for all in this case, 20 observations computer is predicted and took place.

87
00:13:01,200 --> 00:13:08,100
And it's actually useful to take a logarithm and some or advocates of these values,

88
00:13:08,100 --> 00:13:13,560
and this would be the summary of how could we think that these trees or predicted future data

89
00:13:13,560 --> 00:13:22,050
on this is almost an unbiased estimate of local street predictive density for new data.

90
00:13:22,050 --> 00:13:26,040
And then this can be used in model comparison models that it selects.

91
00:13:26,040 --> 00:13:34,920
And then, of course, we can compute other things using other utilities on cost functions than just plug score.

92
00:13:34,920 --> 00:13:45,030
And again, reminder if we would not use the easily one out predictive distributions, we would get higher value like here.

93
00:13:45,030 --> 00:13:47,040
The difference is a tree,

94
00:13:47,040 --> 00:14:00,270
which happens to be close to this number of parameters in the model intercept slope and the standard deviation of the residual parameters,

95
00:14:00,270 --> 00:14:10,440
which reflects that given in these three parameters, we wear kind of fitting to the specific data.

96
00:14:10,440 --> 00:14:16,860
And then this is and then then this is a biased estimate for the performance for the future.

97
00:14:16,860 --> 00:14:26,430
And then we say it's almost unbiased. And then there are these papers discussing specifically more about this level, of course.

98
00:14:26,430 --> 00:14:35,130
But I guess one other issue is that we would need to compute here in this case, 20 times.

99
00:14:35,130 --> 00:14:43,430
These are not posterior. If we did use NCMEC for the full posterior.

100
00:14:43,430 --> 00:14:54,750
And then we would do 20 times again this CMC for more complex models that can take a considerable amount of time.

101
00:14:54,750 --> 00:14:59,850
So. Now, if he already did.

102
00:14:59,850 --> 00:15:05,070
Sample from the full posteriori, and we already have these drawers.

103
00:15:05,070 --> 00:15:16,910
We can use this as a proposal distribution. And this is the target distribution and or in this case, we have been as many as we have no observation.

104
00:15:16,910 --> 00:15:25,520
And then we can use import on sampling important results, so we can't be.

105
00:15:25,520 --> 00:15:34,300
Leave on our posterior are divided by the filibuster, and if you think about the in the cases where we have factor rising likelihood.

106
00:15:34,300 --> 00:15:40,170
This story has one likelihood 10 more. And this one.

107
00:15:40,170 --> 00:15:52,470
And then if we ignore normally say, some terms than this, in contrast, so it's proportional to just one per one likelihood term.

108
00:15:52,470 --> 00:16:01,860
And there is no indexing this place across from the full poster,

109
00:16:01,860 --> 00:16:07,080
and then we can normalise this because we don't know now this normalisation constant and

110
00:16:07,080 --> 00:16:16,320
use so normalised in product sampling is again the same data on the pool posterior.

111
00:16:16,320 --> 00:16:27,750
And then for the full story, or we would have just the empirical average to get these predictive distribution.

112
00:16:27,750 --> 00:16:32,220
But now if we choose to one observation and leave it out.

113
00:16:32,220 --> 00:16:39,910
Now, notice the difference. I only see I have exactly the same.

114
00:16:39,910 --> 00:16:49,720
Lines in this speaker and a previous speaker, I only change that the Alpha channel has been.

115
00:16:49,720 --> 00:16:58,180
Chosen based on these weights, so if the weights is close to zero, Alpha Channel is zero and you can't see it,

116
00:16:58,180 --> 00:17:04,000
and if the weight is larger than golf, it's on a list closer to one.

117
00:17:04,000 --> 00:17:17,140
And so this is the only difference. Only way that the draws and these draws are now, then better reflects of the leaving out posterior.

118
00:17:17,140 --> 00:17:26,050
Of course, if the lever not pull through would be very, very different and we would not have that kind of cross at all, we would.

119
00:17:26,050 --> 00:17:34,810
Feiglin will this diagnostics I talk about is related to this.

120
00:17:34,810 --> 00:17:44,340
And so now we can then compute this lever, not predictive density of the left up data.

121
00:17:44,340 --> 00:17:49,380
So predictive density, but not the parameter values and then S.A.C. across from the pool plus three or anything,

122
00:17:49,380 --> 00:17:54,710
we just wait them using the normalised weights.

123
00:17:54,710 --> 00:18:08,030
So easy, but then the question is how reliable now this estimate is, is the very finite state centrally mid-term kicking in.

124
00:18:08,030 --> 00:18:19,940
There is no general analytic solution or specifically for the Lebanon, of course, will it based on the artificial results or normal linear models?

125
00:18:19,940 --> 00:18:30,980
And one specific other type of model, but not there's no durable solution.

126
00:18:30,980 --> 00:18:39,380
And it's also useful that the Typekit can take a test on the proposal because when we leave out one observation, there's more uncertainty.

127
00:18:39,380 --> 00:18:46,610
And then it's also likely that the base weights are not bounded.

128
00:18:46,610 --> 00:18:59,830
So what we can do is we can look. The empirically the distribution of the weights and unofficially now to 400 trials,

129
00:18:59,830 --> 00:19:07,440
so that it's a bit easier to see these largest weights in this is to here.

130
00:19:07,440 --> 00:19:13,580
And so even the largest weight, it's not close to one.

131
00:19:13,580 --> 00:19:16,960
So these are not these normalised weights on this.

132
00:19:16,960 --> 00:19:20,950
So yeah, they live there.

133
00:19:20,950 --> 00:19:31,360
And also this would be the what would this size of the weights if they would be all equal and we have a lot of weights around there.

134
00:19:31,360 --> 00:19:35,920
So it seems like that this is not not that bad case.

135
00:19:35,920 --> 00:19:50,050
This is four thousand troops and now it's difficult to see, but this the largest weight and now I can't even see if I have this really big.

136
00:19:50,050 --> 00:19:56,200
I can see that there are also some way out there, other weights.

137
00:19:56,200 --> 00:20:01,330
I mentioned the sample size estimation. Yes. So here we get.

138
00:20:01,330 --> 00:20:11,650
We had total sample size like four thousand and the sample size based on this variance approximation say around 400 400.

139
00:20:11,650 --> 00:20:16,120
It is not that kind of percentage efficiency attitude.

140
00:20:16,120 --> 00:20:23,140
But then the question is that is the variance finite? Can we trust these effective sample size estimate?

141
00:20:23,140 --> 00:20:32,770
And of course, with with any finite sample size, we will look at also, the variance estimates are also finite.

142
00:20:32,770 --> 00:20:40,000
So directly looking at the variance estimate doesn't tell whether the variance is finite.

143
00:20:40,000 --> 00:20:45,930
So what we can then do is use something else to estimate.

144
00:20:45,930 --> 00:20:57,220
Whether they're variances finite. So what we do is that we fit in, analysed broader distribution to the largest resource.

145
00:20:57,220 --> 00:21:03,670
So since so we can ignore the public and we choose some cut point.

146
00:21:03,670 --> 00:21:14,650
And then from that cut point, the journalist part of the distribution is such that it has some so cut point there and then it's decreasing,

147
00:21:14,650 --> 00:21:25,590
it has some scale parameter and then it has to say parameter that affects then I'll take the tennis.

148
00:21:25,590 --> 00:21:35,490
And we fight it today, the army did the last traces and this theory that saying that if we choose this plot point far enough in the tail.

149
00:21:35,490 --> 00:21:46,440
Then that's a very large set of distributions that these tail parties will approximate it with this already distribution.

150
00:21:46,440 --> 00:21:49,920
And the nice thing about the parade of distribution is that so that it has this

151
00:21:49,920 --> 00:21:56,860
scale parameter and say parameter and the safe parameter case substance now.

152
00:21:56,860 --> 00:22:07,450
Floor of one per tells the number of finite moments. And then we have finite variants and so probably maternal health.

153
00:22:07,450 --> 00:22:16,900
Can I ask a question? Yes. So this sort of extreme value theory usually needs for eye observations.

154
00:22:16,900 --> 00:22:20,320
But if you run a mathematician, then you have a Markov chain.

155
00:22:20,320 --> 00:22:24,320
You don't have I.D. observation. So is it?

156
00:22:24,320 --> 00:22:28,900
They don't. Yeah, yeah. So so the usually the spectrum value, Terry.

157
00:22:28,900 --> 00:22:31,150
So for the better data,

158
00:22:31,150 --> 00:22:47,220
they would be the issue if we would be asking a question that what is the probability that we would observe weight which is larger than something?

159
00:22:47,220 --> 00:22:51,960
But just for estimating what is that safe barometer?

160
00:22:51,960 --> 00:23:03,180
That's not the issue. So it's just that in a way that we would need to take the dependency only into account if we would say something about the like,

161
00:23:03,180 --> 00:23:13,630
how likely it is that we would see even extreme rates. OK.

162
00:23:13,630 --> 00:23:20,050
Thank you. But now so so we don't know.

163
00:23:20,050 --> 00:23:26,740
But we are not stating anything about the kind of how likely we would see more extreme waits.

164
00:23:26,740 --> 00:23:30,460
We just want to know the safe parameter.

165
00:23:30,460 --> 00:23:41,170
And then if this parameter would be less than half variance is finite, limit their impulse case less than one do not last time to limit wormholes.

166
00:23:41,170 --> 00:23:48,310
And so in this case, we estimate the case had to be zero point five two.

167
00:23:48,310 --> 00:23:59,660
So it would say that, OK, now central troubling me term doesn't help, but we will soon see that kind of this.

168
00:23:59,660 --> 00:24:13,590
Whether Kate is less than half or about half, there's no, you know, behaviour, there's no sharp dress hold.

169
00:24:13,590 --> 00:24:21,000
Before going to this diagnostic port, I'll add also because this diagnostic examples,

170
00:24:21,000 --> 00:24:26,000
they saw results based on specifically on part of this, would it import on sampling?

171
00:24:26,000 --> 00:24:32,300
So in addition to using these are actual distribution or diagnostics,

172
00:24:32,300 --> 00:24:43,010
we can also use it to replace the largest weights with ordered statistics of defeated Baroda distribution.

173
00:24:43,010 --> 00:24:49,840
And this is equivalent to using water to filter most noise out of which is an illustration.

174
00:24:49,840 --> 00:24:56,200
It blue line is from one repetition of symbolism in one simulation,

175
00:24:56,200 --> 00:25:04,120
we draw 10000 draws from some proposal distribution of some specific proposal distribution.

176
00:25:04,120 --> 00:25:07,370
And this specific target distribution.

177
00:25:07,370 --> 00:25:19,820
We rank the weights, so we sort of weights to increase in order, and you can see so basically it's the largest 10th largest, 100 largest weight.

178
00:25:19,820 --> 00:25:27,710
And then in each simulation you can see that there's a not of worrisome. In the latest wait.

179
00:25:27,710 --> 00:25:36,200
This approach also truncated important sampling, which was truncated a the of weights and, of course, and reduces variability.

180
00:25:36,200 --> 00:25:45,890
But now you can see that instead of having this jumping lines, these are smooth lines because we replaced.

181
00:25:45,890 --> 00:25:56,290
This kind of object. Ratios, which would at ratios based on creating this broader distribution.

182
00:25:56,290 --> 00:26:05,960
And it then reduces the variability compared to plain import on sampling and reduces bias complexity truncating input on something.

183
00:26:05,960 --> 00:26:10,100
And yes, an example, then again, 100 simulations,

184
00:26:10,100 --> 00:26:21,900
but now with the increasing SE estimating the normal system constant and in this case, we are integrating where the case.

185
00:26:21,900 --> 00:26:30,110
Like close to zero point five and then regular input on sampling has these jumps.

186
00:26:30,110 --> 00:26:42,890
And so it is an unreliable even if we get more observations that it can sometimes plot truncated impulse input on something doesn't have that issue.

187
00:26:42,890 --> 00:26:47,510
But you can see here it's getting below the truth about the use of this by us.

188
00:26:47,510 --> 00:26:56,270
And then I just put it in what I'm sampling. Hospitals reduce variability and bias.

189
00:26:56,270 --> 00:27:04,730
We also prove in the paper, thanks to Don Simpson, that this protest would have been put on some polling estimates,

190
00:27:04,730 --> 00:27:14,810
asymptotically contestant on his final appearance under some mild but complex conditions.

191
00:27:14,810 --> 00:27:29,810
Cool result also is. That we can connect this case to minimum sample size that gives some feel guarantees for the error.

192
00:27:29,810 --> 00:27:35,620
In self normalising import on sampling and. Simple equation, which looked like this,

193
00:27:35,620 --> 00:27:46,140
so let's say that we have the case where case small and zero corresponds to having and the distribution

194
00:27:46,140 --> 00:27:52,050
of weights being exponential and less than two year old would correspond to having bounded weights.

195
00:27:52,050 --> 00:27:56,460
In that case, let's say we have 100 drawers.

196
00:27:56,460 --> 00:28:06,160
If we were to have independent draws directly from the target distribution 100 drawers is often sufficient, for example, stream in expectation.

197
00:28:06,160 --> 00:28:17,530
And then we can what is that, if now would have a proposal on distribution such that the case estimated to be 2.7 for the same accuracy,

198
00:28:17,530 --> 00:28:28,670
we would need more than 100000 troops. You can see that there's no sharp change at gate zero point five.

199
00:28:28,670 --> 00:28:42,740
So that way. Previously, there was also some system to estimate like could make a hypothesis testing whether the variants it's fine or not.

200
00:28:42,740 --> 00:28:49,460
And then abandon the hope if the hypothesis test would say that variance is infinite.

201
00:28:49,460 --> 00:28:59,150
Instead, we just use the K continuously because there's still hope here.

202
00:28:59,150 --> 00:29:08,370
Although then, after about 0.7 begun of the required sample size, it's start to beat that [INAUDIBLE].

203
00:29:08,370 --> 00:29:14,280
That it's not sensible one like in this lever, not cross validation.

204
00:29:14,280 --> 00:29:25,610
It would be better than to use something else than simple input on something, maybe even just refit, then with them CMC.

205
00:29:25,610 --> 00:29:36,170
That was the charity called Result and yes, an empirical comparison to the Terri sewing root mean square error knowing the truth.

206
00:29:36,170 --> 00:29:42,040
And let's focus first on just on the black lines which are and root mean square error.

207
00:29:42,040 --> 00:29:48,080
Given different proposals, distributions and it devalues your aunt Gail,

208
00:29:48,080 --> 00:29:52,840
tell how good these proposal distributions are because these are the card values.

209
00:29:52,840 --> 00:30:00,940
So from quite small and eventually around 0.7 and 0.9, and we can see that.

210
00:30:00,940 --> 00:30:04,670
So the error goes down when we get more growth.

211
00:30:04,670 --> 00:30:14,390
But this is also reflecting that like here we have this 100 dross and we have the kind of good proposal distribution.

212
00:30:14,390 --> 00:30:27,850
Is somewhere around here and then if we. Draw parallel on the horizontal line, we can see that if K would be around 0.7.

213
00:30:27,850 --> 00:30:37,700
It would also indirectly match that we need more than 100000 drawers to get the same accuracy.

214
00:30:37,700 --> 00:30:44,480
Direct flights are moved to call a standard error estimates based on pace,

215
00:30:44,480 --> 00:30:55,470
the effective sample size estimate I showed you before and then that's well, they match quite closely even.

216
00:30:55,470 --> 00:31:07,560
Here. This is close already to 0.7, and so this red dashed line and this black line onto the air and they are close by,

217
00:31:07,560 --> 00:31:16,410
then we can see that when they get close larger than this, our estimate starts to be also too optimistic.

218
00:31:16,410 --> 00:31:22,710
But it's it's not that bad because we know based on the K hat that we can't trust them anymore.

219
00:31:22,710 --> 00:31:31,830
But it's also interesting you can see that the angle of this so this angle of this is actually now corresponds to that.

220
00:31:31,830 --> 00:31:38,030
The speed of convergence is this. Why?

221
00:31:38,030 --> 00:31:47,540
Tickets are divided by X, but then with the card is.

222
00:31:47,540 --> 00:31:53,430
Clearly larger 22.5. The angle is a different.

223
00:31:53,430 --> 00:32:04,500
And we can summarise this so that here this is the convergence rate, convergence rate, so that.

224
00:32:04,500 --> 00:32:12,010
Usually. In the crude case, so we would have these s two minus one.

225
00:32:12,010 --> 00:32:18,950
But now we get worse convergence rates when case larger.

226
00:32:18,950 --> 00:32:35,180
And, for example, around 0.7 we have the convergence rate is then again, only then the like the Ouachita S to 0.6.

227
00:32:35,180 --> 00:32:50,150
There's a kind of smooth curve, which is partially, of course, because of the limited simulations and limited the length of the simulations.

228
00:32:50,150 --> 00:32:56,840
And then but it's also related that there's no really super smooth transition.

229
00:32:56,840 --> 00:33:03,860
Otherwise, this like starting from zero point five if we draw a line, this test line.

230
00:33:03,860 --> 00:33:15,050
Behaviour close quite close to here, the blue line is for the moment of the normalisation term best moment mean second moment.

231
00:33:15,050 --> 00:33:27,500
So using also then these functions specific estimates and they they have been kind of similar behaviour.

232
00:33:27,500 --> 00:33:34,700
So when we get to, we can say then OK, what would be the minimum sample size required?

233
00:33:34,700 --> 00:33:46,430
And then we could also say that how much we could expect that the variance decreases if you get more gross.

234
00:33:46,430 --> 00:33:54,560
So this is this and we do this empirical predicate using the lightest ratios we have oracle rule selecting

235
00:33:54,560 --> 00:34:00,680
how we need lightest ratios to use for estimating K and then they rule is also such that it fulfils.

236
00:34:00,680 --> 00:34:06,200
Then they attempt to take properties that we always go further in the tail.

237
00:34:06,200 --> 00:34:12,410
But at the same time, increasing the number of draws so that the estimate gets more accurate.

238
00:34:12,410 --> 00:34:23,900
Smallest amount in the experiment has been like 20 largest ratios from 100 ratios and useful, but of course has a lot of variation.

239
00:34:23,900 --> 00:34:31,700
We use empirical profile based estimate by Chang and statements excellent accuracy compared to NCMEC.

240
00:34:31,700 --> 00:34:40,080
So no seems that there's no need to look for more accurate estimates for this.

241
00:34:40,080 --> 00:34:55,410
And see the paper for more details, and the useful question has been from Israeli Eberstadt that why bother with part of the smoothing

242
00:34:55,410 --> 00:35:04,200
improving the estimate in cases where case law like the weather variances infinite's and.

243
00:35:04,200 --> 00:35:09,150
Why don't you just select the proposal so the drought ratios are bound it?

244
00:35:09,150 --> 00:35:17,760
Yes, that would be nice, but it's not always trivial, as you put it, on something leaving out.

245
00:35:17,760 --> 00:35:26,700
Even bigger problem is that high dimensional spaces are not intuitive and they are really scary.

246
00:35:26,700 --> 00:35:36,940
And so here's an example. By construction, we have found it resource finite metrics.

247
00:35:36,940 --> 00:35:44,850
So the target distribution is normal distribution. And then the proposal distribution is a studenti.

248
00:35:44,850 --> 00:35:51,500
It has thicker tails than normal. And so the important ratios are bounded.

249
00:35:51,500 --> 00:36:03,400
Always we. Yes, 100000 draws a lot and an soft, very number of diamonds.

250
00:36:03,400 --> 00:36:09,890
Yes, they affected sample size. So in.

251
00:36:09,890 --> 00:36:16,670
Lou Diamond sums, we get quite good efficiency close to 100 percent, it's efficiency.

252
00:36:16,670 --> 00:36:25,310
But then the effective sample size truck starts to drop and it goes and eventually the terror.

253
00:36:25,310 --> 00:36:40,550
Why is that? Also, if we look at the convergence rate that how much additional draws reduce the error, the conversion rate drop.

254
00:36:40,550 --> 00:36:53,560
So even if. By construction, we have bounded ratios by not variance, asymptotically central element Terim Homes.

255
00:36:53,560 --> 00:37:00,250
Previously, particularly, the practical convergence rate is less.

256
00:37:00,250 --> 00:37:12,310
And eventually close to even zero. The good thing is that quiet can detect diagnosed these present behaviour.

257
00:37:12,310 --> 00:37:23,140
So as we have found it resolves the eye, the true case actually less than zero and in the beginning in low dimensional hat is also less than zero.

258
00:37:23,140 --> 00:37:31,090
But then it goes above zero, then it goes about zero above zero point seven and are out there.

259
00:37:31,090 --> 00:37:37,390
Effective sample size must be zero as as we did, so that at around zero point seven,

260
00:37:37,390 --> 00:37:47,290
we would need more than 100000 troops and then it keeps growing and then competence rate compresses.

261
00:37:47,290 --> 00:37:59,540
So why is why is this happening like this? Why don't we see the asymptotic central limiting their own behaviour?

262
00:37:59,540 --> 00:38:04,670
So now you use a blue line for the normal distribution.

263
00:38:04,670 --> 00:38:15,590
And then this proposal to distribution with Red Line and this is a marginal density, something just a distance from the mall.

264
00:38:15,590 --> 00:38:23,750
So 500 dimensional distribution and then we draw 100000 cross.

265
00:38:23,750 --> 00:38:29,180
And then when we compute how far these drawers are from the moat.

266
00:38:29,180 --> 00:38:37,940
We get this kind of distribution for this one hundred thousand troops, and this is for the proposal and this is for the drug.

267
00:38:37,940 --> 00:38:49,970
So from these 100000 trust. Most of them are closer to the moat than what would be the draws.

268
00:38:49,970 --> 00:39:02,400
From the true distribution. Now, if we are here, what is the important threat is so we can say that the important race is about it.

269
00:39:02,400 --> 00:39:07,960
But look. At the scale of the y axis, the largest.

270
00:39:07,960 --> 00:39:14,640
Ways around that size. And also.

271
00:39:14,640 --> 00:39:24,760
Before we would see that these resource are bound, it would need to see some draws around here.

272
00:39:24,760 --> 00:39:32,520
And looking at this distribution, it's quite unlikely that we would get drugs around here.

273
00:39:32,520 --> 00:39:39,840
In this specific case, it's possible to compute that. The number of draws we need.

274
00:39:39,840 --> 00:39:47,310
To get something here, it's much, much larger than the number of atoms in the universe.

275
00:39:47,310 --> 00:39:52,170
So there's no hope that we would create asymptotic routine,

276
00:39:52,170 --> 00:40:06,110
and thus it is very useful that we can diagnose these pre-symptomatic behaviour if I scale this so that the light I sold, it's only up to 10 to 30.

277
00:40:06,110 --> 00:40:12,280
How the important racial code behaves here, it's almost like a wall.

278
00:40:12,280 --> 00:40:18,120
So in the region where we are getting draws.

279
00:40:18,120 --> 00:40:29,680
With reasonable sample sizes, it looks like it's interesting sort, so interesting support from infinite variance case.

280
00:40:29,680 --> 00:40:37,330
And then we can see this with this empirical approach, OK?

281
00:40:37,330 --> 00:40:43,660
It is also possible like in this case, it would make it less extreme.

282
00:40:43,660 --> 00:40:49,030
It would be possible that like maybe after 10000.

283
00:40:49,030 --> 00:40:56,590
We it looks like aquariums could be infinite unless unless say, let's say,

284
00:40:56,590 --> 00:41:04,120
after like one million drops, we might start to finally see that it's actually founded.

285
00:41:04,120 --> 00:41:16,270
It is possible to see this also then when when you get more, the product eventually starts to also recognise that it could be pounded.

286
00:41:16,270 --> 00:41:29,200
So it's not saying. Kind of what would be if we would give infinitely sampling it to us to say, is that OK at this moment?

287
00:41:29,200 --> 00:41:39,790
It's looks like your performance really bad and it looks like you are not getting much better if you get more trust.

288
00:41:39,790 --> 00:41:52,630
OK, so that was the the main part about the diagnostic. And then I'll discuss this quickly these few applications.

289
00:41:52,630 --> 00:41:57,340
So what is the end result of various small inference stochastic optimisation?

290
00:41:57,340 --> 00:42:00,850
For some of the barriers in Princeton, I took blame on people.

291
00:42:00,850 --> 00:42:08,950
I estimate it's been very useful in forensics, often like normal distribution excuse to some unknown Worcester distribution,

292
00:42:08,950 --> 00:42:16,600
and then we can use that as a proposal distribution sampling.

293
00:42:16,600 --> 00:42:21,430
And then that means that we can use it both for diagnostic and improving.

294
00:42:21,430 --> 00:42:31,430
In this case, we. We'd like to use the Southeast this this order diff different at various in France,

295
00:42:31,430 --> 00:42:43,010
which is kind of blackbox, very similar in France and concrete UMC. AMC here we used a U-turn sampling time looks like this.

296
00:42:43,010 --> 00:42:52,670
And you can see that we can run ATV in less time, but also that if we run it to short our time.

297
00:42:52,670 --> 00:43:03,200
The stochastic optimism hasn't converged yet. And then this normal approximation is bad, but we can use to have to recognise that, OK,

298
00:43:03,200 --> 00:43:11,420
now the optimistic sum result is actually good, and we can also see here that in addition,

299
00:43:11,420 --> 00:43:19,580
not diagnosing that exclude results instead of using just the normal approximation using normal as a proposal distribution,

300
00:43:19,580 --> 00:43:25,040
we can get lower root mean square error with this.

301
00:43:25,040 --> 00:43:36,850
And then again, but I just wouldn't call it a little. And if the preference.

302
00:43:36,850 --> 00:43:52,610
Then. It is 80 I and many other people are reasonable inference algorithms use stochastic optimism, and often in this paper it's been said that OK,

303
00:43:52,610 --> 00:44:02,270
we reduced the step size and when it all appeals, Robbins Monroe condition, it will eventually converts.

304
00:44:02,270 --> 00:44:12,140
And then they've used us the last iterate. But actually, then this optimisation, they stopped much earlier and then the last iterate can be very,

305
00:44:12,140 --> 00:44:16,340
very kind of noisy toys, and that's to stick optimisation.

306
00:44:16,340 --> 00:44:27,250
And then there's this Polyak report, operating autonomously direct operating, which instead use many of the iterations from the end.

307
00:44:27,250 --> 00:44:36,400
And here's an illustration of, again, how we can use this Gearhart. To sell that, especially when the number of dimensions increase,

308
00:44:36,400 --> 00:44:42,620
this optimisation problem gets harder using just lastly, tourist gets noisier gay had.

309
00:44:42,620 --> 00:44:50,650
So it's that the approximation is worse while with iterate rating we get we stay near zero point seven.

310
00:44:50,650 --> 00:44:56,740
So much more stable.

311
00:44:56,740 --> 00:45:09,850
Then in addition of looking at this stochastic optimisation plot, there's also that the what divergence to use use of this exclusive K,

312
00:45:09,850 --> 00:45:17,610
at least not too often to underestimate the uncertainty and so people have been proposing.

313
00:45:17,610 --> 00:45:26,730
Other divergences, which are more mass covering, such as inclusive kale and K squared and so on.

314
00:45:26,730 --> 00:45:33,830
And these many of these common divergences can be presented as expectations of the density racial.

315
00:45:33,830 --> 00:45:43,340
So this is the same density, the ratio as before, and then there's just some func some empirical average, so it's the same thing we've been doing.

316
00:45:43,340 --> 00:45:49,580
And here, what about these different looking, parent based divergences objectives?

317
00:45:49,580 --> 00:46:01,300
So exclusive club. Look, you and we can also know this earlier paper on Terry saying how many moments of.

318
00:46:01,300 --> 00:46:12,090
Well, the W we need and so here, Delta, which just needs to be larger than zero.

319
00:46:12,090 --> 00:46:23,760
And then more equitable, inclusive kale instead requires to find out moments and then Delta, which needs to be lots of them zero.

320
00:46:23,760 --> 00:46:37,860
So it's it's it. You can now guess that then it's more difficult to estimate inclusive kale, an exclusive kale given these Monte-Carlo estimates.

321
00:46:37,860 --> 00:46:47,580
And again, high dimensional spaces so exclusive kale it's known to and underestimate.

322
00:46:47,580 --> 00:46:57,660
And here also again, when we increase the dimensions to 40 and 50, it is the red one is the approximation it is underestimating.

323
00:46:57,660 --> 00:47:07,650
Inclusive Kale here in Lodi instance, it is all we're estimating, and it's giving pounded bone to drizzles, which is nice.

324
00:47:07,650 --> 00:47:11,310
But this overestimation in high dimensions is actually bad.

325
00:47:11,310 --> 00:47:20,700
Now these two concepts are not overlapping anymore, and it's unlikely that from this inclusive K.L. proposal on distribution,

326
00:47:20,700 --> 00:47:27,120
which is over estimating the scale we would get draws around here.

327
00:47:27,120 --> 00:47:35,250
And if you don't get trust there, we will get bad behaviour of this all.

328
00:47:35,250 --> 00:47:41,530
And here. All right, some.

329
00:47:41,530 --> 00:47:50,690
If we optimise exclusive scale so that we like to underestimate and also then how well we are actually.

330
00:47:50,690 --> 00:47:59,090
Being able to estimate the short time term, so this was needed for explosive gale and this was needed for inclusive gale,

331
00:47:59,090 --> 00:48:09,340
and this was needed for normalisation constant. So it's easy to estimate this exclusive scale.

332
00:48:09,340 --> 00:48:13,610
And also, the normalisation constant, we yeah, it goes up again, Europe wants that.

333
00:48:13,610 --> 00:48:24,170
And so not that good. But if you compare that if we are optimising inclusive kale, it's good in low demand since.

334
00:48:24,170 --> 00:48:38,810
But eventually, it will call much, much higher, so it's much worse both for the normalisation factor and that divergence measure itself.

335
00:48:38,810 --> 00:48:45,750
And it's just reminding that from the past decade, we could see also how many draws we would need.

336
00:48:45,750 --> 00:48:52,230
And in this stochastic optimisation that in each step of stochastic optimisation, we would be using these many draws.

337
00:48:52,230 --> 00:49:05,220
It's infeasible. And then also, what is the effect, then, that this Monte-Carlo estimates start to fail?

338
00:49:05,220 --> 00:49:12,290
But is that here? So the dust line is the true. Exclusive Castle.

339
00:49:12,290 --> 00:49:18,500
And this continuous line, the dark green is then what is the estimate?

340
00:49:18,500 --> 00:49:26,720
And so we are underestimating also the divergence and in this pre-symptomatic behaviour,

341
00:49:26,720 --> 00:49:36,470
they look like biased just because it's so unlikely that we would ever see these are overestimates.

342
00:49:36,470 --> 00:49:41,480
And so in into practise, we see this bias behaviour,

343
00:49:41,480 --> 00:49:52,100
which can explain how these results that lie steal these exclusive Gayle is so much more common than these other divergences.

344
00:49:52,100 --> 00:50:04,150
Why people had problems of getting other directions to work reliably and solving.

345
00:50:04,150 --> 00:50:11,320
And so we took small sample size to divert this kind of second corporate political bias, which can affect our results.

346
00:50:11,320 --> 00:50:18,040
And yeah, again, we may need more trust and the number of victims of the universe to make these rely.

347
00:50:18,040 --> 00:50:28,120
And yes, the paper with this more results from this. And the final example.

348
00:50:28,120 --> 00:50:36,460
So this. That's just reminding that what I was talking is not just important sampling and various inference,

349
00:50:36,460 --> 00:50:44,620
it can be used just for plain Monte-Carlo in this paper.

350
00:50:44,620 --> 00:50:55,550
But on an Oh, we did study this leave on outgrows validation, but we wanted to compare import until.

351
00:50:55,550 --> 00:51:07,900
You put on sampling, leaving out cross validation also to the case where we actually then use CMC to sample from this.

352
00:51:07,900 --> 00:51:15,250
And then these drawers are actually prompted directly from the Typekit distribution.

353
00:51:15,250 --> 00:51:23,650
But we did get seemingly biased results because.

354
00:51:23,650 --> 00:51:30,090
So this distribution here was normal distribution.

355
00:51:30,090 --> 00:51:38,330
Well, I then. Given some predicted new.

356
00:51:38,330 --> 00:51:49,400
And then, Sigma. And then when we have these draws of these parameters, if we go far enough in the cave.

357
00:51:49,400 --> 00:51:52,940
These densities are very close to zero.

358
00:51:52,940 --> 00:52:02,120
And then by chance, one of these is that much away from zero so that this distribution of these two entities is very,

359
00:52:02,120 --> 00:52:12,890
very skewed and we see the same behaviour as we could see with this important sampling case,

360
00:52:12,890 --> 00:52:21,200
and we can just look at it then using the positive diagnostic also directly how long tail?

361
00:52:21,200 --> 00:52:25,430
These drugs have I can see that, OK?

362
00:52:25,430 --> 00:52:34,100
Actually, now CMC is also failing to give give the exact look the result.

363
00:52:34,100 --> 00:52:40,220
Like with the fire a reasonable amount of trust. OK.

364
00:52:40,220 --> 00:52:51,990
That's it. And here are the papers I've mentioned throughout the talk and the voters.

365
00:52:51,990 --> 00:52:55,080
Yeah, thanks a lot.

366
00:52:55,080 --> 00:53:03,570
It's very interesting, everything like I wasn't aware of this threat or something to me, so it's good to know all the capabilities that it has,

367
00:53:03,570 --> 00:53:11,490
and I don't really get a place like you can ask questions if you want, like anyone who is in the public.

368
00:53:11,490 --> 00:53:19,920
But otherwise, I have a question which is related to this study in technique that you didn't talk about,

369
00:53:19,920 --> 00:53:24,510
which is that it seems to be addressing or your paper.

370
00:53:24,510 --> 00:53:32,470
There seems to be addressing this very pressing topic of how do we average models given that all models are wrong?

371
00:53:32,470 --> 00:53:37,140
And then you also use this technique in that case, right?

372
00:53:37,140 --> 00:53:42,390
So I wanted to know like what is the usefulness of use?

373
00:53:42,390 --> 00:53:47,850
But it is most important something in the context of a study just because of the computational resource or.

374
00:53:47,850 --> 00:53:55,230
Yes. Yeah. Yeah, that's why we we used that in.

375
00:53:55,230 --> 00:54:04,860
Really, because of the computational efficiency in when we've been when we did this, stacking with M.S., M.S.,

376
00:54:04,860 --> 00:54:21,210
it usually also is working so well that can we get really these high caps or only a small amount of Typekit had so that reliability is good.

377
00:54:21,210 --> 00:54:30,270
We have also example in that paper with Bob variational in France,

378
00:54:30,270 --> 00:54:40,320
and then we have other paper which has more results with the variational in France, in which case it's more likely that we get like hats.

379
00:54:40,320 --> 00:54:48,800
But even then, this but I just wanted important sampling can.

380
00:54:48,800 --> 00:55:01,400
B so that the stacking helps, even if we don't get, like, close to true estimates.

381
00:55:01,400 --> 00:55:06,410
Hmm. Even if the era starts to increase, it's still in the correct direction.

382
00:55:06,410 --> 00:55:21,480
And also in this. The last paper mentioned on this slide currently showing, there's also results showing that depending on the kind of what?

383
00:55:21,480 --> 00:55:23,760
What we are interested like if we are interested in posturing,

384
00:55:23,760 --> 00:55:32,760
mean we do get estimates that are closer to pastrami and even if the caper is quite large.

385
00:55:32,760 --> 00:55:37,840
I'm saying this very small in press. I see. Mm hmm.

386
00:55:37,840 --> 00:55:45,860
We have also saw this item in our paper implicitly adaptive importance sampling has been.

387
00:55:45,860 --> 00:55:58,160
Algorithm that can improve. So like in this specific to this important sampling, live on a cross validation list that.

388
00:55:58,160 --> 00:56:09,920
We don't have like no distribution for the proposal distribution that we just have draws from the

389
00:56:09,920 --> 00:56:17,910
proposal distribution and then this implicitly adopted the input on sampling discusses how we can adopt.

390
00:56:17,910 --> 00:56:27,330
Set of cross using our fine transformations so that we can match then better Typekit distribution,

391
00:56:27,330 --> 00:56:37,950
but with the benefit that we don't need to be able to know the exact shape of that proposal.

392
00:56:37,950 --> 00:56:46,150
And then then and then this would be also useful in in a way that if this.

393
00:56:46,150 --> 00:56:54,280
Basic, but I just moved at Langley when out starts to fail without need to really run.

394
00:56:54,280 --> 00:57:02,030
Mercy, Mercy, this can give better accuracy. I see.

395
00:57:02,030 --> 00:57:16,920
Thank you. Yeah, so, uh, since I got there, no questions in the Chattanooga, uh, what I do think again and again for coming.

396
00:57:16,920 --> 00:57:23,640
So this one hand up by saying who is sort of the government or is it wearing?

397
00:57:23,640 --> 00:57:28,980
I don't know. Well, so maybe it's a clapping symbol?

398
00:57:28,980 --> 00:57:34,300
Yeah, probably. OK, so I. Mr. Mr. Yeah.

399
00:57:34,300 --> 00:57:44,640
Mm hmm. Thanks a lot. OK.

400
00:57:44,640 --> 00:57:48,400
Thank you very much. Right.