1
00:00:08,350 --> 00:00:13,330
So welcome everybody to this workshop on Bayesian prediction and pulse certification.

2
00:00:13,330 --> 00:00:18,480
Roberto Chettinad AM, a sociology Ph.D. student at Nuffield College.

3
00:00:18,480 --> 00:00:26,090
And in my third year and today I'm going to be talking to you about this technique, which is going to allow you to do a number of things.

4
00:00:26,090 --> 00:00:29,960
So why would you want to use this technique at all? You have two goals.

5
00:00:29,960 --> 00:00:36,040
One is to do small area estimation, which means if you have a sample that is representative at the national level,

6
00:00:36,040 --> 00:00:41,980
but you like to find out something about the state or the municipality or the some of the small area level,

7
00:00:41,980 --> 00:00:49,570
then this technique will allow you to extrapolate information from the national level and make reasonable inference at the area level.

8
00:00:49,570 --> 00:00:53,800
Something else that this technique does is it allows you to make inference at the

9
00:00:53,800 --> 00:00:59,530
national level for non representative samples and even at the state and the area level.

10
00:00:59,530 --> 00:01:02,590
But as we will see, there is a few caveats there.

11
00:01:02,590 --> 00:01:09,370
If you start trying to use it for the latter and the title, the workshop is Bayesian prediction and post certification.

12
00:01:09,370 --> 00:01:16,150
And the reason why that is is because the Bayesian approach allows you to do simulations, posterior simulations,

13
00:01:16,150 --> 00:01:20,380
and this has big advantages in terms of computing confidence intervals of the

14
00:01:20,380 --> 00:01:24,970
area levels because you can simply simulate if you're working with elections,

15
00:01:24,970 --> 00:01:30,490
for example, you can simulate multiple elections. If you're working with disease, you can simulate multiple rounds of the disease.

16
00:01:30,490 --> 00:01:35,990
And this gives you confidence intervals without actually having to go through the trouble of calculating posterior variances and so on.

17
00:01:35,990 --> 00:01:45,670
You can just use simulations. That's the that's the advantage of it. And so there are three parts to the prediction and plus fortification framework.

18
00:01:45,670 --> 00:01:52,690
So one is sampling, which we will talk about at the end, actually. The other one is the prediction part and the other one is the certification part.

19
00:01:52,690 --> 00:01:58,840
This workshop focuses mainly on the prediction and certification parts and in particular prediction we're going to be doing,

20
00:01:58,840 --> 00:02:05,200
as I said, in a Bayesian way, which means that you guys are going to have to be. So how many of you are familiar with the Bayesian approach?

21
00:02:05,200 --> 00:02:10,540
I mean, if you use the Bayesian approach in the past one two three?

22
00:02:10,540 --> 00:02:13,780
OK, for it, not many of you.

23
00:02:13,780 --> 00:02:23,430
So I think a good idea is for us to go through a Bayesian primer first, because how many of you are familiar with the concept of a Gibb's sampler?

24
00:02:23,430 --> 00:02:31,530
Want to see, OK, so roughly the same people, OK, so there's going to be this is very good, actually, because I planned for a 30 minutes,

25
00:02:31,530 --> 00:02:40,290
30 to 40 minutes of a primer to Bayesian modelling, and then we're going to get into the meat of the prediction and certification.

26
00:02:40,290 --> 00:02:45,600
And this primer to Bayesian modelling is actually going to be very useful to you to understand the basics of machine learning because

27
00:02:45,600 --> 00:02:53,310
machine learning is ultimately even though we don't have the computational power to do it in a purely Bayesian way at the moment.

28
00:02:53,310 --> 00:03:04,140
It is ultimately a Bayesian endeavour in its conceptualisation. So you we will see how this links to Charles workshop later today, I think.

29
00:03:04,140 --> 00:03:07,980
And so let's start with the Bayesian model in Paris.

30
00:03:07,980 --> 00:03:15,090
So first of all, this formula here is based theorem you all night based here is a way to update

31
00:03:15,090 --> 00:03:19,590
your belief about the occurrence of an event after you observe some data.

32
00:03:19,590 --> 00:03:28,140
And so there are three parts to base theorem BRACA Theta is the probability it can be an event so it can be a parameter, any parameter.

33
00:03:28,140 --> 00:03:37,150
So, for example, the probability of voting for a Republican in the population of the United States and P of Theta is our prior on that parameter.

34
00:03:37,150 --> 00:03:42,990
So we may. It's a probability and it's an expression of uncertainty. So we may have a prior on Theta.

35
00:03:42,990 --> 00:03:49,950
We may say OK, well, roughly between forty five and fifty five percent of people are going to vote for a Republican candidate.

36
00:03:49,950 --> 00:03:55,740
So our prior is going to be some sort of uniform distribution between fifty five, forty five and fifty five.

37
00:03:55,740 --> 00:04:01,770
That's a simple Black Friday. And then we have the likelihood, which is this p of Y given Fita.

38
00:04:01,770 --> 00:04:04,680
So to understand that there is a likelihood,

39
00:04:04,680 --> 00:04:13,740
which means it's the probability of occurrence of the data that you observed y given the parameter that we have a prior for Theta.

40
00:04:13,740 --> 00:04:20,220
OK. So and as we will see in a second C that generates Y, this is the the where the Bayesian approach comes in.

41
00:04:20,220 --> 00:04:29,040
So like Theta, the probability the global this like parameter that exists with uncertainty in broadly generates data.

42
00:04:29,040 --> 00:04:32,370
And then as we observe data, we learn things about theta.

43
00:04:32,370 --> 00:04:39,300
So if we observed why say, why is the number of individuals who in a survey vote say they would vote for Trump?

44
00:04:39,300 --> 00:04:44,460
And we observed why in a survey of NW equals 1000, we observe Y equals nine hundred.

45
00:04:44,460 --> 00:04:48,300
This suggests that our prior that was between forty five and fifty five percent is wrong,

46
00:04:48,300 --> 00:04:53,970
or at least like there is a conflict between the data and the prior. And we should take that into account.

47
00:04:53,970 --> 00:05:02,820
OK. And so this graph here I get there illustrates the Bayesian paradigm.

48
00:05:02,820 --> 00:05:09,630
So there are two ways of looking at this graph. First, through the solid lines and then through the dotted lines.

49
00:05:09,630 --> 00:05:16,770
So first, for the solid lines, parameter theta generates the distribution,

50
00:05:16,770 --> 00:05:23,340
the likelihood of Y and out of this distribution of Y, you have observed values y.

51
00:05:23,340 --> 00:05:27,330
So this this is a random quantity. This is a random quantity.

52
00:05:27,330 --> 00:05:30,720
This is an observed quantity. OK, so this is realised. This has happened.

53
00:05:30,720 --> 00:05:40,620
Why I have happened, why I would be, for example, there the the specific response in a survey that says I vote Republican,

54
00:05:40,620 --> 00:05:44,940
whereas y is the distribution of YS across the population.

55
00:05:44,940 --> 00:05:52,800
So if you were to ask the whole population without sampling, how many like what the probability distribution would be for people to express

56
00:05:52,800 --> 00:05:57,120
that preference and C is the probability distribution of the hyper parameter.

57
00:05:57,120 --> 00:06:01,620
So the parameter that generates the YS, does that kind of make sense to everybody, this kind of structure?

58
00:06:01,620 --> 00:06:08,340
So you have a generation of data based on parameters? Yes. So yeah, theta.

59
00:06:08,340 --> 00:06:15,570
Yeah, Theta is something like the parameters that you've defined to generate a distribution

60
00:06:15,570 --> 00:06:21,330
that has uncertainty of what what y is and what is the actual distribution that was?

61
00:06:21,330 --> 00:06:24,480
That's exactly correct. So feed is what we call a hyper parameter.

62
00:06:24,480 --> 00:06:32,040
If this were a normal distribution, it would be the mean, for example, and in the Bayesian paradigm, the mean as uncertainty inherent uncertainty.

63
00:06:32,040 --> 00:06:34,710
So this is where the very difference between the Bayesian and the frequent.

64
00:06:34,710 --> 00:06:37,560
This approach, which is probably what you've been taught through your statistics,

65
00:06:37,560 --> 00:06:41,850
comes in in that the frequent this approach thinks that Fita doesn't have any uncertainty.

66
00:06:41,850 --> 00:06:45,630
It's a parameter that exists and you have to measure it, period.

67
00:06:45,630 --> 00:06:49,410
Whereas in the Beijing paradigm, we think that everything has inherent uncertainty.

68
00:06:49,410 --> 00:06:59,670
And therefore, as you observe data that you just change your distribution around Fita, that's it that you never quite get to a precise theta.

69
00:06:59,670 --> 00:07:07,140
There's a paradox, which is that if you observe all the possible ends and for in order to inform your primary theta,

70
00:07:07,140 --> 00:07:09,990
then the frequent extend the Bayesian approach will converge.

71
00:07:09,990 --> 00:07:15,870
So if you observe every single value out there, you will be able to observe that with zero uncertainty.

72
00:07:15,870 --> 00:07:20,130
And so you can imagine if this was a normal distribution, it would collapse on the mean.

73
00:07:20,130 --> 00:07:24,510
That's the idea. OK, any more questions on? On this.

74
00:07:24,510 --> 00:07:33,150
And so this white star is the predictive distribution, so before we observe any data, why and why stars are going to be quite similar.

75
00:07:33,150 --> 00:07:43,200
Now here's the trick when you then suppose that you haven't had access to the first part of this of this generate data generating process.

76
00:07:43,200 --> 00:07:45,660
This is, by the way, this conceptualisation is called.

77
00:07:45,660 --> 00:07:52,830
The DAG directed a cyclical graph, and it just serves to show that information flow that within a data generating process.

78
00:07:52,830 --> 00:07:55,050
And so if you didn't observe this first part,

79
00:07:55,050 --> 00:08:03,240
but you only observe why and you just hypothesise that a data generating process of this kind would have had to exist in order to create why,

80
00:08:03,240 --> 00:08:10,230
then you can use the information you gained by why to activate your priors on Theta and then generate a new predictive distributions,

81
00:08:10,230 --> 00:08:14,970
which means that if you were to try to predict the vote share of like a new person and plus one person,

82
00:08:14,970 --> 00:08:21,060
then that prediction would be informed by the new data that you've observed. OK, everybody clear on this.

83
00:08:21,060 --> 00:08:27,350
Perfect. OK, so.

84
00:08:27,350 --> 00:08:34,610
So we've spoken about these priors and posteriors, so the the way let me go back to base theorem.

85
00:08:34,610 --> 00:08:39,980
The way in which base theorem is set up. Is that you have this prior,

86
00:08:39,980 --> 00:08:49,130
you have this likelihood and then this updated this updated probability distribution on the hyper parameter, it's called the posterior.

87
00:08:49,130 --> 00:08:53,390
And so the relationship between priors and likelihood determines the posterior.

88
00:08:53,390 --> 00:09:01,430
And there are some regulatory issues in this relationship. So for example, there are sets of priors and and likelihoods that are married, let's say,

89
00:09:01,430 --> 00:09:07,550
or, as we say, in their conjugates, which means that they have an analytical solution.

90
00:09:07,550 --> 00:09:13,190
So for example, I have an example here. A famous one is the better binomial.

91
00:09:13,190 --> 00:09:16,880
Does anybody is anybody familiar with a better distribution? I don't. Yeah, roughly.

92
00:09:16,880 --> 00:09:21,980
Who is not familiar with better distribution or a better distribution is simply a very good.

93
00:09:21,980 --> 00:09:27,800
Please let me know when when I'm not being clear, a better distribution is simply a distribution defined between zero and one.

94
00:09:27,800 --> 00:09:31,550
And based on two parameters, alpha and beta can take any sort of.

95
00:09:31,550 --> 00:09:35,270
It's very flexible, can take almost any shape between zero and one. OK.

96
00:09:35,270 --> 00:09:42,140
So it's a perfect distribution to have a prior on probabilities. So, for instance, the probability of voting for Donald Trump in a given election.

97
00:09:42,140 --> 00:09:48,020
And so thi that can be our prior can be well distributed with two parameters alpha and beta.

98
00:09:48,020 --> 00:09:51,830
And then we observe a poll and this would be the likelihood.

99
00:09:51,830 --> 00:09:59,240
So the the response of respondent y o respondent I am given conditional on the hyper

100
00:09:59,240 --> 00:10:03,840
parameter theta on which we have a prior is distributed to a newly distribution.

101
00:10:03,840 --> 00:10:12,650
How many of you are familiar with their newly distribution? OK, good. Um, and if that is the case, then you can simply you don't have to do any maths.

102
00:10:12,650 --> 00:10:20,990
All you have to do, so you don't have to do any integrations. You not to do anything. All you have to do is update your parameters, as we show here.

103
00:10:20,990 --> 00:10:23,740
I don't have a highlighter, but that's OK.

104
00:10:23,740 --> 00:10:34,250
And so you just update alpha and beta by summing the number of Trump voters to the Alphas and subtracting the number of Trump voters to the beat us.

105
00:10:34,250 --> 00:10:40,290
And then we'll give you your new posteriors. So your new belief over Theta doesn't make sense.

106
00:10:40,290 --> 00:10:45,490
Very good. OK, well, we're so sorry. I'll talk slower.

107
00:10:45,490 --> 00:10:50,120
Sorry. Very good. OK. Yes.

108
00:10:50,120 --> 00:10:59,600
Yeah, yeah, yeah, yeah. That part is the equivalent of when you take the act's observation, which is why yep.

109
00:10:59,600 --> 00:11:03,890
Yep, integrated with your pride. That's correct. Yes. Stuff what it.

110
00:11:03,890 --> 00:11:05,180
Exactly, exactly.

111
00:11:05,180 --> 00:11:17,300
So in reality, in order to compute this, you would need to multiply the probability distributions of your priors and your likelihood,

112
00:11:17,300 --> 00:11:23,300
and that could be quite hefty and you'd have to divide it by the marginal probability of Y.

113
00:11:23,300 --> 00:11:29,070
So that requires usually a lot of integration and a lot of multiplication, and that's quite a problem.

114
00:11:29,070 --> 00:11:33,680
And the cool thing about conjugate priors is that somebody has already done that work for you,

115
00:11:33,680 --> 00:11:39,470
and they know that all you need to do is add to the Alphas the Y and subtract the ways to the beta.

116
00:11:39,470 --> 00:11:49,900
Yes, Chris. They. Perfectly comfortable and updating us will be a part of you as a nation.

117
00:11:49,900 --> 00:11:57,100
So, for instance, in a pool, right? So if I had the probability that if my hyper parameter is the probability,

118
00:11:57,100 --> 00:12:05,230
the general probability in a population of voting for Donald Trump, then I do a representative poll of the US population.

119
00:12:05,230 --> 00:12:08,890
And I observe 30 people in one hundred people.

120
00:12:08,890 --> 00:12:14,470
That is my end. I observe 30 people that say they vote for Donald Trump and 70 did not do not.

121
00:12:14,470 --> 00:12:25,060
And then I would just go ahead and change the values for my distribution of Theda to account for this.

122
00:12:25,060 --> 00:12:30,340
So, for instance, in the beta binomial or beat up or newly this example,

123
00:12:30,340 --> 00:12:34,120
all I would do is subtract from the original value of alpha that I had in mind.

124
00:12:34,120 --> 00:12:38,410
I will subtract the number of Trump respondents and the original value of beta.

125
00:12:38,410 --> 00:12:43,060
I would actually sorry I would add and subtract. Yes.

126
00:12:43,060 --> 00:12:48,470
Let's listen to that. Say it again, sorry.

127
00:12:48,470 --> 00:12:55,940
Of course, so as you can see, so probability distributions have inherent uncertainty,

128
00:12:55,940 --> 00:13:03,770
and the uncertainty around your observed value is going to depend on the sample size of your of your pool.

129
00:13:03,770 --> 00:13:10,610
So if you have a pool that has a very, very, very high sample size and it estimates the true feed quite precisely,

130
00:13:10,610 --> 00:13:19,820
then your posterior is going to be quite precise. Although remember that the Bayesian base theorem is is actually an averaging via variances.

131
00:13:19,820 --> 00:13:24,860
So you have to you have this likelihood and this this prior.

132
00:13:24,860 --> 00:13:30,770
And what happens is if your prior is really, really, really precise, it's not going to move much when you observe new data.

133
00:13:30,770 --> 00:13:35,510
But if you're price quite loose and your data is very precise, your prior will quickly shift towards your data.

134
00:13:35,510 --> 00:13:39,050
So it's in fact in situations where the prior is quite loose.

135
00:13:39,050 --> 00:13:44,720
We call that a non informative prior because it's not going to affect the estimation of feet.

136
00:13:44,720 --> 00:13:50,750
It's almost going to be exactly the same as a maximum likelihood estimation. Does that make sense?

137
00:13:50,750 --> 00:13:57,460
OK, perfect. Yes. So why are? Yes, that's correct, yes.

138
00:13:57,460 --> 00:14:04,600
Does it? Would it work for the U.S.?

139
00:14:04,600 --> 00:14:09,080
So for Beta Beta is a primer on Theda.

140
00:14:09,080 --> 00:14:14,470
So the distribution of wiiii, which is the zero one is going to be a Bernoulli.

141
00:14:14,470 --> 00:14:22,720
And so if you wanted to observe. So if you wanted to observe, say, a continuous value for why that would be, say, a normal distribution,

142
00:14:22,720 --> 00:14:28,240
there isn't a conjugate between a better and a normal, but there is a conjugate between a normal and a normal.

143
00:14:28,240 --> 00:14:33,670
So if you had, for example, height or weight, then those are normally distributed in your prior reformulated,

144
00:14:33,670 --> 00:14:39,190
not in terms of alpha and beta in a beta distribution. But in terms of new and sigma in a normal distribution.

145
00:14:39,190 --> 00:14:43,930
And then you would then update that way. Does it make sense to take the form of it?

146
00:14:43,930 --> 00:14:48,100
Yes. What you do is just because you got that.

147
00:14:48,100 --> 00:14:52,710
Yes, that's right. Any more questions?

148
00:14:52,710 --> 00:14:57,310
OK, great. And so.

149
00:14:57,310 --> 00:15:05,470
Having familiarise yourself with this, well, you might say, OK, but I heard that these Bayesian people, they are always doing stuff with computations.

150
00:15:05,470 --> 00:15:09,370
Why if there is always analytical solutions, which we can sum by hand?

151
00:15:09,370 --> 00:15:11,170
Do people bother doing computations?

152
00:15:11,170 --> 00:15:20,170
And the answer is well, in situations where your prior and your likelihood do not have conjugate form like, for example, famous ones.

153
00:15:20,170 --> 00:15:26,260
Famous examples of this are a logistic distribution or a mixture models.

154
00:15:26,260 --> 00:15:32,590
These don't have any conjugate forms, so you can find a nice combination of prior impulse here to give you an analytical solution.

155
00:15:32,590 --> 00:15:42,940
Then we rely on what is called Monte Carlo Markov Chain methods, and the basic principle is that is that of a Markov chain.

156
00:15:42,940 --> 00:15:51,160
So a Markov chain is just, in fact, I'm not even sure you need to know what a Markov chain is, because I'm going to give you an example of it.

157
00:15:51,160 --> 00:16:00,580
So we're going to look at the Gib sampler. So the Gibb's sampler is a it's an algorithm that allows you to find your posterior

158
00:16:00,580 --> 00:16:04,960
in instances where there is no analytical solution between your prior and posterior.

159
00:16:04,960 --> 00:16:11,710
And the way it does this is as follows. So imagine you have a normal distribution, so you have two hyper parameters Mew and Theta.

160
00:16:11,710 --> 00:16:19,630
The mean and the variance, OK, the variance in Beijing statistics is usually coded as the precision, which is one over.

161
00:16:19,630 --> 00:16:26,590
So one over the variance and and the way the sample works is as follows.

162
00:16:26,590 --> 00:16:33,520
First, you set some arbitrary initial values for you and Theda, you might say, OK, I'm going to start with theta equals one.

163
00:16:33,520 --> 00:16:38,680
Sorry. New in town and I start with Tao equals one and new equals zero.

164
00:16:38,680 --> 00:16:43,540
And then you say, OK, let's sample. So these are now distributions, right?

165
00:16:43,540 --> 00:16:48,130
Remember, so like you have this normal distribution menu has a distribution tab was the distribution and you say,

166
00:16:48,130 --> 00:16:55,320
let's sample from the distribution of new a new number based on the initial value that I put on Theta.

167
00:16:55,320 --> 00:17:00,120
And then let's do the same on Fita with the new value that I just sampled from me.

168
00:17:00,120 --> 00:17:06,540
And then I iterate, so I continue doing this process until I reach some sort of convergence.

169
00:17:06,540 --> 00:17:12,090
So a typical example of this, I have this sort of whiteboard here.

170
00:17:12,090 --> 00:17:21,190
Maybe I can show you this is very experimental, so bear with me should be called the experimental social science.

171
00:17:21,190 --> 00:17:25,930
So, OK. Can you can you see?

172
00:17:25,930 --> 00:17:34,260
Yeah. OK. So say I have. Tao, which is just the precision or one over the variance on the x axis and Mew,

173
00:17:34,260 --> 00:17:43,610
which is the mean parameter of a normal distribution, I would initialise Mew and Theta to be say here.

174
00:17:43,610 --> 00:17:53,510
Some low value from you and some high value for town. OK, then I would sample a value of MEU from from a value from this initial value of Fita.

175
00:17:53,510 --> 00:17:57,890
And then I would do the same for a value of Tao with respect to a value of meat.

176
00:17:57,890 --> 00:18:03,530
And then this will move me somewhere here and then I would continue doing this.

177
00:18:03,530 --> 00:18:10,870
And if the properties of a mark of chain are reached, then eventually.

178
00:18:10,870 --> 00:18:15,610
I'm going to get here where this is the meeting of the joint distribution.

179
00:18:15,610 --> 00:18:20,410
And this is the the variance around it. So this is the uncertainty around it.

180
00:18:20,410 --> 00:18:24,130
So the idea is that if the properties of remarkable chain are respected,

181
00:18:24,130 --> 00:18:28,990
it doesn't matter that you don't have an analytical solution between your prior and your posterior.

182
00:18:28,990 --> 00:18:33,400
What you can do is simply iterate between by sampling iterative sampling.

183
00:18:33,400 --> 00:18:42,040
This called conditional iterative sampling so you, you, you sample from the conditional distribution of mean with the new values of sigma,

184
00:18:42,040 --> 00:18:47,260
and then you re sample the new values of sigma from the conditional distribution of meat.

185
00:18:47,260 --> 00:18:51,250
And so eventually this converges and we have what I showed you here.

186
00:18:51,250 --> 00:18:58,930
So this is like the joint. The joint uncertainty around the two parameters and the centre would be the point estimates for the two parameters.

187
00:18:58,930 --> 00:19:03,530
Is that kind of makes sense. Yeah. OK. Yes.

188
00:19:03,530 --> 00:19:09,730
But this is like exploring the distribution of these two values.

189
00:19:09,730 --> 00:19:15,280
Yes. You're aiming to get the distribution right and this will converge when you visited.

190
00:19:15,280 --> 00:19:19,990
All sort of like the bulk of the. Absolutely. Absolutely.

191
00:19:19,990 --> 00:19:28,810
Yes. And that's why it takes so long when you have, like very few data or whatever, you're trying to explore distribution, which is maybe too wide.

192
00:19:28,810 --> 00:19:31,810
That's exactly right. Yes. Yes.

193
00:19:31,810 --> 00:19:36,790
You might be familiar if you don't have machine learning, because if you go to machine learning before doing the Bayesian stuff,

194
00:19:36,790 --> 00:19:41,870
you might be familiar with, like stochastic search or like a stochastic gradient descent.

195
00:19:41,870 --> 00:19:48,800
So it's a very similar algorithm in that sense. OK, so having learnt what to give, Sampdoria's, you might be asking, well,

196
00:19:48,800 --> 00:19:55,550
when do we stop sampling because like you could just keep sampling all all the time, but you don't know if you've reached convergence.

197
00:19:55,550 --> 00:20:00,920
Convergence is the key here. If you don't reach convergence, then you might have problems with stability.

198
00:20:00,920 --> 00:20:04,820
So it might mean that the next time you run the same model, if you don't have convergence,

199
00:20:04,820 --> 00:20:08,300
you'll get different estimates for your probability, probably upstairs. And that can all happen.

200
00:20:08,300 --> 00:20:17,260
Yes. OK. They that in this thing, you from three the out or.

201
00:20:17,260 --> 00:20:21,020
I'm sensing from. Distribute some.

202
00:20:21,020 --> 00:20:26,600
Mm. So what does that have to do? So to my base, yeah.

203
00:20:26,600 --> 00:20:31,950
So your data comes in in the conditional distribution of the mean.

204
00:20:31,950 --> 00:20:40,370
So in principle, you have a condition the mean, the mean that you observe from from your data, let's say,

205
00:20:40,370 --> 00:20:46,100
has this prior and then the observed data when you sample from the conditional distribution of new,

206
00:20:46,100 --> 00:20:50,780
you sample a new value of new that has to be coherent with your data.

207
00:20:50,780 --> 00:20:59,060
That's the constraint. So there are two constraints here. Do it? Let me go back up, actually.

208
00:20:59,060 --> 00:21:06,680
When you sample the new value of new, it has to be coherent with the current value of sigma and your data.

209
00:21:06,680 --> 00:21:13,790
So this is a coherence problem. Like every time you sample, it has to get more and more coherent with your data and your sigma.

210
00:21:13,790 --> 00:21:17,450
The reason why it doesn't converge sometimes is because your data is too sparse.

211
00:21:17,450 --> 00:21:24,120
And then there is also, as we were saying in front, there are all sorts of places where this algorithm could be.

212
00:21:24,120 --> 00:21:26,150
It could be over here, could be over there, could be over there.

213
00:21:26,150 --> 00:21:32,090
And so there is any and there's an infinite amount or a non tractable amount of combinations

214
00:21:32,090 --> 00:21:37,260
of new and sigma that are coherent with your data because your data so sparse. Does it make sense?

215
00:21:37,260 --> 00:21:44,330
OK. Yes, it's a method in your son about.

216
00:21:44,330 --> 00:21:50,430
But Justice, Mr Baker, you just said a respectable.

217
00:21:50,430 --> 00:21:57,620
But that's correct, so the starting the properties of the Markov chain that matter are that it is only dependent on its previous value,

218
00:21:57,620 --> 00:22:01,430
so quickly forgets the initial values. That's that's a key property.

219
00:22:01,430 --> 00:22:06,410
And then eventually that if you sample enough, it converges to a coherent joint distribution.

220
00:22:06,410 --> 00:22:13,370
Those are the two properties. So it's not that you're not making any assumptions, you're still making your prior assumption.

221
00:22:13,370 --> 00:22:19,400
So that comes in because your prior distribution, this is these conditionals will get you to a posterior,

222
00:22:19,400 --> 00:22:24,230
which is the average by variance of the prior and the likelihood.

223
00:22:24,230 --> 00:22:32,150
But what you're not doing is you're not letting your initial value for the search algorithm to affect the result doesn't make sense.

224
00:22:32,150 --> 00:22:39,060
OK. We're almost through the Bayesian prior I got from the basic the basic primer mission prior.

225
00:22:39,060 --> 00:22:46,090
And. And another way of looking at this blob that I drew on the board here is it's like this,

226
00:22:46,090 --> 00:22:52,030
so say this was just the the mean of a normal distribution and you were looking at its posterior.

227
00:22:52,030 --> 00:22:56,020
Then there is a bunch of. So here we run to change.

228
00:22:56,020 --> 00:23:00,970
So the way we spoke about a mark of change writes like, we run, we run this algorithm twice,

229
00:23:00,970 --> 00:23:08,470
starting from two different points and then we wait until X number of iterations where it's forgotten the initial values,

230
00:23:08,470 --> 00:23:13,120
which in this case takes around two hundred iterations between the two chains.

231
00:23:13,120 --> 00:23:20,800
And then you can see the two chains mix nicely, which is this called mixing, which is which suggests that we are at this point.

232
00:23:20,800 --> 00:23:23,440
Here we are at convergence because what does this mean?

233
00:23:23,440 --> 00:23:30,370
This means that every time I'm sampling from the the previous value, I'm getting stable coefficients.

234
00:23:30,370 --> 00:23:39,700
If we imagine that this is new, the variance here in each chain represents the actual variance of new the mean of the normal distribution.

235
00:23:39,700 --> 00:23:45,130
And so if both chains have agreed on the variance of you, then you have you have reached convergence.

236
00:23:45,130 --> 00:23:53,050
Does that make sense? Yes. When the chains are completely random and yeah, then you, then it's a problem, then it's a problem.

237
00:23:53,050 --> 00:23:55,180
It means that your algorithms and converged.

238
00:23:55,180 --> 00:23:59,740
And as we will see, there is all sorts of tricks that you can use to try to get your algorithm to converge.

239
00:23:59,740 --> 00:24:05,660
But in principle, you cannot use your posterior values to make inference if the chains have not converged.

240
00:24:05,660 --> 00:24:09,430
Potentially is sort of like a disagreement between your data and your priors.

241
00:24:09,430 --> 00:24:14,800
And no, so if there is a disagreement within your data on your power, your algorithm will still converge.

242
00:24:14,800 --> 00:24:17,050
It'll just it'll be like some sort of average in the middle,

243
00:24:17,050 --> 00:24:23,410
and it suggests that you shouldn't reformulate your prior, really, because especially if you don't have much data,

244
00:24:23,410 --> 00:24:29,020
because your your is affecting your data a lot and maybe you don't want that to happen, it's it's not and not informative prior.

245
00:24:29,020 --> 00:24:36,250
It's a very informative prior and we try to stay away from those if you're trying to do predictive things, for example.

246
00:24:36,250 --> 00:24:47,170
Any more questions? Yeah. So you do learn something that is not informative of the article to as a prior, you have two choices.

247
00:24:47,170 --> 00:24:51,070
If you know your stuff very, very well and your data is very, very small,

248
00:24:51,070 --> 00:24:55,150
you might want to pick something that's informative because that's going to go and augment your data.

249
00:24:55,150 --> 00:25:00,850
But the downside is that that's might be biased because if it's your belief about a given parameter,

250
00:25:00,850 --> 00:25:11,800
then you might have all sorts of bias in yourself. So for example, if I were to ask, I don't know, a a no minority person in America,

251
00:25:11,800 --> 00:25:16,360
what's the proportion of police arrests that involve minorities in the United States?

252
00:25:16,360 --> 00:25:22,450
My guess is they probably would underestimate that proportion, whereas if I asked a minority person, they might overestimate it.

253
00:25:22,450 --> 00:25:25,750
And so some somewhere in between would probably be more more sensible.

254
00:25:25,750 --> 00:25:32,200
So but I guess what I'm trying to say is if you know that your prior is going to add value to your prediction,

255
00:25:32,200 --> 00:25:39,400
put it in if you don't choose a non informative prior and allow your data to speak by itself.

256
00:25:39,400 --> 00:25:48,730
That makes sense. Fantastic. OK. And so the question comes, how do we choose when to stop?

257
00:25:48,730 --> 00:25:53,350
Well, we can look at these plots and say and look at mixing. So as we just saw.

258
00:25:53,350 --> 00:26:00,850
But there's also a formula that we can use, and the formula is called the Gilman Reuben Stat. It's this R hat here,

259
00:26:00,850 --> 00:26:08,800
and all it is is a measure of the within chain and between chain variants.

260
00:26:08,800 --> 00:26:13,750
So you need to, as I was explaining before, you need to have reached a sort of agreement between the two chains.

261
00:26:13,750 --> 00:26:19,480
And so if that is the case that within the within chain variants has to be stable and the between

262
00:26:19,480 --> 00:26:23,890
chain variants shouldn't be very large because you want them to be converging on the same point.

263
00:26:23,890 --> 00:26:24,760
Does that make sense?

264
00:26:24,760 --> 00:26:33,730
So usually values of Goldman Rubin's statistics that hard hat that tell you that your algorithms converged are any where below 1.1,

265
00:26:33,730 --> 00:26:39,000
then you can sort of this is a juristic, by the way, it was developed by Andrew Gelman and Don Lemon.

266
00:26:39,000 --> 00:26:45,100
Yeah, but you can pick you should pick values that are below one point one.

267
00:26:45,100 --> 00:26:53,530
There's also some discussion in the literature about one point one being too large and some conflicts with frequent this stat. Once you get there,

268
00:26:53,530 --> 00:26:59,020
but one point one is a good value for you guys to know.

269
00:26:59,020 --> 00:27:08,320
And then the last problem you have is as you sample from from each one of these, you don't just have the the problem of convergence,

270
00:27:08,320 --> 00:27:12,490
you also have the problem of auto correlation because clearly the value where you

271
00:27:12,490 --> 00:27:16,540
are today will affect the value where you are tomorrow every time you sample.

272
00:27:16,540 --> 00:27:25,630
And so in order to get rid of that, so and the way you can estimate how much that auto correlation is affecting

273
00:27:25,630 --> 00:27:31,060
your your your jib sampler is through this thing called effective sample size.

274
00:27:31,060 --> 00:27:36,430
And all that does is it says, well, if you remove if you filter for the auto correlation,

275
00:27:36,430 --> 00:27:43,240
how many truly new data are you getting every time you sample from your from your conditional distribution?

276
00:27:43,240 --> 00:27:52,690
And so if you're in effect, if you think like if you do 100 runs of your jib sampler and you think that you have reached convergence,

277
00:27:52,690 --> 00:27:54,970
your effective sample size should be 100.

278
00:27:54,970 --> 00:28:01,810
But often what you find is that a hundred thousand rounds of your jib sampler and you think you've reached convergence,

279
00:28:01,810 --> 00:28:08,110
but then you look at the effect of sample size and it's like twenty eight, which means that your values are massively correlated.

280
00:28:08,110 --> 00:28:12,100
And the technique you used to get rid of this auto correlation is called thinning,

281
00:28:12,100 --> 00:28:19,210
which means instead of taking all the 1000, you take every fifth or every sixth or every tenth and so on.

282
00:28:19,210 --> 00:28:23,560
And that should remove the auto correlation because if there's a Markov chain,

283
00:28:23,560 --> 00:28:27,610
today's value is only dependent on yesterday's values, not on the days before.

284
00:28:27,610 --> 00:28:32,750
And so if you pick every 10, they're going to be independent. Does it make sense?

285
00:28:32,750 --> 00:28:40,160
Brilliant. OK. So with this small Bayesian primer, I try to give you like four years of studying in like 20 minutes,

286
00:28:40,160 --> 00:28:48,590
but the learning objectives you should be familiar with what based theorem is, you should be you should know what a likelihood that a posterior is.

287
00:28:48,590 --> 00:28:51,650
You should understand that based Bayesian inferential procedure.

288
00:28:51,650 --> 00:28:56,210
So that kind of data generation and then inference and updating that I showed you before,

289
00:28:56,210 --> 00:28:59,150
you should be able to use conjugate priors and know their limitations.

290
00:28:59,150 --> 00:29:05,450
You should be able to understand what a Gibb's sampler is and why you use that when you don't have analytical solution from conjugate priors.

291
00:29:05,450 --> 00:29:12,370
And then you should be able to check whether your sampler is reliable or that it has converged and that it's not auto correlated.

292
00:29:12,370 --> 00:29:17,380
OK. Take a breather. This was Bayesian statistics.

293
00:29:17,380 --> 00:29:23,650
Very good. What's coming next is a lot more reasonable in my mind.

294
00:29:23,650 --> 00:29:29,200
So we're not going to look at this prediction, impose certification part.

295
00:29:29,200 --> 00:29:33,580
So the way I'm going to teach you this is going to be through the example that I'm most

296
00:29:33,580 --> 00:29:38,710
familiar with because that's what I use with my research and that is voting and in particular,

297
00:29:38,710 --> 00:29:45,820
trying to predict the percentage of people who votes for a given candidate from a non representative sample.

298
00:29:45,820 --> 00:29:52,510
And so, for instance, we may be interested in knowing in 2020 if Joe Biden was the candidate for the Democratic Party,

299
00:29:52,510 --> 00:29:56,800
what would be the proportion of people in the United States who would vote for Donald Trump,

300
00:29:56,800 --> 00:30:00,490
either as a whole in the United States or at the state level?

301
00:30:00,490 --> 00:30:05,920
Because that's important because then you can calculate the Electoral College votes and then you can find out who actually wins the election.

302
00:30:05,920 --> 00:30:13,570
OK? The classic decomposition of that problem that I just explained to you is as follows.

303
00:30:13,570 --> 00:30:22,660
So what you want is you want to find out what is the probability, the probability distribution of people who vote for Choice J.

304
00:30:22,660 --> 00:30:26,860
And turnout conditional on a set of characteristics.

305
00:30:26,860 --> 00:30:29,500
So these characteristics could be the gender of the person.

306
00:30:29,500 --> 00:30:35,170
They could be the race of the person, they could be the income of the person, the education and so on.

307
00:30:35,170 --> 00:30:39,460
And what did they turnout people either turnout on Election Day or they stay home?

308
00:30:39,460 --> 00:30:45,490
And so we are going to decompose this problem. This is a joint distribution between vote and turnout,

309
00:30:45,490 --> 00:30:49,990
and we're going to decompose this problem into two parts just by the definition of a joint distribution.

310
00:30:49,990 --> 00:30:56,140
And that's going to be the distribution of voting conditional on turning out and the distribution of turning out.

311
00:30:56,140 --> 00:31:01,680
And we're going to multiply these together. OK. So that's fairly straightforward.

312
00:31:01,680 --> 00:31:08,410
So one way we have to estimate the.

313
00:31:08,410 --> 00:31:16,270
Distribution of turning out is the probability distribution of individuals will turn out is we do a non-representative sample,

314
00:31:16,270 --> 00:31:22,810
say through Digital Trace, we obtain characteristics on individuals through this digital trace and then

315
00:31:22,810 --> 00:31:26,230
we try to use these characteristics to get and we have some training data.

316
00:31:26,230 --> 00:31:33,430
So we know from a survey, let's say one of these people turned out in 2016 or whether they say they will turn out in 2020.

317
00:31:33,430 --> 00:31:40,960
And what we do is just we fit a model to predict turnout propensity for each individual within our non-representative representative sample.

318
00:31:40,960 --> 00:31:45,070
And this model could be anything. It could be a linear regression. It could be a multilevel regression.

319
00:31:45,070 --> 00:31:51,220
As we're going to see, it could be a machine learning method like random forests or a convolutional neural network and so on.

320
00:31:51,220 --> 00:31:58,870
What matters is that you pick the model that given your limited data because your data is non-representative,

321
00:31:58,870 --> 00:32:04,870
will allow you to make the best out of sample predictions. So this is a fairly common theme in machine learning, essentially.

322
00:32:04,870 --> 00:32:07,270
And what do we mean by out of sample predictions?

323
00:32:07,270 --> 00:32:14,890
Well, if we have a small sample say like eight thousand five hundred individuals, but we are interested in, you know,

324
00:32:14,890 --> 00:32:22,930
hundreds of thousands of categories of voters, then we will only learn about those categories of voters from the small sample that we have.

325
00:32:22,930 --> 00:32:31,480
So for example, if we have in our sample white male, the only race that we have in a sample is whites.

326
00:32:31,480 --> 00:32:36,310
But we have whites of like lower education, higher education and middle education.

327
00:32:36,310 --> 00:32:45,880
And then we are asked to make a guess as to the how minorities in the United States are going to vote based on the whites that we observed.

328
00:32:45,880 --> 00:32:47,380
Then the best,

329
00:32:47,380 --> 00:32:57,400
we're going to have to find out an algorithm that allows us to extrapolate from our limited non representative sample into that unobserved category.

330
00:32:57,400 --> 00:33:02,950
Does that make sense? And so one way we have to do this is the multilevel regression.

331
00:33:02,950 --> 00:33:07,210
Why do we choose multilevel regression? We could, as I said, we could have chosen any other algorithm.

332
00:33:07,210 --> 00:33:13,240
And the answer is because it allows us to do this thing called shrinkage. How many of you are familiar with the term shrinkage?

333
00:33:13,240 --> 00:33:18,670
Yes, very good. So it does this thing called shrinkage, which is a form of regularisation.

334
00:33:18,670 --> 00:33:25,570
And what shrinkage does is when you estimate a coefficient, a random effect coefficient, for instance,

335
00:33:25,570 --> 00:33:31,930
instead of the coefficient being exactly what you observed when your data for that particular category.

336
00:33:31,930 --> 00:33:37,720
So for example, if one of our variables is education and we have three, we have four of them.

337
00:33:37,720 --> 00:33:43,510
Let's say like low education, middle lower education, middle upper education and upper education.

338
00:33:43,510 --> 00:33:53,110
Then instead of your estimate for the age, for the lower lower education effect being exactly the mean of the low educated in your sample,

339
00:33:53,110 --> 00:34:01,360
it is going to be an average between the lowest you mean of the low educated in your sample and the global mean of the four categories.

340
00:34:01,360 --> 00:34:09,280
And that average is going to be a variance average. So it's going to be a compromise between the variance between variance of the

341
00:34:09,280 --> 00:34:14,380
four categories and the within variance of the single category that you have.

342
00:34:14,380 --> 00:34:19,210
So for instance, if you observe only one individual with low education,

343
00:34:19,210 --> 00:34:26,590
then your estimate for low education effect is going to be very close to the average of the four or the four effects.

344
00:34:26,590 --> 00:34:27,220
Why?

345
00:34:27,220 --> 00:34:36,340
Because that estimate is very precise in your sample, whereas if you estimate, if you have a lot of a lot, a lot of observations for low education,

346
00:34:36,340 --> 00:34:42,630
but very few for the other three, the effects of the other three are going to be shrunk towards the effect of low education.

347
00:34:42,630 --> 00:34:50,210
That kind of makes sense. So the effect that this has in practise is that it allows you not to fit your data.

348
00:34:50,210 --> 00:34:57,410
And what this means is that when you look at data that you haven't seen before, your data has not been as affected by noise.

349
00:34:57,410 --> 00:35:03,950
So your model has not been as affected by noise, and therefore you'll be able to make better predictions better out of sample predictions.

350
00:35:03,950 --> 00:35:14,000
Very good. OK. Well, you see on the board is a hierarchical specification for a simple multilevel turnout model.

351
00:35:14,000 --> 00:35:21,440
So we say this is a survey or a digital trace, and we've asked people, Are you going to turn out in 2020?

352
00:35:21,440 --> 00:35:26,090
Then that distribution pie is a better distribution of pie.

353
00:35:26,090 --> 00:35:30,060
These are likelihood. Yes or no? Yes. Turnout turnout?

354
00:35:30,060 --> 00:35:32,890
Yes. No. One zero.

355
00:35:32,890 --> 00:35:42,340
And the distribution can be assumed to be a burden, newly distributions are newly allows for a probability of a switch on and a switch off.

356
00:35:42,340 --> 00:35:45,760
And this has hyper parameter Fita.

357
00:35:45,760 --> 00:35:52,420
And then we can just use a logistic regression to estimate the value of either parameter theda for each individual in our sample.

358
00:35:52,420 --> 00:35:59,980
Now you see the subscript g i. That's because each individual in our sample in our mind is part of a group.

359
00:35:59,980 --> 00:36:03,820
And this is going to be very important when you're going to stratify because you're going to find out

360
00:36:03,820 --> 00:36:09,160
that there are so many people of this particular group in a particular state or in your country,

361
00:36:09,160 --> 00:36:12,070
and we will see how we will do that calculation later.

362
00:36:12,070 --> 00:36:22,660
So the group is going to be defined by the specific combination of sex, age, race, education, household income and state in this particular model.

363
00:36:22,660 --> 00:36:26,290
The these are all the all the details are all random effects.

364
00:36:26,290 --> 00:36:30,850
So they are estimated in the way that I have that I have described previously.

365
00:36:30,850 --> 00:36:38,260
The beta are state level predictors. So why would we put state level predictors in the individual level model?

366
00:36:38,260 --> 00:36:43,840
Well, that's because ultimately you're interested in a state level effect and the

367
00:36:43,840 --> 00:36:47,470
state level predictor can be extremely predictive of individual level choices, actually.

368
00:36:47,470 --> 00:36:53,140
So for instance, if one one of the common variables that we put in the state level predictor,

369
00:36:53,140 --> 00:36:58,840
if we're trying to predict turnout is last year's turnout and this ends up being quite helpful to

370
00:36:58,840 --> 00:37:04,600
the out of sample predictions for categories of voters that we have not observed in our sample.

371
00:37:04,600 --> 00:37:07,150
So these are categories of voters that we're interested in,

372
00:37:07,150 --> 00:37:13,270
but we didn't manage to collect because our sample was imperfect or too small and so on and so forth.

373
00:37:13,270 --> 00:37:26,550
What you see here are priors. Yes, Chris. Does that also help you the latest move by the state?

374
00:37:26,550 --> 00:37:35,810
Yes. The one state that they are from the. Yeah.

375
00:37:35,810 --> 00:37:43,830
Yeah, you're right. So the state level predictor helps you kerb some of that in inaccuracy slash

376
00:37:43,830 --> 00:37:51,380
and consistency that your base estimate of the random effect would bring in.

377
00:37:51,380 --> 00:37:55,760
Yes, I think that's that's correct. Yes. Yeah.

378
00:37:55,760 --> 00:38:01,990
This is the standard framework. But then suddenly, Professor Stemple framework would also need to test the.

379
00:38:01,990 --> 00:38:09,900
The state level. Estimates from a similarly known representative sample.

380
00:38:09,900 --> 00:38:19,900
I'm. Well, you could put them in if you have them, you would definitely put them in.

381
00:38:19,900 --> 00:38:27,380
Well. So in the case of voting, what we usually do is so you're saying, for example, if we have a representative,

382
00:38:27,380 --> 00:38:31,700
a representative sample of voters in the Midwest,

383
00:38:31,700 --> 00:38:36,440
would we use the point estimate from that representative sample as part of our state level predictors?

384
00:38:36,440 --> 00:38:41,490
Is that right? That's a good question, I guess.

385
00:38:41,490 --> 00:38:49,470
There is a there is a trade off, yeah, I think you would want to use a compromise between the latest available data and the preciseness of the data.

386
00:38:49,470 --> 00:38:53,760
The cool thing about having last election results is that they are precise, they are the value.

387
00:38:53,760 --> 00:39:02,340
Whereas a sample that is generated via telephone calling, for example, et cetera, has, as we have seen we saw yesterday with the total survey error,

388
00:39:02,340 --> 00:39:07,020
has many, many steps in which it could go wrong and so you might end up introducing noise to your sample.

389
00:39:07,020 --> 00:39:11,700
But in principle, I would say for countries that haven't had election, say, in eight years or something.

390
00:39:11,700 --> 00:39:16,230
So I was thinking of working on a project in Afghanistan, and in that case,

391
00:39:16,230 --> 00:39:21,000
you would pop in survey data because that would be a lot more accurate, you know, demo level.

392
00:39:21,000 --> 00:39:25,730
This is a model to. Go towards some sort of problem.

393
00:39:25,730 --> 00:39:32,370
From the North and the South Pole, the only work was that there was a lot of.

394
00:39:32,370 --> 00:39:37,470
This is a model to go to the best possible prediction we can have for a category.

395
00:39:37,470 --> 00:39:41,490
And then that's it. This is not doing any certification work.

396
00:39:41,490 --> 00:39:46,380
The result of this model is not going to give you the result at the state level. We need to do extra work to do that.

397
00:39:46,380 --> 00:39:50,850
Yes, I mean, the data you do is. Like the people in the face, the more markets.

398
00:39:50,850 --> 00:39:58,810
Yes, that's correct. Yes. Absolutely yes. The specific characteristic of north and central.

399
00:39:58,810 --> 00:40:05,140
It would work on representative samples as well. In fact, what people do sometimes is they pop representative samples in here,

400
00:40:05,140 --> 00:40:10,540
representative samples at the national level in order to find out how categories vote and

401
00:40:10,540 --> 00:40:14,770
then stratify those categories at the state level in order to obtain area estimation.

402
00:40:14,770 --> 00:40:19,260
So the fact that the sample is representative shouldn't stop you from doing multilevel regression.

403
00:40:19,260 --> 00:40:25,130
Yeah, yeah. Yeah, yeah, sorry. If those things are yes, yes.

404
00:40:25,130 --> 00:40:30,850
You're not asking people what they look like. They say, Yep.

405
00:40:30,850 --> 00:40:39,790
You can still expand. All the still set up today are small cells, but still they'll be assembled directly into those.

406
00:40:39,790 --> 00:40:44,130
Yeah, we'll have that for sure. Yes. They're like a PlayStation.

407
00:40:44,130 --> 00:40:48,180
Yeah, same question with a sample for PlayStation. Yeah, yeah.

408
00:40:48,180 --> 00:40:58,230
They're going to use them all the time. Well. You couldn't use the same model you'd want to adjust for that selection effect somehow.

409
00:40:58,230 --> 00:41:04,580
Yes, of course, yes. But this one is tailored to the selection mechanism for this market.

410
00:41:04,580 --> 00:41:11,870
And therefore would not be because it probably led to expose Muller was very much opposed, make just one personalised PlayStation.

411
00:41:11,870 --> 00:41:15,410
Yeah, they severely upvote this person, wrote the model kind.

412
00:41:15,410 --> 00:41:18,980
Yes. So like we did that we saw yesterday. Yes, yes. So that's it.

413
00:41:18,980 --> 00:41:24,500
An orbit in the samples of PlayStation users, then you wouldn't want to up scale down downscaled.

414
00:41:24,500 --> 00:41:28,820
So there was a couple of things there. So eight, the that kind of thing.

415
00:41:28,820 --> 00:41:33,200
So the guy that uses the PlayStation, which we would want to upvote him, I guess you say, right?

416
00:41:33,200 --> 00:41:40,700
Because that would make it more representative of all. Well, then we should have PlayStation and Xbox as part of our of our certification frame.

417
00:41:40,700 --> 00:41:46,080
Yes, that's what you were saying. Yeah, for sure. We should also have.

418
00:41:46,080 --> 00:41:50,220
Some kind of estimate of the population that contains these variables.

419
00:41:50,220 --> 00:41:53,880
That's exactly right. For those four to five, we don't have also.

420
00:41:53,880 --> 00:42:01,920
So that's exactly what Scott was saying yesterday that one of the limitations of his work is that he has to rely on census that gets done in 2011.

421
00:42:01,920 --> 00:42:05,610
And as we're going to see in the next section, which is the certification part,

422
00:42:05,610 --> 00:42:11,670
you do need accurate, accurate cell sizes, as we call them, in the certification frame.

423
00:42:11,670 --> 00:42:16,050
So you can't you can't do this work if you don't have accurate cell sizes.

424
00:42:16,050 --> 00:42:19,530
Yes, that's correct. Yeah, yeah, go.

425
00:42:19,530 --> 00:42:23,010
Uh-Huh. You take the microphone. Sorry, sorry for the live stream.

426
00:42:23,010 --> 00:42:33,930
I mean, I doubt that anybody in America is awake. So. So in the line of the coefficients here what we have female age, race, education, income.

427
00:42:33,930 --> 00:42:39,400
So you're assigning. So this logic is the distribution that you're assigning to each of these coefficients.

428
00:42:39,400 --> 00:42:45,450
So the prior distribution for all is just the same or you're assigning which we're aware of the where the priors.

429
00:42:45,450 --> 00:42:51,690
Yeah, these are the priors. Right. So I try this you use cheap notation to save space.

430
00:42:51,690 --> 00:42:56,610
But the idea is that each EDA has its own variance parameters.

431
00:42:56,610 --> 00:43:01,570
It doesn't make sense. Yeah, I wasn't looking at the definition of end, but the definition of and is down there.

432
00:43:01,570 --> 00:43:08,760
Yep, yeah, that's right. So this is these are the priors on the eda's and they are shrinkage priors because each EDA,

433
00:43:08,760 --> 00:43:15,360
even though it has like five or ten categories within it, like, say, like education, has four categories.

434
00:43:15,360 --> 00:43:22,500
If I remember correctly, they have a single variance parameter, which means that they'll be shrunk towards the mean of the four categories.

435
00:43:22,500 --> 00:43:27,070
Yes. And this way of parameter it.

436
00:43:27,070 --> 00:43:35,880
So I told you that in Bayesian statistics, we use precision, which is one over the variance as the as the specification for the normal distribution.

437
00:43:35,880 --> 00:43:38,400
And this way of parameter rising with a uniform distribution,

438
00:43:38,400 --> 00:43:46,380
a positive uniform distribution large enough on the logic scale is a way of including a non informative prior on the variance.

439
00:43:46,380 --> 00:43:52,670
Does it make sense to people? Yes. Why would we inspire?

440
00:43:52,670 --> 00:44:00,720
Because five of the logic, this on the logic scale is very, very large, so it allows for variance values between zero.

441
00:44:00,720 --> 00:44:04,700
Essentially, yeah, it's not informative, I would say.

442
00:44:04,700 --> 00:44:12,890
Do you have a I'm sure there exists better suggestions. OK.

443
00:44:12,890 --> 00:44:21,950
Yes, sorry. So you also have some information on state level in this case, yeah.

444
00:44:21,950 --> 00:44:30,080
Did this estimate at the higher level have to be in line with the estimates you are looking for things like is there a relation now that you look for

445
00:44:30,080 --> 00:44:39,800
states estimates because you're interested in calculating something at the state level or could have been also gender or something in the state?

446
00:44:39,800 --> 00:44:45,080
So there have been studies that have put in a gender level variables as well for the reason that Chris was mentioning,

447
00:44:45,080 --> 00:44:51,800
which is that introducing the gender level predictor would help kerb some of the bias introduced by the shrinkage.

448
00:44:51,800 --> 00:44:54,380
But they've shown that they haven't thought you could do that.

449
00:44:54,380 --> 00:45:00,650
You could have a state level predictor or gender lower predictor and age lower predictor and so on. But studies have shown that it wasn't.

450
00:45:00,650 --> 00:45:08,630
It's it doesn't increase predictive accuracy of these models. So it would just increase running time, and we don't want to do that essentially.

451
00:45:08,630 --> 00:45:15,620
So basically, you should make sure that these estimates at the higher level are at the level that you're interested in.

452
00:45:15,620 --> 00:45:20,270
That's correct. Yeah. Any more questions?

453
00:45:20,270 --> 00:45:31,150
OK. Good. And so how can we fit this model where it's clear that this doesn't have even though?

454
00:45:31,150 --> 00:45:35,020
Well, it's clear that this doesn't have. I conjugate prior, right?

455
00:45:35,020 --> 00:45:38,180
It's a very complex model, a lot of nonlinearity, et cetera.

456
00:45:38,180 --> 00:45:44,450
So the way we do it is through our Markov chain and in particular we use two programmes are very famous for doing this.

457
00:45:44,450 --> 00:45:50,000
One is Jag's, which is what because I come from an epidemiology background or like a medical statistics background,

458
00:45:50,000 --> 00:45:52,520
Jag's is what I've been trained to use,

459
00:45:52,520 --> 00:45:59,540
which uses two algorithms the Gibbs sampler and then another one called Metropolis Hastings, which you don't need to know for today.

460
00:45:59,540 --> 00:46:07,310
And another one is Stan, which is a more recent development which comes out of Andrew Gettleman's sort of political science background.

461
00:46:07,310 --> 00:46:10,790
But also, it's better. Stan is better.

462
00:46:10,790 --> 00:46:18,080
If you can learn Stan, learn Stan. And the thing about Stan is it is better because it uses this thing called Hamiltonian Monte-Carlo,

463
00:46:18,080 --> 00:46:26,670
which is a way which is somewhat similar to stochastic gradient descent, in that it's a way to create shortcuts between the iterations.

464
00:46:26,670 --> 00:46:33,920
So it's a way to send you closer to the convergence point faster, essentially.

465
00:46:33,920 --> 00:46:37,220
But you don't need to worry too much about that. Just learn one of these two languages.

466
00:46:37,220 --> 00:46:41,780
If you're starting from scratch, I recommend you learn Stan, but Jag's is a very, very good programme.

467
00:46:41,780 --> 00:46:49,220
That's what we're going to see today. And actually some having spoken to some statisticians, they have a sense the Jags, if you have many,

468
00:46:49,220 --> 00:46:57,250
many, many sources of information, Jag's is a better way to aggregate them than than.

469
00:46:57,250 --> 00:47:02,200
The other part of our turnout model of our decomposition that we saw before, which I remember,

470
00:47:02,200 --> 00:47:09,910
we're still trying to find the joint distribution for voting and turning out was the voting distribution and in particular,

471
00:47:09,910 --> 00:47:16,840
so what I'm showing you here is a very simple example of a voting for Republican.

472
00:47:16,840 --> 00:47:20,350
So we just have a dichotomous outcome in a poll. Do you vote for the Democrat?

473
00:47:20,350 --> 00:47:28,440
We do it for the Republicans. Yes or no? And so. This looks almost identical to what you saw before, but there are two differences.

474
00:47:28,440 --> 00:47:36,600
One is that one, in fact one main difference and that is that this distribution of AR is conditional on turnout.

475
00:47:36,600 --> 00:47:44,250
There are many ways to make this distribution conditional on turnout. One way is to just add turnout weights to each observation.

476
00:47:44,250 --> 00:47:50,490
And then. And then obtained a posterior conditional on the turnout weights and other weight,

477
00:47:50,490 --> 00:47:53,640
which is what I prefer and what I'm going to show, what we did in the example.

478
00:47:53,640 --> 00:48:01,110
I'm going to show you next is to actually estimate a different alpha for people who turn out and people who do not turn out.

479
00:48:01,110 --> 00:48:07,590
And the way you do that is by fitting a turn vote choice model.

480
00:48:07,590 --> 00:48:13,760
You simulate who turns out and then every every simulation you fit the turnout model,

481
00:48:13,760 --> 00:48:18,630
the right to vote choice model to people who turnout in that particular simulation and people who don't turn out.

482
00:48:18,630 --> 00:48:24,540
And then you repeat this every time. And the cool thing about this is that it has a quite nice, like intuitive way.

483
00:48:24,540 --> 00:48:30,960
It's like, OK, every time I simulate, I get a bunch of people turn out and then I'm going to use them to make my estimation.

484
00:48:30,960 --> 00:48:36,990
And then also, that incorporates a lot of uncertainty into our model, which is one of the Bayesian approach based advantages.

485
00:48:36,990 --> 00:48:39,540
Because you you incorporate uncertainty.

486
00:48:39,540 --> 00:48:44,610
If you just put the weights in, you're not going to be able to incorporate the uncertainty about those weights.

487
00:48:44,610 --> 00:48:50,700
But why? By sampling people who turn out according to those weights and fitting a model in them,

488
00:48:50,700 --> 00:48:57,510
then your coefficients are going to are going to incorporate the uncertainty derived from people older now and people who do not turn out.

489
00:48:57,510 --> 00:49:02,250
Does that kind of make sense? And so, yeah, and so this is this is it.

490
00:49:02,250 --> 00:49:08,660
You've seen this before? OK, so here we go to our example.

491
00:49:08,660 --> 00:49:16,280
So the example is thanks to the organising committee, we got some money to do a survey online.

492
00:49:16,280 --> 00:49:22,040
We did a survey on Amazon Mechanical Turk. Are you all familiar with what Amazon Mechanical Turk says?

493
00:49:22,040 --> 00:49:32,270
Yes. Well, for those who don't. It's just a platform where you can put any what are called human intelligent tasks, human intelligence tasks.

494
00:49:32,270 --> 00:49:38,060
And so, for instance, if you needed labels for a training set like so you had like cats wore, for example,

495
00:49:38,060 --> 00:49:42,470
so you can feed the Turks a lot of cat wars like we saw yesterday,

496
00:49:42,470 --> 00:49:46,850
and the Turks would have to click on the cutest ones and or you could do something else.

497
00:49:46,850 --> 00:49:50,360
You could ask them to transcribe an audio file or you can do something else.

498
00:49:50,360 --> 00:49:56,660
You can ask them to fill in a survey like we did. And that's kind of cool about the Turks.

499
00:49:56,660 --> 00:50:03,740
Is that the Mechanical Turk? If you're wondering about the name Mechanical Turk, is that this small aside?

500
00:50:03,740 --> 00:50:12,350
Back in, I think it was the eighteen hundreds or something they used to be the circus they used to carry around a chess player with a turbine,

501
00:50:12,350 --> 00:50:20,330
and the chess player became known as the Mechanical Turk because it was a wooden figure that would play chess and beat a lot of people.

502
00:50:20,330 --> 00:50:23,720
And everybody was thinking like, How the [INAUDIBLE] do they? They have robotics or something?

503
00:50:23,720 --> 00:50:34,910
And it's like, No, actually, that was a little person inside a box underneath the chess board that played chess, which which is quite funny.

504
00:50:34,910 --> 00:50:40,040
But that's kind of why we we call them mechanical tricks, because these are tasks that in principle,

505
00:50:40,040 --> 00:50:43,490
we would like to have an AI do, but we don't have the way to do this.

506
00:50:43,490 --> 00:50:50,750
So we given to humans to to do. OK, so a few details about the survey.

507
00:50:50,750 --> 00:50:54,680
So the goal of this? We asked them a bunch of questions.

508
00:50:54,680 --> 00:50:59,240
Some of the questions in all of the questions involved the their voting preferences

509
00:50:59,240 --> 00:51:03,770
in twenty twenty and some of the questions involved their voting preferences. Twenty sixteen.

510
00:51:03,770 --> 00:51:05,930
Why did we do twenty sixteen? Because we had a benchmark.

511
00:51:05,930 --> 00:51:12,620
We know the results from twenty sixteen and I want to show you that through a very known representative sample, which is the Amazon Mechanical Turk.

512
00:51:12,620 --> 00:51:18,770
Because how many selection effects are there amongst people who you select into being a Mechanical Turk, right?

513
00:51:18,770 --> 00:51:22,310
Like, you know, this is a completely non representative sample.

514
00:51:22,310 --> 00:51:31,530
But as I will show you is that from this lone representative sample, we were replicate the 2016 election results almost to the point.

515
00:51:31,530 --> 00:51:36,870
Few details on the survey. We survey a thousand five hundred workers on the 11th of June.

516
00:51:36,870 --> 00:51:39,750
Amazon anthrax takes twenty five percent of the total fee,

517
00:51:39,750 --> 00:51:45,540
which is a bummer because we asked for a specific characteristics about the Mechanical Turk.

518
00:51:45,540 --> 00:51:50,010
So we asked that they were Turks that had, on average, an accuracy of ninety five percent.

519
00:51:50,010 --> 00:51:57,120
We asked that they live in the United States and we asked that they had been approved for at least a thousand human intelligent tasks.

520
00:51:57,120 --> 00:52:05,520
So I just. Why did I do this? I want to screen out bots because bots are a problem in all online platform and in particular on Mechanical Turk.

521
00:52:05,520 --> 00:52:11,460
There have been paper showing that there is survey bots that just randomly click throughout.

522
00:52:11,460 --> 00:52:19,530
Another thing we did to screen our box was we had a CAPTCHA in the survey, so a bot wouldn't have been able to to fulfil the CAPTCHA task.

523
00:52:19,530 --> 00:52:28,170
OK, do we need to do a break, by the way, at some point? What at?

524
00:52:28,170 --> 00:52:40,530
OK, so we have time with time, OK, great. And so it costed us around a thousand dollars and we obtain a thousand five hundred responses.

525
00:52:40,530 --> 00:52:49,950
And. Yeah, that's it. This so now we're going to get into good.

526
00:52:49,950 --> 00:52:53,520
OK. This is what I figured. You know, I'll take this opportunity.

527
00:52:53,520 --> 00:52:58,860
I think Cenk tomorrow is going to speak more about Mechanical Turk and teach you how to gather data from Mechanical Turk.

528
00:52:58,860 --> 00:53:06,150
But I want it to give you sort of an overview of what a screenshot of a human intelligence tasks looks looks like.

529
00:53:06,150 --> 00:53:11,640
And so up here you have the percentage of human intelligent tasks in your batch.

530
00:53:11,640 --> 00:53:17,010
So we had two thousand five hundred human intelligent tasks AI. We wanted our survey to be taken a thousand five hundred times.

531
00:53:17,010 --> 00:53:24,240
And this collective was called to batch a batch of 1500 human intelligent tasks.

532
00:53:24,240 --> 00:53:29,190
And this was the description of the task.

533
00:53:29,190 --> 00:53:35,310
These are the qualifications that we asked for souhaite approval rate greater with a 95 percent location in the United States.

534
00:53:35,310 --> 00:53:41,940
More than sorry, it was five hundred. We asked them that they had been approved for at least five hundred intelligent tasks.

535
00:53:41,940 --> 00:53:48,720
And and this was an example of what the any given Mechanical Turk saw.

536
00:53:48,720 --> 00:53:52,470
So what it was was simply where it says the link will appear here.

537
00:53:52,470 --> 00:53:56,850
Only if you accept the hit, the link to our survey would appear. They would click on it.

538
00:53:56,850 --> 00:54:03,930
They would conclude the survey. And then at the end of the survey, a code a unique code to each individual Turk would be shown.

539
00:54:03,930 --> 00:54:09,870
And then they had to input that code inside so that we could match the responses to the Turks.

540
00:54:09,870 --> 00:54:13,540
Does that make sense? OK, fantastic. Yeah.

541
00:54:13,540 --> 00:54:20,770
Yeah, yep. So.

542
00:54:20,770 --> 00:54:27,940
So that's just a clarification. Yeah. What's his approval rate of greater than or equal to 95?

543
00:54:27,940 --> 00:54:33,640
So it means that I only wanted people whose human intelligence ask approval rate,

544
00:54:33,640 --> 00:54:39,400
so the proportion of approved over non-approved was ninety five percent or above.

545
00:54:39,400 --> 00:54:46,540
So I wanted people who weren't bots because bots would have a very low approval rate because people sorry for you.

546
00:54:46,540 --> 00:54:50,410
I would have gone to approve each and each requester.

547
00:54:50,410 --> 00:54:55,960
So the person who posts a service called the requester, each requester has to approve each individual worker.

548
00:54:55,960 --> 00:55:01,300
That's the thing. Yeah. And you could sort of look at the history of what they've done to sort of accept.

549
00:55:01,300 --> 00:55:05,800
That's right. OK, that makes sense. Brilliant.

550
00:55:05,800 --> 00:55:09,470
OK. And so we conduct this survey.

551
00:55:09,470 --> 00:55:15,560
Now we go to unless you guys can read sideways. OK.

552
00:55:15,560 --> 00:55:25,470
So we conduct the survey. And we ask for a bunch of categories.

553
00:55:25,470 --> 00:55:30,630
We ask for a bunch of information, so we ask for, as I said, more choice and turn out behaviour in 2016.

554
00:55:30,630 --> 00:55:34,830
But also some individual level characteristics.

555
00:55:34,830 --> 00:55:38,910
So like, oh no, sorry, these are not the individual level characteristics.

556
00:55:38,910 --> 00:55:42,320
These are the this is the state.

557
00:55:42,320 --> 00:55:48,180
So this was an example of the question, by the way, so did you vote for the presidential election 2016 presidential ouster?

558
00:55:48,180 --> 00:55:53,240
Yes. No. Can't remember Don. No was not eligible, but we I got emails from people.

559
00:55:53,240 --> 00:55:57,320
They were very happy that I was not eligible in because they had just been naturalised.

560
00:55:57,320 --> 00:56:00,860
And a lot of people only in their surveys only get two, right? No, I didn't vote.

561
00:56:00,860 --> 00:56:05,720
And then they feel bad about it because like, it's almost like they didn't do their duty. But I was happy that they were.

562
00:56:05,720 --> 00:56:13,190
They were happy. And we further asked, Which candidate did you vote for president in 2016 with potential answers.

563
00:56:13,190 --> 00:56:16,640
Donald Trump, Hillary Clinton third party can't remember, don't know.

564
00:56:16,640 --> 00:56:17,510
Did not vote.

565
00:56:17,510 --> 00:56:24,590
Notice that we introduced the third party issue here, whereas in the example of vote choice that I showed you before, there were only two options.

566
00:56:24,590 --> 00:56:30,230
Which means now we're going to have to do some funky stuff of our model to allow for more multiple options.

567
00:56:30,230 --> 00:56:38,160
Funky stuff. I mean, it's like, am I? The from multiple sources, including the census and so on.

568
00:56:38,160 --> 00:56:41,370
We also have some state level characteristics that we're interested in.

569
00:56:41,370 --> 00:56:46,770
So for each state, we have the percentage Hispanic percentage, black percentage, Asian percentage, non-college whites.

570
00:56:46,770 --> 00:56:52,380
That one's important because what's become important in the narrative because of Trump?

571
00:56:52,380 --> 00:56:59,490
Percentage college grad college grads in general, because, as you know, Hillary Clinton was the first Democratic candidate to win.

572
00:56:59,490 --> 00:57:04,050
I don't know about the first, but first in a long time to win college grads in the United States.

573
00:57:04,050 --> 00:57:09,750
Median income of the state. The percentage that the Republican won in 2016.

574
00:57:09,750 --> 00:57:18,840
The percentage that the Libertarian won 2016. The percentage of the Greens and the percentage of people who turned out in 2016.

575
00:57:18,840 --> 00:57:26,790
Now before you ask, yes, we're predicting 2016 state level results with 2016 covariates.

576
00:57:26,790 --> 00:57:30,690
So if perhaps it's cheating? But the reason why I'm showing you this is because eight,

577
00:57:30,690 --> 00:57:36,690
there is no direct circularity because we're predicting we're using these models to inform an individual level model.

578
00:57:36,690 --> 00:57:44,070
And B, these are going to be the variables that are going to be using to predict the 2020 result in your in your exercise.

579
00:57:44,070 --> 00:57:53,840
And so, yes, part of it was just I didn't I didn't think it was correct to go back to 2012 to get these these statistics because.

580
00:57:53,840 --> 00:58:02,660
Given that we've asked the question today, this is going to be a better measure of the reliability of this model being able to to correctly

581
00:58:02,660 --> 00:58:07,160
adjust for bias than a model that would have asked them would have used twenty twelve.

582
00:58:07,160 --> 00:58:12,890
So I think this is the closest thing we get to showing you that if it works in twenty sixteen, it's likely to work in 2020.

583
00:58:12,890 --> 00:58:24,400
I don't think. OK. This is the turnout model, it's almost identical to the one you've seen before.

584
00:58:24,400 --> 00:58:32,200
This is the vote choice model, so as I said, it's a little more funky because now we have to use this categorical distribution,

585
00:58:32,200 --> 00:58:35,230
which is just a multinational distribution with and equals one.

586
00:58:35,230 --> 00:58:41,350
So this time, instead of everybody answering being able to say I'm either a Democrat or a Republican,

587
00:58:41,350 --> 00:58:48,310
they can say I'm a Democrat, a Republican or a third party person or a Anees in this election.

588
00:58:48,310 --> 00:58:53,920
Good point. Missing values and basing statistics. If you have missing values in your outcome variable, that doesn't matter.

589
00:58:53,920 --> 00:59:02,230
You can just feed them in because the the Bayesian model estimates the predictive distributions and then just inputs them afterwards.

590
00:59:02,230 --> 00:59:09,730
So it's an automatic imputation of the output values, which is pretty useful in a lot of scenarios.

591
00:59:09,730 --> 00:59:11,560
And so the categorical distribution just says, well,

592
00:59:11,560 --> 00:59:16,660
instead of having one probability distribution either Democrat or Republican, you now have three probabilities.

593
00:59:16,660 --> 00:59:21,670
You either there's a probability for voting for a Democrat, probably four. There is Republican probability for voting other.

594
00:59:21,670 --> 00:59:24,610
And these have to sum up to one, of course.

595
00:59:24,610 --> 00:59:35,290
And the other funky stuff that's happening is that, as you can see, the effects now all have these indexed new, which is representing the vote choice.

596
00:59:35,290 --> 00:59:40,840
So you can estimate a state effect for each vote choice for a state effect

597
00:59:40,840 --> 00:59:43,750
for the Democrat state effect of the Republicans as they effects the others.

598
00:59:43,750 --> 00:59:50,710
And the other funky stuff is that there is this ID constraint, which is it's a some two zero constraint.

599
00:59:50,710 --> 00:59:57,340
So the idea is that if you have party specific effects and you have multiple parties in a binomial model,

600
00:59:57,340 --> 01:00:04,000
there is already a slim to zero constraint because you're either the effect that is for one party is immediately against the other.

601
01:00:04,000 --> 01:00:10,120
Here we have to introduce that by hand. And so you can just have this some two zero constraint, some two.

602
01:00:10,120 --> 01:00:19,360
You can also be repurposed as the mean. This alpha bar is just the mean of the choices.

603
01:00:19,360 --> 01:00:23,500
Yes. Maybe there a typo here.

604
01:00:23,500 --> 01:00:28,910
But any case, I think I think you get the idea. OK.

605
01:00:28,910 --> 01:00:35,120
And so now there is this this code. So. Yes.

606
01:00:35,120 --> 01:00:41,070
Yes. I need to get. That's the.

607
01:00:41,070 --> 01:00:47,580
That we want all day. We want the effects, yes, we want the effects to some to zero.

608
01:00:47,580 --> 01:00:56,070
That's right. So because if, if, if the gender effect say is positive for the Republicans,

609
01:00:56,070 --> 01:01:02,110
for the Democrats and positive for the Greens, it cannot be that it's also positive for the Republicans.

610
01:01:02,110 --> 01:01:07,060
Does that makes sense? So we're looking at. And what why is that?

611
01:01:07,060 --> 01:01:15,170
Well, first of all, I think that's. Well, because. Interpretability.

612
01:01:15,170 --> 01:01:21,570
Yes, it's difficult to think about the full extent of but.

613
01:01:21,570 --> 01:01:27,960
So I don't know that to right at this point. Yeah, but there's a little bit of zero.

614
01:01:27,960 --> 01:01:32,910
Yeah. Okay, so one of them would be. Yes, that's correct.

615
01:01:32,910 --> 01:01:39,410
Yes, yes it is. Does that make sense? Yeah. OK.

616
01:01:39,410 --> 01:01:45,710
And then you have this model, so you're going to get to work on this on your own later on.

617
01:01:45,710 --> 01:01:51,800
But I'm just going to go through the motions of showing you what, how you fit a Jags model.

618
01:01:51,800 --> 01:01:58,190
So this is a model to replicate 2016 results. So the first thing you do is you create a list.

619
01:01:58,190 --> 01:02:05,930
This is in R, by way. So you create a list with your model data and we introduce the data.

620
01:02:05,930 --> 01:02:11,020
Notice that I introduce the vote choice as ones and zeroes.

621
01:02:11,020 --> 01:02:18,170
And so instead of being a single vector with one two three, it's three columns with ones, ones, ones.

622
01:02:18,170 --> 01:02:25,460
The reason I do that is because there is a fitting a categorical model on Jags is quite expensive in computational time,

623
01:02:25,460 --> 01:02:28,850
and there's a well known equivalency, which is the plus on models.

624
01:02:28,850 --> 01:02:33,270
So you can you can approximate a categorical model with OPOs on model and a

625
01:02:33,270 --> 01:02:38,330
plus some other requires that you have three different choices on the output.

626
01:02:38,330 --> 01:02:46,520
Yes. So this is one of the authors. Yeah, you're making a model and based on.

627
01:02:46,520 --> 01:02:56,240
Yep, yep. The the selected resolution that's based at midnight or 812, etc. We get a different selection rule.

628
01:02:56,240 --> 01:03:02,000
Yep, yep, that works the moment you make a model to estimate.

629
01:03:02,000 --> 01:03:08,300
Thanks, you know? Yep, yep. What is then the virtue of that model?

630
01:03:08,300 --> 01:03:12,170
So the virtual the models that we're going to find out how accurate we can get

631
01:03:12,170 --> 01:03:16,160
to 2016 and then we're going to use the same infrastructure to predict 2020.

632
01:03:16,160 --> 01:03:25,990
Yeah. But then if you predicted it would need to set similar selected under the model, may just the.

633
01:03:25,990 --> 01:03:33,670
It's sorry this to the same 1100 I have asked who they voted for in 2016 and who they're going to vote for in 2020, right?

634
01:03:33,670 --> 01:03:40,390
Yeah. What do you mean you want to create? You want to create different models for different selection rules.

635
01:03:40,390 --> 01:03:46,810
I'm just saying the only reason I predict the context in if you cannot obtain any insights which will generate you,

636
01:03:46,810 --> 01:03:56,860
I would to say that if the coefficient for gender is positive, for instance, yeah, the general consensus sits inside and the system itself.

637
01:03:56,860 --> 01:03:58,810
Yeah, for sure. For sure, yes.

638
01:03:58,810 --> 01:04:07,990
But the idea, but the idea is that through the shrinkage effects, you hope that by bringing your you don't fit the data too closely.

639
01:04:07,990 --> 01:04:13,240
And so you can do better out of sample predictions. Yeah, but the system zeroes.

640
01:04:13,240 --> 01:04:18,580
Well, if the selection effect is super strong and you have an account, so there is this assumption called ignore ability,

641
01:04:18,580 --> 01:04:25,540
which is that I can ignore the remaining variables that I have in introducing the model as part of the residual.

642
01:04:25,540 --> 01:04:30,130
If that assumption is broken as you suggest that through some very heavy selection effect,

643
01:04:30,130 --> 01:04:34,390
then I can't ignore it, so I would need to introduce it in the model and then the certification frame.

644
01:04:34,390 --> 01:04:41,300
So yeah, it seems that the Facebook example works because within cells.

645
01:04:41,300 --> 01:04:47,080
Being imposed on Democrats isn't just a protest where you buy an explosive. Yeah, that's right.

646
01:04:47,080 --> 01:04:56,650
Yeah. Google and Facebook. Yeah. Well, they still have some selection because they end up overestimating, if I remember correctly, the Obama vote.

647
01:04:56,650 --> 01:05:00,430
But but not by much so they did well, but they did a little bit.

648
01:05:00,430 --> 01:05:09,190
It wasn't perfect, right? So therefore, it would seem that that's basically whatever the thing you're asking for the section a little bit.

649
01:05:09,190 --> 01:05:16,900
Yeah. The assumption here is that the selection into the specific time and day in which we asked the survey is

650
01:05:16,900 --> 01:05:23,830
not heavy and that the selection of being an empty worker after you control for all these variables,

651
01:05:23,830 --> 01:05:31,390
here is there is none. Essentially, there's obviously an assumption that's obviously false, but it allows us to make.

652
01:05:31,390 --> 01:05:36,490
And as always, we will see the results of a pretty decent. So it suggests that it's not too far off.

653
01:05:36,490 --> 01:05:42,280
Let's say we can agree to the beginning of the third only relevant and predictive context and not concerns.

654
01:05:42,280 --> 01:05:48,580
Oh, for sure. But I mean, the yeah, this is true of the entire endeavour that I'm teaching you today.

655
01:05:48,580 --> 01:05:54,370
This is a predictive context. Is not an influential tool. Yeah. Anybody else?

656
01:05:54,370 --> 01:06:01,750
No. OK. So we continue we specified these variables and then we specify so Jag's,

657
01:06:01,750 --> 01:06:06,850
the way it works is almost exactly as I showed you with those hierarchical models up there.

658
01:06:06,850 --> 01:06:15,760
So literally, you say predicted distribution distributor of a brand only distribution with parameter pie at the individual level.

659
01:06:15,760 --> 01:06:18,830
We add this to the Oh, this is an interesting thing.

660
01:06:18,830 --> 01:06:24,520
So when people respond to their turnout question, there's a well known phenomenon of overreporting.

661
01:06:24,520 --> 01:06:34,420
And people have found that in the axe. Sorry, the AP and the American election study can remember now an American national election study.

662
01:06:34,420 --> 01:06:39,790
The overreporting factor is roughly thirteen point five percent in pew.

663
01:06:39,790 --> 01:06:45,100
So online surveys there overreporting factor has been found to be 17 percent and above.

664
01:06:45,100 --> 01:06:54,160
So what we do here, we do kind of a rough correction, so we estimate the model and then we say whatever,

665
01:06:54,160 --> 01:07:03,400
whatever distribution you have for each individual, whatever distribution you have thus far estimated, subtract 17 percent to it.

666
01:07:03,400 --> 01:07:05,980
So if the distribution is simply shifting,

667
01:07:05,980 --> 01:07:12,820
if the distribution was imagine like a normal with like a mean of like 70 percent probability, then now a shift shifted 17 points down.

668
01:07:12,820 --> 01:07:19,330
And this actually ends up having a very positive effect on the estimates. Yeah, you should remember if you do vote choice and vote turnout stuff,

669
01:07:19,330 --> 01:07:25,480
you should remember to account for turnout over reporting because really important. The obviously this assumption is not perfect.

670
01:07:25,480 --> 01:07:31,750
In principle, we would have done another model from another information source that would have told us who is more likely to out of

671
01:07:31,750 --> 01:07:36,820
these categories to over report than the report and then created a predictive value for each individual in the sample.

672
01:07:36,820 --> 01:07:44,290
But we didn't do that. We just assumed that people are across the board likely to over report in the same way.

673
01:07:44,290 --> 01:07:52,860
So this is like a uniform over reporting kind of model. Yeah. So in this context, what it's like doing this every time?

674
01:07:52,860 --> 01:07:57,060
No, so it's simply if I answer a thousand people.

675
01:07:57,060 --> 01:08:01,320
And in reality, only five hundred of them voted in my sample.

676
01:08:01,320 --> 01:08:07,440
It's going to be seven hundred and fifty. Just because people lie.

677
01:08:07,440 --> 01:08:11,480
Yes, well, but they lie for all sorts of understandable motives, right?

678
01:08:11,480 --> 01:08:18,160
Ideas like social desirability. There is like they feel like perhaps a little ashamed that they didn't turn up to vote.

679
01:08:18,160 --> 01:08:22,550
All sorts of things. Yeah. OK.

680
01:08:22,550 --> 01:08:29,720
And then the and then the we outline this turnout model.

681
01:08:29,720 --> 01:08:35,990
These are random effects at the state level, region effects, age effect, race effect and so on.

682
01:08:35,990 --> 01:08:41,440
This is the state level predictor. Specified the priors that we specified above.

683
01:08:41,440 --> 01:08:50,030
Again, you'll get a chance to play with this code yourself, so you can you can start. You can start slow and then continue.

684
01:08:50,030 --> 01:08:57,950
So there's some cheeky stuff that is happening here, which I will explain in a minute. You might be wondering what is this auxiliary parameter better?

685
01:08:57,950 --> 01:09:02,180
And what happens is in search algorithms kind of like the jib sampler.

686
01:09:02,180 --> 01:09:06,410
There's this weird phenomenon that if you over parameter rise and then monitor.

687
01:09:06,410 --> 01:09:11,930
So if you if say you're interested in parameter beta and then you specify this weird big model for beta,

688
01:09:11,930 --> 01:09:18,350
which is like beta is actually equal to like alpha times x plus zero and so on and so forth.

689
01:09:18,350 --> 01:09:24,410
Then what you find is that by specifying these sub models and not monitoring them, you're still just monitoring beta.

690
01:09:24,410 --> 01:09:29,570
Your model converges faster. And the reason for this is a mathematical reason.

691
01:09:29,570 --> 01:09:34,870
There's there's been papers showing this is just that.

692
01:09:34,870 --> 01:09:43,510
The search algorithm finds the correct the correct convergence point faster, essentially, Ivan explained that at all.

693
01:09:43,510 --> 01:09:46,000
But like, I'll tell you the paper and you can you can look at it.

694
01:09:46,000 --> 01:09:56,890
So here we like over parameter size are state level predictor by multiplying it for a new variable auxiliary beta,

695
01:09:56,890 --> 01:10:00,370
which is distributed by a normal distribution. But we don't monitor it.

696
01:10:00,370 --> 01:10:07,180
We don't care about it. We care about this beta here, which is the multiplication of beta star and auxiliary variable beta.

697
01:10:07,180 --> 01:10:16,360
OK. And then these are, as I stated before, and these are all the you see a random effect with its random effect distribution

698
01:10:16,360 --> 01:10:22,220
and auxiliary parameter multiplying by the the original random effect.

699
01:10:22,220 --> 01:10:27,310
Yeah, you see a difference. Yes. Massive, massive convergence.

700
01:10:27,310 --> 01:10:37,840
When you take that off. Yeah. Yeah. So the model that I a fit for this is run for seven thousand iterations and now converges almost perfectly.

701
01:10:37,840 --> 01:10:40,630
There's a few parameters that could do tiny, tiny, little bit better.

702
01:10:40,630 --> 01:10:45,760
But before I was using the auxiliary parameter, I needed like fifteen thousand iterations to do it.

703
01:10:45,760 --> 01:10:50,740
So actually, it's very useful to know these tricks. Yeah, very, very useful.

704
01:10:50,740 --> 01:10:56,080
Can these traits? OK.

705
01:10:56,080 --> 01:11:05,500
Choice models, so notice again, we're specifying essentially three different Poisson regressions that are bound together by some two zero constraint.

706
01:11:05,500 --> 01:11:14,320
This the this trick works so long as you introduce a individual level effect that has a uninformative distribution.

707
01:11:14,320 --> 01:11:22,090
This is well known. I have. If you look at the PDF, there is a citation for this and you can go and look at that paper.

708
01:11:22,090 --> 01:11:27,520
And it's quite neat because the pass on distribution is very stable based on sampler story on Jag's, and Stan is very stable.

709
01:11:27,520 --> 01:11:32,220
The categorical sampler is quite unstable, and so this stuff converges a lot faster.

710
01:11:32,220 --> 01:11:43,660
OK, and then so these are the facts, as I said before. Notice that the way we specify the conditionality is by specifying estimating a

711
01:11:43,660 --> 01:11:49,950
different parameter for whether people turn out a pretty good turnout or not.

712
01:11:49,950 --> 01:11:56,230
And so essentially, we have two models here. One is a model for vote choice for people who do not turn out.

713
01:11:56,230 --> 01:11:59,410
One is a model for vote choice for people who do turn out. It's kind of interesting.

714
01:11:59,410 --> 01:12:03,520
I don't bother monitoring the vote choice model for people who do not turn out

715
01:12:03,520 --> 01:12:07,570
because it wasn't part of the of the inference project and we were going to do here.

716
01:12:07,570 --> 01:12:14,530
We predict your project, but it's kind of cool because if you have a if you want to find out which party would benefit more from getting people to

717
01:12:14,530 --> 01:12:21,220
turn out a model of vote choice where people who are predicted not to turn out will tell you that we just kind of need.

718
01:12:21,220 --> 01:12:33,790
OK. And this stuff below is almost exactly the same as before with the some two zero constraint that I described before, and it's close to zero.

719
01:12:33,790 --> 01:12:41,140
Yes, so the plus is this custom distribution is just defined over positive values, but the values we fed it are zeros and ones.

720
01:12:41,140 --> 01:12:45,040
And then we have to put that that random effect that sorry,

721
01:12:45,040 --> 01:12:50,560
the individual level effect in in order to have the random effect of efficiency to be estimated in

722
01:12:50,560 --> 01:12:55,120
exactly the same number that it would be from the categorical distribution to the plus on distribution.

723
01:12:55,120 --> 01:13:00,760
And therefore, this only works if the distance of the way the gross zeros and ones that the you know something.

724
01:13:00,760 --> 01:13:06,910
Yes, you eat. Yeah, you're right. If you're sat, if some of your categories are very small, you run into problems.

725
01:13:06,910 --> 01:13:13,570
Yes. But this is a prediction that the.

726
01:13:13,570 --> 01:13:25,020
Develop like, no, we didn't that we we didn't, because what we are interested in ultimately,

727
01:13:25,020 --> 01:13:30,030
which is the prediction of the 2020 results, 2020 results haven't happened yet.

728
01:13:30,030 --> 01:13:33,770
And so we we're happy with.

729
01:13:33,770 --> 01:13:42,740
Building the model with all the information that we have and then putting out the best possible prediction that we can right now and.

730
01:13:42,740 --> 01:13:47,720
The other reason is that a thousand five hundred people for this kind of exercise is not much.

731
01:13:47,720 --> 01:13:52,130
So you don't have the luxury really to do twenty eighty or seventy 30,

732
01:13:52,130 --> 01:13:58,940
because even removing 30 percent of the individuals from this sample would mean that many categories would actually go empty.

733
01:13:58,940 --> 01:14:02,330
And that would massively decrease the predictive accuracy of your model.

734
01:14:02,330 --> 01:14:07,820
So yeah, you're right in in a ideal world, we would take fifteen thousand people.

735
01:14:07,820 --> 01:14:12,050
Five thousand would leave out fit the model. See how it did in 2016.

736
01:14:12,050 --> 01:14:16,880
Do again three or four times to see whether the model coefficients are stable, then fitted for the whole thing.

737
01:14:16,880 --> 01:14:24,080
But the formal performance? We don't have an out of sample prediction metrics.

738
01:14:24,080 --> 01:14:29,930
Not. But we do have. Yeah.

739
01:14:29,930 --> 01:14:38,480
So for the predictive model, you're right, we don't have a formal prediction mechanism, sorry, a formal like cross validation happening.

740
01:14:38,480 --> 01:14:53,910
But the fact that you do have the the hard validation of whether it works or not with respect to respect to the 2016 results that.

741
01:14:53,910 --> 01:14:59,020
You you need to see results. Yes. Right.

742
01:14:59,020 --> 01:15:04,450
Yeah, it was. I think for sure, you're right. So you're right that we could do better with predictions.

743
01:15:04,450 --> 01:15:08,380
So one thing that we could do is fitted with 2012 data.

744
01:15:08,380 --> 01:15:14,260
But even that is weird because the question that we asked them about their 2016 behaviour, we asked them today.

745
01:15:14,260 --> 01:15:18,520
So they responded to that survey knowing what the average behaviour was as well.

746
01:15:18,520 --> 01:15:25,240
So I mean, it's a weird question. In principle, the ideal case would have been we would have done a serving 20 60,

747
01:15:25,240 --> 01:15:34,510
would have done a survey today and then we would have checked at each point and out of sample metric to test the exact model.

748
01:15:34,510 --> 01:15:36,790
I want to say this is something that the literature doesn't do at all.

749
01:15:36,790 --> 01:15:47,080
So like most models, I think because you don't want to run into the problem of having empty cells most and also because data so scarce again,

750
01:15:47,080 --> 01:15:50,230
like a sample of ten thousand is considered to be massive in this literature,

751
01:15:50,230 --> 01:15:56,680
whereas like in reality for most sample so gellman at all, they don't do an out of sample metric,

752
01:15:56,680 --> 01:16:01,570
but they could have because they had three hundred fifty thousand people. So. So they are.

753
01:16:01,570 --> 01:16:08,530
They could have done that. Yeah, but that's a point. Well taken. I think there some criticism in general about the difficulty level of suspicion.

754
01:16:08,530 --> 01:16:13,360
Yeah. That by adding, right? Yeah. Being just in perspective.

755
01:16:13,360 --> 01:16:21,430
Yeah. You have so much degrees of freedom basically feeding data that are sort of goodness of fit measure within data.

756
01:16:21,430 --> 01:16:23,740
But we are tweeting right away.

757
01:16:23,740 --> 01:16:31,990
So so my question would be, do you have, if you evaluate it, the variance of the random facts and it becomes more like an algorithm?

758
01:16:31,990 --> 01:16:36,520
We didn't evaluate these. Are you saying, did we do a level specific R-Squared?

759
01:16:36,520 --> 01:16:42,240
We didn't do that. No, because. Yeah.

760
01:16:42,240 --> 01:16:46,040
Sorry. You're right, sorry.

761
01:16:46,040 --> 01:16:57,400
So the question has been about predictive accuracy of this model outside of the sample in which the model has been trained on and in particular,

762
01:16:57,400 --> 01:17:03,550
whether we did any cross validation. And so I don't know if I take that point about.

763
01:17:03,550 --> 01:17:11,200
So if you compare, I agree that any model that is tested on the training set, let's say, is going to have overfitting problems.

764
01:17:11,200 --> 01:17:18,730
But I don't take the point that multilevel regression is particularly war of aggression is better at dealing with that than linear regression.

765
01:17:18,730 --> 01:17:21,970
It's worse than dealing with that than a random forest, for instance.

766
01:17:21,970 --> 01:17:28,840
So, so because because the shrinkage, because the shrinkage effect are a form of regularisation.

767
01:17:28,840 --> 01:17:34,900
So they're meant to be there in order to help you help you filter out the noise so that the thing

768
01:17:34,900 --> 01:17:41,110
I'm referring to basically that multilevel effects are basically the same as a random error,

769
01:17:41,110 --> 01:17:46,910
like a normal linear regression, right? Yes, in a way, I rated it like an additional error term.

770
01:17:46,910 --> 01:17:49,390
Yeah. If you would think of it in the classical sense. Yeah.

771
01:17:49,390 --> 01:17:58,690
And therefore, it will be quite strange to find performance of a model by two fixed effects estimates the coefficients plus the residuals, right?

772
01:17:58,690 --> 01:18:02,440
Because then for definition forms is one percent. No, for sure.

773
01:18:02,440 --> 01:18:06,700
No, I get what you're saying. Yes, I get what you're saying.

774
01:18:06,700 --> 01:18:12,370
Yeah, we can discuss we can discuss it more later, but I'm pretty sure your point is well taken, there is no cross-pollination measure here.

775
01:18:12,370 --> 01:18:22,250
Yeah. OK. And so, yeah, we end up we finish specifying the model as and you'll get to play around with this model with the code.

776
01:18:22,250 --> 01:18:34,550
So don't worry if it looks daunting at the moment, we tell Jags which parameters we want to monitor by through this, this thing here.

777
01:18:34,550 --> 01:18:43,460
And we then tell Jags to run four chains for seven thousand iteration burning.

778
01:18:43,460 --> 01:18:52,520
The first 6000 burning is like you remember when I showed you that image of the of the chains and a conversion after the first three hundred,

779
01:18:52,520 --> 01:18:58,400
then burning is the first three. We just throw them away because we know that they are not independent and they're not convergent.

780
01:18:58,400 --> 01:19:03,180
Whereas the posterior sorry, the last one thousand are assumed to have converged.

781
01:19:03,180 --> 01:19:08,600
And so it's like as if we are every sample it is, if we're taking a new value from the joint distribution.

782
01:19:08,600 --> 01:19:15,290
So the effective sample size is a thousand. Essentially, we thought that was very good.

783
01:19:15,290 --> 01:19:23,440
Yeah, true to. So two questions, two answers to that one.

784
01:19:23,440 --> 01:19:26,490
Two is good, but you can never be too sure.

785
01:19:26,490 --> 01:19:35,730
And two is you can run Jags in parallel on air and the way the parallel is personalisation works is that each chain is run in parallel.

786
01:19:35,730 --> 01:19:43,670
So instead of running a single chain with, oh yeah, sorry, we should have this mike thing is a bit hard, sorry.

787
01:19:43,670 --> 01:19:47,490
And the question was why do we run four chains instead of two chains?

788
01:19:47,490 --> 01:19:54,090
And the answer is because you can't be sure enough and b, because of the way you can run chains in parallel on air,

789
01:19:54,090 --> 01:20:02,340
which means that you can run four chains with a foul with two hundred and fifty useful.

790
01:20:02,340 --> 01:20:07,950
So if you think that your model is going to converge after six thousand five hundred iterations,

791
01:20:07,950 --> 01:20:12,840
then you can either run the two chains for if you want a thousand values at the end,

792
01:20:12,840 --> 01:20:20,010
you can run the two chains for another 1000 in each chain, or you can run four chains two hundred and fifty each.

793
01:20:20,010 --> 01:20:26,770
And because you can run it in parallel, running for chains 250 each is faster. That makes sense.

794
01:20:26,770 --> 01:20:31,420
Yes. Yes. We run in for the packaging. Yeah, yeah.

795
01:20:31,420 --> 01:20:39,250
We run in four different genes, but each chain sort of is like a mark of genes, like a distribution.

796
01:20:39,250 --> 01:20:43,540
So where are we from? Which of these genes are we draw in our values?

797
01:20:43,540 --> 01:20:48,310
That's a good question. So what you do is you have these four chain. So the yeah, the question was recorded.

798
01:20:48,310 --> 01:21:02,770
So I'm paranoid of others. So the the the four chains reach convergence after the 6000 iteration, at which point from each chain we remove every four.

799
01:21:02,770 --> 01:21:04,510
So there's a thinning factor of four.

800
01:21:04,510 --> 01:21:12,100
The leaves two hundred and fifty observations for each chain that we stack them up as if they were from all from a single chain.

801
01:21:12,100 --> 01:21:18,760
And then we just use the the stack of 1000 observations because think about it, given that they have converged,

802
01:21:18,760 --> 01:21:23,080
or at least we assume that they have converge, they should be from the same distribution so we can just stack them up, right?

803
01:21:23,080 --> 01:21:27,400
Yeah, that's the idea. Anymore questions.

804
01:21:27,400 --> 01:21:35,010
But. OK. This model takes about four hours to converge, so yeah, it's a bit painful.

805
01:21:35,010 --> 01:21:41,550
Used to take a lot longer. I stressed out about this and but what you're going to play with in class is going to be something a lot simpler, I think.

806
01:21:41,550 --> 01:21:46,230
So we're going to pick three or four categories and we're going to let you

807
01:21:46,230 --> 01:21:50,010
fit a model only for vote choice and only for those three or four categories.

808
01:21:50,010 --> 01:22:00,210
So that should be quite fast. I would expect it to take about 10 to 15 minutes or so and Goldman Reuben statistics is showing here.

809
01:22:00,210 --> 01:22:07,080
So notice that some parameter these these guys here are a bit are having a hard time to converge,

810
01:22:07,080 --> 01:22:12,870
but actually they're very close to one point one, which means that if we run a few more iterations, they probably converge.

811
01:22:12,870 --> 01:22:18,840
Yes. A just quick question on your opinion. Yeah, lately they've been using INLA.

812
01:22:18,840 --> 01:22:26,130
Yeah, yeah. Great. Really, really good. Because of the speed especially is it's super slow, enlace super integrated,

813
01:22:26,130 --> 01:22:32,430
nested le plus approximation is a new Bayesian software which uses integrated nested LA Pass approximation

814
01:22:32,430 --> 01:22:38,190
and assumes that instead of using the Gibb's sampler uses this this technique that I just mentioned.

815
01:22:38,190 --> 01:22:44,220
And but it has the answer to this event as an advantage, which is speed, massive, massive gains and speed.

816
01:22:44,220 --> 01:22:48,720
These advantages are that it's hard to bring in data from multiple sources in LA,

817
01:22:48,720 --> 01:22:58,410
whereas here I can literally like have two models linking to each other, which estimate data from two completely different data sources.

818
01:22:58,410 --> 01:23:02,190
I could have the British election study to inform my turnout model, which is actually what is done,

819
01:23:02,190 --> 01:23:06,000
and then I could have my survey to inform the vote choice model, which is actually what people do all the time.

820
01:23:06,000 --> 01:23:12,570
And you can stack those into the same model, which is super cool, and you can find nonlinear ways to join the two models.

821
01:23:12,570 --> 01:23:14,820
So that's really, really fun that you can do with Jack.

822
01:23:14,820 --> 01:23:21,300
Super flexible as much then as well in LA, they're not quite there yet, so you can if you're like a genius,

823
01:23:21,300 --> 01:23:28,470
like you can figure out the way I like to introduce this new information by priors and penalise complexly priors and so on, so forth.

824
01:23:28,470 --> 01:23:33,210
But at the moment, it's it works more similar to like Glenarm.

825
01:23:33,210 --> 01:23:39,900
So like just like it has like a set of like pretty standard models that you can run with it and also has the disadvantage

826
01:23:39,900 --> 01:23:46,830
that you will never be able to run mixture models on it because they defy the assumption of Gaussian latent field,

827
01:23:46,830 --> 01:23:54,390
which is the underlying assumption of integrating national device approximation. So yeah, but it's a great.

828
01:23:54,390 --> 01:23:55,810
Yeah. Very good.

829
01:23:55,810 --> 01:24:05,100
Nine Laser is a great option, I use it for for a lot of stuff, but not for this because I needed to bring in a lot of nonlinear lesia.

830
01:24:05,100 --> 01:24:10,660
But actually, you could use it for this. We can talk about it later like you.

831
01:24:10,660 --> 01:24:20,020
OK, so we have this model, a converged and now it's time to show we're going to the fact that a converging means that now we have this predictive

832
01:24:20,020 --> 01:24:26,980
machine and we can use this predictive machine to make predictions about individual categories or categories of interest.

833
01:24:26,980 --> 01:24:31,210
How can we? Which means that, you know, we need to set up these categories of interest.

834
01:24:31,210 --> 01:24:37,360
We need to find out how many people from a specific category lie in the state of Texas, for which we want to make the prediction for.

835
01:24:37,360 --> 01:24:45,130
And we take this number from the American Community Survey Micro Data, which is available online.

836
01:24:45,130 --> 01:24:48,710
And at the end of this talk, I will show you how to download it. Yes, just no.

837
01:24:48,710 --> 01:24:56,560
Sorry, I put it as a question. Sorry. And we breakdown the population into the following characteristics.

838
01:24:56,560 --> 01:25:02,050
So we break it down into two categories age six categories race five categories Education

839
01:25:02,050 --> 01:25:05,980
four categories household income three categories and state fifty one categories.

840
01:25:05,980 --> 01:25:09,340
So this amounts to a total of thirty six thousand seven hundred cells,

841
01:25:09,340 --> 01:25:15,160
of which only twenty nine thousand we actually find that are populated in the micro data.

842
01:25:15,160 --> 01:25:19,450
So that means that there are some cells that are so rare that a sample of I think the

843
01:25:19,450 --> 01:25:23,800
micro data is about three point five million Americans don't contain any of them.

844
01:25:23,800 --> 01:25:27,370
So these are very rare cells.

845
01:25:27,370 --> 01:25:39,160
This is kind of a neat plot, which is actually something that from my own research, so on the involved in the bold line, you see.

846
01:25:39,160 --> 01:25:47,650
So this is an ordered. You take the cells, each cell, so each voter category and you order them by the largest cell to the smallest cell,

847
01:25:47,650 --> 01:25:52,390
where the largest is represents the largest proportion of the population in the smallest, the smallest proportion.

848
01:25:52,390 --> 01:25:58,030
So some of these cells in the around the 30000 index, I don't have one person in them, literally.

849
01:25:58,030 --> 01:26:03,190
In fact, on the 3000, there's one person out of three million that belongs to that particular voter category,

850
01:26:03,190 --> 01:26:09,970
whereas some of the cells on top, these have like three hundred a thousand people and so on and so forth.

851
01:26:09,970 --> 01:26:18,980
On the y axis is the cell probability in the population. So if you were to sample at random from the population represented in the microbiota

852
01:26:18,980 --> 01:26:23,560
at the largest category after you cut it up in the way that we describe,

853
01:26:23,560 --> 01:26:28,630
the largest category has about a two in a thousand probability of being sampled.

854
01:26:28,630 --> 01:26:35,260
OK, so these are very small categories. This is very, very defined target stratification.

855
01:26:35,260 --> 01:26:39,010
The dotted line represents the cumulative distribution.

856
01:26:39,010 --> 01:26:47,230
So that means that as you sum the size of these cells, you eventually get to 100 percent of the population in.

857
01:26:47,230 --> 01:26:52,810
The kind of neat thing about it is that even though we we have thirty thousand cells,

858
01:26:52,810 --> 01:26:57,940
the the top the largest 5000 make up about 80 percent of the population,

859
01:26:57,940 --> 01:27:03,610
which means that actually in terms of power is if you start thinking about power dynamics, how powerful should your sample be?

860
01:27:03,610 --> 01:27:09,100
Well, if your sample is powerful enough to capture dynamic for the top eight five thousand

861
01:27:09,100 --> 01:27:15,100
so long as there is not too much heterogeneity in vote choice in the other,

862
01:27:15,100 --> 01:27:20,080
in the remainder 20 20 percent, you actually fine. So that's why some of these,

863
01:27:20,080 --> 01:27:25,570
even though if you did the maths in order for it to have a sample that's powerful

864
01:27:25,570 --> 01:27:31,390
at the 90 percent level and have it work on an MRP approach or prediction,

865
01:27:31,390 --> 01:27:34,270
poor stratification approach you might need, like a hundred thousand people.

866
01:27:34,270 --> 01:27:37,540
The reason why it works so well with smaller samples like ten thousand and one thousand

867
01:27:37,540 --> 01:27:42,670
five hundred is because you only need to get right the top five five thousand cells.

868
01:27:42,670 --> 01:27:46,630
The rest you can kind of ignore so long, obviously like this is not true.

869
01:27:46,630 --> 01:27:52,750
If you happen to live in a federation where each state is completely either a genius from the other,

870
01:27:52,750 --> 01:27:57,260
then you know you cannot do this kind of work there. Yeah.

871
01:27:57,260 --> 01:28:02,900
OK, so this is kind of sorry, this is a kind of like a descriptive of how the certification cells look like,

872
01:28:02,900 --> 01:28:09,860
essentially, if you have to like order them and rank them, et cetera. And then.

873
01:28:09,860 --> 01:28:14,300
This is actually what they looked at. Look like in practise.

874
01:28:14,300 --> 01:28:18,380
So this is the lady on the left.

875
01:28:18,380 --> 01:28:22,910
These are the variables that represent a given category.

876
01:28:22,910 --> 01:28:30,990
So this the top category is. Females of Hispanic origin between the age of forty four and fifty four,

877
01:28:30,990 --> 01:28:40,590
which are college graduates who earn between zero and forty nine thousand fifty thousand dollars, who live in the state of Florida.

878
01:28:40,590 --> 01:28:44,940
And there's two hundred and forty one such people in the micro data sample.

879
01:28:44,940 --> 01:28:52,980
And and if you look at the very bottom, you have these female older race category.

880
01:28:52,980 --> 01:28:57,780
College graduates earn the same amount and there's only one of them in the whole sample.

881
01:28:57,780 --> 01:29:05,220
So these counts, these cell counts are what you're going to use to stratify your predictions from the multilevel regression model.

882
01:29:05,220 --> 01:29:10,700
That makes sense. And we're going to look at how exactly this certification is going to happen.

883
01:29:10,700 --> 01:29:16,640
Before we do that, I want you to have a look at how one representative the anthrax sample actually is.

884
01:29:16,640 --> 01:29:19,700
And if you look at this or on the x axis,

885
01:29:19,700 --> 01:29:29,900
you have the population proportions and on the y axis you have the sample proportions and you can see that our our frame is quite rare.

886
01:29:29,900 --> 01:29:36,870
So our. Sorry.

887
01:29:36,870 --> 01:29:46,280
I may have swapped the labels here hold on the same scale, right? No, but it doesn't matter so much.

888
01:29:46,280 --> 01:29:51,890
Well, in any case, let's look, they're different. That's what matters. That the age is different.

889
01:29:51,890 --> 01:29:55,430
Presidential election is different. Yeah, I think. Sorry, guys.

890
01:29:55,430 --> 01:30:10,110
I think I swapped the labels so. Yet in reality, in our sample over predicted Hillary Clinton, so our sample had more.

891
01:30:10,110 --> 01:30:17,990
We have more. Yes, that's right. Do we have young people here?

892
01:30:17,990 --> 01:30:23,090
In any case, that doesn't matter too much, so I'm a bit fried. But the important thing is that the samples are different.

893
01:30:23,090 --> 01:30:28,550
That's the main that's the main takeaway here. If they were not different, they would be on the same y axis.

894
01:30:28,550 --> 01:30:33,440
So the the point the plots would be the points would be exactly on the y axis.

895
01:30:33,440 --> 01:30:39,500
Yes, that is the amount of the absence.

896
01:30:39,500 --> 01:30:45,680
Sort of is remains, though, that that there are other reasons that were not based on certification variables.

897
01:30:45,680 --> 01:30:47,940
Yes. And normally you would say, OK,

898
01:30:47,940 --> 01:30:57,020
we do a random sample at the certification and therefore anything else that's left which might affect whatever incident is randomly distributed.

899
01:30:57,020 --> 01:31:00,440
Yes. Which obviously, this doesn't help or doesn't do anything with it.

900
01:31:00,440 --> 01:31:04,360
So I'm just wondering why the certification attempt is made.

901
01:31:04,360 --> 01:31:10,720
When this sort of reason why it was set on fire, which is basically what incident sets start up is a random sample.

902
01:31:10,720 --> 01:31:17,080
Anything that we missed, but then yeah, it would be presumably randomly distributed, right?

903
01:31:17,080 --> 01:31:17,590
Yeah.

904
01:31:17,590 --> 01:31:28,640
So therefore, it seems a bit difficult to make the step in between, right, because just go straight for the prediction, for instance, predicting.

905
01:31:28,640 --> 01:31:35,660
This is the inference is obviously invalid, right, because. You know, like an excellent example.

906
01:31:35,660 --> 01:31:40,930
Yeah. Mm hmm. In this case? Well, but it works.

907
01:31:40,930 --> 01:31:46,840
So why OK? I think I get your point. I think it's a matter of how much error you're willing to tolerate.

908
01:31:46,840 --> 01:31:55,270
So like, if you think that your certification result is going to be as good at is going to be of the same error as a random sample.

909
01:31:55,270 --> 01:31:57,130
So there's there's two there's two ideas, right?

910
01:31:57,130 --> 01:32:02,440
You can either do a random sample and just take the point estimate of the random sample and that's it.

911
01:32:02,440 --> 01:32:07,840
Or you can do a non random sample stratify and take the point estimate of the stratification.

912
01:32:07,840 --> 01:32:17,680
Both are going to be wrong. But if you can tolerate the level of error of the of the stratified sample, you can do it a lot cheaper.

913
01:32:17,680 --> 01:32:26,170
And so for the same level of error, you need a lot less money to conduct the the non-representative sample because there's one one follow up on that.

914
01:32:26,170 --> 01:32:34,260
Yeah. Let's say you're a normal person the sample comes from. At Trump rally supporters, yeah, you go there.

915
01:32:34,260 --> 01:32:38,720
Well, yeah, you at some point find a white male female.

916
01:32:38,720 --> 01:32:42,360
Yeah, fine. Yeah, my immigrant, etc.

917
01:32:42,360 --> 01:32:47,370
Yeah, you can. You can wait it what everybody wants. You shouldn't sample dependent variable.

918
01:32:47,370 --> 01:32:53,220
You should never sample. You shouldn't sample anything which is associated with the defence.

919
01:32:53,220 --> 01:32:57,960
All right. Well, yeah, in principle, yes, but but most.

920
01:32:57,960 --> 01:33:01,350
So the residual effects are quite small. And the reason why?

921
01:33:01,350 --> 01:33:08,550
So that you mean like the residual selection effects end up being quite small because and you you know this because these work,

922
01:33:08,550 --> 01:33:12,450
these these stratification effect, certification mechanisms work.

923
01:33:12,450 --> 01:33:17,250
But you're right in principle, the there is an issue with the residual selection effects.

924
01:33:17,250 --> 01:33:23,350
There is an issue. Yes. Yeah. Yeah. It's not just observation if we go up.

925
01:33:23,350 --> 01:33:27,030
Yeah, if we go up to this table, right?

926
01:33:27,030 --> 01:33:34,210
You find that in the last, there's one people we to tolerate that there is indeed no selection effect for that group.

927
01:33:34,210 --> 01:33:38,860
But this is not. This is a sample of the population, not of the Turk.

928
01:33:38,860 --> 01:33:43,030
All right. Yeah. So this is if in the end there was only one person.

929
01:33:43,030 --> 01:33:47,170
Right. It's a problem. But what happens there is that the shrinkage of that comes in.

930
01:33:47,170 --> 01:33:54,370
So if you have only one white person in your sample but you have a thousand five hundred non-white people,

931
01:33:54,370 --> 01:33:58,630
then the effect of the white person will be shrunk towards that of the of the others.

932
01:33:58,630 --> 01:34:02,320
So it's still a problem because maybe there is a strong white effect,

933
01:34:02,320 --> 01:34:06,850
but it's a lot reduced because the effect has shrunk because there's the the white

934
01:34:06,850 --> 01:34:10,660
effect is estimated with very little precision because there's only one person,

935
01:34:10,660 --> 01:34:18,640
right? But if the selection effect is done on, let's say, like a swing states, OK.

936
01:34:18,640 --> 01:34:23,850
So, for example, yes, if you only have one person from West Virginia, yeah, you're stuffed in that case,

937
01:34:23,850 --> 01:34:30,190
yeah, right, because the selection procedure within that within that states is not random.

938
01:34:30,190 --> 01:34:35,500
That's right. So if you were trying to do a prediction on the area level prediction of the state

939
01:34:35,500 --> 01:34:41,140
level and you were lacking individuals who make that state different from others.

940
01:34:41,140 --> 01:34:46,090
So individuals who are particular to the state and you don't have them in your sample,

941
01:34:46,090 --> 01:34:50,680
you're not going to make good predictions in this case, it doesn't matter because you're not going to make a prediction.

942
01:34:50,680 --> 01:34:54,190
So for District Columbia within south region, within. No, no, no.

943
01:34:54,190 --> 01:34:58,720
Right, exactly. But yes, exactly. So that's a key thing here.

944
01:34:58,720 --> 01:35:03,610
Like we are. Stratification means you are average weighted averaging.

945
01:35:03,610 --> 01:35:11,170
Right? And when you average forecasts, you always or, well, there is a rule that like if the forecasts are capturing different levels of noise,

946
01:35:11,170 --> 01:35:16,120
you are always going to get a better forecast in the end. And the weights improve that dramatically.

947
01:35:16,120 --> 01:35:25,840
So this idea that even if if you're averaging over 10000 categories for a given state and five of those categories are kind of like dodgy predicted.

948
01:35:25,840 --> 01:35:29,320
But the important top 1000 are properly predicted.

949
01:35:29,320 --> 01:35:33,700
You're actually going to get a pretty good estimate for that state, even though some of the categories were missed predicted.

950
01:35:33,700 --> 01:35:42,590
That makes sense. Yes. I guess I'm not understanding how this is applied.

951
01:35:42,590 --> 01:35:49,390
Yeah, they're kind of sorted. But then why do you start with 240 in advanced age?

952
01:35:49,390 --> 01:35:57,190
No, the salad is just I took the certification frame as it was, and I assigned it a salad.

953
01:35:57,190 --> 01:36:00,790
Oh, you mean sorry, you mean in reference to the previous court, it is not the same as.

954
01:36:00,790 --> 01:36:05,680
No, sorry. No, I should have made. You're actually right. I should have made this slide the same as the other one.

955
01:36:05,680 --> 01:36:14,740
You're right. Sorry? Apologies. Yes. Chris.

956
01:36:14,740 --> 01:36:25,210
Thanks. It seems to be the lion. This message seems to be quite reliant on having pretty granular data like this for many millions of respondents.

957
01:36:25,210 --> 01:36:31,730
Yup. He's talking about like. I mean, I'm in just the Middle East and North Africa, this just doesn't exist.

958
01:36:31,730 --> 01:36:34,310
Yes, this is, but you can't get the individual level data.

959
01:36:34,310 --> 01:36:39,860
So is this message of the other ways around this or is this just not appropriate for that kind of stuff?

960
01:36:39,860 --> 01:36:45,650
So I once briefly spoke to Jasmine about doing this in Afghanistan,

961
01:36:45,650 --> 01:36:53,750
and she said that the last census in Afghanistan had been conducted like like in the last like big survey she said had been collected.

962
01:36:53,750 --> 01:36:58,430
Doesn't seven. So data are completely outdated.

963
01:36:58,430 --> 01:37:04,100
God knows what has happened in the in the area specific provinces since then.

964
01:37:04,100 --> 01:37:13,970
You have to do the extra work there to actually do, predict the sales counts and then apply the certification to the predicted cell counts.

965
01:37:13,970 --> 01:37:22,670
Yeah, it's unlikely to work, I think in those scenarios, even when the agriculture sector loses.

966
01:37:22,670 --> 01:37:27,560
Yeah. You know, for sure, you could for sure you could you could try.

967
01:37:27,560 --> 01:37:33,590
So but again, like. Yeah, so the question the follow up question was if you aggregate together a lot of surveys,

968
01:37:33,590 --> 01:37:38,210
does this could this potentially replace the lack of census data?

969
01:37:38,210 --> 01:37:41,270
The answer is yes, to the extent that the surveys are reliable.

970
01:37:41,270 --> 01:37:45,590
So like if you have a show in India, when I worked on the Indian election, for example,

971
01:37:45,590 --> 01:37:53,210
we use the India Household Development Survey because the census data was limited in their crosstabs.

972
01:37:53,210 --> 01:37:59,000
So again, another big problem is that you have to use micro data, usually because there's loads of crosstabs that you're interested, right?

973
01:37:59,000 --> 01:38:02,930
And often census crosstabs are very limited to like two or three interactions.

974
01:38:02,930 --> 01:38:10,070
And most so yeah, service can be a solution, but usually they have to be augmented by via some modelling.

975
01:38:10,070 --> 01:38:14,420
Yeah, that's that's what I would say. You. Yeah.

976
01:38:14,420 --> 01:38:20,850
So what do you do? So what do you do when, for example, let's say.

977
01:38:20,850 --> 01:38:29,610
For example, like when you have a survey and the combination of these stratification variables leads to what you were saying,

978
01:38:29,610 --> 01:38:31,920
something like two thousand cells being empty.

979
01:38:31,920 --> 01:38:38,340
Yeah, in terms of calculating the true proportion from the census to be used with the much smaller data.

980
01:38:38,340 --> 01:38:42,250
So what do you do you just exclude them, would you? Yeah, you said in your part.

981
01:38:42,250 --> 01:38:49,020
So you have if that situation arises, there's two things that can happen.

982
01:38:49,020 --> 01:39:01,560
One is they those guys lie here in which case, like as long as those exact guys form the make up of a specific state that you're interested in.

983
01:39:01,560 --> 01:39:02,880
Usually this isn't the case, by the way.

984
01:39:02,880 --> 01:39:08,550
Usually these, like smaller categories, are kind of like spread out at random across states, so it doesn't really matter.

985
01:39:08,550 --> 01:39:18,180
But if the so in the in the positive scenario, those guys that you don't have in your census samples are here and you ignored them.

986
01:39:18,180 --> 01:39:24,300
It's fine to ignore them in the negative scenario. They're here and then you're absolutely screwed.

987
01:39:24,300 --> 01:39:31,330
There's no way to get around that. Yeah. OK.

988
01:39:31,330 --> 01:39:41,920
And so. So, yeah, sample very unrepresentative invo choice over it, so just going by memory because I don't know if I can trust these plots,

989
01:39:41,920 --> 01:39:53,110
but over represents Hillary Clinton quite dramatically under represents Donald Trump for a fair bit.

990
01:39:53,110 --> 01:40:02,220
And it over represents third parties. So after we have estimated these quantities, P of G.

991
01:40:02,220 --> 01:40:09,390
So the joint distribution of turn out and vote and then the conditional distribution of turnout,

992
01:40:09,390 --> 01:40:18,690
we can then estimate the area level proportion of people who will vote for a given party and turnout with this formula here.

993
01:40:18,690 --> 01:40:23,150
So huge is just the number of people in a specific cell.

994
01:40:23,150 --> 01:40:32,940
PG is the joint distribution of voting and turning out in a specific cell and then you some over the categories up to s.

995
01:40:32,940 --> 01:40:34,470
Again, I should have been a little more tension here,

996
01:40:34,470 --> 01:40:45,510
but you some over the categories for the specific state and then you recall the the the area level estimate for each party.

997
01:40:45,510 --> 01:40:49,620
And then you can do so not just at the party level. So this look look here.

998
01:40:49,620 --> 01:40:54,480
This is conditional on the voter characteristics and the specific party.

999
01:40:54,480 --> 01:40:58,740
Right. Whereas here we don't. The specific state sorry.

1000
01:40:58,740 --> 01:41:03,510
Whereas here we don't bother about the state because we are interested in aggregating at the national level and so

1001
01:41:03,510 --> 01:41:10,440
we can just some over the groups by their weight and obtain the national level results for each voter category.

1002
01:41:10,440 --> 01:41:15,060
Does that make sense for everybody? Yeah. This is just simple certification weighted average.

1003
01:41:15,060 --> 01:41:18,780
That's all. And these are the results.

1004
01:41:18,780 --> 01:41:29,910
So at the bottom, you see the role. Amazon and Turk estimates, and at the top you see the predicted estimates by state.

1005
01:41:29,910 --> 01:41:37,200
So there's a few things to note. So first of all, correlations improved dramatically. Mean, absolute error is shaved off by on average.

1006
01:41:37,200 --> 01:41:46,650
So 10 points from Trump, five points from Hillary, five points from the third parties turn now to shaved off by like thirty two points.

1007
01:41:46,650 --> 01:41:54,720
So we massively improve on the rural sample, and there is clearly this phenomenon called attenuation bias happening.

1008
01:41:54,720 --> 01:42:01,320
So even to the correlation is really high. The variance in the predicted state level vote shares is very small,

1009
01:42:01,320 --> 01:42:08,280
which means that what you're going to get is the correct order of the states, but very much shrunk towards the global mean.

1010
01:42:08,280 --> 01:42:18,420
And that's actually that's an effect of the shrinkage and of the of the general trend to shrink effects towards the mean.

1011
01:42:18,420 --> 01:42:23,660
And what else should you notice? No, I think that's it.

1012
01:42:23,660 --> 01:42:29,150
I think we can. So I think overall, I think we can be pretty happy with the results,

1013
01:42:29,150 --> 01:42:36,950
given that at the state level at least we have shaved off a lot of error from the role of sample estimates.

1014
01:42:36,950 --> 01:42:46,840
Now the really cool stuff comes in when you look at the national level. And so if we look at the these are the distributions of Republicans,

1015
01:42:46,840 --> 01:42:53,610
Democrats and third parties, the solid lines represent the predictions, the small dotted,

1016
01:42:53,610 --> 01:42:57,580
the thick dotted lines represent the twenty sixteen actual result,

1017
01:42:57,580 --> 01:43:05,680
and the small dots represent the role prediction of the the real national level prediction of the Amazon thing.

1018
01:43:05,680 --> 01:43:12,730
And if we look at the errors, so our error on Trump's vote share is less than a single percentage point at the national level.

1019
01:43:12,730 --> 01:43:16,600
Zero point five Our error on Clinton is quite large.

1020
01:43:16,600 --> 01:43:21,730
It's about four percentage points, so we overestimate Clinton by four percentage points.

1021
01:43:21,730 --> 01:43:25,660
Our error on others is we underestimate them by about four percentage points.

1022
01:43:25,660 --> 01:43:30,700
So what is happening there actually is we must have missed some selection effect that

1023
01:43:30,700 --> 01:43:37,360
would have told us about the substitution effect between Clinton and third parties. So we have to go back and think about what that was all about.

1024
01:43:37,360 --> 01:43:42,010
But that's what's happening here is that this is a state level result.

1025
01:43:42,010 --> 01:43:48,160
No, this model is is aggregating at the national level. There's also the question was, does this model use the state level result?

1026
01:43:48,160 --> 01:43:58,780
The answer is no. It just the aggregated aggregate using state a as another group and aggregating over as you would otherwise.

1027
01:43:58,780 --> 01:44:07,150
Yeah. I mean, I'm pretty chuffed that that Trump vote point like half a percentage point from the truth, the turnout.

1028
01:44:07,150 --> 01:44:15,130
Again, we overestimate turnout. Still, even though we had applied that uniform correction of 17 percentage points.

1029
01:44:15,130 --> 01:44:22,180
But that's not news. People lie a lot about their turnout and the survey data about turnout that we get from EM talks or otherwise.

1030
01:44:22,180 --> 01:44:23,920
It's pretty crap.

1031
01:44:23,920 --> 01:44:33,070
So in our one thousand five hundred sample, about 280 people said they didn't turn out, which makes for a very small proportions of people.

1032
01:44:33,070 --> 01:44:39,940
And therefore probably we missed a lot of cases. We, a lot of groups were predicted to vote at higher levels.

1033
01:44:39,940 --> 01:44:43,600
We actually do. And on the right, there's just an Electoral College projection.

1034
01:44:43,600 --> 01:44:46,990
So assuming there was a lot of stuff going on with the Electoral College these days,

1035
01:44:46,990 --> 01:44:50,830
so like a lot of states have signed to this pledge that they will actually give

1036
01:44:50,830 --> 01:44:55,390
the Electoral College votes to the person who wins the national level result.

1037
01:44:55,390 --> 01:45:02,140
But I ignored that. I just assumed that each state is going to give the Electoral College votes as they usually do.

1038
01:45:02,140 --> 01:45:07,400
And so we would have assigned with this model a probability of Trump winning of about twenty three percent, which is not great.

1039
01:45:07,400 --> 01:45:12,640
But it's also not terrible, considering that Hillary's vote share was initially overestimated.

1040
01:45:12,640 --> 01:45:19,000
So Trump's much was initially underestimated by 10 percentage points in the Amazon sample.

1041
01:45:19,000 --> 01:45:26,060
Yes. Do you know how that compares? Predictions that other other people.

1042
01:45:26,060 --> 01:45:32,620
Yes, yes, yes. So the question was, do I know how this prediction?

1043
01:45:32,620 --> 01:45:35,950
Well, this is not a prediction, by the way. So it's kind of an unfair comparison.

1044
01:45:35,950 --> 01:45:40,360
But how this estimate compares to some of the pre-election predictions.

1045
01:45:40,360 --> 01:45:46,210
And the answer is it's a little better. But for all, apart from FiveThirtyEight, so FiveThirtyEight, if I remember correctly,

1046
01:45:46,210 --> 01:45:52,690
had a point two seven percent Trump winning Wong and the Princeton Election Consortium

1047
01:45:52,690 --> 01:45:57,940
had a zero percent Trump winning Drew Linzer had a one percent Trump winning.

1048
01:45:57,940 --> 01:46:04,840
So I mean, you know, from survey data, given you can't really compare because this poll stock and theirs was pretty.

1049
01:46:04,840 --> 01:46:11,080
But given that the survey data has the same chance of being bad this time around as it did last time around.

1050
01:46:11,080 --> 01:46:16,900
This should be encouraging because it suggests that if you do enough modelling, you can get better results than your average predictor.

1051
01:46:16,900 --> 01:46:22,390
Yeah. Yeah. Yeah. Hold on. Mike, Mike, Mike, Mike, listen.

1052
01:46:22,390 --> 01:46:31,510
Sorry. So just to go back one more time to the virtue of the model, because if you give me the data of the actual results,

1053
01:46:31,510 --> 01:46:35,110
yeah, I can make a model which probably predicts better rights.

1054
01:46:35,110 --> 01:46:38,370
No, but but the results are not coming in at the state level, right?

1055
01:46:38,370 --> 01:46:45,620
They're coming in at the individual level. But if you give me all the same data. I could probably do pretty well, right?

1056
01:46:45,620 --> 01:46:53,480
Look at the passport effects because we asked him who he voted for. Yes, used to the state level results.

1057
01:46:53,480 --> 01:47:02,610
Yeah, initial model. Yeah. So. It kind of seems that maybe a bit of a work around growth this we, as I said to you before,

1058
01:47:02,610 --> 01:47:06,520
you could have used it instead of twenty sixteen, you could have used the Republican 2012 results.

1059
01:47:06,520 --> 01:47:11,850
Yeah, but so there was a number of conceptual reasons why I still think that's wrong because again,

1060
01:47:11,850 --> 01:47:15,300
we asked the survey today, not before the election.

1061
01:47:15,300 --> 01:47:18,650
So this is not a prediction exercise. It's a replication exercise.

1062
01:47:18,650 --> 01:47:24,090
So, but in principle, if you were to do a prediction exercise, as you will do at the workshop,

1063
01:47:24,090 --> 01:47:31,740
you are going to be using the 2016 results for predicting Biden's or whoever else.

1064
01:47:31,740 --> 01:47:38,220
But like, I agree that this is not a prediction effort. This is a re estimation replication effort for sure.

1065
01:47:38,220 --> 01:47:42,820
OK, yeah, but but understand this, I make this have this point come across.

1066
01:47:42,820 --> 01:47:48,630
Understand that at this level, the state level results from last year are not playing any role.

1067
01:47:48,630 --> 01:47:54,090
They played the role at the individual level model insofar as they were correlated with individual level responses.

1068
01:47:54,090 --> 01:47:59,790
And actually, the chances that the 2016 results would have probably been more correlated with

1069
01:47:59,790 --> 01:48:06,270
individual level responses today about the 2016 behaviour than 2012 responses.

1070
01:48:06,270 --> 01:48:10,110
But it would have been that far off because state level behaviour is actually

1071
01:48:10,110 --> 01:48:14,130
pretty stable across time and the correlations across states is pretty stable.

1072
01:48:14,130 --> 01:48:18,240
So obviously we we should have read that at using 2012 and seeing what would have happened.

1073
01:48:18,240 --> 01:48:24,220
But you shouldn't think that it's it would have been that far off because their results at the state level weren't that different from 2012 to 2016.

1074
01:48:24,220 --> 01:48:26,940
But I'm just saying that's just a hypothetical model.

1075
01:48:26,940 --> 01:48:34,110
I would do a very simple set of data, namely say I just take one person out of every state's assigned this person.

1076
01:48:34,110 --> 01:48:38,640
The state level results that 0.20 52.

1077
01:48:38,640 --> 01:48:44,370
If it was 52 percent, then I just exploded by the number of people living in states.

1078
01:48:44,370 --> 01:48:48,760
And then I get perfect results, right? Yeah, but that's not what we are doing right now.

1079
01:48:48,760 --> 01:48:53,340
I mean, use at an individual level. Yeah. And then you exploded, right?

1080
01:48:53,340 --> 01:48:58,560
So in principle, this bits. Huge individual in.

1081
01:48:58,560 --> 01:49:02,160
So you said you say this. Well, if yes, so if you assign the state level is perfectly.

1082
01:49:02,160 --> 01:49:04,950
I agree with you, but we're not assigning the state level perfectly.

1083
01:49:04,950 --> 01:49:09,060
We're assigning the correlation between the state level result and the individual response.

1084
01:49:09,060 --> 01:49:15,300
But but but then again, like the virtual verge perspective. Yeah, also from our application perspective, you could probably do better.

1085
01:49:15,300 --> 01:49:19,730
Easier, right? Yeah. Yes. Yes.

1086
01:49:19,730 --> 01:49:27,330
Yes. Yes. No. Of course you could. Of course, the fact that there's 2016 results leads to a better estimate than 2012 results 100 percent.

1087
01:49:27,330 --> 01:49:32,880
I don't dismiss the certification for. Yes. Yes. We're trying to show that certification matters here.

1088
01:49:32,880 --> 01:49:38,740
Yeah. I know the questions.

1089
01:49:38,740 --> 01:49:44,200
OK. And so.

1090
01:49:44,200 --> 01:49:51,140
We're almost there, guys. Yeah, there's a few considerations here.

1091
01:49:51,140 --> 01:49:59,390
So one is that we would have done a lot better as well if we would have improved the sample and we certainly would have increased the sample size.

1092
01:49:59,390 --> 01:50:05,360
Sample size is actually a very contentious issue at the moment in the sort of literature.

1093
01:50:05,360 --> 01:50:11,810
So one way to call this model, by the way, prediction and certification is also MRP multilevel regression and certification.

1094
01:50:11,810 --> 01:50:20,360
And some people have suggested that. So for nationally, for estimating area level result from nationally representative surveys,

1095
01:50:20,360 --> 01:50:25,850
MRP does a lot better with higher sample sizes, so error reduces dramatically.

1096
01:50:25,850 --> 01:50:31,700
If you go from a thousand people in your sample to ten thousand people for Amazon Enteric specifically,

1097
01:50:31,700 --> 01:50:41,660
so for non-representative samples of this kind. Some people goil at all suggested that the name of the paper is surveys are not representative.

1098
01:50:41,660 --> 01:50:46,040
Surveys are cheap and reliable, or are they reliable or something like that?

1099
01:50:46,040 --> 01:50:52,700
But they suggest that actually there is a decreasing returns on these kind of samples because of all their set.

1100
01:50:52,700 --> 01:50:59,180
If you look at the total survey error, some other errors start overcrowding the sampling error.

1101
01:50:59,180 --> 01:51:03,740
And so even if you had 10000 people, you wouldn't do much better than you did here.

1102
01:51:03,740 --> 01:51:15,770
But the point is, people don't know. And as part of your one potential six project, if you'd like to study it, you could come talk to me.

1103
01:51:15,770 --> 01:51:20,720
Sample size for non-representative surveys Impulse fortification context is very important,

1104
01:51:20,720 --> 01:51:27,200
and people don't really know what the [INAUDIBLE] is going on. So that's one of the things. And yeah, that's it.

1105
01:51:27,200 --> 01:51:35,570
So we can take a 20 minute yet 10, 15, 20 minute you guys decide break and then we'll come to the workshop.

1106
01:51:35,570 --> 01:51:42,970
Yes. Mike, Mike, Mike.

1107
01:51:42,970 --> 01:51:46,930
OK, maybe a dummy question after the end of all of this, but no other way.

1108
01:51:46,930 --> 01:51:51,460
So if I understand correctly, this whole exercise is to estimate a proportion.

1109
01:51:51,460 --> 01:51:57,670
That's correct. So I'm not just the proportion, by the way. You can also not you can also do it with like height weight.

1110
01:51:57,670 --> 01:52:02,620
You can do a BMI. You can do it with like any quantity that you want to be representative at the national level,

1111
01:52:02,620 --> 01:52:11,740
but you only have known representative estimates for what it is to estimate a single like parameter.

1112
01:52:11,740 --> 01:52:21,190
Yeah. Could you also use this same way of having a non representative sample if you wanted to estimate relationships between variables?

1113
01:52:21,190 --> 01:52:26,870
Maybe this is really a silly question, but you know what I mean? So if my aim is not to estimate who?

1114
01:52:26,870 --> 01:52:35,020
Yeah, the share of people who vote, but let's say I want to know if more educated people are more likely to have children outside of marriage.

1115
01:52:35,020 --> 01:52:40,600
And then I do like such a quick and cheap, non-representative survey.

1116
01:52:40,600 --> 01:52:46,080
Do I then would I then be also able to infer?

1117
01:52:46,080 --> 01:52:54,780
This kind of relationship, I think that's that's an intriguing question, can we apply a pulse stratification to coefficients, let's say?

1118
01:52:54,780 --> 01:52:58,260
I need to think about that often, like off the top of my head.

1119
01:52:58,260 --> 01:53:02,970
I don't see. Maybe you need to do some extra work, but I don't see why you shouldn't.

1120
01:53:02,970 --> 01:53:09,220
Oh, sorry, I don't see why you couldn't. Maybe there's all sorts of reasons why you shouldn't. But yeah, I can think we can.

1121
01:53:09,220 --> 01:53:12,510
But this is not something that is I haven't seen in the literature yet,

1122
01:53:12,510 --> 01:53:21,000
but also because I come I come at it from a perspective where at the moment it's being used for like, well, political science,

1123
01:53:21,000 --> 01:53:27,300
like predicting who votes, but also disease mapping, like predicting like percentage of influenza cases in different states,

1124
01:53:27,300 --> 01:53:32,040
etc. So at the moment, it's mostly been proportions, as you say.

1125
01:53:32,040 --> 01:53:39,180
Okay, thanks. You have no knowledge. As you well know, rich people are more likely to.

1126
01:53:39,180 --> 01:53:45,430
Yes, things like that. So should should.

1127
01:53:45,430 --> 01:53:50,200
We can we can have a conversation about it afterwards. Thank you.

1128
01:53:50,200 --> 01:53:54,810
Yeah, I think it's a it's a very good question. Think about.

1129
01:53:54,810 --> 01:54:03,720
The existing literature on trend of life or stratification has really been in forecasting or elections or things like,

1130
01:54:03,720 --> 01:54:06,660
you know, just trying to replicate survey questions,

1131
01:54:06,660 --> 01:54:16,530
I mean, the general concept here is to go from a messy, dirty sample doing to a more reliable sample by essentially waiting, right?

1132
01:54:16,530 --> 01:54:21,600
This is just a weeding exercise, so it could actually be an interesting question to try.

1133
01:54:21,600 --> 01:54:29,760
Maybe it's a group project or something else to think about, maybe doing a survey with questions that you might be interested in, right?

1134
01:54:29,760 --> 01:54:39,450
For, for example, taking a survey sociodemographic survey, doing it on a non-representative sample, using a platform such as Mechanical Turk.

1135
01:54:39,450 --> 01:54:43,800
Right? And then finding a relationship that you would be interested in.

1136
01:54:43,800 --> 01:54:50,220
Essentially a relationship of some kind of correlation that you will is well documented and then seeing to what

1137
01:54:50,220 --> 01:54:55,290
extent that works in the non-representative sample and then compare it with something that you trust more,

1138
01:54:55,290 --> 01:55:01,710
which is, of course, much more expensive. So taking a correlation for something like the British Household Panel study or understanding society,

1139
01:55:01,710 --> 01:55:09,930
doing it on a on a non representative sample and seeing if doing the rerating allows you to get within the bounds of that.

1140
01:55:09,930 --> 01:55:18,390
Because of course, the big sell here is that you have a thousand person survey, which is being done for a thousand dollars, which is very cheap.

1141
01:55:18,390 --> 01:55:21,870
And so it's it's potentially scalable if it works like that.

1142
01:55:21,870 --> 01:55:32,410
But so I think it's an interesting question that could even be worth lowering as a as a project.

1143
01:55:32,410 --> 01:55:47,720
So, yeah, I was recently looking at a deer, this survey, literature and these calibration and ranking algorithm, but I don't often see the.

1144
01:55:47,720 --> 01:55:52,400
Living with bears and statistics tell me a bit more.

1145
01:55:52,400 --> 01:56:00,310
It doesn't seem he did the survey literature calibration issues here don't seem to be using the division was the lead there.

1146
01:56:00,310 --> 01:56:08,500
So it's so the MRP studies that you let me be precise.

1147
01:56:08,500 --> 01:56:11,680
You don't need Bayesian to do this sort of stuff. You can do it with frequent.

1148
01:56:11,680 --> 01:56:19,300
There's absolutely no problem like feel free to use your instead of like running the model through Jag's, feel free to use LMM and put in the model.

1149
01:56:19,300 --> 01:56:26,230
That's number one. The advantage of Bayesian is that because you obtain simulations for your posterior, you can.

1150
01:56:26,230 --> 01:56:35,110
You don't need to rely on sampling assumptions in order to calculate a specific predictive distribution.

1151
01:56:35,110 --> 01:56:39,100
You get the predicted distribution directly from the model. You don't need to do any extra work.

1152
01:56:39,100 --> 01:56:43,870
And so for maybe it's being lazy, believe it.

1153
01:56:43,870 --> 01:56:47,320
For me, having these simulations really helps conceptualise the problem.

1154
01:56:47,320 --> 01:56:55,130
And also, yeah, that all would be different because it's not completely dead, but it's quite linked.

1155
01:56:55,130 --> 01:57:01,720
Mm-Hmm. We use, for instance, bootstrapping in France to derive calibration weights.

1156
01:57:01,720 --> 01:57:07,410
Yeah, how that would be different from a Bayesian simulation.

1157
01:57:07,410 --> 01:57:13,010
So that's a method of deriving the weights. Yeah, but also.

1158
01:57:13,010 --> 01:57:17,510
And uncertainty, yeah, and uncertainty around the waits, so that is not fair.

1159
01:57:17,510 --> 01:57:24,200
So, so I'm not like overly familiar with that specific method, but for my general understanding of bootstrapping and that kind of thing.

1160
01:57:24,200 --> 01:57:29,660
So you would obtain you would obtain some measure of uncertainty.

1161
01:57:29,660 --> 01:57:39,710
I think that the assumptions that lie behind the bootstrapping are different from the ones that come with the Bayesian models that I showed you here.

1162
01:57:39,710 --> 01:57:43,160
I am very familiar, like the assumptions behind the Bayesian models are very simple.

1163
01:57:43,160 --> 01:57:46,310
You can derive them from your four probability, five probably axioms.

1164
01:57:46,310 --> 01:57:52,460
So you don't need to do any extra work to specify new assumptions beyond what is a probability.

1165
01:57:52,460 --> 01:57:57,260
And so that, to me, is a very intuitive thing, but that doesn't mean you can't do it.

1166
01:57:57,260 --> 01:58:01,400
So feel free to replicate these exercises by doing the using the bootstrapping.

1167
01:58:01,400 --> 01:58:07,880
The bottom line is they both get. The important thing for me is that we put it in our minds that you have to have uncertainty estimates,

1168
01:58:07,880 --> 01:58:13,970
you have to have uncertainty estimates and like a lot of the applications often don't.

1169
01:58:13,970 --> 01:58:22,910
And it's a problem because then you can't do calculations like the probability of Trump winning, you know, like we saw here.

1170
01:58:22,910 --> 01:58:30,110
Like this point to three comes because out of a thousand simulations, Trump wins twenty three to two hundred thirty times.

1171
01:58:30,110 --> 01:58:34,290
But if we didn't have uncertainty over the state level estimates, you couldn't do that calculation.

1172
01:58:34,290 --> 01:58:40,040
So it's really important that you do have the uncertainty. But you're right. You're right. You can just do it with other ways.

1173
01:58:40,040 --> 01:58:47,840
There are other ways of predicting weights and finding distributions around the weights.

1174
01:58:47,840 --> 01:58:54,570
If that's all, maybe we'll take a break. Yeah. Thank you.