1 00:00:08,350 --> 00:00:13,330 So welcome everybody to this workshop on Bayesian prediction and pulse certification. 2 00:00:13,330 --> 00:00:18,480 Roberto Chettinad AM, a sociology Ph.D. student at Nuffield College. 3 00:00:18,480 --> 00:00:26,090 And in my third year and today I'm going to be talking to you about this technique, which is going to allow you to do a number of things. 4 00:00:26,090 --> 00:00:29,960 So why would you want to use this technique at all? You have two goals. 5 00:00:29,960 --> 00:00:36,040 One is to do small area estimation, which means if you have a sample that is representative at the national level, 6 00:00:36,040 --> 00:00:41,980 but you like to find out something about the state or the municipality or the some of the small area level, 7 00:00:41,980 --> 00:00:49,570 then this technique will allow you to extrapolate information from the national level and make reasonable inference at the area level. 8 00:00:49,570 --> 00:00:53,800 Something else that this technique does is it allows you to make inference at the 9 00:00:53,800 --> 00:00:59,530 national level for non representative samples and even at the state and the area level. 10 00:00:59,530 --> 00:01:02,590 But as we will see, there is a few caveats there. 11 00:01:02,590 --> 00:01:09,370 If you start trying to use it for the latter and the title, the workshop is Bayesian prediction and post certification. 12 00:01:09,370 --> 00:01:16,150 And the reason why that is is because the Bayesian approach allows you to do simulations, posterior simulations, 13 00:01:16,150 --> 00:01:20,380 and this has big advantages in terms of computing confidence intervals of the 14 00:01:20,380 --> 00:01:24,970 area levels because you can simply simulate if you're working with elections, 15 00:01:24,970 --> 00:01:30,490 for example, you can simulate multiple elections. If you're working with disease, you can simulate multiple rounds of the disease. 16 00:01:30,490 --> 00:01:35,990 And this gives you confidence intervals without actually having to go through the trouble of calculating posterior variances and so on. 17 00:01:35,990 --> 00:01:45,670 You can just use simulations. That's the that's the advantage of it. And so there are three parts to the prediction and plus fortification framework. 18 00:01:45,670 --> 00:01:52,690 So one is sampling, which we will talk about at the end, actually. The other one is the prediction part and the other one is the certification part. 19 00:01:52,690 --> 00:01:58,840 This workshop focuses mainly on the prediction and certification parts and in particular prediction we're going to be doing, 20 00:01:58,840 --> 00:02:05,200 as I said, in a Bayesian way, which means that you guys are going to have to be. So how many of you are familiar with the Bayesian approach? 21 00:02:05,200 --> 00:02:10,540 I mean, if you use the Bayesian approach in the past one two three? 22 00:02:10,540 --> 00:02:13,780 OK, for it, not many of you. 23 00:02:13,780 --> 00:02:23,430 So I think a good idea is for us to go through a Bayesian primer first, because how many of you are familiar with the concept of a Gibb's sampler? 24 00:02:23,430 --> 00:02:31,530 Want to see, OK, so roughly the same people, OK, so there's going to be this is very good, actually, because I planned for a 30 minutes, 25 00:02:31,530 --> 00:02:40,290 30 to 40 minutes of a primer to Bayesian modelling, and then we're going to get into the meat of the prediction and certification. 26 00:02:40,290 --> 00:02:45,600 And this primer to Bayesian modelling is actually going to be very useful to you to understand the basics of machine learning because 27 00:02:45,600 --> 00:02:53,310 machine learning is ultimately even though we don't have the computational power to do it in a purely Bayesian way at the moment. 28 00:02:53,310 --> 00:03:04,140 It is ultimately a Bayesian endeavour in its conceptualisation. So you we will see how this links to Charles workshop later today, I think. 29 00:03:04,140 --> 00:03:07,980 And so let's start with the Bayesian model in Paris. 30 00:03:07,980 --> 00:03:15,090 So first of all, this formula here is based theorem you all night based here is a way to update 31 00:03:15,090 --> 00:03:19,590 your belief about the occurrence of an event after you observe some data. 32 00:03:19,590 --> 00:03:28,140 And so there are three parts to base theorem BRACA Theta is the probability it can be an event so it can be a parameter, any parameter. 33 00:03:28,140 --> 00:03:37,150 So, for example, the probability of voting for a Republican in the population of the United States and P of Theta is our prior on that parameter. 34 00:03:37,150 --> 00:03:42,990 So we may. It's a probability and it's an expression of uncertainty. So we may have a prior on Theta. 35 00:03:42,990 --> 00:03:49,950 We may say OK, well, roughly between forty five and fifty five percent of people are going to vote for a Republican candidate. 36 00:03:49,950 --> 00:03:55,740 So our prior is going to be some sort of uniform distribution between fifty five, forty five and fifty five. 37 00:03:55,740 --> 00:04:01,770 That's a simple Black Friday. And then we have the likelihood, which is this p of Y given Fita. 38 00:04:01,770 --> 00:04:04,680 So to understand that there is a likelihood, 39 00:04:04,680 --> 00:04:13,740 which means it's the probability of occurrence of the data that you observed y given the parameter that we have a prior for Theta. 40 00:04:13,740 --> 00:04:20,220 OK. So and as we will see in a second C that generates Y, this is the the where the Bayesian approach comes in. 41 00:04:20,220 --> 00:04:29,040 So like Theta, the probability the global this like parameter that exists with uncertainty in broadly generates data. 42 00:04:29,040 --> 00:04:32,370 And then as we observe data, we learn things about theta. 43 00:04:32,370 --> 00:04:39,300 So if we observed why say, why is the number of individuals who in a survey vote say they would vote for Trump? 44 00:04:39,300 --> 00:04:44,460 And we observed why in a survey of NW equals 1000, we observe Y equals nine hundred. 45 00:04:44,460 --> 00:04:48,300 This suggests that our prior that was between forty five and fifty five percent is wrong, 46 00:04:48,300 --> 00:04:53,970 or at least like there is a conflict between the data and the prior. And we should take that into account. 47 00:04:53,970 --> 00:05:02,820 OK. And so this graph here I get there illustrates the Bayesian paradigm. 48 00:05:02,820 --> 00:05:09,630 So there are two ways of looking at this graph. First, through the solid lines and then through the dotted lines. 49 00:05:09,630 --> 00:05:16,770 So first, for the solid lines, parameter theta generates the distribution, 50 00:05:16,770 --> 00:05:23,340 the likelihood of Y and out of this distribution of Y, you have observed values y. 51 00:05:23,340 --> 00:05:27,330 So this this is a random quantity. This is a random quantity. 52 00:05:27,330 --> 00:05:30,720 This is an observed quantity. OK, so this is realised. This has happened. 53 00:05:30,720 --> 00:05:40,620 Why I have happened, why I would be, for example, there the the specific response in a survey that says I vote Republican, 54 00:05:40,620 --> 00:05:44,940 whereas y is the distribution of YS across the population. 55 00:05:44,940 --> 00:05:52,800 So if you were to ask the whole population without sampling, how many like what the probability distribution would be for people to express 56 00:05:52,800 --> 00:05:57,120 that preference and C is the probability distribution of the hyper parameter. 57 00:05:57,120 --> 00:06:01,620 So the parameter that generates the YS, does that kind of make sense to everybody, this kind of structure? 58 00:06:01,620 --> 00:06:08,340 So you have a generation of data based on parameters? Yes. So yeah, theta. 59 00:06:08,340 --> 00:06:15,570 Yeah, Theta is something like the parameters that you've defined to generate a distribution 60 00:06:15,570 --> 00:06:21,330 that has uncertainty of what what y is and what is the actual distribution that was? 61 00:06:21,330 --> 00:06:24,480 That's exactly correct. So feed is what we call a hyper parameter. 62 00:06:24,480 --> 00:06:32,040 If this were a normal distribution, it would be the mean, for example, and in the Bayesian paradigm, the mean as uncertainty inherent uncertainty. 63 00:06:32,040 --> 00:06:34,710 So this is where the very difference between the Bayesian and the frequent. 64 00:06:34,710 --> 00:06:37,560 This approach, which is probably what you've been taught through your statistics, 65 00:06:37,560 --> 00:06:41,850 comes in in that the frequent this approach thinks that Fita doesn't have any uncertainty. 66 00:06:41,850 --> 00:06:45,630 It's a parameter that exists and you have to measure it, period. 67 00:06:45,630 --> 00:06:49,410 Whereas in the Beijing paradigm, we think that everything has inherent uncertainty. 68 00:06:49,410 --> 00:06:59,670 And therefore, as you observe data that you just change your distribution around Fita, that's it that you never quite get to a precise theta. 69 00:06:59,670 --> 00:07:07,140 There's a paradox, which is that if you observe all the possible ends and for in order to inform your primary theta, 70 00:07:07,140 --> 00:07:09,990 then the frequent extend the Bayesian approach will converge. 71 00:07:09,990 --> 00:07:15,870 So if you observe every single value out there, you will be able to observe that with zero uncertainty. 72 00:07:15,870 --> 00:07:20,130 And so you can imagine if this was a normal distribution, it would collapse on the mean. 73 00:07:20,130 --> 00:07:24,510 That's the idea. OK, any more questions on? On this. 74 00:07:24,510 --> 00:07:33,150 And so this white star is the predictive distribution, so before we observe any data, why and why stars are going to be quite similar. 75 00:07:33,150 --> 00:07:43,200 Now here's the trick when you then suppose that you haven't had access to the first part of this of this generate data generating process. 76 00:07:43,200 --> 00:07:45,660 This is, by the way, this conceptualisation is called. 77 00:07:45,660 --> 00:07:52,830 The DAG directed a cyclical graph, and it just serves to show that information flow that within a data generating process. 78 00:07:52,830 --> 00:07:55,050 And so if you didn't observe this first part, 79 00:07:55,050 --> 00:08:03,240 but you only observe why and you just hypothesise that a data generating process of this kind would have had to exist in order to create why, 80 00:08:03,240 --> 00:08:10,230 then you can use the information you gained by why to activate your priors on Theta and then generate a new predictive distributions, 81 00:08:10,230 --> 00:08:14,970 which means that if you were to try to predict the vote share of like a new person and plus one person, 82 00:08:14,970 --> 00:08:21,060 then that prediction would be informed by the new data that you've observed. OK, everybody clear on this. 83 00:08:21,060 --> 00:08:27,350 Perfect. OK, so. 84 00:08:27,350 --> 00:08:34,610 So we've spoken about these priors and posteriors, so the the way let me go back to base theorem. 85 00:08:34,610 --> 00:08:39,980 The way in which base theorem is set up. Is that you have this prior, 86 00:08:39,980 --> 00:08:49,130 you have this likelihood and then this updated this updated probability distribution on the hyper parameter, it's called the posterior. 87 00:08:49,130 --> 00:08:53,390 And so the relationship between priors and likelihood determines the posterior. 88 00:08:53,390 --> 00:09:01,430 And there are some regulatory issues in this relationship. So for example, there are sets of priors and and likelihoods that are married, let's say, 89 00:09:01,430 --> 00:09:07,550 or, as we say, in their conjugates, which means that they have an analytical solution. 90 00:09:07,550 --> 00:09:13,190 So for example, I have an example here. A famous one is the better binomial. 91 00:09:13,190 --> 00:09:16,880 Does anybody is anybody familiar with a better distribution? I don't. Yeah, roughly. 92 00:09:16,880 --> 00:09:21,980 Who is not familiar with better distribution or a better distribution is simply a very good. 93 00:09:21,980 --> 00:09:27,800 Please let me know when when I'm not being clear, a better distribution is simply a distribution defined between zero and one. 94 00:09:27,800 --> 00:09:31,550 And based on two parameters, alpha and beta can take any sort of. 95 00:09:31,550 --> 00:09:35,270 It's very flexible, can take almost any shape between zero and one. OK. 96 00:09:35,270 --> 00:09:42,140 So it's a perfect distribution to have a prior on probabilities. So, for instance, the probability of voting for Donald Trump in a given election. 97 00:09:42,140 --> 00:09:48,020 And so thi that can be our prior can be well distributed with two parameters alpha and beta. 98 00:09:48,020 --> 00:09:51,830 And then we observe a poll and this would be the likelihood. 99 00:09:51,830 --> 00:09:59,240 So the the response of respondent y o respondent I am given conditional on the hyper 100 00:09:59,240 --> 00:10:03,840 parameter theta on which we have a prior is distributed to a newly distribution. 101 00:10:03,840 --> 00:10:12,650 How many of you are familiar with their newly distribution? OK, good. Um, and if that is the case, then you can simply you don't have to do any maths. 102 00:10:12,650 --> 00:10:20,990 All you have to do, so you don't have to do any integrations. You not to do anything. All you have to do is update your parameters, as we show here. 103 00:10:20,990 --> 00:10:23,740 I don't have a highlighter, but that's OK. 104 00:10:23,740 --> 00:10:34,250 And so you just update alpha and beta by summing the number of Trump voters to the Alphas and subtracting the number of Trump voters to the beat us. 105 00:10:34,250 --> 00:10:40,290 And then we'll give you your new posteriors. So your new belief over Theta doesn't make sense. 106 00:10:40,290 --> 00:10:45,490 Very good. OK, well, we're so sorry. I'll talk slower. 107 00:10:45,490 --> 00:10:50,120 Sorry. Very good. OK. Yes. 108 00:10:50,120 --> 00:10:59,600 Yeah, yeah, yeah, yeah. That part is the equivalent of when you take the act's observation, which is why yep. 109 00:10:59,600 --> 00:11:03,890 Yep, integrated with your pride. That's correct. Yes. Stuff what it. 110 00:11:03,890 --> 00:11:05,180 Exactly, exactly. 111 00:11:05,180 --> 00:11:17,300 So in reality, in order to compute this, you would need to multiply the probability distributions of your priors and your likelihood, 112 00:11:17,300 --> 00:11:23,300 and that could be quite hefty and you'd have to divide it by the marginal probability of Y. 113 00:11:23,300 --> 00:11:29,070 So that requires usually a lot of integration and a lot of multiplication, and that's quite a problem. 114 00:11:29,070 --> 00:11:33,680 And the cool thing about conjugate priors is that somebody has already done that work for you, 115 00:11:33,680 --> 00:11:39,470 and they know that all you need to do is add to the Alphas the Y and subtract the ways to the beta. 116 00:11:39,470 --> 00:11:49,900 Yes, Chris. They. Perfectly comfortable and updating us will be a part of you as a nation. 117 00:11:49,900 --> 00:11:57,100 So, for instance, in a pool, right? So if I had the probability that if my hyper parameter is the probability, 118 00:11:57,100 --> 00:12:05,230 the general probability in a population of voting for Donald Trump, then I do a representative poll of the US population. 119 00:12:05,230 --> 00:12:08,890 And I observe 30 people in one hundred people. 120 00:12:08,890 --> 00:12:14,470 That is my end. I observe 30 people that say they vote for Donald Trump and 70 did not do not. 121 00:12:14,470 --> 00:12:25,060 And then I would just go ahead and change the values for my distribution of Theda to account for this. 122 00:12:25,060 --> 00:12:30,340 So, for instance, in the beta binomial or beat up or newly this example, 123 00:12:30,340 --> 00:12:34,120 all I would do is subtract from the original value of alpha that I had in mind. 124 00:12:34,120 --> 00:12:38,410 I will subtract the number of Trump respondents and the original value of beta. 125 00:12:38,410 --> 00:12:43,060 I would actually sorry I would add and subtract. Yes. 126 00:12:43,060 --> 00:12:48,470 Let's listen to that. Say it again, sorry. 127 00:12:48,470 --> 00:12:55,940 Of course, so as you can see, so probability distributions have inherent uncertainty, 128 00:12:55,940 --> 00:13:03,770 and the uncertainty around your observed value is going to depend on the sample size of your of your pool. 129 00:13:03,770 --> 00:13:10,610 So if you have a pool that has a very, very, very high sample size and it estimates the true feed quite precisely, 130 00:13:10,610 --> 00:13:19,820 then your posterior is going to be quite precise. Although remember that the Bayesian base theorem is is actually an averaging via variances. 131 00:13:19,820 --> 00:13:24,860 So you have to you have this likelihood and this this prior. 132 00:13:24,860 --> 00:13:30,770 And what happens is if your prior is really, really, really precise, it's not going to move much when you observe new data. 133 00:13:30,770 --> 00:13:35,510 But if you're price quite loose and your data is very precise, your prior will quickly shift towards your data. 134 00:13:35,510 --> 00:13:39,050 So it's in fact in situations where the prior is quite loose. 135 00:13:39,050 --> 00:13:44,720 We call that a non informative prior because it's not going to affect the estimation of feet. 136 00:13:44,720 --> 00:13:50,750 It's almost going to be exactly the same as a maximum likelihood estimation. Does that make sense? 137 00:13:50,750 --> 00:13:57,460 OK, perfect. Yes. So why are? Yes, that's correct, yes. 138 00:13:57,460 --> 00:14:04,600 Does it? Would it work for the U.S.? 139 00:14:04,600 --> 00:14:09,080 So for Beta Beta is a primer on Theda. 140 00:14:09,080 --> 00:14:14,470 So the distribution of wiiii, which is the zero one is going to be a Bernoulli. 141 00:14:14,470 --> 00:14:22,720 And so if you wanted to observe. So if you wanted to observe, say, a continuous value for why that would be, say, a normal distribution, 142 00:14:22,720 --> 00:14:28,240 there isn't a conjugate between a better and a normal, but there is a conjugate between a normal and a normal. 143 00:14:28,240 --> 00:14:33,670 So if you had, for example, height or weight, then those are normally distributed in your prior reformulated, 144 00:14:33,670 --> 00:14:39,190 not in terms of alpha and beta in a beta distribution. But in terms of new and sigma in a normal distribution. 145 00:14:39,190 --> 00:14:43,930 And then you would then update that way. Does it make sense to take the form of it? 146 00:14:43,930 --> 00:14:48,100 Yes. What you do is just because you got that. 147 00:14:48,100 --> 00:14:52,710 Yes, that's right. Any more questions? 148 00:14:52,710 --> 00:14:57,310 OK, great. And so. 149 00:14:57,310 --> 00:15:05,470 Having familiarise yourself with this, well, you might say, OK, but I heard that these Bayesian people, they are always doing stuff with computations. 150 00:15:05,470 --> 00:15:09,370 Why if there is always analytical solutions, which we can sum by hand? 151 00:15:09,370 --> 00:15:11,170 Do people bother doing computations? 152 00:15:11,170 --> 00:15:20,170 And the answer is well, in situations where your prior and your likelihood do not have conjugate form like, for example, famous ones. 153 00:15:20,170 --> 00:15:26,260 Famous examples of this are a logistic distribution or a mixture models. 154 00:15:26,260 --> 00:15:32,590 These don't have any conjugate forms, so you can find a nice combination of prior impulse here to give you an analytical solution. 155 00:15:32,590 --> 00:15:42,940 Then we rely on what is called Monte Carlo Markov Chain methods, and the basic principle is that is that of a Markov chain. 156 00:15:42,940 --> 00:15:51,160 So a Markov chain is just, in fact, I'm not even sure you need to know what a Markov chain is, because I'm going to give you an example of it. 157 00:15:51,160 --> 00:16:00,580 So we're going to look at the Gib sampler. So the Gibb's sampler is a it's an algorithm that allows you to find your posterior 158 00:16:00,580 --> 00:16:04,960 in instances where there is no analytical solution between your prior and posterior. 159 00:16:04,960 --> 00:16:11,710 And the way it does this is as follows. So imagine you have a normal distribution, so you have two hyper parameters Mew and Theta. 160 00:16:11,710 --> 00:16:19,630 The mean and the variance, OK, the variance in Beijing statistics is usually coded as the precision, which is one over. 161 00:16:19,630 --> 00:16:26,590 So one over the variance and and the way the sample works is as follows. 162 00:16:26,590 --> 00:16:33,520 First, you set some arbitrary initial values for you and Theda, you might say, OK, I'm going to start with theta equals one. 163 00:16:33,520 --> 00:16:38,680 Sorry. New in town and I start with Tao equals one and new equals zero. 164 00:16:38,680 --> 00:16:43,540 And then you say, OK, let's sample. So these are now distributions, right? 165 00:16:43,540 --> 00:16:48,130 Remember, so like you have this normal distribution menu has a distribution tab was the distribution and you say, 166 00:16:48,130 --> 00:16:55,320 let's sample from the distribution of new a new number based on the initial value that I put on Theta. 167 00:16:55,320 --> 00:17:00,120 And then let's do the same on Fita with the new value that I just sampled from me. 168 00:17:00,120 --> 00:17:06,540 And then I iterate, so I continue doing this process until I reach some sort of convergence. 169 00:17:06,540 --> 00:17:12,090 So a typical example of this, I have this sort of whiteboard here. 170 00:17:12,090 --> 00:17:21,190 Maybe I can show you this is very experimental, so bear with me should be called the experimental social science. 171 00:17:21,190 --> 00:17:25,930 So, OK. Can you can you see? 172 00:17:25,930 --> 00:17:34,260 Yeah. OK. So say I have. Tao, which is just the precision or one over the variance on the x axis and Mew, 173 00:17:34,260 --> 00:17:43,610 which is the mean parameter of a normal distribution, I would initialise Mew and Theta to be say here. 174 00:17:43,610 --> 00:17:53,510 Some low value from you and some high value for town. OK, then I would sample a value of MEU from from a value from this initial value of Fita. 175 00:17:53,510 --> 00:17:57,890 And then I would do the same for a value of Tao with respect to a value of meat. 176 00:17:57,890 --> 00:18:03,530 And then this will move me somewhere here and then I would continue doing this. 177 00:18:03,530 --> 00:18:10,870 And if the properties of a mark of chain are reached, then eventually. 178 00:18:10,870 --> 00:18:15,610 I'm going to get here where this is the meeting of the joint distribution. 179 00:18:15,610 --> 00:18:20,410 And this is the the variance around it. So this is the uncertainty around it. 180 00:18:20,410 --> 00:18:24,130 So the idea is that if the properties of remarkable chain are respected, 181 00:18:24,130 --> 00:18:28,990 it doesn't matter that you don't have an analytical solution between your prior and your posterior. 182 00:18:28,990 --> 00:18:33,400 What you can do is simply iterate between by sampling iterative sampling. 183 00:18:33,400 --> 00:18:42,040 This called conditional iterative sampling so you, you, you sample from the conditional distribution of mean with the new values of sigma, 184 00:18:42,040 --> 00:18:47,260 and then you re sample the new values of sigma from the conditional distribution of meat. 185 00:18:47,260 --> 00:18:51,250 And so eventually this converges and we have what I showed you here. 186 00:18:51,250 --> 00:18:58,930 So this is like the joint. The joint uncertainty around the two parameters and the centre would be the point estimates for the two parameters. 187 00:18:58,930 --> 00:19:03,530 Is that kind of makes sense. Yeah. OK. Yes. 188 00:19:03,530 --> 00:19:09,730 But this is like exploring the distribution of these two values. 189 00:19:09,730 --> 00:19:15,280 Yes. You're aiming to get the distribution right and this will converge when you visited. 190 00:19:15,280 --> 00:19:19,990 All sort of like the bulk of the. Absolutely. Absolutely. 191 00:19:19,990 --> 00:19:28,810 Yes. And that's why it takes so long when you have, like very few data or whatever, you're trying to explore distribution, which is maybe too wide. 192 00:19:28,810 --> 00:19:31,810 That's exactly right. Yes. Yes. 193 00:19:31,810 --> 00:19:36,790 You might be familiar if you don't have machine learning, because if you go to machine learning before doing the Bayesian stuff, 194 00:19:36,790 --> 00:19:41,870 you might be familiar with, like stochastic search or like a stochastic gradient descent. 195 00:19:41,870 --> 00:19:48,800 So it's a very similar algorithm in that sense. OK, so having learnt what to give, Sampdoria's, you might be asking, well, 196 00:19:48,800 --> 00:19:55,550 when do we stop sampling because like you could just keep sampling all all the time, but you don't know if you've reached convergence. 197 00:19:55,550 --> 00:20:00,920 Convergence is the key here. If you don't reach convergence, then you might have problems with stability. 198 00:20:00,920 --> 00:20:04,820 So it might mean that the next time you run the same model, if you don't have convergence, 199 00:20:04,820 --> 00:20:08,300 you'll get different estimates for your probability, probably upstairs. And that can all happen. 200 00:20:08,300 --> 00:20:17,260 Yes. OK. They that in this thing, you from three the out or. 201 00:20:17,260 --> 00:20:21,020 I'm sensing from. Distribute some. 202 00:20:21,020 --> 00:20:26,600 Mm. So what does that have to do? So to my base, yeah. 203 00:20:26,600 --> 00:20:31,950 So your data comes in in the conditional distribution of the mean. 204 00:20:31,950 --> 00:20:40,370 So in principle, you have a condition the mean, the mean that you observe from from your data, let's say, 205 00:20:40,370 --> 00:20:46,100 has this prior and then the observed data when you sample from the conditional distribution of new, 206 00:20:46,100 --> 00:20:50,780 you sample a new value of new that has to be coherent with your data. 207 00:20:50,780 --> 00:20:59,060 That's the constraint. So there are two constraints here. Do it? Let me go back up, actually. 208 00:20:59,060 --> 00:21:06,680 When you sample the new value of new, it has to be coherent with the current value of sigma and your data. 209 00:21:06,680 --> 00:21:13,790 So this is a coherence problem. Like every time you sample, it has to get more and more coherent with your data and your sigma. 210 00:21:13,790 --> 00:21:17,450 The reason why it doesn't converge sometimes is because your data is too sparse. 211 00:21:17,450 --> 00:21:24,120 And then there is also, as we were saying in front, there are all sorts of places where this algorithm could be. 212 00:21:24,120 --> 00:21:26,150 It could be over here, could be over there, could be over there. 213 00:21:26,150 --> 00:21:32,090 And so there is any and there's an infinite amount or a non tractable amount of combinations 214 00:21:32,090 --> 00:21:37,260 of new and sigma that are coherent with your data because your data so sparse. Does it make sense? 215 00:21:37,260 --> 00:21:44,330 OK. Yes, it's a method in your son about. 216 00:21:44,330 --> 00:21:50,430 But Justice, Mr Baker, you just said a respectable. 217 00:21:50,430 --> 00:21:57,620 But that's correct, so the starting the properties of the Markov chain that matter are that it is only dependent on its previous value, 218 00:21:57,620 --> 00:22:01,430 so quickly forgets the initial values. That's that's a key property. 219 00:22:01,430 --> 00:22:06,410 And then eventually that if you sample enough, it converges to a coherent joint distribution. 220 00:22:06,410 --> 00:22:13,370 Those are the two properties. So it's not that you're not making any assumptions, you're still making your prior assumption. 221 00:22:13,370 --> 00:22:19,400 So that comes in because your prior distribution, this is these conditionals will get you to a posterior, 222 00:22:19,400 --> 00:22:24,230 which is the average by variance of the prior and the likelihood. 223 00:22:24,230 --> 00:22:32,150 But what you're not doing is you're not letting your initial value for the search algorithm to affect the result doesn't make sense. 224 00:22:32,150 --> 00:22:39,060 OK. We're almost through the Bayesian prior I got from the basic the basic primer mission prior. 225 00:22:39,060 --> 00:22:46,090 And. And another way of looking at this blob that I drew on the board here is it's like this, 226 00:22:46,090 --> 00:22:52,030 so say this was just the the mean of a normal distribution and you were looking at its posterior. 227 00:22:52,030 --> 00:22:56,020 Then there is a bunch of. So here we run to change. 228 00:22:56,020 --> 00:23:00,970 So the way we spoke about a mark of change writes like, we run, we run this algorithm twice, 229 00:23:00,970 --> 00:23:08,470 starting from two different points and then we wait until X number of iterations where it's forgotten the initial values, 230 00:23:08,470 --> 00:23:13,120 which in this case takes around two hundred iterations between the two chains. 231 00:23:13,120 --> 00:23:20,800 And then you can see the two chains mix nicely, which is this called mixing, which is which suggests that we are at this point. 232 00:23:20,800 --> 00:23:23,440 Here we are at convergence because what does this mean? 233 00:23:23,440 --> 00:23:30,370 This means that every time I'm sampling from the the previous value, I'm getting stable coefficients. 234 00:23:30,370 --> 00:23:39,700 If we imagine that this is new, the variance here in each chain represents the actual variance of new the mean of the normal distribution. 235 00:23:39,700 --> 00:23:45,130 And so if both chains have agreed on the variance of you, then you have you have reached convergence. 236 00:23:45,130 --> 00:23:53,050 Does that make sense? Yes. When the chains are completely random and yeah, then you, then it's a problem, then it's a problem. 237 00:23:53,050 --> 00:23:55,180 It means that your algorithms and converged. 238 00:23:55,180 --> 00:23:59,740 And as we will see, there is all sorts of tricks that you can use to try to get your algorithm to converge. 239 00:23:59,740 --> 00:24:05,660 But in principle, you cannot use your posterior values to make inference if the chains have not converged. 240 00:24:05,660 --> 00:24:09,430 Potentially is sort of like a disagreement between your data and your priors. 241 00:24:09,430 --> 00:24:14,800 And no, so if there is a disagreement within your data on your power, your algorithm will still converge. 242 00:24:14,800 --> 00:24:17,050 It'll just it'll be like some sort of average in the middle, 243 00:24:17,050 --> 00:24:23,410 and it suggests that you shouldn't reformulate your prior, really, because especially if you don't have much data, 244 00:24:23,410 --> 00:24:29,020 because your your is affecting your data a lot and maybe you don't want that to happen, it's it's not and not informative prior. 245 00:24:29,020 --> 00:24:36,250 It's a very informative prior and we try to stay away from those if you're trying to do predictive things, for example. 246 00:24:36,250 --> 00:24:47,170 Any more questions? Yeah. So you do learn something that is not informative of the article to as a prior, you have two choices. 247 00:24:47,170 --> 00:24:51,070 If you know your stuff very, very well and your data is very, very small, 248 00:24:51,070 --> 00:24:55,150 you might want to pick something that's informative because that's going to go and augment your data. 249 00:24:55,150 --> 00:25:00,850 But the downside is that that's might be biased because if it's your belief about a given parameter, 250 00:25:00,850 --> 00:25:11,800 then you might have all sorts of bias in yourself. So for example, if I were to ask, I don't know, a a no minority person in America, 251 00:25:11,800 --> 00:25:16,360 what's the proportion of police arrests that involve minorities in the United States? 252 00:25:16,360 --> 00:25:22,450 My guess is they probably would underestimate that proportion, whereas if I asked a minority person, they might overestimate it. 253 00:25:22,450 --> 00:25:25,750 And so some somewhere in between would probably be more more sensible. 254 00:25:25,750 --> 00:25:32,200 So but I guess what I'm trying to say is if you know that your prior is going to add value to your prediction, 255 00:25:32,200 --> 00:25:39,400 put it in if you don't choose a non informative prior and allow your data to speak by itself. 256 00:25:39,400 --> 00:25:48,730 That makes sense. Fantastic. OK. And so the question comes, how do we choose when to stop? 257 00:25:48,730 --> 00:25:53,350 Well, we can look at these plots and say and look at mixing. So as we just saw. 258 00:25:53,350 --> 00:26:00,850 But there's also a formula that we can use, and the formula is called the Gilman Reuben Stat. It's this R hat here, 259 00:26:00,850 --> 00:26:08,800 and all it is is a measure of the within chain and between chain variants. 260 00:26:08,800 --> 00:26:13,750 So you need to, as I was explaining before, you need to have reached a sort of agreement between the two chains. 261 00:26:13,750 --> 00:26:19,480 And so if that is the case that within the within chain variants has to be stable and the between 262 00:26:19,480 --> 00:26:23,890 chain variants shouldn't be very large because you want them to be converging on the same point. 263 00:26:23,890 --> 00:26:24,760 Does that make sense? 264 00:26:24,760 --> 00:26:33,730 So usually values of Goldman Rubin's statistics that hard hat that tell you that your algorithms converged are any where below 1.1, 265 00:26:33,730 --> 00:26:39,000 then you can sort of this is a juristic, by the way, it was developed by Andrew Gelman and Don Lemon. 266 00:26:39,000 --> 00:26:45,100 Yeah, but you can pick you should pick values that are below one point one. 267 00:26:45,100 --> 00:26:53,530 There's also some discussion in the literature about one point one being too large and some conflicts with frequent this stat. Once you get there, 268 00:26:53,530 --> 00:26:59,020 but one point one is a good value for you guys to know. 269 00:26:59,020 --> 00:27:08,320 And then the last problem you have is as you sample from from each one of these, you don't just have the the problem of convergence, 270 00:27:08,320 --> 00:27:12,490 you also have the problem of auto correlation because clearly the value where you 271 00:27:12,490 --> 00:27:16,540 are today will affect the value where you are tomorrow every time you sample. 272 00:27:16,540 --> 00:27:25,630 And so in order to get rid of that, so and the way you can estimate how much that auto correlation is affecting 273 00:27:25,630 --> 00:27:31,060 your your your jib sampler is through this thing called effective sample size. 274 00:27:31,060 --> 00:27:36,430 And all that does is it says, well, if you remove if you filter for the auto correlation, 275 00:27:36,430 --> 00:27:43,240 how many truly new data are you getting every time you sample from your from your conditional distribution? 276 00:27:43,240 --> 00:27:52,690 And so if you're in effect, if you think like if you do 100 runs of your jib sampler and you think that you have reached convergence, 277 00:27:52,690 --> 00:27:54,970 your effective sample size should be 100. 278 00:27:54,970 --> 00:28:01,810 But often what you find is that a hundred thousand rounds of your jib sampler and you think you've reached convergence, 279 00:28:01,810 --> 00:28:08,110 but then you look at the effect of sample size and it's like twenty eight, which means that your values are massively correlated. 280 00:28:08,110 --> 00:28:12,100 And the technique you used to get rid of this auto correlation is called thinning, 281 00:28:12,100 --> 00:28:19,210 which means instead of taking all the 1000, you take every fifth or every sixth or every tenth and so on. 282 00:28:19,210 --> 00:28:23,560 And that should remove the auto correlation because if there's a Markov chain, 283 00:28:23,560 --> 00:28:27,610 today's value is only dependent on yesterday's values, not on the days before. 284 00:28:27,610 --> 00:28:32,750 And so if you pick every 10, they're going to be independent. Does it make sense? 285 00:28:32,750 --> 00:28:40,160 Brilliant. OK. So with this small Bayesian primer, I try to give you like four years of studying in like 20 minutes, 286 00:28:40,160 --> 00:28:48,590 but the learning objectives you should be familiar with what based theorem is, you should be you should know what a likelihood that a posterior is. 287 00:28:48,590 --> 00:28:51,650 You should understand that based Bayesian inferential procedure. 288 00:28:51,650 --> 00:28:56,210 So that kind of data generation and then inference and updating that I showed you before, 289 00:28:56,210 --> 00:28:59,150 you should be able to use conjugate priors and know their limitations. 290 00:28:59,150 --> 00:29:05,450 You should be able to understand what a Gibb's sampler is and why you use that when you don't have analytical solution from conjugate priors. 291 00:29:05,450 --> 00:29:12,370 And then you should be able to check whether your sampler is reliable or that it has converged and that it's not auto correlated. 292 00:29:12,370 --> 00:29:17,380 OK. Take a breather. This was Bayesian statistics. 293 00:29:17,380 --> 00:29:23,650 Very good. What's coming next is a lot more reasonable in my mind. 294 00:29:23,650 --> 00:29:29,200 So we're not going to look at this prediction, impose certification part. 295 00:29:29,200 --> 00:29:33,580 So the way I'm going to teach you this is going to be through the example that I'm most 296 00:29:33,580 --> 00:29:38,710 familiar with because that's what I use with my research and that is voting and in particular, 297 00:29:38,710 --> 00:29:45,820 trying to predict the percentage of people who votes for a given candidate from a non representative sample. 298 00:29:45,820 --> 00:29:52,510 And so, for instance, we may be interested in knowing in 2020 if Joe Biden was the candidate for the Democratic Party, 299 00:29:52,510 --> 00:29:56,800 what would be the proportion of people in the United States who would vote for Donald Trump, 300 00:29:56,800 --> 00:30:00,490 either as a whole in the United States or at the state level? 301 00:30:00,490 --> 00:30:05,920 Because that's important because then you can calculate the Electoral College votes and then you can find out who actually wins the election. 302 00:30:05,920 --> 00:30:13,570 OK? The classic decomposition of that problem that I just explained to you is as follows. 303 00:30:13,570 --> 00:30:22,660 So what you want is you want to find out what is the probability, the probability distribution of people who vote for Choice J. 304 00:30:22,660 --> 00:30:26,860 And turnout conditional on a set of characteristics. 305 00:30:26,860 --> 00:30:29,500 So these characteristics could be the gender of the person. 306 00:30:29,500 --> 00:30:35,170 They could be the race of the person, they could be the income of the person, the education and so on. 307 00:30:35,170 --> 00:30:39,460 And what did they turnout people either turnout on Election Day or they stay home? 308 00:30:39,460 --> 00:30:45,490 And so we are going to decompose this problem. This is a joint distribution between vote and turnout, 309 00:30:45,490 --> 00:30:49,990 and we're going to decompose this problem into two parts just by the definition of a joint distribution. 310 00:30:49,990 --> 00:30:56,140 And that's going to be the distribution of voting conditional on turning out and the distribution of turning out. 311 00:30:56,140 --> 00:31:01,680 And we're going to multiply these together. OK. So that's fairly straightforward. 312 00:31:01,680 --> 00:31:08,410 So one way we have to estimate the. 313 00:31:08,410 --> 00:31:16,270 Distribution of turning out is the probability distribution of individuals will turn out is we do a non-representative sample, 314 00:31:16,270 --> 00:31:22,810 say through Digital Trace, we obtain characteristics on individuals through this digital trace and then 315 00:31:22,810 --> 00:31:26,230 we try to use these characteristics to get and we have some training data. 316 00:31:26,230 --> 00:31:33,430 So we know from a survey, let's say one of these people turned out in 2016 or whether they say they will turn out in 2020. 317 00:31:33,430 --> 00:31:40,960 And what we do is just we fit a model to predict turnout propensity for each individual within our non-representative representative sample. 318 00:31:40,960 --> 00:31:45,070 And this model could be anything. It could be a linear regression. It could be a multilevel regression. 319 00:31:45,070 --> 00:31:51,220 As we're going to see, it could be a machine learning method like random forests or a convolutional neural network and so on. 320 00:31:51,220 --> 00:31:58,870 What matters is that you pick the model that given your limited data because your data is non-representative, 321 00:31:58,870 --> 00:32:04,870 will allow you to make the best out of sample predictions. So this is a fairly common theme in machine learning, essentially. 322 00:32:04,870 --> 00:32:07,270 And what do we mean by out of sample predictions? 323 00:32:07,270 --> 00:32:14,890 Well, if we have a small sample say like eight thousand five hundred individuals, but we are interested in, you know, 324 00:32:14,890 --> 00:32:22,930 hundreds of thousands of categories of voters, then we will only learn about those categories of voters from the small sample that we have. 325 00:32:22,930 --> 00:32:31,480 So for example, if we have in our sample white male, the only race that we have in a sample is whites. 326 00:32:31,480 --> 00:32:36,310 But we have whites of like lower education, higher education and middle education. 327 00:32:36,310 --> 00:32:45,880 And then we are asked to make a guess as to the how minorities in the United States are going to vote based on the whites that we observed. 328 00:32:45,880 --> 00:32:47,380 Then the best, 329 00:32:47,380 --> 00:32:57,400 we're going to have to find out an algorithm that allows us to extrapolate from our limited non representative sample into that unobserved category. 330 00:32:57,400 --> 00:33:02,950 Does that make sense? And so one way we have to do this is the multilevel regression. 331 00:33:02,950 --> 00:33:07,210 Why do we choose multilevel regression? We could, as I said, we could have chosen any other algorithm. 332 00:33:07,210 --> 00:33:13,240 And the answer is because it allows us to do this thing called shrinkage. How many of you are familiar with the term shrinkage? 333 00:33:13,240 --> 00:33:18,670 Yes, very good. So it does this thing called shrinkage, which is a form of regularisation. 334 00:33:18,670 --> 00:33:25,570 And what shrinkage does is when you estimate a coefficient, a random effect coefficient, for instance, 335 00:33:25,570 --> 00:33:31,930 instead of the coefficient being exactly what you observed when your data for that particular category. 336 00:33:31,930 --> 00:33:37,720 So for example, if one of our variables is education and we have three, we have four of them. 337 00:33:37,720 --> 00:33:43,510 Let's say like low education, middle lower education, middle upper education and upper education. 338 00:33:43,510 --> 00:33:53,110 Then instead of your estimate for the age, for the lower lower education effect being exactly the mean of the low educated in your sample, 339 00:33:53,110 --> 00:34:01,360 it is going to be an average between the lowest you mean of the low educated in your sample and the global mean of the four categories. 340 00:34:01,360 --> 00:34:09,280 And that average is going to be a variance average. So it's going to be a compromise between the variance between variance of the 341 00:34:09,280 --> 00:34:14,380 four categories and the within variance of the single category that you have. 342 00:34:14,380 --> 00:34:19,210 So for instance, if you observe only one individual with low education, 343 00:34:19,210 --> 00:34:26,590 then your estimate for low education effect is going to be very close to the average of the four or the four effects. 344 00:34:26,590 --> 00:34:27,220 Why? 345 00:34:27,220 --> 00:34:36,340 Because that estimate is very precise in your sample, whereas if you estimate, if you have a lot of a lot, a lot of observations for low education, 346 00:34:36,340 --> 00:34:42,630 but very few for the other three, the effects of the other three are going to be shrunk towards the effect of low education. 347 00:34:42,630 --> 00:34:50,210 That kind of makes sense. So the effect that this has in practise is that it allows you not to fit your data. 348 00:34:50,210 --> 00:34:57,410 And what this means is that when you look at data that you haven't seen before, your data has not been as affected by noise. 349 00:34:57,410 --> 00:35:03,950 So your model has not been as affected by noise, and therefore you'll be able to make better predictions better out of sample predictions. 350 00:35:03,950 --> 00:35:14,000 Very good. OK. Well, you see on the board is a hierarchical specification for a simple multilevel turnout model. 351 00:35:14,000 --> 00:35:21,440 So we say this is a survey or a digital trace, and we've asked people, Are you going to turn out in 2020? 352 00:35:21,440 --> 00:35:26,090 Then that distribution pie is a better distribution of pie. 353 00:35:26,090 --> 00:35:30,060 These are likelihood. Yes or no? Yes. Turnout turnout? 354 00:35:30,060 --> 00:35:32,890 Yes. No. One zero. 355 00:35:32,890 --> 00:35:42,340 And the distribution can be assumed to be a burden, newly distributions are newly allows for a probability of a switch on and a switch off. 356 00:35:42,340 --> 00:35:45,760 And this has hyper parameter Fita. 357 00:35:45,760 --> 00:35:52,420 And then we can just use a logistic regression to estimate the value of either parameter theda for each individual in our sample. 358 00:35:52,420 --> 00:35:59,980 Now you see the subscript g i. That's because each individual in our sample in our mind is part of a group. 359 00:35:59,980 --> 00:36:03,820 And this is going to be very important when you're going to stratify because you're going to find out 360 00:36:03,820 --> 00:36:09,160 that there are so many people of this particular group in a particular state or in your country, 361 00:36:09,160 --> 00:36:12,070 and we will see how we will do that calculation later. 362 00:36:12,070 --> 00:36:22,660 So the group is going to be defined by the specific combination of sex, age, race, education, household income and state in this particular model. 363 00:36:22,660 --> 00:36:26,290 The these are all the all the details are all random effects. 364 00:36:26,290 --> 00:36:30,850 So they are estimated in the way that I have that I have described previously. 365 00:36:30,850 --> 00:36:38,260 The beta are state level predictors. So why would we put state level predictors in the individual level model? 366 00:36:38,260 --> 00:36:43,840 Well, that's because ultimately you're interested in a state level effect and the 367 00:36:43,840 --> 00:36:47,470 state level predictor can be extremely predictive of individual level choices, actually. 368 00:36:47,470 --> 00:36:53,140 So for instance, if one one of the common variables that we put in the state level predictor, 369 00:36:53,140 --> 00:36:58,840 if we're trying to predict turnout is last year's turnout and this ends up being quite helpful to 370 00:36:58,840 --> 00:37:04,600 the out of sample predictions for categories of voters that we have not observed in our sample. 371 00:37:04,600 --> 00:37:07,150 So these are categories of voters that we're interested in, 372 00:37:07,150 --> 00:37:13,270 but we didn't manage to collect because our sample was imperfect or too small and so on and so forth. 373 00:37:13,270 --> 00:37:26,550 What you see here are priors. Yes, Chris. Does that also help you the latest move by the state? 374 00:37:26,550 --> 00:37:35,810 Yes. The one state that they are from the. Yeah. 375 00:37:35,810 --> 00:37:43,830 Yeah, you're right. So the state level predictor helps you kerb some of that in inaccuracy slash 376 00:37:43,830 --> 00:37:51,380 and consistency that your base estimate of the random effect would bring in. 377 00:37:51,380 --> 00:37:55,760 Yes, I think that's that's correct. Yes. Yeah. 378 00:37:55,760 --> 00:38:01,990 This is the standard framework. But then suddenly, Professor Stemple framework would also need to test the. 379 00:38:01,990 --> 00:38:09,900 The state level. Estimates from a similarly known representative sample. 380 00:38:09,900 --> 00:38:19,900 I'm. Well, you could put them in if you have them, you would definitely put them in. 381 00:38:19,900 --> 00:38:27,380 Well. So in the case of voting, what we usually do is so you're saying, for example, if we have a representative, 382 00:38:27,380 --> 00:38:31,700 a representative sample of voters in the Midwest, 383 00:38:31,700 --> 00:38:36,440 would we use the point estimate from that representative sample as part of our state level predictors? 384 00:38:36,440 --> 00:38:41,490 Is that right? That's a good question, I guess. 385 00:38:41,490 --> 00:38:49,470 There is a there is a trade off, yeah, I think you would want to use a compromise between the latest available data and the preciseness of the data. 386 00:38:49,470 --> 00:38:53,760 The cool thing about having last election results is that they are precise, they are the value. 387 00:38:53,760 --> 00:39:02,340 Whereas a sample that is generated via telephone calling, for example, et cetera, has, as we have seen we saw yesterday with the total survey error, 388 00:39:02,340 --> 00:39:07,020 has many, many steps in which it could go wrong and so you might end up introducing noise to your sample. 389 00:39:07,020 --> 00:39:11,700 But in principle, I would say for countries that haven't had election, say, in eight years or something. 390 00:39:11,700 --> 00:39:16,230 So I was thinking of working on a project in Afghanistan, and in that case, 391 00:39:16,230 --> 00:39:21,000 you would pop in survey data because that would be a lot more accurate, you know, demo level. 392 00:39:21,000 --> 00:39:25,730 This is a model to. Go towards some sort of problem. 393 00:39:25,730 --> 00:39:32,370 From the North and the South Pole, the only work was that there was a lot of. 394 00:39:32,370 --> 00:39:37,470 This is a model to go to the best possible prediction we can have for a category. 395 00:39:37,470 --> 00:39:41,490 And then that's it. This is not doing any certification work. 396 00:39:41,490 --> 00:39:46,380 The result of this model is not going to give you the result at the state level. We need to do extra work to do that. 397 00:39:46,380 --> 00:39:50,850 Yes, I mean, the data you do is. Like the people in the face, the more markets. 398 00:39:50,850 --> 00:39:58,810 Yes, that's correct. Yes. Absolutely yes. The specific characteristic of north and central. 399 00:39:58,810 --> 00:40:05,140 It would work on representative samples as well. In fact, what people do sometimes is they pop representative samples in here, 400 00:40:05,140 --> 00:40:10,540 representative samples at the national level in order to find out how categories vote and 401 00:40:10,540 --> 00:40:14,770 then stratify those categories at the state level in order to obtain area estimation. 402 00:40:14,770 --> 00:40:19,260 So the fact that the sample is representative shouldn't stop you from doing multilevel regression. 403 00:40:19,260 --> 00:40:25,130 Yeah, yeah. Yeah, yeah, sorry. If those things are yes, yes. 404 00:40:25,130 --> 00:40:30,850 You're not asking people what they look like. They say, Yep. 405 00:40:30,850 --> 00:40:39,790 You can still expand. All the still set up today are small cells, but still they'll be assembled directly into those. 406 00:40:39,790 --> 00:40:44,130 Yeah, we'll have that for sure. Yes. They're like a PlayStation. 407 00:40:44,130 --> 00:40:48,180 Yeah, same question with a sample for PlayStation. Yeah, yeah. 408 00:40:48,180 --> 00:40:58,230 They're going to use them all the time. Well. You couldn't use the same model you'd want to adjust for that selection effect somehow. 409 00:40:58,230 --> 00:41:04,580 Yes, of course, yes. But this one is tailored to the selection mechanism for this market. 410 00:41:04,580 --> 00:41:11,870 And therefore would not be because it probably led to expose Muller was very much opposed, make just one personalised PlayStation. 411 00:41:11,870 --> 00:41:15,410 Yeah, they severely upvote this person, wrote the model kind. 412 00:41:15,410 --> 00:41:18,980 Yes. So like we did that we saw yesterday. Yes, yes. So that's it. 413 00:41:18,980 --> 00:41:24,500 An orbit in the samples of PlayStation users, then you wouldn't want to up scale down downscaled. 414 00:41:24,500 --> 00:41:28,820 So there was a couple of things there. So eight, the that kind of thing. 415 00:41:28,820 --> 00:41:33,200 So the guy that uses the PlayStation, which we would want to upvote him, I guess you say, right? 416 00:41:33,200 --> 00:41:40,700 Because that would make it more representative of all. Well, then we should have PlayStation and Xbox as part of our of our certification frame. 417 00:41:40,700 --> 00:41:46,080 Yes, that's what you were saying. Yeah, for sure. We should also have. 418 00:41:46,080 --> 00:41:50,220 Some kind of estimate of the population that contains these variables. 419 00:41:50,220 --> 00:41:53,880 That's exactly right. For those four to five, we don't have also. 420 00:41:53,880 --> 00:42:01,920 So that's exactly what Scott was saying yesterday that one of the limitations of his work is that he has to rely on census that gets done in 2011. 421 00:42:01,920 --> 00:42:05,610 And as we're going to see in the next section, which is the certification part, 422 00:42:05,610 --> 00:42:11,670 you do need accurate, accurate cell sizes, as we call them, in the certification frame. 423 00:42:11,670 --> 00:42:16,050 So you can't you can't do this work if you don't have accurate cell sizes. 424 00:42:16,050 --> 00:42:19,530 Yes, that's correct. Yeah, yeah, go. 425 00:42:19,530 --> 00:42:23,010 Uh-Huh. You take the microphone. Sorry, sorry for the live stream. 426 00:42:23,010 --> 00:42:33,930 I mean, I doubt that anybody in America is awake. So. So in the line of the coefficients here what we have female age, race, education, income. 427 00:42:33,930 --> 00:42:39,400 So you're assigning. So this logic is the distribution that you're assigning to each of these coefficients. 428 00:42:39,400 --> 00:42:45,450 So the prior distribution for all is just the same or you're assigning which we're aware of the where the priors. 429 00:42:45,450 --> 00:42:51,690 Yeah, these are the priors. Right. So I try this you use cheap notation to save space. 430 00:42:51,690 --> 00:42:56,610 But the idea is that each EDA has its own variance parameters. 431 00:42:56,610 --> 00:43:01,570 It doesn't make sense. Yeah, I wasn't looking at the definition of end, but the definition of and is down there. 432 00:43:01,570 --> 00:43:08,760 Yep, yeah, that's right. So this is these are the priors on the eda's and they are shrinkage priors because each EDA, 433 00:43:08,760 --> 00:43:15,360 even though it has like five or ten categories within it, like, say, like education, has four categories. 434 00:43:15,360 --> 00:43:22,500 If I remember correctly, they have a single variance parameter, which means that they'll be shrunk towards the mean of the four categories. 435 00:43:22,500 --> 00:43:27,070 Yes. And this way of parameter it. 436 00:43:27,070 --> 00:43:35,880 So I told you that in Bayesian statistics, we use precision, which is one over the variance as the as the specification for the normal distribution. 437 00:43:35,880 --> 00:43:38,400 And this way of parameter rising with a uniform distribution, 438 00:43:38,400 --> 00:43:46,380 a positive uniform distribution large enough on the logic scale is a way of including a non informative prior on the variance. 439 00:43:46,380 --> 00:43:52,670 Does it make sense to people? Yes. Why would we inspire? 440 00:43:52,670 --> 00:44:00,720 Because five of the logic, this on the logic scale is very, very large, so it allows for variance values between zero. 441 00:44:00,720 --> 00:44:04,700 Essentially, yeah, it's not informative, I would say. 442 00:44:04,700 --> 00:44:12,890 Do you have a I'm sure there exists better suggestions. OK. 443 00:44:12,890 --> 00:44:21,950 Yes, sorry. So you also have some information on state level in this case, yeah. 444 00:44:21,950 --> 00:44:30,080 Did this estimate at the higher level have to be in line with the estimates you are looking for things like is there a relation now that you look for 445 00:44:30,080 --> 00:44:39,800 states estimates because you're interested in calculating something at the state level or could have been also gender or something in the state? 446 00:44:39,800 --> 00:44:45,080 So there have been studies that have put in a gender level variables as well for the reason that Chris was mentioning, 447 00:44:45,080 --> 00:44:51,800 which is that introducing the gender level predictor would help kerb some of the bias introduced by the shrinkage. 448 00:44:51,800 --> 00:44:54,380 But they've shown that they haven't thought you could do that. 449 00:44:54,380 --> 00:45:00,650 You could have a state level predictor or gender lower predictor and age lower predictor and so on. But studies have shown that it wasn't. 450 00:45:00,650 --> 00:45:08,630 It's it doesn't increase predictive accuracy of these models. So it would just increase running time, and we don't want to do that essentially. 451 00:45:08,630 --> 00:45:15,620 So basically, you should make sure that these estimates at the higher level are at the level that you're interested in. 452 00:45:15,620 --> 00:45:20,270 That's correct. Yeah. Any more questions? 453 00:45:20,270 --> 00:45:31,150 OK. Good. And so how can we fit this model where it's clear that this doesn't have even though? 454 00:45:31,150 --> 00:45:35,020 Well, it's clear that this doesn't have. I conjugate prior, right? 455 00:45:35,020 --> 00:45:38,180 It's a very complex model, a lot of nonlinearity, et cetera. 456 00:45:38,180 --> 00:45:44,450 So the way we do it is through our Markov chain and in particular we use two programmes are very famous for doing this. 457 00:45:44,450 --> 00:45:50,000 One is Jag's, which is what because I come from an epidemiology background or like a medical statistics background, 458 00:45:50,000 --> 00:45:52,520 Jag's is what I've been trained to use, 459 00:45:52,520 --> 00:45:59,540 which uses two algorithms the Gibbs sampler and then another one called Metropolis Hastings, which you don't need to know for today. 460 00:45:59,540 --> 00:46:07,310 And another one is Stan, which is a more recent development which comes out of Andrew Gettleman's sort of political science background. 461 00:46:07,310 --> 00:46:10,790 But also, it's better. Stan is better. 462 00:46:10,790 --> 00:46:18,080 If you can learn Stan, learn Stan. And the thing about Stan is it is better because it uses this thing called Hamiltonian Monte-Carlo, 463 00:46:18,080 --> 00:46:26,670 which is a way which is somewhat similar to stochastic gradient descent, in that it's a way to create shortcuts between the iterations. 464 00:46:26,670 --> 00:46:33,920 So it's a way to send you closer to the convergence point faster, essentially. 465 00:46:33,920 --> 00:46:37,220 But you don't need to worry too much about that. Just learn one of these two languages. 466 00:46:37,220 --> 00:46:41,780 If you're starting from scratch, I recommend you learn Stan, but Jag's is a very, very good programme. 467 00:46:41,780 --> 00:46:49,220 That's what we're going to see today. And actually some having spoken to some statisticians, they have a sense the Jags, if you have many, 468 00:46:49,220 --> 00:46:57,250 many, many sources of information, Jag's is a better way to aggregate them than than. 469 00:46:57,250 --> 00:47:02,200 The other part of our turnout model of our decomposition that we saw before, which I remember, 470 00:47:02,200 --> 00:47:09,910 we're still trying to find the joint distribution for voting and turning out was the voting distribution and in particular, 471 00:47:09,910 --> 00:47:16,840 so what I'm showing you here is a very simple example of a voting for Republican. 472 00:47:16,840 --> 00:47:20,350 So we just have a dichotomous outcome in a poll. Do you vote for the Democrat? 473 00:47:20,350 --> 00:47:28,440 We do it for the Republicans. Yes or no? And so. This looks almost identical to what you saw before, but there are two differences. 474 00:47:28,440 --> 00:47:36,600 One is that one, in fact one main difference and that is that this distribution of AR is conditional on turnout. 475 00:47:36,600 --> 00:47:44,250 There are many ways to make this distribution conditional on turnout. One way is to just add turnout weights to each observation. 476 00:47:44,250 --> 00:47:50,490 And then. And then obtained a posterior conditional on the turnout weights and other weight, 477 00:47:50,490 --> 00:47:53,640 which is what I prefer and what I'm going to show, what we did in the example. 478 00:47:53,640 --> 00:48:01,110 I'm going to show you next is to actually estimate a different alpha for people who turn out and people who do not turn out. 479 00:48:01,110 --> 00:48:07,590 And the way you do that is by fitting a turn vote choice model. 480 00:48:07,590 --> 00:48:13,760 You simulate who turns out and then every every simulation you fit the turnout model, 481 00:48:13,760 --> 00:48:18,630 the right to vote choice model to people who turnout in that particular simulation and people who don't turn out. 482 00:48:18,630 --> 00:48:24,540 And then you repeat this every time. And the cool thing about this is that it has a quite nice, like intuitive way. 483 00:48:24,540 --> 00:48:30,960 It's like, OK, every time I simulate, I get a bunch of people turn out and then I'm going to use them to make my estimation. 484 00:48:30,960 --> 00:48:36,990 And then also, that incorporates a lot of uncertainty into our model, which is one of the Bayesian approach based advantages. 485 00:48:36,990 --> 00:48:39,540 Because you you incorporate uncertainty. 486 00:48:39,540 --> 00:48:44,610 If you just put the weights in, you're not going to be able to incorporate the uncertainty about those weights. 487 00:48:44,610 --> 00:48:50,700 But why? By sampling people who turn out according to those weights and fitting a model in them, 488 00:48:50,700 --> 00:48:57,510 then your coefficients are going to are going to incorporate the uncertainty derived from people older now and people who do not turn out. 489 00:48:57,510 --> 00:49:02,250 Does that kind of make sense? And so, yeah, and so this is this is it. 490 00:49:02,250 --> 00:49:08,660 You've seen this before? OK, so here we go to our example. 491 00:49:08,660 --> 00:49:16,280 So the example is thanks to the organising committee, we got some money to do a survey online. 492 00:49:16,280 --> 00:49:22,040 We did a survey on Amazon Mechanical Turk. Are you all familiar with what Amazon Mechanical Turk says? 493 00:49:22,040 --> 00:49:32,270 Yes. Well, for those who don't. It's just a platform where you can put any what are called human intelligent tasks, human intelligence tasks. 494 00:49:32,270 --> 00:49:38,060 And so, for instance, if you needed labels for a training set like so you had like cats wore, for example, 495 00:49:38,060 --> 00:49:42,470 so you can feed the Turks a lot of cat wars like we saw yesterday, 496 00:49:42,470 --> 00:49:46,850 and the Turks would have to click on the cutest ones and or you could do something else. 497 00:49:46,850 --> 00:49:50,360 You could ask them to transcribe an audio file or you can do something else. 498 00:49:50,360 --> 00:49:56,660 You can ask them to fill in a survey like we did. And that's kind of cool about the Turks. 499 00:49:56,660 --> 00:50:03,740 Is that the Mechanical Turk? If you're wondering about the name Mechanical Turk, is that this small aside? 500 00:50:03,740 --> 00:50:12,350 Back in, I think it was the eighteen hundreds or something they used to be the circus they used to carry around a chess player with a turbine, 501 00:50:12,350 --> 00:50:20,330 and the chess player became known as the Mechanical Turk because it was a wooden figure that would play chess and beat a lot of people. 502 00:50:20,330 --> 00:50:23,720 And everybody was thinking like, How the [INAUDIBLE] do they? They have robotics or something? 503 00:50:23,720 --> 00:50:34,910 And it's like, No, actually, that was a little person inside a box underneath the chess board that played chess, which which is quite funny. 504 00:50:34,910 --> 00:50:40,040 But that's kind of why we we call them mechanical tricks, because these are tasks that in principle, 505 00:50:40,040 --> 00:50:43,490 we would like to have an AI do, but we don't have the way to do this. 506 00:50:43,490 --> 00:50:50,750 So we given to humans to to do. OK, so a few details about the survey. 507 00:50:50,750 --> 00:50:54,680 So the goal of this? We asked them a bunch of questions. 508 00:50:54,680 --> 00:50:59,240 Some of the questions in all of the questions involved the their voting preferences 509 00:50:59,240 --> 00:51:03,770 in twenty twenty and some of the questions involved their voting preferences. Twenty sixteen. 510 00:51:03,770 --> 00:51:05,930 Why did we do twenty sixteen? Because we had a benchmark. 511 00:51:05,930 --> 00:51:12,620 We know the results from twenty sixteen and I want to show you that through a very known representative sample, which is the Amazon Mechanical Turk. 512 00:51:12,620 --> 00:51:18,770 Because how many selection effects are there amongst people who you select into being a Mechanical Turk, right? 513 00:51:18,770 --> 00:51:22,310 Like, you know, this is a completely non representative sample. 514 00:51:22,310 --> 00:51:31,530 But as I will show you is that from this lone representative sample, we were replicate the 2016 election results almost to the point. 515 00:51:31,530 --> 00:51:36,870 Few details on the survey. We survey a thousand five hundred workers on the 11th of June. 516 00:51:36,870 --> 00:51:39,750 Amazon anthrax takes twenty five percent of the total fee, 517 00:51:39,750 --> 00:51:45,540 which is a bummer because we asked for a specific characteristics about the Mechanical Turk. 518 00:51:45,540 --> 00:51:50,010 So we asked that they were Turks that had, on average, an accuracy of ninety five percent. 519 00:51:50,010 --> 00:51:57,120 We asked that they live in the United States and we asked that they had been approved for at least a thousand human intelligent tasks. 520 00:51:57,120 --> 00:52:05,520 So I just. Why did I do this? I want to screen out bots because bots are a problem in all online platform and in particular on Mechanical Turk. 521 00:52:05,520 --> 00:52:11,460 There have been paper showing that there is survey bots that just randomly click throughout. 522 00:52:11,460 --> 00:52:19,530 Another thing we did to screen our box was we had a CAPTCHA in the survey, so a bot wouldn't have been able to to fulfil the CAPTCHA task. 523 00:52:19,530 --> 00:52:28,170 OK, do we need to do a break, by the way, at some point? What at? 524 00:52:28,170 --> 00:52:40,530 OK, so we have time with time, OK, great. And so it costed us around a thousand dollars and we obtain a thousand five hundred responses. 525 00:52:40,530 --> 00:52:49,950 And. Yeah, that's it. This so now we're going to get into good. 526 00:52:49,950 --> 00:52:53,520 OK. This is what I figured. You know, I'll take this opportunity. 527 00:52:53,520 --> 00:52:58,860 I think Cenk tomorrow is going to speak more about Mechanical Turk and teach you how to gather data from Mechanical Turk. 528 00:52:58,860 --> 00:53:06,150 But I want it to give you sort of an overview of what a screenshot of a human intelligence tasks looks looks like. 529 00:53:06,150 --> 00:53:11,640 And so up here you have the percentage of human intelligent tasks in your batch. 530 00:53:11,640 --> 00:53:17,010 So we had two thousand five hundred human intelligent tasks AI. We wanted our survey to be taken a thousand five hundred times. 531 00:53:17,010 --> 00:53:24,240 And this collective was called to batch a batch of 1500 human intelligent tasks. 532 00:53:24,240 --> 00:53:29,190 And this was the description of the task. 533 00:53:29,190 --> 00:53:35,310 These are the qualifications that we asked for souhaite approval rate greater with a 95 percent location in the United States. 534 00:53:35,310 --> 00:53:41,940 More than sorry, it was five hundred. We asked them that they had been approved for at least five hundred intelligent tasks. 535 00:53:41,940 --> 00:53:48,720 And and this was an example of what the any given Mechanical Turk saw. 536 00:53:48,720 --> 00:53:52,470 So what it was was simply where it says the link will appear here. 537 00:53:52,470 --> 00:53:56,850 Only if you accept the hit, the link to our survey would appear. They would click on it. 538 00:53:56,850 --> 00:54:03,930 They would conclude the survey. And then at the end of the survey, a code a unique code to each individual Turk would be shown. 539 00:54:03,930 --> 00:54:09,870 And then they had to input that code inside so that we could match the responses to the Turks. 540 00:54:09,870 --> 00:54:13,540 Does that make sense? OK, fantastic. Yeah. 541 00:54:13,540 --> 00:54:20,770 Yeah, yep. So. 542 00:54:20,770 --> 00:54:27,940 So that's just a clarification. Yeah. What's his approval rate of greater than or equal to 95? 543 00:54:27,940 --> 00:54:33,640 So it means that I only wanted people whose human intelligence ask approval rate, 544 00:54:33,640 --> 00:54:39,400 so the proportion of approved over non-approved was ninety five percent or above. 545 00:54:39,400 --> 00:54:46,540 So I wanted people who weren't bots because bots would have a very low approval rate because people sorry for you. 546 00:54:46,540 --> 00:54:50,410 I would have gone to approve each and each requester. 547 00:54:50,410 --> 00:54:55,960 So the person who posts a service called the requester, each requester has to approve each individual worker. 548 00:54:55,960 --> 00:55:01,300 That's the thing. Yeah. And you could sort of look at the history of what they've done to sort of accept. 549 00:55:01,300 --> 00:55:05,800 That's right. OK, that makes sense. Brilliant. 550 00:55:05,800 --> 00:55:09,470 OK. And so we conduct this survey. 551 00:55:09,470 --> 00:55:15,560 Now we go to unless you guys can read sideways. OK. 552 00:55:15,560 --> 00:55:25,470 So we conduct the survey. And we ask for a bunch of categories. 553 00:55:25,470 --> 00:55:30,630 We ask for a bunch of information, so we ask for, as I said, more choice and turn out behaviour in 2016. 554 00:55:30,630 --> 00:55:34,830 But also some individual level characteristics. 555 00:55:34,830 --> 00:55:38,910 So like, oh no, sorry, these are not the individual level characteristics. 556 00:55:38,910 --> 00:55:42,320 These are the this is the state. 557 00:55:42,320 --> 00:55:48,180 So this was an example of the question, by the way, so did you vote for the presidential election 2016 presidential ouster? 558 00:55:48,180 --> 00:55:53,240 Yes. No. Can't remember Don. No was not eligible, but we I got emails from people. 559 00:55:53,240 --> 00:55:57,320 They were very happy that I was not eligible in because they had just been naturalised. 560 00:55:57,320 --> 00:56:00,860 And a lot of people only in their surveys only get two, right? No, I didn't vote. 561 00:56:00,860 --> 00:56:05,720 And then they feel bad about it because like, it's almost like they didn't do their duty. But I was happy that they were. 562 00:56:05,720 --> 00:56:13,190 They were happy. And we further asked, Which candidate did you vote for president in 2016 with potential answers. 563 00:56:13,190 --> 00:56:16,640 Donald Trump, Hillary Clinton third party can't remember, don't know. 564 00:56:16,640 --> 00:56:17,510 Did not vote. 565 00:56:17,510 --> 00:56:24,590 Notice that we introduced the third party issue here, whereas in the example of vote choice that I showed you before, there were only two options. 566 00:56:24,590 --> 00:56:30,230 Which means now we're going to have to do some funky stuff of our model to allow for more multiple options. 567 00:56:30,230 --> 00:56:38,160 Funky stuff. I mean, it's like, am I? The from multiple sources, including the census and so on. 568 00:56:38,160 --> 00:56:41,370 We also have some state level characteristics that we're interested in. 569 00:56:41,370 --> 00:56:46,770 So for each state, we have the percentage Hispanic percentage, black percentage, Asian percentage, non-college whites. 570 00:56:46,770 --> 00:56:52,380 That one's important because what's become important in the narrative because of Trump? 571 00:56:52,380 --> 00:56:59,490 Percentage college grad college grads in general, because, as you know, Hillary Clinton was the first Democratic candidate to win. 572 00:56:59,490 --> 00:57:04,050 I don't know about the first, but first in a long time to win college grads in the United States. 573 00:57:04,050 --> 00:57:09,750 Median income of the state. The percentage that the Republican won in 2016. 574 00:57:09,750 --> 00:57:18,840 The percentage that the Libertarian won 2016. The percentage of the Greens and the percentage of people who turned out in 2016. 575 00:57:18,840 --> 00:57:26,790 Now before you ask, yes, we're predicting 2016 state level results with 2016 covariates. 576 00:57:26,790 --> 00:57:30,690 So if perhaps it's cheating? But the reason why I'm showing you this is because eight, 577 00:57:30,690 --> 00:57:36,690 there is no direct circularity because we're predicting we're using these models to inform an individual level model. 578 00:57:36,690 --> 00:57:44,070 And B, these are going to be the variables that are going to be using to predict the 2020 result in your in your exercise. 579 00:57:44,070 --> 00:57:53,840 And so, yes, part of it was just I didn't I didn't think it was correct to go back to 2012 to get these these statistics because. 580 00:57:53,840 --> 00:58:02,660 Given that we've asked the question today, this is going to be a better measure of the reliability of this model being able to to correctly 581 00:58:02,660 --> 00:58:07,160 adjust for bias than a model that would have asked them would have used twenty twelve. 582 00:58:07,160 --> 00:58:12,890 So I think this is the closest thing we get to showing you that if it works in twenty sixteen, it's likely to work in 2020. 583 00:58:12,890 --> 00:58:24,400 I don't think. OK. This is the turnout model, it's almost identical to the one you've seen before. 584 00:58:24,400 --> 00:58:32,200 This is the vote choice model, so as I said, it's a little more funky because now we have to use this categorical distribution, 585 00:58:32,200 --> 00:58:35,230 which is just a multinational distribution with and equals one. 586 00:58:35,230 --> 00:58:41,350 So this time, instead of everybody answering being able to say I'm either a Democrat or a Republican, 587 00:58:41,350 --> 00:58:48,310 they can say I'm a Democrat, a Republican or a third party person or a Anees in this election. 588 00:58:48,310 --> 00:58:53,920 Good point. Missing values and basing statistics. If you have missing values in your outcome variable, that doesn't matter. 589 00:58:53,920 --> 00:59:02,230 You can just feed them in because the the Bayesian model estimates the predictive distributions and then just inputs them afterwards. 590 00:59:02,230 --> 00:59:09,730 So it's an automatic imputation of the output values, which is pretty useful in a lot of scenarios. 591 00:59:09,730 --> 00:59:11,560 And so the categorical distribution just says, well, 592 00:59:11,560 --> 00:59:16,660 instead of having one probability distribution either Democrat or Republican, you now have three probabilities. 593 00:59:16,660 --> 00:59:21,670 You either there's a probability for voting for a Democrat, probably four. There is Republican probability for voting other. 594 00:59:21,670 --> 00:59:24,610 And these have to sum up to one, of course. 595 00:59:24,610 --> 00:59:35,290 And the other funky stuff that's happening is that, as you can see, the effects now all have these indexed new, which is representing the vote choice. 596 00:59:35,290 --> 00:59:40,840 So you can estimate a state effect for each vote choice for a state effect 597 00:59:40,840 --> 00:59:43,750 for the Democrat state effect of the Republicans as they effects the others. 598 00:59:43,750 --> 00:59:50,710 And the other funky stuff is that there is this ID constraint, which is it's a some two zero constraint. 599 00:59:50,710 --> 00:59:57,340 So the idea is that if you have party specific effects and you have multiple parties in a binomial model, 600 00:59:57,340 --> 01:00:04,000 there is already a slim to zero constraint because you're either the effect that is for one party is immediately against the other. 601 01:00:04,000 --> 01:00:10,120 Here we have to introduce that by hand. And so you can just have this some two zero constraint, some two. 602 01:00:10,120 --> 01:00:19,360 You can also be repurposed as the mean. This alpha bar is just the mean of the choices. 603 01:00:19,360 --> 01:00:23,500 Yes. Maybe there a typo here. 604 01:00:23,500 --> 01:00:28,910 But any case, I think I think you get the idea. OK. 605 01:00:28,910 --> 01:00:35,120 And so now there is this this code. So. Yes. 606 01:00:35,120 --> 01:00:41,070 Yes. I need to get. That's the. 607 01:00:41,070 --> 01:00:47,580 That we want all day. We want the effects, yes, we want the effects to some to zero. 608 01:00:47,580 --> 01:00:56,070 That's right. So because if, if, if the gender effect say is positive for the Republicans, 609 01:00:56,070 --> 01:01:02,110 for the Democrats and positive for the Greens, it cannot be that it's also positive for the Republicans. 610 01:01:02,110 --> 01:01:07,060 Does that makes sense? So we're looking at. And what why is that? 611 01:01:07,060 --> 01:01:15,170 Well, first of all, I think that's. Well, because. Interpretability. 612 01:01:15,170 --> 01:01:21,570 Yes, it's difficult to think about the full extent of but. 613 01:01:21,570 --> 01:01:27,960 So I don't know that to right at this point. Yeah, but there's a little bit of zero. 614 01:01:27,960 --> 01:01:32,910 Yeah. Okay, so one of them would be. Yes, that's correct. 615 01:01:32,910 --> 01:01:39,410 Yes, yes it is. Does that make sense? Yeah. OK. 616 01:01:39,410 --> 01:01:45,710 And then you have this model, so you're going to get to work on this on your own later on. 617 01:01:45,710 --> 01:01:51,800 But I'm just going to go through the motions of showing you what, how you fit a Jags model. 618 01:01:51,800 --> 01:01:58,190 So this is a model to replicate 2016 results. So the first thing you do is you create a list. 619 01:01:58,190 --> 01:02:05,930 This is in R, by way. So you create a list with your model data and we introduce the data. 620 01:02:05,930 --> 01:02:11,020 Notice that I introduce the vote choice as ones and zeroes. 621 01:02:11,020 --> 01:02:18,170 And so instead of being a single vector with one two three, it's three columns with ones, ones, ones. 622 01:02:18,170 --> 01:02:25,460 The reason I do that is because there is a fitting a categorical model on Jags is quite expensive in computational time, 623 01:02:25,460 --> 01:02:28,850 and there's a well known equivalency, which is the plus on models. 624 01:02:28,850 --> 01:02:33,270 So you can you can approximate a categorical model with OPOs on model and a 625 01:02:33,270 --> 01:02:38,330 plus some other requires that you have three different choices on the output. 626 01:02:38,330 --> 01:02:46,520 Yes. So this is one of the authors. Yeah, you're making a model and based on. 627 01:02:46,520 --> 01:02:56,240 Yep, yep. The the selected resolution that's based at midnight or 812, etc. We get a different selection rule. 628 01:02:56,240 --> 01:03:02,000 Yep, yep, that works the moment you make a model to estimate. 629 01:03:02,000 --> 01:03:08,300 Thanks, you know? Yep, yep. What is then the virtue of that model? 630 01:03:08,300 --> 01:03:12,170 So the virtual the models that we're going to find out how accurate we can get 631 01:03:12,170 --> 01:03:16,160 to 2016 and then we're going to use the same infrastructure to predict 2020. 632 01:03:16,160 --> 01:03:25,990 Yeah. But then if you predicted it would need to set similar selected under the model, may just the. 633 01:03:25,990 --> 01:03:33,670 It's sorry this to the same 1100 I have asked who they voted for in 2016 and who they're going to vote for in 2020, right? 634 01:03:33,670 --> 01:03:40,390 Yeah. What do you mean you want to create? You want to create different models for different selection rules. 635 01:03:40,390 --> 01:03:46,810 I'm just saying the only reason I predict the context in if you cannot obtain any insights which will generate you, 636 01:03:46,810 --> 01:03:56,860 I would to say that if the coefficient for gender is positive, for instance, yeah, the general consensus sits inside and the system itself. 637 01:03:56,860 --> 01:03:58,810 Yeah, for sure. For sure, yes. 638 01:03:58,810 --> 01:04:07,990 But the idea, but the idea is that through the shrinkage effects, you hope that by bringing your you don't fit the data too closely. 639 01:04:07,990 --> 01:04:13,240 And so you can do better out of sample predictions. Yeah, but the system zeroes. 640 01:04:13,240 --> 01:04:18,580 Well, if the selection effect is super strong and you have an account, so there is this assumption called ignore ability, 641 01:04:18,580 --> 01:04:25,540 which is that I can ignore the remaining variables that I have in introducing the model as part of the residual. 642 01:04:25,540 --> 01:04:30,130 If that assumption is broken as you suggest that through some very heavy selection effect, 643 01:04:30,130 --> 01:04:34,390 then I can't ignore it, so I would need to introduce it in the model and then the certification frame. 644 01:04:34,390 --> 01:04:41,300 So yeah, it seems that the Facebook example works because within cells. 645 01:04:41,300 --> 01:04:47,080 Being imposed on Democrats isn't just a protest where you buy an explosive. Yeah, that's right. 646 01:04:47,080 --> 01:04:56,650 Yeah. Google and Facebook. Yeah. Well, they still have some selection because they end up overestimating, if I remember correctly, the Obama vote. 647 01:04:56,650 --> 01:05:00,430 But but not by much so they did well, but they did a little bit. 648 01:05:00,430 --> 01:05:09,190 It wasn't perfect, right? So therefore, it would seem that that's basically whatever the thing you're asking for the section a little bit. 649 01:05:09,190 --> 01:05:16,900 Yeah. The assumption here is that the selection into the specific time and day in which we asked the survey is 650 01:05:16,900 --> 01:05:23,830 not heavy and that the selection of being an empty worker after you control for all these variables, 651 01:05:23,830 --> 01:05:31,390 here is there is none. Essentially, there's obviously an assumption that's obviously false, but it allows us to make. 652 01:05:31,390 --> 01:05:36,490 And as always, we will see the results of a pretty decent. So it suggests that it's not too far off. 653 01:05:36,490 --> 01:05:42,280 Let's say we can agree to the beginning of the third only relevant and predictive context and not concerns. 654 01:05:42,280 --> 01:05:48,580 Oh, for sure. But I mean, the yeah, this is true of the entire endeavour that I'm teaching you today. 655 01:05:48,580 --> 01:05:54,370 This is a predictive context. Is not an influential tool. Yeah. Anybody else? 656 01:05:54,370 --> 01:06:01,750 No. OK. So we continue we specified these variables and then we specify so Jag's, 657 01:06:01,750 --> 01:06:06,850 the way it works is almost exactly as I showed you with those hierarchical models up there. 658 01:06:06,850 --> 01:06:15,760 So literally, you say predicted distribution distributor of a brand only distribution with parameter pie at the individual level. 659 01:06:15,760 --> 01:06:18,830 We add this to the Oh, this is an interesting thing. 660 01:06:18,830 --> 01:06:24,520 So when people respond to their turnout question, there's a well known phenomenon of overreporting. 661 01:06:24,520 --> 01:06:34,420 And people have found that in the axe. Sorry, the AP and the American election study can remember now an American national election study. 662 01:06:34,420 --> 01:06:39,790 The overreporting factor is roughly thirteen point five percent in pew. 663 01:06:39,790 --> 01:06:45,100 So online surveys there overreporting factor has been found to be 17 percent and above. 664 01:06:45,100 --> 01:06:54,160 So what we do here, we do kind of a rough correction, so we estimate the model and then we say whatever, 665 01:06:54,160 --> 01:07:03,400 whatever distribution you have for each individual, whatever distribution you have thus far estimated, subtract 17 percent to it. 666 01:07:03,400 --> 01:07:05,980 So if the distribution is simply shifting, 667 01:07:05,980 --> 01:07:12,820 if the distribution was imagine like a normal with like a mean of like 70 percent probability, then now a shift shifted 17 points down. 668 01:07:12,820 --> 01:07:19,330 And this actually ends up having a very positive effect on the estimates. Yeah, you should remember if you do vote choice and vote turnout stuff, 669 01:07:19,330 --> 01:07:25,480 you should remember to account for turnout over reporting because really important. The obviously this assumption is not perfect. 670 01:07:25,480 --> 01:07:31,750 In principle, we would have done another model from another information source that would have told us who is more likely to out of 671 01:07:31,750 --> 01:07:36,820 these categories to over report than the report and then created a predictive value for each individual in the sample. 672 01:07:36,820 --> 01:07:44,290 But we didn't do that. We just assumed that people are across the board likely to over report in the same way. 673 01:07:44,290 --> 01:07:52,860 So this is like a uniform over reporting kind of model. Yeah. So in this context, what it's like doing this every time? 674 01:07:52,860 --> 01:07:57,060 No, so it's simply if I answer a thousand people. 675 01:07:57,060 --> 01:08:01,320 And in reality, only five hundred of them voted in my sample. 676 01:08:01,320 --> 01:08:07,440 It's going to be seven hundred and fifty. Just because people lie. 677 01:08:07,440 --> 01:08:11,480 Yes, well, but they lie for all sorts of understandable motives, right? 678 01:08:11,480 --> 01:08:18,160 Ideas like social desirability. There is like they feel like perhaps a little ashamed that they didn't turn up to vote. 679 01:08:18,160 --> 01:08:22,550 All sorts of things. Yeah. OK. 680 01:08:22,550 --> 01:08:29,720 And then the and then the we outline this turnout model. 681 01:08:29,720 --> 01:08:35,990 These are random effects at the state level, region effects, age effect, race effect and so on. 682 01:08:35,990 --> 01:08:41,440 This is the state level predictor. Specified the priors that we specified above. 683 01:08:41,440 --> 01:08:50,030 Again, you'll get a chance to play with this code yourself, so you can you can start. You can start slow and then continue. 684 01:08:50,030 --> 01:08:57,950 So there's some cheeky stuff that is happening here, which I will explain in a minute. You might be wondering what is this auxiliary parameter better? 685 01:08:57,950 --> 01:09:02,180 And what happens is in search algorithms kind of like the jib sampler. 686 01:09:02,180 --> 01:09:06,410 There's this weird phenomenon that if you over parameter rise and then monitor. 687 01:09:06,410 --> 01:09:11,930 So if you if say you're interested in parameter beta and then you specify this weird big model for beta, 688 01:09:11,930 --> 01:09:18,350 which is like beta is actually equal to like alpha times x plus zero and so on and so forth. 689 01:09:18,350 --> 01:09:24,410 Then what you find is that by specifying these sub models and not monitoring them, you're still just monitoring beta. 690 01:09:24,410 --> 01:09:29,570 Your model converges faster. And the reason for this is a mathematical reason. 691 01:09:29,570 --> 01:09:34,870 There's there's been papers showing this is just that. 692 01:09:34,870 --> 01:09:43,510 The search algorithm finds the correct the correct convergence point faster, essentially, Ivan explained that at all. 693 01:09:43,510 --> 01:09:46,000 But like, I'll tell you the paper and you can you can look at it. 694 01:09:46,000 --> 01:09:56,890 So here we like over parameter size are state level predictor by multiplying it for a new variable auxiliary beta, 695 01:09:56,890 --> 01:10:00,370 which is distributed by a normal distribution. But we don't monitor it. 696 01:10:00,370 --> 01:10:07,180 We don't care about it. We care about this beta here, which is the multiplication of beta star and auxiliary variable beta. 697 01:10:07,180 --> 01:10:16,360 OK. And then these are, as I stated before, and these are all the you see a random effect with its random effect distribution 698 01:10:16,360 --> 01:10:22,220 and auxiliary parameter multiplying by the the original random effect. 699 01:10:22,220 --> 01:10:27,310 Yeah, you see a difference. Yes. Massive, massive convergence. 700 01:10:27,310 --> 01:10:37,840 When you take that off. Yeah. Yeah. So the model that I a fit for this is run for seven thousand iterations and now converges almost perfectly. 701 01:10:37,840 --> 01:10:40,630 There's a few parameters that could do tiny, tiny, little bit better. 702 01:10:40,630 --> 01:10:45,760 But before I was using the auxiliary parameter, I needed like fifteen thousand iterations to do it. 703 01:10:45,760 --> 01:10:50,740 So actually, it's very useful to know these tricks. Yeah, very, very useful. 704 01:10:50,740 --> 01:10:56,080 Can these traits? OK. 705 01:10:56,080 --> 01:11:05,500 Choice models, so notice again, we're specifying essentially three different Poisson regressions that are bound together by some two zero constraint. 706 01:11:05,500 --> 01:11:14,320 This the this trick works so long as you introduce a individual level effect that has a uninformative distribution. 707 01:11:14,320 --> 01:11:22,090 This is well known. I have. If you look at the PDF, there is a citation for this and you can go and look at that paper. 708 01:11:22,090 --> 01:11:27,520 And it's quite neat because the pass on distribution is very stable based on sampler story on Jag's, and Stan is very stable. 709 01:11:27,520 --> 01:11:32,220 The categorical sampler is quite unstable, and so this stuff converges a lot faster. 710 01:11:32,220 --> 01:11:43,660 OK, and then so these are the facts, as I said before. Notice that the way we specify the conditionality is by specifying estimating a 711 01:11:43,660 --> 01:11:49,950 different parameter for whether people turn out a pretty good turnout or not. 712 01:11:49,950 --> 01:11:56,230 And so essentially, we have two models here. One is a model for vote choice for people who do not turn out. 713 01:11:56,230 --> 01:11:59,410 One is a model for vote choice for people who do turn out. It's kind of interesting. 714 01:11:59,410 --> 01:12:03,520 I don't bother monitoring the vote choice model for people who do not turn out 715 01:12:03,520 --> 01:12:07,570 because it wasn't part of the of the inference project and we were going to do here. 716 01:12:07,570 --> 01:12:14,530 We predict your project, but it's kind of cool because if you have a if you want to find out which party would benefit more from getting people to 717 01:12:14,530 --> 01:12:21,220 turn out a model of vote choice where people who are predicted not to turn out will tell you that we just kind of need. 718 01:12:21,220 --> 01:12:33,790 OK. And this stuff below is almost exactly the same as before with the some two zero constraint that I described before, and it's close to zero. 719 01:12:33,790 --> 01:12:41,140 Yes, so the plus is this custom distribution is just defined over positive values, but the values we fed it are zeros and ones. 720 01:12:41,140 --> 01:12:45,040 And then we have to put that that random effect that sorry, 721 01:12:45,040 --> 01:12:50,560 the individual level effect in in order to have the random effect of efficiency to be estimated in 722 01:12:50,560 --> 01:12:55,120 exactly the same number that it would be from the categorical distribution to the plus on distribution. 723 01:12:55,120 --> 01:13:00,760 And therefore, this only works if the distance of the way the gross zeros and ones that the you know something. 724 01:13:00,760 --> 01:13:06,910 Yes, you eat. Yeah, you're right. If you're sat, if some of your categories are very small, you run into problems. 725 01:13:06,910 --> 01:13:13,570 Yes. But this is a prediction that the. 726 01:13:13,570 --> 01:13:25,020 Develop like, no, we didn't that we we didn't, because what we are interested in ultimately, 727 01:13:25,020 --> 01:13:30,030 which is the prediction of the 2020 results, 2020 results haven't happened yet. 728 01:13:30,030 --> 01:13:33,770 And so we we're happy with. 729 01:13:33,770 --> 01:13:42,740 Building the model with all the information that we have and then putting out the best possible prediction that we can right now and. 730 01:13:42,740 --> 01:13:47,720 The other reason is that a thousand five hundred people for this kind of exercise is not much. 731 01:13:47,720 --> 01:13:52,130 So you don't have the luxury really to do twenty eighty or seventy 30, 732 01:13:52,130 --> 01:13:58,940 because even removing 30 percent of the individuals from this sample would mean that many categories would actually go empty. 733 01:13:58,940 --> 01:14:02,330 And that would massively decrease the predictive accuracy of your model. 734 01:14:02,330 --> 01:14:07,820 So yeah, you're right in in a ideal world, we would take fifteen thousand people. 735 01:14:07,820 --> 01:14:12,050 Five thousand would leave out fit the model. See how it did in 2016. 736 01:14:12,050 --> 01:14:16,880 Do again three or four times to see whether the model coefficients are stable, then fitted for the whole thing. 737 01:14:16,880 --> 01:14:24,080 But the formal performance? We don't have an out of sample prediction metrics. 738 01:14:24,080 --> 01:14:29,930 Not. But we do have. Yeah. 739 01:14:29,930 --> 01:14:38,480 So for the predictive model, you're right, we don't have a formal prediction mechanism, sorry, a formal like cross validation happening. 740 01:14:38,480 --> 01:14:53,910 But the fact that you do have the the hard validation of whether it works or not with respect to respect to the 2016 results that. 741 01:14:53,910 --> 01:14:59,020 You you need to see results. Yes. Right. 742 01:14:59,020 --> 01:15:04,450 Yeah, it was. I think for sure, you're right. So you're right that we could do better with predictions. 743 01:15:04,450 --> 01:15:08,380 So one thing that we could do is fitted with 2012 data. 744 01:15:08,380 --> 01:15:14,260 But even that is weird because the question that we asked them about their 2016 behaviour, we asked them today. 745 01:15:14,260 --> 01:15:18,520 So they responded to that survey knowing what the average behaviour was as well. 746 01:15:18,520 --> 01:15:25,240 So I mean, it's a weird question. In principle, the ideal case would have been we would have done a serving 20 60, 747 01:15:25,240 --> 01:15:34,510 would have done a survey today and then we would have checked at each point and out of sample metric to test the exact model. 748 01:15:34,510 --> 01:15:36,790 I want to say this is something that the literature doesn't do at all. 749 01:15:36,790 --> 01:15:47,080 So like most models, I think because you don't want to run into the problem of having empty cells most and also because data so scarce again, 750 01:15:47,080 --> 01:15:50,230 like a sample of ten thousand is considered to be massive in this literature, 751 01:15:50,230 --> 01:15:56,680 whereas like in reality for most sample so gellman at all, they don't do an out of sample metric, 752 01:15:56,680 --> 01:16:01,570 but they could have because they had three hundred fifty thousand people. So. So they are. 753 01:16:01,570 --> 01:16:08,530 They could have done that. Yeah, but that's a point. Well taken. I think there some criticism in general about the difficulty level of suspicion. 754 01:16:08,530 --> 01:16:13,360 Yeah. That by adding, right? Yeah. Being just in perspective. 755 01:16:13,360 --> 01:16:21,430 Yeah. You have so much degrees of freedom basically feeding data that are sort of goodness of fit measure within data. 756 01:16:21,430 --> 01:16:23,740 But we are tweeting right away. 757 01:16:23,740 --> 01:16:31,990 So so my question would be, do you have, if you evaluate it, the variance of the random facts and it becomes more like an algorithm? 758 01:16:31,990 --> 01:16:36,520 We didn't evaluate these. Are you saying, did we do a level specific R-Squared? 759 01:16:36,520 --> 01:16:42,240 We didn't do that. No, because. Yeah. 760 01:16:42,240 --> 01:16:46,040 Sorry. You're right, sorry. 761 01:16:46,040 --> 01:16:57,400 So the question has been about predictive accuracy of this model outside of the sample in which the model has been trained on and in particular, 762 01:16:57,400 --> 01:17:03,550 whether we did any cross validation. And so I don't know if I take that point about. 763 01:17:03,550 --> 01:17:11,200 So if you compare, I agree that any model that is tested on the training set, let's say, is going to have overfitting problems. 764 01:17:11,200 --> 01:17:18,730 But I don't take the point that multilevel regression is particularly war of aggression is better at dealing with that than linear regression. 765 01:17:18,730 --> 01:17:21,970 It's worse than dealing with that than a random forest, for instance. 766 01:17:21,970 --> 01:17:28,840 So, so because because the shrinkage, because the shrinkage effect are a form of regularisation. 767 01:17:28,840 --> 01:17:34,900 So they're meant to be there in order to help you help you filter out the noise so that the thing 768 01:17:34,900 --> 01:17:41,110 I'm referring to basically that multilevel effects are basically the same as a random error, 769 01:17:41,110 --> 01:17:46,910 like a normal linear regression, right? Yes, in a way, I rated it like an additional error term. 770 01:17:46,910 --> 01:17:49,390 Yeah. If you would think of it in the classical sense. Yeah. 771 01:17:49,390 --> 01:17:58,690 And therefore, it will be quite strange to find performance of a model by two fixed effects estimates the coefficients plus the residuals, right? 772 01:17:58,690 --> 01:18:02,440 Because then for definition forms is one percent. No, for sure. 773 01:18:02,440 --> 01:18:06,700 No, I get what you're saying. Yes, I get what you're saying. 774 01:18:06,700 --> 01:18:12,370 Yeah, we can discuss we can discuss it more later, but I'm pretty sure your point is well taken, there is no cross-pollination measure here. 775 01:18:12,370 --> 01:18:22,250 Yeah. OK. And so, yeah, we end up we finish specifying the model as and you'll get to play around with this model with the code. 776 01:18:22,250 --> 01:18:34,550 So don't worry if it looks daunting at the moment, we tell Jags which parameters we want to monitor by through this, this thing here. 777 01:18:34,550 --> 01:18:43,460 And we then tell Jags to run four chains for seven thousand iteration burning. 778 01:18:43,460 --> 01:18:52,520 The first 6000 burning is like you remember when I showed you that image of the of the chains and a conversion after the first three hundred, 779 01:18:52,520 --> 01:18:58,400 then burning is the first three. We just throw them away because we know that they are not independent and they're not convergent. 780 01:18:58,400 --> 01:19:03,180 Whereas the posterior sorry, the last one thousand are assumed to have converged. 781 01:19:03,180 --> 01:19:08,600 And so it's like as if we are every sample it is, if we're taking a new value from the joint distribution. 782 01:19:08,600 --> 01:19:15,290 So the effective sample size is a thousand. Essentially, we thought that was very good. 783 01:19:15,290 --> 01:19:23,440 Yeah, true to. So two questions, two answers to that one. 784 01:19:23,440 --> 01:19:26,490 Two is good, but you can never be too sure. 785 01:19:26,490 --> 01:19:35,730 And two is you can run Jags in parallel on air and the way the parallel is personalisation works is that each chain is run in parallel. 786 01:19:35,730 --> 01:19:43,670 So instead of running a single chain with, oh yeah, sorry, we should have this mike thing is a bit hard, sorry. 787 01:19:43,670 --> 01:19:47,490 And the question was why do we run four chains instead of two chains? 788 01:19:47,490 --> 01:19:54,090 And the answer is because you can't be sure enough and b, because of the way you can run chains in parallel on air, 789 01:19:54,090 --> 01:20:02,340 which means that you can run four chains with a foul with two hundred and fifty useful. 790 01:20:02,340 --> 01:20:07,950 So if you think that your model is going to converge after six thousand five hundred iterations, 791 01:20:07,950 --> 01:20:12,840 then you can either run the two chains for if you want a thousand values at the end, 792 01:20:12,840 --> 01:20:20,010 you can run the two chains for another 1000 in each chain, or you can run four chains two hundred and fifty each. 793 01:20:20,010 --> 01:20:26,770 And because you can run it in parallel, running for chains 250 each is faster. That makes sense. 794 01:20:26,770 --> 01:20:31,420 Yes. Yes. We run in for the packaging. Yeah, yeah. 795 01:20:31,420 --> 01:20:39,250 We run in four different genes, but each chain sort of is like a mark of genes, like a distribution. 796 01:20:39,250 --> 01:20:43,540 So where are we from? Which of these genes are we draw in our values? 797 01:20:43,540 --> 01:20:48,310 That's a good question. So what you do is you have these four chain. So the yeah, the question was recorded. 798 01:20:48,310 --> 01:21:02,770 So I'm paranoid of others. So the the the four chains reach convergence after the 6000 iteration, at which point from each chain we remove every four. 799 01:21:02,770 --> 01:21:04,510 So there's a thinning factor of four. 800 01:21:04,510 --> 01:21:12,100 The leaves two hundred and fifty observations for each chain that we stack them up as if they were from all from a single chain. 801 01:21:12,100 --> 01:21:18,760 And then we just use the the stack of 1000 observations because think about it, given that they have converged, 802 01:21:18,760 --> 01:21:23,080 or at least we assume that they have converge, they should be from the same distribution so we can just stack them up, right? 803 01:21:23,080 --> 01:21:27,400 Yeah, that's the idea. Anymore questions. 804 01:21:27,400 --> 01:21:35,010 But. OK. This model takes about four hours to converge, so yeah, it's a bit painful. 805 01:21:35,010 --> 01:21:41,550 Used to take a lot longer. I stressed out about this and but what you're going to play with in class is going to be something a lot simpler, I think. 806 01:21:41,550 --> 01:21:46,230 So we're going to pick three or four categories and we're going to let you 807 01:21:46,230 --> 01:21:50,010 fit a model only for vote choice and only for those three or four categories. 808 01:21:50,010 --> 01:22:00,210 So that should be quite fast. I would expect it to take about 10 to 15 minutes or so and Goldman Reuben statistics is showing here. 809 01:22:00,210 --> 01:22:07,080 So notice that some parameter these these guys here are a bit are having a hard time to converge, 810 01:22:07,080 --> 01:22:12,870 but actually they're very close to one point one, which means that if we run a few more iterations, they probably converge. 811 01:22:12,870 --> 01:22:18,840 Yes. A just quick question on your opinion. Yeah, lately they've been using INLA. 812 01:22:18,840 --> 01:22:26,130 Yeah, yeah. Great. Really, really good. Because of the speed especially is it's super slow, enlace super integrated, 813 01:22:26,130 --> 01:22:32,430 nested le plus approximation is a new Bayesian software which uses integrated nested LA Pass approximation 814 01:22:32,430 --> 01:22:38,190 and assumes that instead of using the Gibb's sampler uses this this technique that I just mentioned. 815 01:22:38,190 --> 01:22:44,220 And but it has the answer to this event as an advantage, which is speed, massive, massive gains and speed. 816 01:22:44,220 --> 01:22:48,720 These advantages are that it's hard to bring in data from multiple sources in LA, 817 01:22:48,720 --> 01:22:58,410 whereas here I can literally like have two models linking to each other, which estimate data from two completely different data sources. 818 01:22:58,410 --> 01:23:02,190 I could have the British election study to inform my turnout model, which is actually what is done, 819 01:23:02,190 --> 01:23:06,000 and then I could have my survey to inform the vote choice model, which is actually what people do all the time. 820 01:23:06,000 --> 01:23:12,570 And you can stack those into the same model, which is super cool, and you can find nonlinear ways to join the two models. 821 01:23:12,570 --> 01:23:14,820 So that's really, really fun that you can do with Jack. 822 01:23:14,820 --> 01:23:21,300 Super flexible as much then as well in LA, they're not quite there yet, so you can if you're like a genius, 823 01:23:21,300 --> 01:23:28,470 like you can figure out the way I like to introduce this new information by priors and penalise complexly priors and so on, so forth. 824 01:23:28,470 --> 01:23:33,210 But at the moment, it's it works more similar to like Glenarm. 825 01:23:33,210 --> 01:23:39,900 So like just like it has like a set of like pretty standard models that you can run with it and also has the disadvantage 826 01:23:39,900 --> 01:23:46,830 that you will never be able to run mixture models on it because they defy the assumption of Gaussian latent field, 827 01:23:46,830 --> 01:23:54,390 which is the underlying assumption of integrating national device approximation. So yeah, but it's a great. 828 01:23:54,390 --> 01:23:55,810 Yeah. Very good. 829 01:23:55,810 --> 01:24:05,100 Nine Laser is a great option, I use it for for a lot of stuff, but not for this because I needed to bring in a lot of nonlinear lesia. 830 01:24:05,100 --> 01:24:10,660 But actually, you could use it for this. We can talk about it later like you. 831 01:24:10,660 --> 01:24:20,020 OK, so we have this model, a converged and now it's time to show we're going to the fact that a converging means that now we have this predictive 832 01:24:20,020 --> 01:24:26,980 machine and we can use this predictive machine to make predictions about individual categories or categories of interest. 833 01:24:26,980 --> 01:24:31,210 How can we? Which means that, you know, we need to set up these categories of interest. 834 01:24:31,210 --> 01:24:37,360 We need to find out how many people from a specific category lie in the state of Texas, for which we want to make the prediction for. 835 01:24:37,360 --> 01:24:45,130 And we take this number from the American Community Survey Micro Data, which is available online. 836 01:24:45,130 --> 01:24:48,710 And at the end of this talk, I will show you how to download it. Yes, just no. 837 01:24:48,710 --> 01:24:56,560 Sorry, I put it as a question. Sorry. And we breakdown the population into the following characteristics. 838 01:24:56,560 --> 01:25:02,050 So we break it down into two categories age six categories race five categories Education 839 01:25:02,050 --> 01:25:05,980 four categories household income three categories and state fifty one categories. 840 01:25:05,980 --> 01:25:09,340 So this amounts to a total of thirty six thousand seven hundred cells, 841 01:25:09,340 --> 01:25:15,160 of which only twenty nine thousand we actually find that are populated in the micro data. 842 01:25:15,160 --> 01:25:19,450 So that means that there are some cells that are so rare that a sample of I think the 843 01:25:19,450 --> 01:25:23,800 micro data is about three point five million Americans don't contain any of them. 844 01:25:23,800 --> 01:25:27,370 So these are very rare cells. 845 01:25:27,370 --> 01:25:39,160 This is kind of a neat plot, which is actually something that from my own research, so on the involved in the bold line, you see. 846 01:25:39,160 --> 01:25:47,650 So this is an ordered. You take the cells, each cell, so each voter category and you order them by the largest cell to the smallest cell, 847 01:25:47,650 --> 01:25:52,390 where the largest is represents the largest proportion of the population in the smallest, the smallest proportion. 848 01:25:52,390 --> 01:25:58,030 So some of these cells in the around the 30000 index, I don't have one person in them, literally. 849 01:25:58,030 --> 01:26:03,190 In fact, on the 3000, there's one person out of three million that belongs to that particular voter category, 850 01:26:03,190 --> 01:26:09,970 whereas some of the cells on top, these have like three hundred a thousand people and so on and so forth. 851 01:26:09,970 --> 01:26:18,980 On the y axis is the cell probability in the population. So if you were to sample at random from the population represented in the microbiota 852 01:26:18,980 --> 01:26:23,560 at the largest category after you cut it up in the way that we describe, 853 01:26:23,560 --> 01:26:28,630 the largest category has about a two in a thousand probability of being sampled. 854 01:26:28,630 --> 01:26:35,260 OK, so these are very small categories. This is very, very defined target stratification. 855 01:26:35,260 --> 01:26:39,010 The dotted line represents the cumulative distribution. 856 01:26:39,010 --> 01:26:47,230 So that means that as you sum the size of these cells, you eventually get to 100 percent of the population in. 857 01:26:47,230 --> 01:26:52,810 The kind of neat thing about it is that even though we we have thirty thousand cells, 858 01:26:52,810 --> 01:26:57,940 the the top the largest 5000 make up about 80 percent of the population, 859 01:26:57,940 --> 01:27:03,610 which means that actually in terms of power is if you start thinking about power dynamics, how powerful should your sample be? 860 01:27:03,610 --> 01:27:09,100 Well, if your sample is powerful enough to capture dynamic for the top eight five thousand 861 01:27:09,100 --> 01:27:15,100 so long as there is not too much heterogeneity in vote choice in the other, 862 01:27:15,100 --> 01:27:20,080 in the remainder 20 20 percent, you actually fine. So that's why some of these, 863 01:27:20,080 --> 01:27:25,570 even though if you did the maths in order for it to have a sample that's powerful 864 01:27:25,570 --> 01:27:31,390 at the 90 percent level and have it work on an MRP approach or prediction, 865 01:27:31,390 --> 01:27:34,270 poor stratification approach you might need, like a hundred thousand people. 866 01:27:34,270 --> 01:27:37,540 The reason why it works so well with smaller samples like ten thousand and one thousand 867 01:27:37,540 --> 01:27:42,670 five hundred is because you only need to get right the top five five thousand cells. 868 01:27:42,670 --> 01:27:46,630 The rest you can kind of ignore so long, obviously like this is not true. 869 01:27:46,630 --> 01:27:52,750 If you happen to live in a federation where each state is completely either a genius from the other, 870 01:27:52,750 --> 01:27:57,260 then you know you cannot do this kind of work there. Yeah. 871 01:27:57,260 --> 01:28:02,900 OK, so this is kind of sorry, this is a kind of like a descriptive of how the certification cells look like, 872 01:28:02,900 --> 01:28:09,860 essentially, if you have to like order them and rank them, et cetera. And then. 873 01:28:09,860 --> 01:28:14,300 This is actually what they looked at. Look like in practise. 874 01:28:14,300 --> 01:28:18,380 So this is the lady on the left. 875 01:28:18,380 --> 01:28:22,910 These are the variables that represent a given category. 876 01:28:22,910 --> 01:28:30,990 So this the top category is. Females of Hispanic origin between the age of forty four and fifty four, 877 01:28:30,990 --> 01:28:40,590 which are college graduates who earn between zero and forty nine thousand fifty thousand dollars, who live in the state of Florida. 878 01:28:40,590 --> 01:28:44,940 And there's two hundred and forty one such people in the micro data sample. 879 01:28:44,940 --> 01:28:52,980 And and if you look at the very bottom, you have these female older race category. 880 01:28:52,980 --> 01:28:57,780 College graduates earn the same amount and there's only one of them in the whole sample. 881 01:28:57,780 --> 01:29:05,220 So these counts, these cell counts are what you're going to use to stratify your predictions from the multilevel regression model. 882 01:29:05,220 --> 01:29:10,700 That makes sense. And we're going to look at how exactly this certification is going to happen. 883 01:29:10,700 --> 01:29:16,640 Before we do that, I want you to have a look at how one representative the anthrax sample actually is. 884 01:29:16,640 --> 01:29:19,700 And if you look at this or on the x axis, 885 01:29:19,700 --> 01:29:29,900 you have the population proportions and on the y axis you have the sample proportions and you can see that our our frame is quite rare. 886 01:29:29,900 --> 01:29:36,870 So our. Sorry. 887 01:29:36,870 --> 01:29:46,280 I may have swapped the labels here hold on the same scale, right? No, but it doesn't matter so much. 888 01:29:46,280 --> 01:29:51,890 Well, in any case, let's look, they're different. That's what matters. That the age is different. 889 01:29:51,890 --> 01:29:55,430 Presidential election is different. Yeah, I think. Sorry, guys. 890 01:29:55,430 --> 01:30:10,110 I think I swapped the labels so. Yet in reality, in our sample over predicted Hillary Clinton, so our sample had more. 891 01:30:10,110 --> 01:30:17,990 We have more. Yes, that's right. Do we have young people here? 892 01:30:17,990 --> 01:30:23,090 In any case, that doesn't matter too much, so I'm a bit fried. But the important thing is that the samples are different. 893 01:30:23,090 --> 01:30:28,550 That's the main that's the main takeaway here. If they were not different, they would be on the same y axis. 894 01:30:28,550 --> 01:30:33,440 So the the point the plots would be the points would be exactly on the y axis. 895 01:30:33,440 --> 01:30:39,500 Yes, that is the amount of the absence. 896 01:30:39,500 --> 01:30:45,680 Sort of is remains, though, that that there are other reasons that were not based on certification variables. 897 01:30:45,680 --> 01:30:47,940 Yes. And normally you would say, OK, 898 01:30:47,940 --> 01:30:57,020 we do a random sample at the certification and therefore anything else that's left which might affect whatever incident is randomly distributed. 899 01:30:57,020 --> 01:31:00,440 Yes. Which obviously, this doesn't help or doesn't do anything with it. 900 01:31:00,440 --> 01:31:04,360 So I'm just wondering why the certification attempt is made. 901 01:31:04,360 --> 01:31:10,720 When this sort of reason why it was set on fire, which is basically what incident sets start up is a random sample. 902 01:31:10,720 --> 01:31:17,080 Anything that we missed, but then yeah, it would be presumably randomly distributed, right? 903 01:31:17,080 --> 01:31:17,590 Yeah. 904 01:31:17,590 --> 01:31:28,640 So therefore, it seems a bit difficult to make the step in between, right, because just go straight for the prediction, for instance, predicting. 905 01:31:28,640 --> 01:31:35,660 This is the inference is obviously invalid, right, because. You know, like an excellent example. 906 01:31:35,660 --> 01:31:40,930 Yeah. Mm hmm. In this case? Well, but it works. 907 01:31:40,930 --> 01:31:46,840 So why OK? I think I get your point. I think it's a matter of how much error you're willing to tolerate. 908 01:31:46,840 --> 01:31:55,270 So like, if you think that your certification result is going to be as good at is going to be of the same error as a random sample. 909 01:31:55,270 --> 01:31:57,130 So there's there's two there's two ideas, right? 910 01:31:57,130 --> 01:32:02,440 You can either do a random sample and just take the point estimate of the random sample and that's it. 911 01:32:02,440 --> 01:32:07,840 Or you can do a non random sample stratify and take the point estimate of the stratification. 912 01:32:07,840 --> 01:32:17,680 Both are going to be wrong. But if you can tolerate the level of error of the of the stratified sample, you can do it a lot cheaper. 913 01:32:17,680 --> 01:32:26,170 And so for the same level of error, you need a lot less money to conduct the the non-representative sample because there's one one follow up on that. 914 01:32:26,170 --> 01:32:34,260 Yeah. Let's say you're a normal person the sample comes from. At Trump rally supporters, yeah, you go there. 915 01:32:34,260 --> 01:32:38,720 Well, yeah, you at some point find a white male female. 916 01:32:38,720 --> 01:32:42,360 Yeah, fine. Yeah, my immigrant, etc. 917 01:32:42,360 --> 01:32:47,370 Yeah, you can. You can wait it what everybody wants. You shouldn't sample dependent variable. 918 01:32:47,370 --> 01:32:53,220 You should never sample. You shouldn't sample anything which is associated with the defence. 919 01:32:53,220 --> 01:32:57,960 All right. Well, yeah, in principle, yes, but but most. 920 01:32:57,960 --> 01:33:01,350 So the residual effects are quite small. And the reason why? 921 01:33:01,350 --> 01:33:08,550 So that you mean like the residual selection effects end up being quite small because and you you know this because these work, 922 01:33:08,550 --> 01:33:12,450 these these stratification effect, certification mechanisms work. 923 01:33:12,450 --> 01:33:17,250 But you're right in principle, the there is an issue with the residual selection effects. 924 01:33:17,250 --> 01:33:23,350 There is an issue. Yes. Yeah. Yeah. It's not just observation if we go up. 925 01:33:23,350 --> 01:33:27,030 Yeah, if we go up to this table, right? 926 01:33:27,030 --> 01:33:34,210 You find that in the last, there's one people we to tolerate that there is indeed no selection effect for that group. 927 01:33:34,210 --> 01:33:38,860 But this is not. This is a sample of the population, not of the Turk. 928 01:33:38,860 --> 01:33:43,030 All right. Yeah. So this is if in the end there was only one person. 929 01:33:43,030 --> 01:33:47,170 Right. It's a problem. But what happens there is that the shrinkage of that comes in. 930 01:33:47,170 --> 01:33:54,370 So if you have only one white person in your sample but you have a thousand five hundred non-white people, 931 01:33:54,370 --> 01:33:58,630 then the effect of the white person will be shrunk towards that of the of the others. 932 01:33:58,630 --> 01:34:02,320 So it's still a problem because maybe there is a strong white effect, 933 01:34:02,320 --> 01:34:06,850 but it's a lot reduced because the effect has shrunk because there's the the white 934 01:34:06,850 --> 01:34:10,660 effect is estimated with very little precision because there's only one person, 935 01:34:10,660 --> 01:34:18,640 right? But if the selection effect is done on, let's say, like a swing states, OK. 936 01:34:18,640 --> 01:34:23,850 So, for example, yes, if you only have one person from West Virginia, yeah, you're stuffed in that case, 937 01:34:23,850 --> 01:34:30,190 yeah, right, because the selection procedure within that within that states is not random. 938 01:34:30,190 --> 01:34:35,500 That's right. So if you were trying to do a prediction on the area level prediction of the state 939 01:34:35,500 --> 01:34:41,140 level and you were lacking individuals who make that state different from others. 940 01:34:41,140 --> 01:34:46,090 So individuals who are particular to the state and you don't have them in your sample, 941 01:34:46,090 --> 01:34:50,680 you're not going to make good predictions in this case, it doesn't matter because you're not going to make a prediction. 942 01:34:50,680 --> 01:34:54,190 So for District Columbia within south region, within. No, no, no. 943 01:34:54,190 --> 01:34:58,720 Right, exactly. But yes, exactly. So that's a key thing here. 944 01:34:58,720 --> 01:35:03,610 Like we are. Stratification means you are average weighted averaging. 945 01:35:03,610 --> 01:35:11,170 Right? And when you average forecasts, you always or, well, there is a rule that like if the forecasts are capturing different levels of noise, 946 01:35:11,170 --> 01:35:16,120 you are always going to get a better forecast in the end. And the weights improve that dramatically. 947 01:35:16,120 --> 01:35:25,840 So this idea that even if if you're averaging over 10000 categories for a given state and five of those categories are kind of like dodgy predicted. 948 01:35:25,840 --> 01:35:29,320 But the important top 1000 are properly predicted. 949 01:35:29,320 --> 01:35:33,700 You're actually going to get a pretty good estimate for that state, even though some of the categories were missed predicted. 950 01:35:33,700 --> 01:35:42,590 That makes sense. Yes. I guess I'm not understanding how this is applied. 951 01:35:42,590 --> 01:35:49,390 Yeah, they're kind of sorted. But then why do you start with 240 in advanced age? 952 01:35:49,390 --> 01:35:57,190 No, the salad is just I took the certification frame as it was, and I assigned it a salad. 953 01:35:57,190 --> 01:36:00,790 Oh, you mean sorry, you mean in reference to the previous court, it is not the same as. 954 01:36:00,790 --> 01:36:05,680 No, sorry. No, I should have made. You're actually right. I should have made this slide the same as the other one. 955 01:36:05,680 --> 01:36:14,740 You're right. Sorry? Apologies. Yes. Chris. 956 01:36:14,740 --> 01:36:25,210 Thanks. It seems to be the lion. This message seems to be quite reliant on having pretty granular data like this for many millions of respondents. 957 01:36:25,210 --> 01:36:31,730 Yup. He's talking about like. I mean, I'm in just the Middle East and North Africa, this just doesn't exist. 958 01:36:31,730 --> 01:36:34,310 Yes, this is, but you can't get the individual level data. 959 01:36:34,310 --> 01:36:39,860 So is this message of the other ways around this or is this just not appropriate for that kind of stuff? 960 01:36:39,860 --> 01:36:45,650 So I once briefly spoke to Jasmine about doing this in Afghanistan, 961 01:36:45,650 --> 01:36:53,750 and she said that the last census in Afghanistan had been conducted like like in the last like big survey she said had been collected. 962 01:36:53,750 --> 01:36:58,430 Doesn't seven. So data are completely outdated. 963 01:36:58,430 --> 01:37:04,100 God knows what has happened in the in the area specific provinces since then. 964 01:37:04,100 --> 01:37:13,970 You have to do the extra work there to actually do, predict the sales counts and then apply the certification to the predicted cell counts. 965 01:37:13,970 --> 01:37:22,670 Yeah, it's unlikely to work, I think in those scenarios, even when the agriculture sector loses. 966 01:37:22,670 --> 01:37:27,560 Yeah. You know, for sure, you could for sure you could you could try. 967 01:37:27,560 --> 01:37:33,590 So but again, like. Yeah, so the question the follow up question was if you aggregate together a lot of surveys, 968 01:37:33,590 --> 01:37:38,210 does this could this potentially replace the lack of census data? 969 01:37:38,210 --> 01:37:41,270 The answer is yes, to the extent that the surveys are reliable. 970 01:37:41,270 --> 01:37:45,590 So like if you have a show in India, when I worked on the Indian election, for example, 971 01:37:45,590 --> 01:37:53,210 we use the India Household Development Survey because the census data was limited in their crosstabs. 972 01:37:53,210 --> 01:37:59,000 So again, another big problem is that you have to use micro data, usually because there's loads of crosstabs that you're interested, right? 973 01:37:59,000 --> 01:38:02,930 And often census crosstabs are very limited to like two or three interactions. 974 01:38:02,930 --> 01:38:10,070 And most so yeah, service can be a solution, but usually they have to be augmented by via some modelling. 975 01:38:10,070 --> 01:38:14,420 Yeah, that's that's what I would say. You. Yeah. 976 01:38:14,420 --> 01:38:20,850 So what do you do? So what do you do when, for example, let's say. 977 01:38:20,850 --> 01:38:29,610 For example, like when you have a survey and the combination of these stratification variables leads to what you were saying, 978 01:38:29,610 --> 01:38:31,920 something like two thousand cells being empty. 979 01:38:31,920 --> 01:38:38,340 Yeah, in terms of calculating the true proportion from the census to be used with the much smaller data. 980 01:38:38,340 --> 01:38:42,250 So what do you do you just exclude them, would you? Yeah, you said in your part. 981 01:38:42,250 --> 01:38:49,020 So you have if that situation arises, there's two things that can happen. 982 01:38:49,020 --> 01:39:01,560 One is they those guys lie here in which case, like as long as those exact guys form the make up of a specific state that you're interested in. 983 01:39:01,560 --> 01:39:02,880 Usually this isn't the case, by the way. 984 01:39:02,880 --> 01:39:08,550 Usually these, like smaller categories, are kind of like spread out at random across states, so it doesn't really matter. 985 01:39:08,550 --> 01:39:18,180 But if the so in the in the positive scenario, those guys that you don't have in your census samples are here and you ignored them. 986 01:39:18,180 --> 01:39:24,300 It's fine to ignore them in the negative scenario. They're here and then you're absolutely screwed. 987 01:39:24,300 --> 01:39:31,330 There's no way to get around that. Yeah. OK. 988 01:39:31,330 --> 01:39:41,920 And so. So, yeah, sample very unrepresentative invo choice over it, so just going by memory because I don't know if I can trust these plots, 989 01:39:41,920 --> 01:39:53,110 but over represents Hillary Clinton quite dramatically under represents Donald Trump for a fair bit. 990 01:39:53,110 --> 01:40:02,220 And it over represents third parties. So after we have estimated these quantities, P of G. 991 01:40:02,220 --> 01:40:09,390 So the joint distribution of turn out and vote and then the conditional distribution of turnout, 992 01:40:09,390 --> 01:40:18,690 we can then estimate the area level proportion of people who will vote for a given party and turnout with this formula here. 993 01:40:18,690 --> 01:40:23,150 So huge is just the number of people in a specific cell. 994 01:40:23,150 --> 01:40:32,940 PG is the joint distribution of voting and turning out in a specific cell and then you some over the categories up to s. 995 01:40:32,940 --> 01:40:34,470 Again, I should have been a little more tension here, 996 01:40:34,470 --> 01:40:45,510 but you some over the categories for the specific state and then you recall the the the area level estimate for each party. 997 01:40:45,510 --> 01:40:49,620 And then you can do so not just at the party level. So this look look here. 998 01:40:49,620 --> 01:40:54,480 This is conditional on the voter characteristics and the specific party. 999 01:40:54,480 --> 01:40:58,740 Right. Whereas here we don't. The specific state sorry. 1000 01:40:58,740 --> 01:41:03,510 Whereas here we don't bother about the state because we are interested in aggregating at the national level and so 1001 01:41:03,510 --> 01:41:10,440 we can just some over the groups by their weight and obtain the national level results for each voter category. 1002 01:41:10,440 --> 01:41:15,060 Does that make sense for everybody? Yeah. This is just simple certification weighted average. 1003 01:41:15,060 --> 01:41:18,780 That's all. And these are the results. 1004 01:41:18,780 --> 01:41:29,910 So at the bottom, you see the role. Amazon and Turk estimates, and at the top you see the predicted estimates by state. 1005 01:41:29,910 --> 01:41:37,200 So there's a few things to note. So first of all, correlations improved dramatically. Mean, absolute error is shaved off by on average. 1006 01:41:37,200 --> 01:41:46,650 So 10 points from Trump, five points from Hillary, five points from the third parties turn now to shaved off by like thirty two points. 1007 01:41:46,650 --> 01:41:54,720 So we massively improve on the rural sample, and there is clearly this phenomenon called attenuation bias happening. 1008 01:41:54,720 --> 01:42:01,320 So even to the correlation is really high. The variance in the predicted state level vote shares is very small, 1009 01:42:01,320 --> 01:42:08,280 which means that what you're going to get is the correct order of the states, but very much shrunk towards the global mean. 1010 01:42:08,280 --> 01:42:18,420 And that's actually that's an effect of the shrinkage and of the of the general trend to shrink effects towards the mean. 1011 01:42:18,420 --> 01:42:23,660 And what else should you notice? No, I think that's it. 1012 01:42:23,660 --> 01:42:29,150 I think we can. So I think overall, I think we can be pretty happy with the results, 1013 01:42:29,150 --> 01:42:36,950 given that at the state level at least we have shaved off a lot of error from the role of sample estimates. 1014 01:42:36,950 --> 01:42:46,840 Now the really cool stuff comes in when you look at the national level. And so if we look at the these are the distributions of Republicans, 1015 01:42:46,840 --> 01:42:53,610 Democrats and third parties, the solid lines represent the predictions, the small dotted, 1016 01:42:53,610 --> 01:42:57,580 the thick dotted lines represent the twenty sixteen actual result, 1017 01:42:57,580 --> 01:43:05,680 and the small dots represent the role prediction of the the real national level prediction of the Amazon thing. 1018 01:43:05,680 --> 01:43:12,730 And if we look at the errors, so our error on Trump's vote share is less than a single percentage point at the national level. 1019 01:43:12,730 --> 01:43:16,600 Zero point five Our error on Clinton is quite large. 1020 01:43:16,600 --> 01:43:21,730 It's about four percentage points, so we overestimate Clinton by four percentage points. 1021 01:43:21,730 --> 01:43:25,660 Our error on others is we underestimate them by about four percentage points. 1022 01:43:25,660 --> 01:43:30,700 So what is happening there actually is we must have missed some selection effect that 1023 01:43:30,700 --> 01:43:37,360 would have told us about the substitution effect between Clinton and third parties. So we have to go back and think about what that was all about. 1024 01:43:37,360 --> 01:43:42,010 But that's what's happening here is that this is a state level result. 1025 01:43:42,010 --> 01:43:48,160 No, this model is is aggregating at the national level. There's also the question was, does this model use the state level result? 1026 01:43:48,160 --> 01:43:58,780 The answer is no. It just the aggregated aggregate using state a as another group and aggregating over as you would otherwise. 1027 01:43:58,780 --> 01:44:07,150 Yeah. I mean, I'm pretty chuffed that that Trump vote point like half a percentage point from the truth, the turnout. 1028 01:44:07,150 --> 01:44:15,130 Again, we overestimate turnout. Still, even though we had applied that uniform correction of 17 percentage points. 1029 01:44:15,130 --> 01:44:22,180 But that's not news. People lie a lot about their turnout and the survey data about turnout that we get from EM talks or otherwise. 1030 01:44:22,180 --> 01:44:23,920 It's pretty crap. 1031 01:44:23,920 --> 01:44:33,070 So in our one thousand five hundred sample, about 280 people said they didn't turn out, which makes for a very small proportions of people. 1032 01:44:33,070 --> 01:44:39,940 And therefore probably we missed a lot of cases. We, a lot of groups were predicted to vote at higher levels. 1033 01:44:39,940 --> 01:44:43,600 We actually do. And on the right, there's just an Electoral College projection. 1034 01:44:43,600 --> 01:44:46,990 So assuming there was a lot of stuff going on with the Electoral College these days, 1035 01:44:46,990 --> 01:44:50,830 so like a lot of states have signed to this pledge that they will actually give 1036 01:44:50,830 --> 01:44:55,390 the Electoral College votes to the person who wins the national level result. 1037 01:44:55,390 --> 01:45:02,140 But I ignored that. I just assumed that each state is going to give the Electoral College votes as they usually do. 1038 01:45:02,140 --> 01:45:07,400 And so we would have assigned with this model a probability of Trump winning of about twenty three percent, which is not great. 1039 01:45:07,400 --> 01:45:12,640 But it's also not terrible, considering that Hillary's vote share was initially overestimated. 1040 01:45:12,640 --> 01:45:19,000 So Trump's much was initially underestimated by 10 percentage points in the Amazon sample. 1041 01:45:19,000 --> 01:45:26,060 Yes. Do you know how that compares? Predictions that other other people. 1042 01:45:26,060 --> 01:45:32,620 Yes, yes, yes. So the question was, do I know how this prediction? 1043 01:45:32,620 --> 01:45:35,950 Well, this is not a prediction, by the way. So it's kind of an unfair comparison. 1044 01:45:35,950 --> 01:45:40,360 But how this estimate compares to some of the pre-election predictions. 1045 01:45:40,360 --> 01:45:46,210 And the answer is it's a little better. But for all, apart from FiveThirtyEight, so FiveThirtyEight, if I remember correctly, 1046 01:45:46,210 --> 01:45:52,690 had a point two seven percent Trump winning Wong and the Princeton Election Consortium 1047 01:45:52,690 --> 01:45:57,940 had a zero percent Trump winning Drew Linzer had a one percent Trump winning. 1048 01:45:57,940 --> 01:46:04,840 So I mean, you know, from survey data, given you can't really compare because this poll stock and theirs was pretty. 1049 01:46:04,840 --> 01:46:11,080 But given that the survey data has the same chance of being bad this time around as it did last time around. 1050 01:46:11,080 --> 01:46:16,900 This should be encouraging because it suggests that if you do enough modelling, you can get better results than your average predictor. 1051 01:46:16,900 --> 01:46:22,390 Yeah. Yeah. Yeah. Hold on. Mike, Mike, Mike, Mike, listen. 1052 01:46:22,390 --> 01:46:31,510 Sorry. So just to go back one more time to the virtue of the model, because if you give me the data of the actual results, 1053 01:46:31,510 --> 01:46:35,110 yeah, I can make a model which probably predicts better rights. 1054 01:46:35,110 --> 01:46:38,370 No, but but the results are not coming in at the state level, right? 1055 01:46:38,370 --> 01:46:45,620 They're coming in at the individual level. But if you give me all the same data. I could probably do pretty well, right? 1056 01:46:45,620 --> 01:46:53,480 Look at the passport effects because we asked him who he voted for. Yes, used to the state level results. 1057 01:46:53,480 --> 01:47:02,610 Yeah, initial model. Yeah. So. It kind of seems that maybe a bit of a work around growth this we, as I said to you before, 1058 01:47:02,610 --> 01:47:06,520 you could have used it instead of twenty sixteen, you could have used the Republican 2012 results. 1059 01:47:06,520 --> 01:47:11,850 Yeah, but so there was a number of conceptual reasons why I still think that's wrong because again, 1060 01:47:11,850 --> 01:47:15,300 we asked the survey today, not before the election. 1061 01:47:15,300 --> 01:47:18,650 So this is not a prediction exercise. It's a replication exercise. 1062 01:47:18,650 --> 01:47:24,090 So, but in principle, if you were to do a prediction exercise, as you will do at the workshop, 1063 01:47:24,090 --> 01:47:31,740 you are going to be using the 2016 results for predicting Biden's or whoever else. 1064 01:47:31,740 --> 01:47:38,220 But like, I agree that this is not a prediction effort. This is a re estimation replication effort for sure. 1065 01:47:38,220 --> 01:47:42,820 OK, yeah, but but understand this, I make this have this point come across. 1066 01:47:42,820 --> 01:47:48,630 Understand that at this level, the state level results from last year are not playing any role. 1067 01:47:48,630 --> 01:47:54,090 They played the role at the individual level model insofar as they were correlated with individual level responses. 1068 01:47:54,090 --> 01:47:59,790 And actually, the chances that the 2016 results would have probably been more correlated with 1069 01:47:59,790 --> 01:48:06,270 individual level responses today about the 2016 behaviour than 2012 responses. 1070 01:48:06,270 --> 01:48:10,110 But it would have been that far off because state level behaviour is actually 1071 01:48:10,110 --> 01:48:14,130 pretty stable across time and the correlations across states is pretty stable. 1072 01:48:14,130 --> 01:48:18,240 So obviously we we should have read that at using 2012 and seeing what would have happened. 1073 01:48:18,240 --> 01:48:24,220 But you shouldn't think that it's it would have been that far off because their results at the state level weren't that different from 2012 to 2016. 1074 01:48:24,220 --> 01:48:26,940 But I'm just saying that's just a hypothetical model. 1075 01:48:26,940 --> 01:48:34,110 I would do a very simple set of data, namely say I just take one person out of every state's assigned this person. 1076 01:48:34,110 --> 01:48:38,640 The state level results that 0.20 52. 1077 01:48:38,640 --> 01:48:44,370 If it was 52 percent, then I just exploded by the number of people living in states. 1078 01:48:44,370 --> 01:48:48,760 And then I get perfect results, right? Yeah, but that's not what we are doing right now. 1079 01:48:48,760 --> 01:48:53,340 I mean, use at an individual level. Yeah. And then you exploded, right? 1080 01:48:53,340 --> 01:48:58,560 So in principle, this bits. Huge individual in. 1081 01:48:58,560 --> 01:49:02,160 So you said you say this. Well, if yes, so if you assign the state level is perfectly. 1082 01:49:02,160 --> 01:49:04,950 I agree with you, but we're not assigning the state level perfectly. 1083 01:49:04,950 --> 01:49:09,060 We're assigning the correlation between the state level result and the individual response. 1084 01:49:09,060 --> 01:49:15,300 But but but then again, like the virtual verge perspective. Yeah, also from our application perspective, you could probably do better. 1085 01:49:15,300 --> 01:49:19,730 Easier, right? Yeah. Yes. Yes. 1086 01:49:19,730 --> 01:49:27,330 Yes. Yes. No. Of course you could. Of course, the fact that there's 2016 results leads to a better estimate than 2012 results 100 percent. 1087 01:49:27,330 --> 01:49:32,880 I don't dismiss the certification for. Yes. Yes. We're trying to show that certification matters here. 1088 01:49:32,880 --> 01:49:38,740 Yeah. I know the questions. 1089 01:49:38,740 --> 01:49:44,200 OK. And so. 1090 01:49:44,200 --> 01:49:51,140 We're almost there, guys. Yeah, there's a few considerations here. 1091 01:49:51,140 --> 01:49:59,390 So one is that we would have done a lot better as well if we would have improved the sample and we certainly would have increased the sample size. 1092 01:49:59,390 --> 01:50:05,360 Sample size is actually a very contentious issue at the moment in the sort of literature. 1093 01:50:05,360 --> 01:50:11,810 So one way to call this model, by the way, prediction and certification is also MRP multilevel regression and certification. 1094 01:50:11,810 --> 01:50:20,360 And some people have suggested that. So for nationally, for estimating area level result from nationally representative surveys, 1095 01:50:20,360 --> 01:50:25,850 MRP does a lot better with higher sample sizes, so error reduces dramatically. 1096 01:50:25,850 --> 01:50:31,700 If you go from a thousand people in your sample to ten thousand people for Amazon Enteric specifically, 1097 01:50:31,700 --> 01:50:41,660 so for non-representative samples of this kind. Some people goil at all suggested that the name of the paper is surveys are not representative. 1098 01:50:41,660 --> 01:50:46,040 Surveys are cheap and reliable, or are they reliable or something like that? 1099 01:50:46,040 --> 01:50:52,700 But they suggest that actually there is a decreasing returns on these kind of samples because of all their set. 1100 01:50:52,700 --> 01:50:59,180 If you look at the total survey error, some other errors start overcrowding the sampling error. 1101 01:50:59,180 --> 01:51:03,740 And so even if you had 10000 people, you wouldn't do much better than you did here. 1102 01:51:03,740 --> 01:51:15,770 But the point is, people don't know. And as part of your one potential six project, if you'd like to study it, you could come talk to me. 1103 01:51:15,770 --> 01:51:20,720 Sample size for non-representative surveys Impulse fortification context is very important, 1104 01:51:20,720 --> 01:51:27,200 and people don't really know what the [INAUDIBLE] is going on. So that's one of the things. And yeah, that's it. 1105 01:51:27,200 --> 01:51:35,570 So we can take a 20 minute yet 10, 15, 20 minute you guys decide break and then we'll come to the workshop. 1106 01:51:35,570 --> 01:51:42,970 Yes. Mike, Mike, Mike. 1107 01:51:42,970 --> 01:51:46,930 OK, maybe a dummy question after the end of all of this, but no other way. 1108 01:51:46,930 --> 01:51:51,460 So if I understand correctly, this whole exercise is to estimate a proportion. 1109 01:51:51,460 --> 01:51:57,670 That's correct. So I'm not just the proportion, by the way. You can also not you can also do it with like height weight. 1110 01:51:57,670 --> 01:52:02,620 You can do a BMI. You can do it with like any quantity that you want to be representative at the national level, 1111 01:52:02,620 --> 01:52:11,740 but you only have known representative estimates for what it is to estimate a single like parameter. 1112 01:52:11,740 --> 01:52:21,190 Yeah. Could you also use this same way of having a non representative sample if you wanted to estimate relationships between variables? 1113 01:52:21,190 --> 01:52:26,870 Maybe this is really a silly question, but you know what I mean? So if my aim is not to estimate who? 1114 01:52:26,870 --> 01:52:35,020 Yeah, the share of people who vote, but let's say I want to know if more educated people are more likely to have children outside of marriage. 1115 01:52:35,020 --> 01:52:40,600 And then I do like such a quick and cheap, non-representative survey. 1116 01:52:40,600 --> 01:52:46,080 Do I then would I then be also able to infer? 1117 01:52:46,080 --> 01:52:54,780 This kind of relationship, I think that's that's an intriguing question, can we apply a pulse stratification to coefficients, let's say? 1118 01:52:54,780 --> 01:52:58,260 I need to think about that often, like off the top of my head. 1119 01:52:58,260 --> 01:53:02,970 I don't see. Maybe you need to do some extra work, but I don't see why you shouldn't. 1120 01:53:02,970 --> 01:53:09,220 Oh, sorry, I don't see why you couldn't. Maybe there's all sorts of reasons why you shouldn't. But yeah, I can think we can. 1121 01:53:09,220 --> 01:53:12,510 But this is not something that is I haven't seen in the literature yet, 1122 01:53:12,510 --> 01:53:21,000 but also because I come I come at it from a perspective where at the moment it's being used for like, well, political science, 1123 01:53:21,000 --> 01:53:27,300 like predicting who votes, but also disease mapping, like predicting like percentage of influenza cases in different states, 1124 01:53:27,300 --> 01:53:32,040 etc. So at the moment, it's mostly been proportions, as you say. 1125 01:53:32,040 --> 01:53:39,180 Okay, thanks. You have no knowledge. As you well know, rich people are more likely to. 1126 01:53:39,180 --> 01:53:45,430 Yes, things like that. So should should. 1127 01:53:45,430 --> 01:53:50,200 We can we can have a conversation about it afterwards. Thank you. 1128 01:53:50,200 --> 01:53:54,810 Yeah, I think it's a it's a very good question. Think about. 1129 01:53:54,810 --> 01:54:03,720 The existing literature on trend of life or stratification has really been in forecasting or elections or things like, 1130 01:54:03,720 --> 01:54:06,660 you know, just trying to replicate survey questions, 1131 01:54:06,660 --> 01:54:16,530 I mean, the general concept here is to go from a messy, dirty sample doing to a more reliable sample by essentially waiting, right? 1132 01:54:16,530 --> 01:54:21,600 This is just a weeding exercise, so it could actually be an interesting question to try. 1133 01:54:21,600 --> 01:54:29,760 Maybe it's a group project or something else to think about, maybe doing a survey with questions that you might be interested in, right? 1134 01:54:29,760 --> 01:54:39,450 For, for example, taking a survey sociodemographic survey, doing it on a non-representative sample, using a platform such as Mechanical Turk. 1135 01:54:39,450 --> 01:54:43,800 Right? And then finding a relationship that you would be interested in. 1136 01:54:43,800 --> 01:54:50,220 Essentially a relationship of some kind of correlation that you will is well documented and then seeing to what 1137 01:54:50,220 --> 01:54:55,290 extent that works in the non-representative sample and then compare it with something that you trust more, 1138 01:54:55,290 --> 01:55:01,710 which is, of course, much more expensive. So taking a correlation for something like the British Household Panel study or understanding society, 1139 01:55:01,710 --> 01:55:09,930 doing it on a on a non representative sample and seeing if doing the rerating allows you to get within the bounds of that. 1140 01:55:09,930 --> 01:55:18,390 Because of course, the big sell here is that you have a thousand person survey, which is being done for a thousand dollars, which is very cheap. 1141 01:55:18,390 --> 01:55:21,870 And so it's it's potentially scalable if it works like that. 1142 01:55:21,870 --> 01:55:32,410 But so I think it's an interesting question that could even be worth lowering as a as a project. 1143 01:55:32,410 --> 01:55:47,720 So, yeah, I was recently looking at a deer, this survey, literature and these calibration and ranking algorithm, but I don't often see the. 1144 01:55:47,720 --> 01:55:52,400 Living with bears and statistics tell me a bit more. 1145 01:55:52,400 --> 01:56:00,310 It doesn't seem he did the survey literature calibration issues here don't seem to be using the division was the lead there. 1146 01:56:00,310 --> 01:56:08,500 So it's so the MRP studies that you let me be precise. 1147 01:56:08,500 --> 01:56:11,680 You don't need Bayesian to do this sort of stuff. You can do it with frequent. 1148 01:56:11,680 --> 01:56:19,300 There's absolutely no problem like feel free to use your instead of like running the model through Jag's, feel free to use LMM and put in the model. 1149 01:56:19,300 --> 01:56:26,230 That's number one. The advantage of Bayesian is that because you obtain simulations for your posterior, you can. 1150 01:56:26,230 --> 01:56:35,110 You don't need to rely on sampling assumptions in order to calculate a specific predictive distribution. 1151 01:56:35,110 --> 01:56:39,100 You get the predicted distribution directly from the model. You don't need to do any extra work. 1152 01:56:39,100 --> 01:56:43,870 And so for maybe it's being lazy, believe it. 1153 01:56:43,870 --> 01:56:47,320 For me, having these simulations really helps conceptualise the problem. 1154 01:56:47,320 --> 01:56:55,130 And also, yeah, that all would be different because it's not completely dead, but it's quite linked. 1155 01:56:55,130 --> 01:57:01,720 Mm-Hmm. We use, for instance, bootstrapping in France to derive calibration weights. 1156 01:57:01,720 --> 01:57:07,410 Yeah, how that would be different from a Bayesian simulation. 1157 01:57:07,410 --> 01:57:13,010 So that's a method of deriving the weights. Yeah, but also. 1158 01:57:13,010 --> 01:57:17,510 And uncertainty, yeah, and uncertainty around the waits, so that is not fair. 1159 01:57:17,510 --> 01:57:24,200 So, so I'm not like overly familiar with that specific method, but for my general understanding of bootstrapping and that kind of thing. 1160 01:57:24,200 --> 01:57:29,660 So you would obtain you would obtain some measure of uncertainty. 1161 01:57:29,660 --> 01:57:39,710 I think that the assumptions that lie behind the bootstrapping are different from the ones that come with the Bayesian models that I showed you here. 1162 01:57:39,710 --> 01:57:43,160 I am very familiar, like the assumptions behind the Bayesian models are very simple. 1163 01:57:43,160 --> 01:57:46,310 You can derive them from your four probability, five probably axioms. 1164 01:57:46,310 --> 01:57:52,460 So you don't need to do any extra work to specify new assumptions beyond what is a probability. 1165 01:57:52,460 --> 01:57:57,260 And so that, to me, is a very intuitive thing, but that doesn't mean you can't do it. 1166 01:57:57,260 --> 01:58:01,400 So feel free to replicate these exercises by doing the using the bootstrapping. 1167 01:58:01,400 --> 01:58:07,880 The bottom line is they both get. The important thing for me is that we put it in our minds that you have to have uncertainty estimates, 1168 01:58:07,880 --> 01:58:13,970 you have to have uncertainty estimates and like a lot of the applications often don't. 1169 01:58:13,970 --> 01:58:22,910 And it's a problem because then you can't do calculations like the probability of Trump winning, you know, like we saw here. 1170 01:58:22,910 --> 01:58:30,110 Like this point to three comes because out of a thousand simulations, Trump wins twenty three to two hundred thirty times. 1171 01:58:30,110 --> 01:58:34,290 But if we didn't have uncertainty over the state level estimates, you couldn't do that calculation. 1172 01:58:34,290 --> 01:58:40,040 So it's really important that you do have the uncertainty. But you're right. You're right. You can just do it with other ways. 1173 01:58:40,040 --> 01:58:47,840 There are other ways of predicting weights and finding distributions around the weights. 1174 01:58:47,840 --> 01:58:54,570 If that's all, maybe we'll take a break. Yeah. Thank you.