1 00:00:00,610 --> 00:00:09,040 Well, welcome today. I have Rafael PEREIRA with me, who the university lecturer in statistics at the Department of Primary Care Health Sciences. 2 00:00:09,370 --> 00:00:10,480 Good day to you. Good morning. 3 00:00:11,110 --> 00:00:20,919 So Rafael's expertise in statistics and runs the monitoring group here in Oxford and has been working in primary care health care statistics. 4 00:00:20,920 --> 00:00:23,950 And it's a good friend of mine for some 15 years now. 5 00:00:23,960 --> 00:00:29,230 So we're going to look today at three different papers while we have this discussion. 6 00:00:29,740 --> 00:00:34,660 The first of them is absence of evidence is not evidence of absence by Doug Altman. 7 00:00:35,170 --> 00:00:40,180 The second is a really interesting paper which does date quite a bit. 8 00:00:40,180 --> 00:00:49,120 1763 An Effort towards solving a problem problem in the Doctrine of Challenges by the late Reverend Mr. Bays. 9 00:00:50,290 --> 00:00:56,429 And final paper is actually generalised linear models. 10 00:00:56,430 --> 00:01:01,570 So where we're really going to see how we get on with that. So let me start. 11 00:01:02,650 --> 00:01:07,420 Let's say I'm a up and coming student. 12 00:01:07,420 --> 00:01:13,340 I'm doing a Ph.D. and I've come to see a statistician. What do you think are the best ways to prepare? 13 00:01:13,390 --> 00:01:17,440 What do you think people should really know? And statistics before they say come to you? 14 00:01:17,800 --> 00:01:23,440 They may be an epidemiologist and maybe a clinician, but what the sort of problems you see where you think, 15 00:01:23,440 --> 00:01:25,570 well, you could have actually solved that yourself? 16 00:01:26,200 --> 00:01:36,279 I think I think the main issue that we deal with day to day is that people believe that statistics is mainly about a series 17 00:01:36,280 --> 00:01:41,620 of techniques that you would employ once you have your data that will give you an answer or the answer that you want. 18 00:01:42,280 --> 00:01:47,950 And more and more fortunately people are beginning to realise that it's not about that. 19 00:01:47,950 --> 00:01:59,890 So statistics is really a way of thinking. It's more about trying to understand how the data is generated and in because of the same reason, 20 00:02:00,250 --> 00:02:03,340 what sort of data would be the most appropriate to answer a particular question. 21 00:02:04,180 --> 00:02:11,740 So really we want to be involved or people need to be thinking about statistics or if you want study, design, whatever you want to call it, 22 00:02:12,430 --> 00:02:19,840 from the very early phase of your study and you want to think about what are the reasons why you want to collect information, 23 00:02:20,050 --> 00:02:26,440 what's the best possible information you want you possibly can collect to answer the the the question that you have in mind. 24 00:02:27,370 --> 00:02:34,210 And once you have that, if you if you're thinking adequately early on, then the methods you're going to use could be very simple methods. 25 00:02:34,510 --> 00:02:38,470 Sometimes they have to be slightly more complex, sometimes they have to be slightly more technical. 26 00:02:38,710 --> 00:02:43,240 But generally, if you have your question set up, right, gathering data right, 27 00:02:43,600 --> 00:02:46,870 then the methods that you're going to employ will be relatively simple ones. 28 00:02:47,230 --> 00:02:51,040 So let me come on on that point because I think over the years I've seen this problem a lot. 29 00:02:51,520 --> 00:02:55,510 I'd say it's one of the major problems. People say, I have a bit of data. 30 00:02:55,540 --> 00:03:00,959 Yeah, can I see a statistician? Yeah. And actually the issue is you have some data. 31 00:03:00,960 --> 00:03:04,270 And most of the time, as we try to say, well, what question do you want to ask you this? 32 00:03:04,630 --> 00:03:07,990 That's correct. So the onus is it's almost reciprocal. 33 00:03:08,020 --> 00:03:11,440 So if you start early on thinking, okay, what are the data? What data? 34 00:03:11,440 --> 00:03:16,420 I'm going to be collecting too much the right question, then the methods you're going to be employing are going to be very simple. 35 00:03:16,660 --> 00:03:23,290 When it becomes really complex is when people suddenly around say, I've been collecting data for the last 20 years, this is my large dataset. 36 00:03:23,740 --> 00:03:27,910 I want to do something using, Well, what do you want to do? 37 00:03:27,910 --> 00:03:34,990 What are the questions that you want to address? And sometimes this was very, very good advice that I got from Patil. 38 00:03:35,440 --> 00:03:42,130 Who was it? The statistician. When I first arrived to this apartment, she said, Well, first things you need to ask is, what's your question? 39 00:03:43,450 --> 00:03:48,850 And very often you find that when people come in with data, they don't have a question. 40 00:03:49,000 --> 00:03:52,870 And that's that's crucial because I have a statistician, sometimes I have my own questions, 41 00:03:52,870 --> 00:03:59,020 but they may not necessarily match the questions that the clinician or epidemiologists would be wanting to answer. 42 00:03:59,560 --> 00:04:05,049 So I guess this is just relate this to the sort of old trials and making old trial data available. 43 00:04:05,050 --> 00:04:11,320 This is a real concern of people who say, well, if we make data available, people are just going to do what they call dredging. 44 00:04:11,500 --> 00:04:19,690 Yeah. Which I guess is what is the real concern when you have lots of data available and to to to people out there. 45 00:04:19,690 --> 00:04:30,700 What does that so I mean this term you see often bandied about for data dredging more so data dredging is mainly about and it's it's seen as a 46 00:04:30,700 --> 00:04:41,980 very pejorative term but I'll come back to why nowadays is not as bad or has been rebranded into something that is being called now data mining. 47 00:04:42,670 --> 00:04:50,410 But data dredging is mainly about exploring information and exploring data without a clear idea what you want to look at. 48 00:04:50,740 --> 00:04:54,370 It's mainly doing loads of different comparisons. 49 00:04:54,370 --> 00:04:59,650 Could be, for example, if you have different subgroups or just looking at any kind of structure that the data might be. 50 00:05:01,860 --> 00:05:06,089 What happened to have and then come up and say, this is an answer. 51 00:05:06,090 --> 00:05:13,240 This is the answer to a question. So it is almost really reverse engineering ending up with an answer and then going back and oh, 52 00:05:13,260 --> 00:05:16,440 what was the question that was going to be answered by that? Okay. 53 00:05:16,710 --> 00:05:20,280 Yeah, you know, it's interesting. So but that's a really interesting start. 54 00:05:20,290 --> 00:05:24,989 Let's move on to some of these papers. Okay. What is it about tautology? 55 00:05:24,990 --> 00:05:29,370 The absence of evidence is not evidence of of absence. 56 00:05:30,890 --> 00:05:37,890 And this paper talks a little bit about throwing the term negative in the bin. 57 00:05:37,890 --> 00:05:43,380 And I'll say what is published in the BMJ by Doug Altman and Martin Bland, 58 00:05:43,560 --> 00:05:47,850 which many people will know in the Bland Open Box, which, if you have told me, might come back to that. 59 00:05:48,480 --> 00:05:57,240 So by convention a P-value greater than 5% P greater than 0.05 is called not significant randomised controlled clinical 60 00:05:57,240 --> 00:06:03,060 trial that do not show a significant difference between the treatments being compared are often called negative. 61 00:06:03,270 --> 00:06:13,490 Yeah. Yeah. You know you say I mean this artificial P of 0.05, where does that come from and why do we get in this body? 62 00:06:13,500 --> 00:06:16,530 Do see this? Well, it doesn't work or it's negative. 63 00:06:17,760 --> 00:06:25,440 The threshold, if we want to call it of p point nought five is remnant to when before we have computers. 64 00:06:25,920 --> 00:06:33,510 And before then what had happened is people had looked at specific what we call statistical distributions. 65 00:06:33,750 --> 00:06:40,680 People might know the normal distribution is the main one, which is a bell shaped distribution, but as many other types of distributions, 66 00:06:40,680 --> 00:06:48,450 not only that one, and calculated the specific value for that distribution at different probabilities. 67 00:06:49,080 --> 00:06:53,370 So it would be, for example, a probability of 1%. What would be the value of that distribution? 68 00:06:53,370 --> 00:06:57,210 A probability of 5%, 10%. 69 00:06:57,900 --> 00:07:05,580 And because they calculated these numbers, they became the thresholds that people could then use in the calculations. 70 00:07:05,730 --> 00:07:10,980 So there were just very specific values and the same thing applies for the normal distribution 71 00:07:10,980 --> 00:07:14,520 as to for other distributions are very commonly used in other statistical methods. 72 00:07:14,790 --> 00:07:23,640 For example, the F distribution of the TV screen and maybe what other people have heard of now with the arrival of a computer, 73 00:07:23,970 --> 00:07:30,840 it allows us to calculate the reverse. So for whatever value of X, for example, less than the normal distribution, 74 00:07:30,840 --> 00:07:39,600 what would be the probability that the would be, let's say that the the p value from that number onwards. 75 00:07:40,260 --> 00:07:47,490 And with that approach, it meant that a threshold of oh point nought five became a little bit outdated, 76 00:07:48,090 --> 00:07:53,250 which means that what it boils down to is if you have a certain amount of information, 77 00:07:53,670 --> 00:08:03,060 do your calculation end up with a P value of 0.051 that quantitatively and qualitatively is not different from 0.05. 78 00:08:03,250 --> 00:08:03,880 Hmm. Yeah. 79 00:08:03,900 --> 00:08:12,090 So having a predefined threshold point or five is really a line in the sand which says you'll get it wrong one in 20 times, roughly speaking. 80 00:08:12,660 --> 00:08:18,270 But people are moving away from that threshold and thinking more of what is the actual p value? 81 00:08:18,270 --> 00:08:25,110 Is it very, very small? Is it relatively large and making more if you want qualitative decisions based around that? 82 00:08:25,470 --> 00:08:32,129 Okay. So that's interesting. So one of the things that we often find and people are doing is and this paper is 83 00:08:32,130 --> 00:08:37,890 about is is more around the sort of sample size issues and it talks about here. 84 00:08:37,890 --> 00:08:46,440 Freeman himself found that only 30% of a sample of 71 trials published in the New England Journal of Medicine with a p greater than 0.1, 85 00:08:46,440 --> 00:08:54,150 were large enough to have a 90% chance of detecting even a 50% difference in the effectiveness of the treatment being compared. 86 00:08:54,270 --> 00:09:00,809 Yeah. So when we talk about that, this sort of terminology, alpha and beta isn't the sample size. 87 00:09:00,810 --> 00:09:08,000 So you see this is 80% power to detect a certain difference with an amount of confidence around 88 00:09:08,000 --> 00:09:13,780 that I could use so elaborate tried to explain in a drawing to explain I guess somebody said is it 89 00:09:13,860 --> 00:09:19,229 math with the letters which for us of clinicians is really difficult to understand and just trying 90 00:09:19,230 --> 00:09:24,360 to explain them terms a bit so people could understand how you arrive at sample size calculation. 91 00:09:24,390 --> 00:09:36,750 Okay. So let me give you a brief, some some brief terminology of what we use and set the scene of how we calculate these samples. 92 00:09:37,770 --> 00:09:43,050 Most of the calculations that we do or most of the comparisons that we do are based around medians. 93 00:09:43,230 --> 00:09:45,770 So it means so means of something very good. 94 00:09:45,990 --> 00:09:55,080 Could be, for example, the mean number of individuals, a proportion of individuals, that proportion can be thought of as a similar type of mean. 95 00:09:55,380 --> 00:10:00,020 So the proportion of individuals that recover in one group compared to proportion of. 96 00:10:00,060 --> 00:10:02,280 People that recover if you give them a different type of treatment. 97 00:10:02,310 --> 00:10:08,240 So the proportions can be thought of as a kind of mean if you want other types of means would be, 98 00:10:08,250 --> 00:10:10,960 for example, the mean blood pressure, one division in one group. 99 00:10:10,980 --> 00:10:14,700 So the mean blood pressure, if you don't give them treatment, this is the mean blood pressure. 100 00:10:14,700 --> 00:10:23,730 If you give them treatment. Most of these comparisons around means no means are a really interesting statistic because the more data you have, 101 00:10:24,000 --> 00:10:27,630 the more information you'll have about the overall mean of the population. 102 00:10:28,170 --> 00:10:31,650 And that translates into having more precise information around the mean. 103 00:10:32,280 --> 00:10:38,009 So if we think about that idea of precision as we extend the sample and we gather more precision, 104 00:10:38,010 --> 00:10:41,370 more information about where the actual mean of that group happens to be, 105 00:10:41,910 --> 00:10:48,150 what happens is when you compare two groups, if you have bigger numbers in each one of these groups with increased precision, 106 00:10:48,450 --> 00:10:51,450 you increase the certainty about a separation. 107 00:10:51,450 --> 00:10:56,370 If a separation exists between these two groups, and that's where the term sample size comes in. 108 00:10:57,870 --> 00:11:01,680 The separation that exists within these two groups is a hypothetical separation, 109 00:11:02,040 --> 00:11:06,900 because before we start the experiment, if you want, we don't know that separation exists. 110 00:11:07,440 --> 00:11:12,209 So it's down now to comparing two ideas of two hypotheses. 111 00:11:12,210 --> 00:11:16,560 You might be one idea. First, you usually call it the null hypothesis for the null idea. 112 00:11:16,580 --> 00:11:22,740 There's no difference between the two groups, in which case it doesn't matter how big a sample size is going to be, we're still going to get. 113 00:11:23,280 --> 00:11:25,290 In fact, the the bigger the sample it is, 114 00:11:25,290 --> 00:11:30,899 the more precise we precisely will know that there's no difference between these two things versus another hypothesis, 115 00:11:30,900 --> 00:11:33,030 another idea that there is some degree of separation. 116 00:11:33,450 --> 00:11:40,089 And it is this degree of separation that could be ten millimetres of mercury, for example, or 20 minutes of mercury. 117 00:11:40,090 --> 00:11:49,170 If we're talking about blood pressure, if we have a clear separation, if there's a very big separation, we need less data, smaller sample size. 118 00:11:49,380 --> 00:11:54,480 If there's a smaller separation between these two samples, we need more data because we need to increase the precision. 119 00:11:54,780 --> 00:11:59,130 And thus with this type of alpha, how if we reject, 120 00:11:59,400 --> 00:12:03,600 if we how much information we need to collect in order to decide whether one 121 00:12:03,600 --> 00:12:07,049 hypothesis is correct or we think we have evidence to reject one hypothesis, 122 00:12:07,050 --> 00:12:08,310 in fact or not. 123 00:12:08,760 --> 00:12:18,100 And beta, which is the capacity of us in a given experiment to identify or to collect data to say that this is actually happening is is the. 124 00:12:18,110 --> 00:12:24,450 Yeah, and so what are some people like here they go 90% power and some people say 80% power. 125 00:12:24,450 --> 00:12:28,559 And you put some grant applications in and people say increase the power. 126 00:12:28,560 --> 00:12:31,800 Why why do people do that before? 127 00:12:32,310 --> 00:12:36,650 If we if we think about it in sort of repetitions of the same experiment. 128 00:12:36,660 --> 00:12:41,910 So if you were doing this experiment lots of times because it's probably the easiest way to think about it theoretically, 129 00:12:42,270 --> 00:12:47,549 if you were to do the randomised controlled trial 100 times with these numbers, 130 00:12:47,550 --> 00:12:53,820 let's say you choose if you find that the sample size you need is 200 in each group with those numbers, 200 in each group, 131 00:12:54,300 --> 00:12:57,410 if you write a hundred times a power of 80% of a power, 132 00:12:57,430 --> 00:13:04,890 90% will tell you how many times 80 out of 100 for 90 to 100, you will find a significant again. 133 00:13:05,250 --> 00:13:08,370 Okay, that's how much that's made. That's simple even for me. 134 00:13:08,940 --> 00:13:16,469 An interesting one. Just finish this one point. I think about this paper which published in 1995 in the BMJ. 135 00:13:16,470 --> 00:13:21,810 So it's quite it's not an old paper book if it is, it is new compared to the other two. 136 00:13:21,930 --> 00:13:32,759 Yeah, I compared. And I guess there's something here about probably why systematic reviews came about talks about five fraud analytic treatments, 137 00:13:32,760 --> 00:13:40,169 mainly structure kinase in acute myocardial infarction and the overview of randomised controlled trials, 138 00:13:40,170 --> 00:13:48,990 which is what we now know is a systematic review, found a modest but clinically worthwhile and highly significant reduction in mortality of 22%. 139 00:13:49,500 --> 00:13:55,290 But only five of the 24 trials had shown a statistically significant effect with the p less than four whole point. 140 00:13:55,440 --> 00:14:03,729 Yeah. And I thought with some estimation, I just wonder is this an open source as wider research is on the power or is that still a problem? 141 00:14:03,730 --> 00:14:09,120 We see tonight today that actually many trials are still underpowered and we need to do more. 142 00:14:09,480 --> 00:14:13,200 Where are we actually moving into an age now? Where would you see much larger trials? 143 00:14:13,380 --> 00:14:25,620 I think there's a combination of things that so for one thing, probably nowadays people are more clear on the idea of power. 144 00:14:25,950 --> 00:14:34,890 And if you have a primary outcome, they normally funders would be adamant that you need enough information out there, enough patients, 145 00:14:35,160 --> 00:14:42,990 a large enough sample size to find it best if the if the primary outcome is going to to show significant finding or not. 146 00:14:43,230 --> 00:14:50,580 So the problem of powering on a primary outcome. It's I'm not saying it's solved, but it's not as big as it was maybe 20, 30 years ago. 147 00:14:51,120 --> 00:14:57,179 However, you have to think that each study will be looking at one particular primary outcome 148 00:14:57,180 --> 00:15:01,200 and there might be a whole host of other secondary outcomes that might be. Even more clinically relevant. 149 00:15:01,650 --> 00:15:04,709 But simply because they are less likely to have been reported. 150 00:15:04,710 --> 00:15:13,410 The event rate is smaller. That it means that you need much bigger sample sizes to identify an effect in those types of outcomes. 151 00:15:14,190 --> 00:15:18,780 Which means that a person doing the study may be focusing on a completely and a 152 00:15:18,780 --> 00:15:25,169 proxy which is not as clinically relevant as a secondary outcome by definition, 153 00:15:25,170 --> 00:15:28,830 than it means that that particular study would be underpowered for that secondary outcome. 154 00:15:29,370 --> 00:15:36,840 And that means that you would need still to use all these different studies that look at relatively same populations, have similar questions, 155 00:15:37,080 --> 00:15:42,990 combine them all in systematic reviews to find the real hopefully, if not the real answer, 156 00:15:43,650 --> 00:15:46,680 to gather enough evidence to say, well, we think this is the right answer. 157 00:15:47,490 --> 00:15:52,680 Yeah. Okay, good. And it's interesting. He finishes this. We'll just finish this paper with this sort of bold statement, 158 00:15:52,680 --> 00:15:59,280 which I think is probably were imprinted on your mind when we are told that there is no evidence of A causes B, 159 00:15:59,730 --> 00:16:04,320 we should first ask why the absence of evidence means simply that there is no information at all. 160 00:16:04,350 --> 00:16:12,400 Yes, if there are data, we should look for quantification of the association rather than just a p value quantity. 161 00:16:12,900 --> 00:16:16,290 What does that mean? So I can understand the first, but we haven't got the evidence. 162 00:16:16,290 --> 00:16:19,700 So you shouldn't say this is negative. You should just say, look, we don't know the answer. 163 00:16:19,710 --> 00:16:24,030 You don't know. But this quantification of the association rather than the p value. 164 00:16:24,720 --> 00:16:33,510 So. So that's, that's for example if you have information from randomised controlled trials and it's still uncertain as to what, 165 00:16:33,510 --> 00:16:39,300 what the real answer is because we don't have enough information. You might go back and look at observational studies. 166 00:16:40,020 --> 00:16:45,209 So the classic example might be of serious adverse events where from randomised 167 00:16:45,210 --> 00:16:48,150 controlled trials you might get some idea of what the serious adverse events are, 168 00:16:48,150 --> 00:16:52,070 but you don't have enough information because it's just not enough numbers. 169 00:16:52,080 --> 00:16:54,060 And then you see yourself and see, that's why very, very small. 170 00:16:54,630 --> 00:16:58,770 So you might want to revert back and look at what other sources of information might be 171 00:16:58,770 --> 00:17:06,120 providing you with more relevant data to explore whether there's an association or not. 172 00:17:06,120 --> 00:17:12,630 And that's what that's referring to. So you might want to use other forms of information to quantify if an association might be present. 173 00:17:12,840 --> 00:17:18,030 Okay. Well, that's good. Let's move back now just to 1763. 174 00:17:18,030 --> 00:17:24,460 I notice this small jump. I notice this an essay towards solving a problem in the doctrine of Genesis. 175 00:17:24,490 --> 00:17:29,820 Yes, I notice it was submitted on the 23rd of December three three just in time for Christmas. 176 00:17:30,960 --> 00:17:33,540 Now, this is interesting. I read this. 177 00:17:33,660 --> 00:17:38,670 I actually went over this again yesterday, some bits of a complex, but there's some very interesting bits in there. 178 00:17:39,420 --> 00:17:42,870 But let's I guess the Reverend Mr. Bayes. 179 00:17:42,870 --> 00:17:52,920 Yep. Is somebody who most people coming into statistics, epidemiology will have heard of the concept of Bayesian reasoning. 180 00:17:53,100 --> 00:17:55,800 That's correct. That's in a way the reason why I chose this paper. 181 00:17:55,830 --> 00:18:05,850 Okay, so give me the reason why I chose it and give me, in a sense, if possible, if possible at all, what is Mr. Bailey's issue in Bayesian reasoning? 182 00:18:06,060 --> 00:18:13,320 Well, actually, this paper is not about Bayesian reasoning, which is probably one thing that people might find or. 183 00:18:14,010 --> 00:18:19,200 It's it's not about the usual thing that people associate. 184 00:18:19,500 --> 00:18:23,820 Well, it is. And it isn't the main focus of this paper. It's not about Bayesian reasoning. 185 00:18:24,360 --> 00:18:31,950 It's about trying to solve a slightly different problem, which you can then regard as a basic problem or frequentist problem, 186 00:18:32,100 --> 00:18:41,579 which is the separation between two camps that was went on for many years and probably now it's converging 187 00:18:41,580 --> 00:18:48,330 because people don't necessarily new statisticians or probably not at least in the last decade or so, 188 00:18:48,930 --> 00:18:53,370 use Bayesian or frequentist methods, depending on what's more useful. 189 00:18:53,430 --> 00:18:56,579 Anyway, this paper is the focus of this paper. 190 00:18:56,580 --> 00:19:07,290 It's trying to understand coming up with a series of proofs of propositions to answer the following question a simple, 191 00:19:07,290 --> 00:19:18,209 relatively simple following question. If you have a series of experiments that are independent and you observe the series of experiments, 192 00:19:18,210 --> 00:19:23,610 one of the outcomes of these year's some experiments, let's say tossing a coin, you toss a coin, which is an unfair coin. 193 00:19:23,610 --> 00:19:27,780 So we don't know exactly the probability. If there's a third going, we would know it's 5050 chance. 194 00:19:28,320 --> 00:19:36,120 But it is an unfair coin. So we don't know what the probability of landing heads or tails is, but we observe a series of heads or a series of tails. 195 00:19:37,050 --> 00:19:44,640 And the question that we're trying to answer is if we knew what if we have observed all these outcomes of the experiments? 196 00:19:45,120 --> 00:19:56,310 Can we quantify what are the chances of this probability of heads, let's say, being between a number of another, let's say between 0.8 and .29? 197 00:19:57,750 --> 00:20:05,920 And that's in a way really. Missionary because although he doesn't think about it this way, it can be thought of it in a frequencies way. 198 00:20:06,130 --> 00:20:11,110 It switches around the problem of most of the time, we think. 199 00:20:11,710 --> 00:20:17,960 Given that we know this is a very common 5050 or given that we know that this coin is point three, 200 00:20:17,980 --> 00:20:23,920 the probability of point eight of being had some point of being tails one on one chance of winning eight times. 201 00:20:24,730 --> 00:20:29,230 And this is the other way around. So it switches it saying, okay, given that we observe all this evidence, 202 00:20:29,680 --> 00:20:37,000 what are how can we bound and calculate or determine the probability of heads or the probability of of tails? 203 00:20:37,810 --> 00:20:41,740 Now, within this paper, I think is Proposition four and five, 204 00:20:42,340 --> 00:20:52,930 you find the basis to of what is called Bayesian reasoning, which is all about in the way conditional probability. 205 00:20:53,230 --> 00:21:01,000 So given that one experiment that if you have two events that that only if both events happen, 206 00:21:01,000 --> 00:21:06,520 you will receive a prize and say and you observe either one event and not the other one. 207 00:21:06,730 --> 00:21:11,770 How do you update your probability or calculate the probability on the first event and all 208 00:21:11,770 --> 00:21:17,259 these conditional probability if we then revert that into parameters like for example, 209 00:21:17,260 --> 00:21:23,770 this probability of has or probability of tails gets us back to base and reason directly to Bayesian reason. 210 00:21:24,640 --> 00:21:27,610 So it's interesting. Yeah, I mean, you look at it so it says here, 211 00:21:28,210 --> 00:21:35,830 find out a method by which we might judge concerning the probability that an event has to happen in given circumstances and upon 212 00:21:35,830 --> 00:21:42,790 supposition that we know nothing concerning that under the same circumstances that has happened a certain number of times. 213 00:21:43,040 --> 00:21:45,130 About a certain of the number of times. That's right. 214 00:21:45,550 --> 00:21:54,459 But I guess it goes back to the large pie, because there's a bit here on on page 47 two that the larger the number of experiments, 215 00:21:54,460 --> 00:21:57,940 we have to support a conclusion so much more. 216 00:21:57,940 --> 00:22:02,409 The reason we have to take it for granted. Yes. 217 00:22:02,410 --> 00:22:06,100 And this is at the heart of statistics, if you want. 218 00:22:06,310 --> 00:22:11,320 We can't with statistics, we can't prove anything. 219 00:22:11,710 --> 00:22:17,680 And that's that's the major thing that needs to be ingrained on everyone that 220 00:22:17,710 --> 00:22:22,810 the use of statistics for works in statistics should we can't prove anything. 221 00:22:23,080 --> 00:22:29,190 And it's also sort of part of the scientific method. Science allows for uncertainty, 222 00:22:29,200 --> 00:22:37,170 shoot up and since we think we know but we should always be ready to readjust if the information 223 00:22:37,180 --> 00:22:43,360 that the evidence now that suddenly switches or changes or changed our ideas behind it. 224 00:22:43,840 --> 00:22:50,590 So what here what he's saying is the more information we have, the most certain we should be, which makes sense. 225 00:22:50,920 --> 00:22:54,940 But it doesn't say we should be certain, which is, I think, a clear distinction. 226 00:22:55,870 --> 00:23:00,370 So that's interesting because I'm just looking here through I noted yesterday when I was reading that 227 00:23:00,370 --> 00:23:07,509 there's a particular here on 409 when he talked about when you born and it talks about a second 228 00:23:07,510 --> 00:23:13,899 appearance or one of the return of the sun and an expectation would we might be raised the name of 229 00:23:13,900 --> 00:23:18,250 a second return and he might know that there was another 3 to 1 with some probability of that. 230 00:23:18,270 --> 00:23:25,450 So talked about the fact the sun's going to come up tomorrow and it would increase, but it would never be 100%. 231 00:23:25,450 --> 00:23:30,280 That's correct. Yes. So basically, statisticians can never be certain of anything. 232 00:23:31,780 --> 00:23:36,410 Yes, that's correct. We can never be certain. We can attach a probability to an answer. 233 00:23:37,420 --> 00:23:42,360 I'm 90% sure. Okay. So I think this bit was really interesting to me. 234 00:23:42,380 --> 00:23:43,090 It is lower than this. 235 00:23:43,090 --> 00:23:49,690 Some of this is a bit complicated for me, but this seems to be low even like around normal distributions and everything in here. 236 00:23:49,690 --> 00:23:55,570 And another important thing to highlight of this paper is how much the writing in statistics had changed. 237 00:23:55,960 --> 00:24:02,800 So, I mean, I also have to be honest, I also had trouble following some of his proofs and some of his arguments. 238 00:24:03,460 --> 00:24:08,770 And this is mainly because, although it's using logical argument to present what his proposition for, 239 00:24:08,770 --> 00:24:14,350 his justification for the different propositions that he's he's making in the paper, 240 00:24:14,800 --> 00:24:20,920 it doesn't produce what we now would be the main tool that we now use, which is algebra. 241 00:24:21,550 --> 00:24:30,340 So with algebraic terms, things, people that have studied maths particularly things become a lot more simple to follow. 242 00:24:30,550 --> 00:24:36,520 And the second paper about my now that uses very complex algebra but still 243 00:24:36,520 --> 00:24:43,419 relatively easy to follow compared to the 50 pages or so the Reverend Bias uses. 244 00:24:43,420 --> 00:24:49,420 Well, let's try this out. Well, some of this is on the last page full on a okay Australia. 245 00:24:49,840 --> 00:24:51,430 He says since this was written, 246 00:24:51,430 --> 00:24:59,740 I found a method of considerably improving the approximation in the 2D and 3D rules by demonstrating that the expression and I have. 247 00:24:59,850 --> 00:25:01,830 No chance of seeing that expression. 248 00:25:02,880 --> 00:25:14,860 So I want to test your ability to communicate this equation to our audience today without taking any breath or pauses in between you. 249 00:25:14,910 --> 00:25:17,460 That's that's going to be quite tricky. But let me let me have a go. 250 00:25:17,850 --> 00:25:23,270 So the equation says something like it's a ratio and it's two sigma divided by one plus to 251 00:25:23,910 --> 00:25:31,860 expectation of A to the P to the to the power of p b to the power of Q plus 2e8 to the power of P. 252 00:25:31,860 --> 00:25:36,240 Beat the power of Q the violent. And it sounds it looks very complex. 253 00:25:36,360 --> 00:25:40,419 And the main thing is we use these approximations. 254 00:25:40,420 --> 00:25:46,280 So we use these values many times they come from fairly large calculations, 255 00:25:46,290 --> 00:25:55,829 very large sums that are then simplified with assumptions and arguments that would say, for example, well, is this particular value? 256 00:25:55,830 --> 00:25:59,850 We think it's very, very small so we can get rid of it. Well, this particular way cancels with another thing. 257 00:25:59,850 --> 00:26:04,380 So let's get rid of it and end up with something that will be relatively simple to compute. 258 00:26:04,530 --> 00:26:07,200 Usually at this time it was the 17th, 259 00:26:07,200 --> 00:26:14,010 18th century by hand that would then allow them to use to create an approximation to the calculation that they wanted to obtain. 260 00:26:14,520 --> 00:26:19,290 Nowadays, with computers, we end up with really, really large equations. 261 00:26:19,290 --> 00:26:28,229 So this they talk about an expression that is an approximation. Nowadays we have expressions of many several lines long, which we can then, 262 00:26:28,230 --> 00:26:35,070 including a computer that would get a much more accurate and a value that is much more 263 00:26:35,070 --> 00:26:38,910 closer to the real value that we want to calculate is usually still an approximation, 264 00:26:38,910 --> 00:26:45,959 but it's much closer to what we want. While there's no chance I could say that, so if you can say that without coming up for breath, 265 00:26:45,960 --> 00:26:49,340 that means you're pretty qualified in sophistication and about to be an example. 266 00:26:49,820 --> 00:26:57,900 Let's say just page 376. There's some issues here which I think are definitions which he says are fundamental to probability. 267 00:26:58,620 --> 00:27:02,819 And I guess probability is is a fundamental principle in health care. 268 00:27:02,820 --> 00:27:09,260 And we're always trying to estimate what's the risk of A causing B, but just point one's untreated. 269 00:27:09,600 --> 00:27:12,870 Several events are inconsistent when if one of them happens, 270 00:27:12,870 --> 00:27:21,570 none of the rest can and then point to two events are contrary when one or other of them most and both together cannot happen. 271 00:27:21,690 --> 00:27:25,560 Yes. And I and I, we touched on it in I mentioned it before with these terms. 272 00:27:26,220 --> 00:27:33,570 They're sort of the terms of independent. Independent of in the same way you just explain them because they seem to be important concepts. 273 00:27:34,170 --> 00:27:47,730 Yeah. So, so you have things like but if you have an independent, well this first two terms are mainly about what we call sometimes disjoint events. 274 00:27:48,060 --> 00:27:52,410 So there is a series of events that one of them might happen or not happen. 275 00:27:52,890 --> 00:27:58,920 And in those cases, we, we say that these events are not independent because if one happens, the other will not happen, 276 00:27:58,920 --> 00:28:05,340 for example, and they are particularly useful for defining what are the totality of events that might happen. 277 00:28:05,850 --> 00:28:09,239 And that also defines everything that might might happen if you want. 278 00:28:09,240 --> 00:28:13,770 So that's that's that's a really crucial aspect of probability because then you say, well, 279 00:28:14,250 --> 00:28:22,470 if you have only one type of event death, let's say the country of that would be surviving. 280 00:28:22,650 --> 00:28:32,700 And if you are not dead, your life and that some of those two things as of 2 to 1 now the other the flip side of that is when you have 281 00:28:32,700 --> 00:28:39,810 events that are independent of each other and that normally means that the chance if one of those events occur, 282 00:28:40,320 --> 00:28:48,090 it does not affect the other event. And this is a little more difficult to come up with examples of what these might be. 283 00:28:48,090 --> 00:28:54,150 Let's think about is, for example, whether it's going to be sunny or rainy today, 284 00:28:54,400 --> 00:28:57,990 that there might be one event, let's say it's sunny today for a change here in Oxford. 285 00:28:58,500 --> 00:29:07,220 And then the type of breakfast I'm going to have, I normally have toast, I might have eggs and I haven't seen weather, weather, the weather reported. 286 00:29:07,380 --> 00:29:10,770 And it might be that if it's it might be exactly the same temperature. 287 00:29:10,770 --> 00:29:15,690 So it might not necessarily affect my decision of what I'm going to have for breakfast. 288 00:29:15,930 --> 00:29:20,490 So the probability of me having eggs or having a piece of toast and the probability of it being 289 00:29:20,700 --> 00:29:26,310 sunny or raining or the events are thought to be independent because one will not affect the other. 290 00:29:27,120 --> 00:29:36,480 What tends to happen is that these type of probabilities are not very or independent between two events are not very easy to show in, 291 00:29:37,140 --> 00:29:39,360 particularly in medical terminology, 292 00:29:39,360 --> 00:29:46,350 because that is because many things are associated and the concept of association and correlation then becomes very, very important to use. 293 00:29:47,020 --> 00:29:50,309 Yeah, okay, great. I mean, that's a real education. 294 00:29:50,310 --> 00:29:54,510 This program that we could have probably stayed here for about an hour going over this paper and 295 00:29:54,510 --> 00:29:59,690 there are a whole about ten problems to look out many things in the but let's move on now with. 296 00:29:59,810 --> 00:30:03,610 We're into a paper in the Journal of the Royal Statistical Society, 297 00:30:03,620 --> 00:30:08,450 which you are a member of the Statistical Society with join in if you're interested. 298 00:30:08,780 --> 00:30:13,400 Definitely. Definitely. There's lots of different chapters, medical statistics, primary care. 299 00:30:14,930 --> 00:30:19,190 And I think that's that's there's several that people are interested. 300 00:30:19,490 --> 00:30:23,810 Welcome to join not only statisticians but also people who are interested in statistics, of course. 301 00:30:24,200 --> 00:30:33,980 And then we've got this paper by Jane, Nelda and our W.M. Wedderburn and it's about generalised linear model from 1972. 302 00:30:34,190 --> 00:30:41,550 Yep. So I guess the first thing is to say, well, what's the generalisability in your model? 303 00:30:42,360 --> 00:30:42,620 Mm hmm. 304 00:30:44,900 --> 00:30:59,690 This paper I chose, because it really is, is presenting a series of method that underpins most of the modelling we use or we do in medical statistics. 305 00:30:59,690 --> 00:31:10,370 Not all of it, but most of it. If we think of a model being a way of saying what will happen to, let's say, my health, 306 00:31:10,970 --> 00:31:19,720 given the one of my chances and say of survival of of not having or let's say, of having a cardiovascular event, 307 00:31:19,730 --> 00:31:29,959 heart attack, given that I am a smoker and I'm overweight, I'm an adult 55 that I am in touch with within the next year, 308 00:31:29,960 --> 00:31:32,100 one of my probabilities of having a heart attack in the next year. 309 00:31:32,960 --> 00:31:38,000 That association of the different factors of the individual and something else is a model. 310 00:31:39,590 --> 00:31:42,829 And what these papers are describing is a series, 311 00:31:42,830 --> 00:31:50,930 a type of models that links up pretty much any kind of modelling or most kind of modelling that we do in in medical, medical statistics. 312 00:31:51,320 --> 00:32:02,049 The reason we call linear is because we show each one of these factors would have a linear life if you want to some shape of, 313 00:32:02,050 --> 00:32:08,930 let's say in this case, a transformation of the probability. You might be a lot easier to explain in terms of, let's say, weight and height. 314 00:32:09,350 --> 00:32:15,000 So we want to explain height in relation to weight we can fit aimed at the height. 315 00:32:15,110 --> 00:32:20,240 The taller you are, you're playing along a lot. So y in excess of height against weight. 316 00:32:20,300 --> 00:32:22,880 But what these models do is extend that line, 317 00:32:23,030 --> 00:32:32,030 which would be only in certain circumstances would be useful only when you have a very transformation of X and Y to other types of approaches, 318 00:32:32,030 --> 00:32:38,210 like for example, probability, like for example, risk, like for example, one of the other ones. 319 00:32:39,020 --> 00:32:42,860 So he talks about a plus on distribution and he talks about, well, let me comment on that. 320 00:32:42,860 --> 00:32:46,429 Let me come in on the introduction. And this is the thing for us. 321 00:32:46,430 --> 00:32:53,600 I said simple clinical clinician in the first almost two sentences. 322 00:32:53,600 --> 00:32:59,870 Yeah, let's just listen to these two linear models customarily embody both systematic and 323 00:32:59,870 --> 00:33:04,510 random error component where the errors usually seem to have normal distribution. 324 00:33:04,940 --> 00:33:08,120 The associated analytic technique is least squares. 325 00:33:08,120 --> 00:33:09,050 Very. Yeah. 326 00:33:09,170 --> 00:33:18,950 Now, well, I've got one, two, three systematic random error I need to know about normal distributions to and then I've got these square theory now. 327 00:33:19,130 --> 00:33:23,720 No wonder I'm in trouble. Yes. So then how to give that? 328 00:33:24,320 --> 00:33:29,230 How would you go about sort of thinking this thing through and how would a students then think, 329 00:33:29,240 --> 00:33:32,680 well, you've got to understand these concepts before you can then move on to. 330 00:33:32,700 --> 00:33:42,140 Yeah, these. I think the best way to approach these type of thoughts in this type of paper is to think in terms of, 331 00:33:42,350 --> 00:33:45,560 first of all, this, this what we call the simple linear regression. 332 00:33:46,550 --> 00:33:53,660 So in simple in regression, which is a specialised case and by specialised case we call it a simple case 333 00:33:53,660 --> 00:33:58,520 of this type of models that this paper is presented in superior aggression. 334 00:33:58,520 --> 00:34:05,840 What we have is one viable outcome, whatever you want to call it, that we call it a dependent outcome. 335 00:34:06,200 --> 00:34:13,310 So something that we want to say something about and something that is an independent variable, 336 00:34:13,430 --> 00:34:16,970 that is something that we either have control of or that we know. 337 00:34:18,440 --> 00:34:27,020 So in the in, again, probably the best way to think about it, let's say BMI and blood pressure. 338 00:34:27,230 --> 00:34:30,590 So BMI goes up and see a weight is an it. Yeah, 339 00:34:30,950 --> 00:34:36,019 as a weight go so maybe your blood pressure is also is also going going up and we might want then 340 00:34:36,020 --> 00:34:44,960 to fit align to if we think of available in a graphical shape like a scatterplot as your BMI. 341 00:34:45,620 --> 00:34:49,640 If you plot a BMI individuals with a particular BMI, a particular blood pressure, 342 00:34:49,970 --> 00:34:56,900 you would start seeing a pattern that might follow a line as your BMI goes up, your blood pressure goes down. 343 00:34:58,340 --> 00:35:03,850 And this linear regression is why they. Two is you can just see the line, say, okay, we draw a line on those points. 344 00:35:04,420 --> 00:35:11,290 But because there's going to be small variation in this comes to the idea of either thing you call it 345 00:35:12,460 --> 00:35:19,600 random viability or systematic viability because there's going to be viability between individuals. 346 00:35:19,600 --> 00:35:23,259 And that viability might be because some individuals naturally have lower blood pressure. 347 00:35:23,260 --> 00:35:27,930 So high blood pressure, but they also might be because of the measurement error. 348 00:35:27,940 --> 00:35:32,110 As you take the blood pressure, you might be some error. There's some merit in the measurements. 349 00:35:32,350 --> 00:35:35,080 There's going to be scatter around that particular line. 350 00:35:36,460 --> 00:35:47,200 So we use here where this weighted least squares comes in, we use different methods to come up with the best possible line. 351 00:35:47,920 --> 00:35:54,430 And that best possible line in the case of a simple linear regression is one that minimises the difference. 352 00:35:54,740 --> 00:36:01,070 We'll go from each point to the line, and that's what the least squares comes in an end with this. 353 00:36:01,090 --> 00:36:08,950 And then the coach, I'm trying to understand these folks so we've got weight and say blood pressure going up in a line, 354 00:36:09,130 --> 00:36:11,740 but some people will vary from that line. Right. 355 00:36:12,010 --> 00:36:18,010 And the difference in their variation, you'll use the least squares method to come up with the best possible line. 356 00:36:18,010 --> 00:36:22,120 And there are two sources of variability, the random variability that occurs in blood pressure. 357 00:36:22,120 --> 00:36:26,800 But there may be some systematic variability because the blood pressure curves not working, for example. 358 00:36:27,640 --> 00:36:31,240 And then look, this is if I can get this, I'll be happy. 359 00:36:31,420 --> 00:36:37,059 Then obtain the approximation to the weights isn't there will be to the reward. 360 00:36:37,060 --> 00:36:44,290 So what? So you see the what if we can just get that that would not be the panacea if that light. 361 00:36:44,290 --> 00:36:52,120 So you have an A the A line which is almost a regression line and then you talk about either the beaches or the weights of each component. 362 00:36:52,390 --> 00:36:58,510 There's like two things. So the weights, the weights that they talk about in terms of of weight, at least squares, 363 00:36:58,840 --> 00:37:02,110 is in relation to how much weight you're going to give to each one of these points. 364 00:37:02,890 --> 00:37:08,770 And if you are talking of simple linear regression, you give equal weights to each one of the points. 365 00:37:09,430 --> 00:37:15,759 If, for example, you're talking about something slightly more complex where you think some points have more information than others, 366 00:37:15,760 --> 00:37:21,330 you give more value to those points. Similar to what happens in meta analysis, where you might give more weight to studies are bigger. 367 00:37:21,790 --> 00:37:24,309 That's one area the betas are you talking about. 368 00:37:24,310 --> 00:37:33,730 It's more of the quantification of how much things change by a change of one unit in your independent variable. 369 00:37:33,760 --> 00:37:44,170 Let me be clear when we're talking about this line, if you think of a line, the line would have a which is completely defined by an intercept. 370 00:37:44,380 --> 00:37:50,170 That is the point where if you had zero BMI, this is completely theoretical, 371 00:37:50,170 --> 00:37:54,970 but if you had a similar BMI, what would be your blood pressure say might be 20. 372 00:37:55,240 --> 00:38:01,480 Let's a hypothetical. So that's that's that in a way uncross your line tells you where you start. 373 00:38:02,230 --> 00:38:06,430 And then the other bit that completely defines your line is the slope. 374 00:38:07,360 --> 00:38:12,880 So by a change of one unit and BMI, how much your blood pressure is going to change. 375 00:38:13,120 --> 00:38:20,409 And it might be that when you move from 20 for a BMI of 25 to a BMI of 26 to blood pressure increases by two millimetres millimetres 376 00:38:20,410 --> 00:38:28,690 of make it so that to the change in one unit in one unit of BMI is an increase of two units of systolic blood pressure. 377 00:38:29,110 --> 00:38:37,090 That two is exactly your slope. Okay, now the slope and the betas are exactly the same thing. 378 00:38:37,120 --> 00:38:42,630 Okay. So when you want to do with a combination of lean squares or whatever you want to call it, you know, 379 00:38:42,710 --> 00:38:48,580 to determine your betas is basically to determine the slope or the increase in the 380 00:38:48,580 --> 00:38:52,570 dependent variable that you obtain by increasing your independent by one unit. 381 00:38:52,720 --> 00:39:00,790 So that so basically you, like we said, you may have blood pressure, may be affected by weight, you may add another factor in like age. 382 00:39:00,790 --> 00:39:06,100 Yeah. Which would come into your model. That's right. And that would increase the slope or decrease it. 383 00:39:06,550 --> 00:39:09,970 What happens is that each one of these characteristics have their own slopes. 384 00:39:09,970 --> 00:39:18,640 Yeah. So BMI might have a slope of two, but once you bring age, your slope of the of BMI might decrease, as you were saying to one or 1.2. 385 00:39:18,850 --> 00:39:22,810 And there is another slope of three or 1.2 as well. 386 00:39:23,140 --> 00:39:31,150 And so then the only other factor, if we get these folks, we're running out of time is this this sort of idea in models is how much can be 387 00:39:31,150 --> 00:39:36,250 explained by the model and how much count what that value and how to test for that. 388 00:39:36,550 --> 00:39:39,040 So we're talking about the scatter that you obtain. Yes. 389 00:39:39,070 --> 00:39:44,050 And because there's going to be scatter, I mean, it could be natural variability or whatever it is, 390 00:39:44,290 --> 00:39:48,370 that scatter is precisely that you can't explain through your model. 391 00:39:48,380 --> 00:39:51,420 Yeah. So your line is brilliant if you only have two dots. 392 00:39:51,430 --> 00:39:53,379 Yeah. So you can explain everything. 393 00:39:53,380 --> 00:39:59,260 If you have that's once you have three dots or more, three individuals or more, there's going to be some variability you won't be able to explain. 394 00:39:59,970 --> 00:40:03,810 And that very belief that you won't be able to explain tells you how good your model is. 395 00:40:03,840 --> 00:40:04,200 Okay. 396 00:40:04,560 --> 00:40:11,280 And there's different ways of quantifying how good your model is in terms of how much of the total variability of the data you manage to explain. 397 00:40:11,850 --> 00:40:12,320 Okay. 398 00:40:12,330 --> 00:40:19,740 And that could be in terms of, for example, just looking at the differences from design to each one of these dots or other types of measurements. 399 00:40:20,160 --> 00:40:23,370 Gosh. So that's been an amazing, very quick journey. 400 00:40:23,400 --> 00:40:27,870 40 minutes on probability sample five linear models. 401 00:40:28,170 --> 00:40:32,710 Now let's just finish with, let's say, a bit of advice. 402 00:40:32,730 --> 00:40:38,440 What would you say the best advice if you're embarking on a course epidemiology statistics? 403 00:40:38,460 --> 00:40:46,080 What sort of resource for what would be the best way to just keep improving your statistical knowledge, improving existing knowledge? 404 00:40:46,110 --> 00:40:50,310 I think what has worked for me is getting your hands dirty. 405 00:40:50,430 --> 00:40:51,690 So doing the things, 406 00:40:52,410 --> 00:41:01,320 having some idea of what you want to achieve or what you can possibly achieve through either lectures or through going to courses. 407 00:41:01,770 --> 00:41:04,649 But more than anything is the in applying these things. 408 00:41:04,650 --> 00:41:12,240 So if, for example, you have data and you want to use a particular model that has a linear model of these claims, then go ahead and do it. 409 00:41:13,020 --> 00:41:17,639 Ask for advice. Many times there's advice either in the statistical packages, 410 00:41:17,640 --> 00:41:25,170 also in the in the in the text and also in in the individuals in your own own organisation or elsewhere. 411 00:41:25,380 --> 00:41:29,670 And the earlier you go for advice, the better that will be. The other thing that I would I would say. 412 00:41:29,940 --> 00:41:32,940 Okay. Well, on that note, let me just let me just finish here. 413 00:41:33,480 --> 00:41:37,800 You go. And you got to answer one of the two. Okay. Mean all the median? 414 00:41:42,260 --> 00:41:45,440 Medium. The like a true statistician has to think about it. 415 00:41:45,740 --> 00:41:49,700 Stator or are are all the time. 416 00:41:49,890 --> 00:41:55,820 Okay. There you go. I going with speed. You know, while on that now I'd like to say thank you very much to Rafael Perera for what's 417 00:41:55,820 --> 00:42:00,950 been a very interesting discussion on statistics and many issues around that. 418 00:42:00,980 --> 00:42:03,110 Thank you. You're welcome. Thanks. You're welcome.