1 00:00:00,150 --> 00:00:05,580 So I'm going to talk about a piece of work that I have just recently published. 2 00:00:05,970 --> 00:00:15,390 As I was saying to a few people earlier, it's describes a slightly different approach to modelling diseases than you'll be learning in this course. 3 00:00:15,750 --> 00:00:23,130 And hopefully I will during this talk, I will sort of introduce you to some of the things that you're going to learn later on. 4 00:00:23,160 --> 00:00:29,820 So if it looks a bit mathematical, just just ignore it and it will have you regard it sort of like a cultural experience. 5 00:00:31,390 --> 00:00:39,330 But but there will be these will be things that you will be learning about, even if not in exactly this sort of way. 6 00:00:40,290 --> 00:00:44,580 So my starting point, I started in the cancer epidemiology unit. 7 00:00:45,480 --> 00:00:48,690 Now, does anybody know who this person is? Yes, it is Ricciardo. 8 00:00:48,900 --> 00:00:57,420 And what was Ricciardo famous for? So he's most famous for having sort of established the link between smoking and lung cancer. 9 00:00:58,410 --> 00:01:03,900 And he's also famous for something else, which is he's got his name associated with paper, 10 00:01:04,590 --> 00:01:08,550 the age distribution of cancer and a multi stage theory of carcinogenesis. 11 00:01:09,300 --> 00:01:16,710 Now going back to the 1950s that they came up with a very, very simple idea for how cancer occurs. 12 00:01:17,160 --> 00:01:23,730 And the thinking was, well, cancer occurs through a sequence of steps, one thing after the other. 13 00:01:24,270 --> 00:01:29,999 So genetic mutations, for example, and until you've had every single one of those mutations, 14 00:01:30,000 --> 00:01:34,590 your your your healthy, but when that final mutation hits, then you know, you've got problems. 15 00:01:35,850 --> 00:01:42,720 And what they realised was that if, if the rates of each of these mutations have a constant rate, 16 00:01:43,290 --> 00:01:52,830 then the distribution of of cancer would be expected to go like age to some power or time some power. 17 00:01:53,400 --> 00:01:59,160 And when they looked at the data and sure enough that that was sort of roughly what they found. 18 00:01:59,430 --> 00:02:06,840 So when you take the log of the that the instance from possible, it was the log of h you see a straight line and that's what they found. 19 00:02:07,350 --> 00:02:12,120 And the slope of that line will tell you the number of stages that are involved in the model. 20 00:02:12,630 --> 00:02:22,860 So that was all seem to be pretty encouraging at the time, but it's not been used very much really since then and the several good reasons for that. 21 00:02:23,220 --> 00:02:27,390 So it's certainly used to be difficult to formulate and understand. 22 00:02:28,370 --> 00:02:34,049 It was certainly difficult to fit to the data and in some cases there are questions about whether the model was identifiable. 23 00:02:34,050 --> 00:02:39,750 So that's whether you could really determine what these parameters were representing. 24 00:02:40,500 --> 00:02:48,210 And perhaps most importantly of all, alternative methods came along, such as logistic regression and cox proportional hazards, 25 00:02:48,570 --> 00:02:54,270 which are actually more useful for most of the things that people were interested in now. 26 00:02:55,290 --> 00:03:00,360 George Box famously said that all models are wrong, but some are useful. 27 00:03:01,050 --> 00:03:05,880 And what he meant by that was that a model can help you to sort of describe and understand processes. 28 00:03:06,090 --> 00:03:10,739 And this is very much the approach I'm taking in the talk today. 29 00:03:10,740 --> 00:03:17,549 And, and in my approach to this work, despite not having been used that much when it has been used, 30 00:03:17,550 --> 00:03:20,820 this model has tended to be associated with important pieces of work. 31 00:03:20,820 --> 00:03:24,930 And I've just listed a handful of examples here and the list needs to be updated. 32 00:03:25,710 --> 00:03:31,290 But what I would argue is that there are many of the reasons for not using this model are no longer present. 33 00:03:31,290 --> 00:03:36,390 So I would argue that it is now comparatively easy to formulate these things and to understand what's going on. 34 00:03:36,840 --> 00:03:40,380 It's easy to fit the data. These issues are identifiable. 35 00:03:40,390 --> 00:03:43,320 They are not really necessarily that important in practise. 36 00:03:44,070 --> 00:03:50,370 And although there are alternative methods, the value of these sort of multi-stage models are sort of they're different. 37 00:03:50,370 --> 00:03:52,320 So so there there is a value to them. 38 00:03:53,220 --> 00:04:02,250 And there's one other thing that really got me interested in them, and this is that they're designed to model sequences of somatic mutations. 39 00:04:02,910 --> 00:04:14,399 And in the last five or six years, one of the really big discoveries in, in my opinion, is the realisation that we are all as we age, 40 00:04:14,400 --> 00:04:24,990 we are cells, accumulate somatic mutations and we get these clones of cells that are mostly harmless but accumulate over time. 41 00:04:24,990 --> 00:04:28,290 So moles in your skin or a very obvious visible example. 42 00:04:29,550 --> 00:04:35,400 So as we age, we, we, we gain these, these mutations in our, you know, in our stem cells. 43 00:04:35,400 --> 00:04:43,380 And most of them are harmless. But by the time we start to get into into old age, it turns out that if you look at your cellular, your blood supply, 44 00:04:43,440 --> 00:04:52,200 that quite a substantial proportion of your blood can be contain these these clones that are distinct from healthy, healthy blood cells. 45 00:04:53,430 --> 00:04:59,520 And it turns out that almost everywhere that people have looked at in your blood and your throat and. 46 00:04:59,850 --> 00:05:01,980 In even in single cells. 47 00:05:02,880 --> 00:05:10,320 This is found to be the case and is also being been found that these these are almost certainly involved in cardiovascular diseases. 48 00:05:11,340 --> 00:05:15,540 My thinking was what other diseases might might these things be involved with? 49 00:05:15,540 --> 00:05:21,030 And is this a reason to look again using multi-stage models to study disease? 50 00:05:21,660 --> 00:05:25,110 Another benefit of these models that distinguishes them from, say, 51 00:05:25,110 --> 00:05:29,370 proportional hazards or logistic regression is that they have what is called a parametric model. 52 00:05:29,580 --> 00:05:35,340 So by that, we specifically model the incidence of disease as a function of age or function of time, 53 00:05:36,090 --> 00:05:39,809 which is different to say logistic regression or proportional hazards, 54 00:05:39,810 --> 00:05:44,010 where either you get the disease or you don't get the disease and you're just estimating how, 55 00:05:44,100 --> 00:05:48,000 say, smoking or or body mass index, for example, modifies your risks. 56 00:05:49,680 --> 00:05:53,940 Now, David Cox, I don't know whether any of you have heard of him. 57 00:05:53,970 --> 00:06:01,290 He unfortunately recently died this year. And in his nineties he devised the proportional hazards and the logistic regression models. 58 00:06:02,460 --> 00:06:04,860 However, in a 1994 interview, 59 00:06:05,430 --> 00:06:13,620 he remarked that he would normally want to tackle problems parametric with his actual specific description for how the incidence of disease occurs, 60 00:06:13,620 --> 00:06:14,220 for example. 61 00:06:14,880 --> 00:06:22,740 And the reason why, he said, was because various people have shown that the answers are very insensitive to the exact formulation of the model. 62 00:06:22,800 --> 00:06:28,230 So in other words, it doesn't really matter if the model is imperfect, it still gets about the right answer. 63 00:06:28,710 --> 00:06:35,250 But you have all the other benefits that goes with having it, which are you tend to it gives you a better understanding of what's going on. 64 00:06:35,580 --> 00:06:39,150 They're often more informative and sort of easy to interpret, and I'll give you examples of that. 65 00:06:39,570 --> 00:06:46,500 And in addition, they allow you to extrapolate. So your data may lie with people aged between 40 and 60, 66 00:06:46,650 --> 00:06:52,110 but you can extrapolate bit forward or back to get an idea of how the incidence will change with age. 67 00:06:53,490 --> 00:06:58,080 But there are very good reasons why most people don't use these sorts of models. 68 00:06:58,800 --> 00:07:03,840 And the first is that you don't need you don't need to go to all this extra effort of describing 69 00:07:03,840 --> 00:07:07,620 the incidence of disease to be able to estimate these the factors that you're interested in. 70 00:07:07,830 --> 00:07:11,700 The, for example, how smoking it modifies your risk of disease. 71 00:07:12,300 --> 00:07:16,200 And in addition to this, for the proportional hazards model, for example, 72 00:07:16,440 --> 00:07:24,300 allows you to model quite complex situations such as where you might have two different groups with very complex varying incidence patterns. 73 00:07:24,600 --> 00:07:28,200 And it would still the analysis would still in principle give you the correct answer. 74 00:07:28,830 --> 00:07:35,340 So if your interest is in estimating associations with, say, smoking and lung cancer, 75 00:07:35,850 --> 00:07:39,480 then a non part model is like parametric model is likely to be better. 76 00:07:39,900 --> 00:07:46,680 And that's no doubt why most of your syllabus is is aimed at just that, and rightly so. 77 00:07:47,550 --> 00:07:55,140 As a result of this, the the age dependent incidence of disease and some of these statistical problems 78 00:07:55,140 --> 00:07:59,640 are needed to to study these things correctly have tended to be ignored. 79 00:08:00,210 --> 00:08:07,860 And as a result, in my opinion, some potentially important patterns of of disease incidence have been overlooked. 80 00:08:07,860 --> 00:08:11,370 And I'm going to show some of those later on. 81 00:08:12,560 --> 00:08:16,250 And I should emphasise that most of the results here you couldn't have got with 82 00:08:16,250 --> 00:08:20,150 your with a conventional logistic regression or proportional has this model. 83 00:08:21,170 --> 00:08:29,500 And what we're going to find we're going to find that a large class of diseases are statistically at least potentially avoidable over a lifetime. 84 00:08:29,510 --> 00:08:35,120 That's what I'm going to argue. And that your effective age at risk of these diseases. 85 00:08:35,120 --> 00:08:41,899 So that's the age you had the equivalent age of somebody with no risk factors to your age. 86 00:08:41,900 --> 00:08:45,110 If, say, your you smoke and you're overweight, for example. 87 00:08:45,650 --> 00:08:50,000 And I'm going to argue that this effect of age risks, these more sort of sporadic diseases, 88 00:08:50,450 --> 00:08:56,540 is more sensitive to risk factors than for diseases that tend to be more later onset. 89 00:08:57,910 --> 00:09:06,910 I'm also going to argue that about 60% of the diseases in the UK Biobank are consistent with this with a multi-stage disease process. 90 00:09:07,690 --> 00:09:09,370 But I'm more but unfortunately, 91 00:09:10,000 --> 00:09:17,650 it also appears that you can't you cannot conclude that the the disease process cells have this multi-stage underlying aetiology. 92 00:09:18,040 --> 00:09:20,650 And I now have evidence for all of this. 93 00:09:21,260 --> 00:09:28,060 I'm going to introduce the effect of age and a relative ageing rate, and a handful of other minor results will crop up along the way. 94 00:09:29,770 --> 00:09:34,860 So this was the seizure of the original multi-stage model of a golf course. 95 00:09:34,900 --> 00:09:40,480 Genesis Imagine disease, cancer in particular, occurring through a sequence of steps. 96 00:09:41,620 --> 00:09:43,780 Now, in its most general form, 97 00:09:44,470 --> 00:09:53,740 a multi stage model of disease is one in which you go from a healthy to a disease state in one of any number of independent paths. 98 00:09:54,130 --> 00:09:56,410 The only important thing is they must be independent of each other. 99 00:09:56,410 --> 00:10:02,049 They can't interact and they through a series of sequential or non sequential states. 100 00:10:02,050 --> 00:10:07,959 So in its most general form you can take any model of that, that type, and you can write down an equation for it. 101 00:10:07,960 --> 00:10:11,140 Don't worry about the equations. That's just to show that they you can do that. 102 00:10:12,340 --> 00:10:18,100 However, in practise, even if you have a situation where this is this is actually how a disease may occur, 103 00:10:18,550 --> 00:10:22,450 you're only going to see the process that happens most rapidly. 104 00:10:23,320 --> 00:10:30,100 So even if a disease can occur in several different ways, if most of the cases are occurring in the same way, 105 00:10:30,100 --> 00:10:37,540 that's that's the sequence of steps you're going to see. And so in practise, the model is going to be a lot simpler. 106 00:10:38,980 --> 00:10:48,010 And when the the hazard of of a disease is small, which is which will turn out to be the case for most diseases, 107 00:10:48,460 --> 00:10:52,820 then you can approximate this, this the situation with what's called a viral distribution. 108 00:10:52,820 --> 00:11:03,280 And I'll explain what that is in just a moment. So if you don't know what these three things are yet, you will do by the end of the course, I presume. 109 00:11:04,210 --> 00:11:10,720 On the left is your survival function. So this is the probability of surviving disease to a certain age. 110 00:11:10,730 --> 00:11:17,410 So. So this is, say, age 100 say, and your probability of surviving today is about half. 111 00:11:17,410 --> 00:11:24,370 Say, for example, the probability density is is or probability distribution is often referred to. 112 00:11:24,970 --> 00:11:30,700 So if you take if you if you look at this interval between 101 and you calculate the end of the curve, 113 00:11:30,700 --> 00:11:35,830 that's the probability of of getting a disease with it within that age range. 114 00:11:37,090 --> 00:11:44,020 Okay. And then the third thing is a little bit less familiar, and he's not taught that often. 115 00:11:44,020 --> 00:11:49,540 And it probably ought to be taught more generally than just in statistics classes. 116 00:11:49,870 --> 00:11:56,319 And this is the hazard function. And what the hazard function is, is it says, okay, if you came to me now and said, 117 00:11:56,320 --> 00:12:02,050 okay, Anthony, what's the chance I'm going to die of lung cancer in the next day? 118 00:12:02,620 --> 00:12:04,840 Well, that's what the hazard function tells you. Okay. 119 00:12:05,380 --> 00:12:12,460 So the hazard function is the probability of getting a disease, given that you haven't got it until now. 120 00:12:12,730 --> 00:12:17,230 Okay. And all three of these things are related. 121 00:12:17,240 --> 00:12:22,660 You can calculate each one in terms of the other. It requires a certain amount of ingenuity. 122 00:12:23,120 --> 00:12:34,809 But you can do that. And one of the thing that most people are unaware of is that for most diseases that occur, if you're otherwise healthy, 123 00:12:34,810 --> 00:12:40,930 you don't have any underlying conditions for almost all diseases that you look at and almost all people, 124 00:12:41,380 --> 00:12:46,450 your risk of a disease up until 80, 90 is very, very low. 125 00:12:47,290 --> 00:12:51,730 It's just the fact there's so many diseases that it makes it it makes them almost inevitable. 126 00:12:52,330 --> 00:12:58,060 And for that for that reason, you can approximate this probability density by by the hazard function, 127 00:12:58,570 --> 00:13:03,010 which makes a lot of the maths far simpler, although you won't see any of that. 128 00:13:04,840 --> 00:13:12,220 So this is actually data from UK Biobank where I've looked at the probability of a disease having occurred by a given age, 129 00:13:12,790 --> 00:13:19,330 given that you haven't yet had any, any underlying diseases. And so this is age 58, 60, 70, 80, 90. 130 00:13:19,540 --> 00:13:23,109 So even by age 100, of course, that age is less good for that. 131 00:13:23,110 --> 00:13:32,410 But but the likelihood of of getting that so this is the along the bottom here, we've got the probability of having had disease by that age. 132 00:13:32,680 --> 00:13:36,220 It's comparatively small. It's really, really surprisingly small. 133 00:13:36,520 --> 00:13:40,540 And I'm not the only person to observe this, but but it's not commonly known. 134 00:13:40,540 --> 00:13:46,510 So it is also cropped up in a statistical genetics paper, an important statistical genetics paper just recently. 135 00:13:48,770 --> 00:13:53,720 Just bear with me with the maths for a little bit longer. As I said, you can relate these three things to each other. 136 00:13:53,990 --> 00:13:59,210 So if you look at the area underneath this, it has a function that gives you what's called the cumulative hazard function. 137 00:14:00,080 --> 00:14:03,230 For those of you who know a bit of calculus, it's the is the integral of it. 138 00:14:03,890 --> 00:14:12,530 And that allows you to express the survival function in terms of the hazard function as the exponential over there. 139 00:14:13,760 --> 00:14:17,780 So where I'm going with this is, is to show you what the viral distribution is. 140 00:14:18,740 --> 00:14:23,750 This is our survival function. And we we write it in terms of the hazard function. 141 00:14:25,160 --> 00:14:29,150 And if we have a proportional hazards model that you can learn about later. 142 00:14:30,320 --> 00:14:36,530 Then what that assumes is you can write the hazard function as a product of a term that's the same 143 00:14:37,640 --> 00:14:42,440 for everybody and that's the has all the age dependents in it and the time dependents in it, 144 00:14:43,310 --> 00:14:52,220 and a factor which is the relative risk. So for example, X might be one if you're a smoker, not if you zero, if you're not. 145 00:14:52,670 --> 00:14:59,960 So it would just be the relative risk, be one if you're a non-smoker and it would be the relative risk be to the beta if you were a smoker. 146 00:15:00,260 --> 00:15:02,570 Okay. And that's the proportional hazards assumption. 147 00:15:04,220 --> 00:15:12,410 And if you are assuming a wider distribution, then all you're doing is assuming that this hazard based based is called baseline. 148 00:15:12,410 --> 00:15:18,470 Hazard function is a power of time or power of all tend to use time and age interchangeably. 149 00:15:19,230 --> 00:15:20,960 And so that's what a Bible distribution is. 150 00:15:21,890 --> 00:15:30,560 Or if you prefer words in a Bible distribution that has a function, it is a power of time or of age, depending on how you express it. 151 00:15:31,130 --> 00:15:40,650 Okay. And as I said, and for diseases where you know that their interest rate is low, which is is most which is already something to look at, 152 00:15:40,650 --> 00:15:47,129 any way you can approximate the wider distribution is a good approximation to this very simple sequential model of disease, 153 00:15:47,130 --> 00:15:51,120 the very original Armitage doll model of cancer. 154 00:15:52,500 --> 00:15:57,180 Okay. So if we plot, if we take the log of this of the both sides, 155 00:15:57,180 --> 00:16:01,890 we should see a straight line if this is the case and we can start to estimate these parameters. 156 00:16:02,520 --> 00:16:06,270 And for roughly 60% of the disease I looked at. 157 00:16:06,660 --> 00:16:10,650 So I've just these are just all I'm trying to do is show that there's lots of these things. 158 00:16:11,040 --> 00:16:18,690 I looked at hundreds of them, literally, and you get a nice straight line and it all looks pretty consistent. 159 00:16:18,810 --> 00:16:22,350 And you can do statistical tests to check that it's consistent with the model. 160 00:16:23,070 --> 00:16:26,190 And and that was turned out. 161 00:16:26,190 --> 00:16:29,700 This is this is very prevalent, however. 162 00:16:29,700 --> 00:16:37,110 So you might say so why is this not being noticed before? And there is a reason for that the data. 163 00:16:38,340 --> 00:16:43,650 So let's say this is your survival your your chance of surviving a disease up to age 100. 164 00:16:44,310 --> 00:16:48,570 Now, most of the data you get will be of people within a certain age. 165 00:16:48,570 --> 00:16:55,050 So for UK Biobank, data will be collected on people, say between roughly age 1460, for example. 166 00:16:55,890 --> 00:17:02,459 What that means is, is you're not seeing you're not seeing this part of the curve. 167 00:17:02,460 --> 00:17:03,660 You're not seeing data from here. 168 00:17:04,710 --> 00:17:12,690 You're just just all you're saying is this now the most advanced, traditional methods for analysing this sort of data? 169 00:17:13,140 --> 00:17:18,630 They will correctly adapt for the fact that this is this is not flat here. 170 00:17:18,870 --> 00:17:23,219 And they will they will get that right. But what they what they don't do because they are not parametric, 171 00:17:23,220 --> 00:17:31,170 they're not modelling the overall age dependence, they are not able to estimate how far this has dropped down. 172 00:17:32,820 --> 00:17:39,990 So how much lower the how much disease would already have occurred in your underlying population? 173 00:17:41,400 --> 00:17:52,260 And if you don't account for that, then instead of seeing this nice straight line, you see something that's very, very curved, very strongly curved. 174 00:17:52,680 --> 00:18:00,690 And so you might just quite naively plot your data and reach the conclusion that this this model fits very, very few cases. 175 00:18:01,590 --> 00:18:10,890 But if you adjust for the fact that your data does not cover the entire human lifetime, then you get a nice a nice straight line again. 176 00:18:12,300 --> 00:18:16,020 And this is why I think this has been overlooked so far. 177 00:18:17,770 --> 00:18:21,280 Okay. I think this is the last in your question. Just ignore this one if you like. 178 00:18:22,690 --> 00:18:26,140 So I thought, well, you know, there are other models you can look at. 179 00:18:26,440 --> 00:18:30,130 So I thought, well, let's also look at a model where disease can occur. 180 00:18:30,190 --> 00:18:34,069 The steps could go in any order. So this one could occur than this one. 181 00:18:34,070 --> 00:18:38,350 Than this one, as opposed to needing to occur in sequence. And I also fit that to the data. 182 00:18:39,370 --> 00:18:46,660 And what I found is that also fit the data pretty well. But it also fit the same types of diseases pretty well. 183 00:18:47,170 --> 00:18:53,110 So my conclusion is, yes, it's conceivable, based on these results, 184 00:18:53,110 --> 00:18:59,140 that many diseases may take may have this sort of multi-stage underlying aetiology or underlying process, 185 00:18:59,890 --> 00:19:04,720 but we can't actually conclude from this data that any of them actually actually do, 186 00:19:05,110 --> 00:19:10,270 because these two models have very different different biological interpretations. 187 00:19:11,020 --> 00:19:16,300 So that's that's work to be ongoing to determine whether it's the case or not. 188 00:19:17,320 --> 00:19:25,480 Now, as I said, there are, however, other advantages to having this nice parametric model that describes the incidence of disease with age. 189 00:19:26,350 --> 00:19:31,960 And one is extrapolation. So you can quantify differences in the patterns of disease onset. 190 00:19:32,590 --> 00:19:41,620 So what I'm showing here is along the bottom is the probability of having experienced a disease by age 50. 191 00:19:42,320 --> 00:19:50,920 Okay, so down that's on the right hand side are diseases that are more likely to be observed by age 50 going up vertically. 192 00:19:51,460 --> 00:19:55,300 Is the relative increase in risk by age 100? 193 00:19:55,750 --> 00:20:00,730 So at the top here, your these diseases, your risk are rapidly increasing with age. 194 00:20:02,320 --> 00:20:09,250 Now, it turns out the what you find is that these overall these are all the disease and they're sort of spread along the diagonal broadly here. 195 00:20:09,760 --> 00:20:15,400 So down here, these diseases are, you know, there's a good chance you might get them in middle age, 196 00:20:15,880 --> 00:20:21,040 but your risk increases comparatively slowly as you age. 197 00:20:21,580 --> 00:20:26,830 In contrast, these ones, you're very, very unlikely to get until you're quite old. 198 00:20:27,370 --> 00:20:35,370 But the risk rapidly increases. And if you you look at the types of diseases which these are, they are the sort of things you might have guessed. 199 00:20:35,380 --> 00:20:41,410 So these are more, you know, cancers and cardiovascular diseases, and these are your digestive diseases, 200 00:20:41,650 --> 00:20:46,330 musculoskeletal diseases and sort of things of unknown origin or poorly defined. 201 00:20:48,550 --> 00:20:52,230 So this is another way of looking at it. So this is this is the hazard of your disease. 202 00:20:52,240 --> 00:20:57,340 So the problem. So if you live to age 80, it's what's the chance of getting a disease in the next year? 203 00:20:58,090 --> 00:21:05,170 And you can see that for the spread of diseases, you know that there's always a risk throughout your life, but it increases really quite slowly. 204 00:21:05,920 --> 00:21:10,120 Whereas these late onset diseases, you know, you're very unlikely to see them until you're older. 205 00:21:10,540 --> 00:21:14,199 But when you're older, you know they are going to get us. 206 00:21:14,200 --> 00:21:19,150 You know, it's not much we statistically at least it doesn't look good. 207 00:21:19,510 --> 00:21:23,650 So the next question is, well, how how modifiable are risks of these things? 208 00:21:24,490 --> 00:21:28,870 So these are now looking at relative risks for established risk factors. 209 00:21:29,650 --> 00:21:37,270 So there's education, deprivation, BMI, height, drinking, diabetes and smoking. 210 00:21:37,810 --> 00:21:40,510 And what I've done is, is I've done box plots. 211 00:21:40,510 --> 00:21:46,720 So I've I've taken I've made estimates of all these things for all the sporadic diseases and all the late onset diseases. 212 00:21:47,350 --> 00:21:56,290 And the the box plot shows the median and the and the extent of the box gives you an idea of I think it's the upper quartile, 213 00:21:56,290 --> 00:22:01,690 gives you an idea of the spread of of the the estimates and the and the risks, 214 00:22:01,990 --> 00:22:08,170 the relative risks are pretty similar, perhaps a bit higher for diabetes and smoking in the late onset diseases. 215 00:22:09,710 --> 00:22:12,560 Now. I think this is my last equation. 216 00:22:13,400 --> 00:22:19,910 Now, if you go back to Weibel distribution, you can take this relative risk and you can pop it inside the brackets here. 217 00:22:20,450 --> 00:22:24,920 And then what you get is something that's equivalent to an effective age risk of disease. 218 00:22:25,310 --> 00:22:30,290 So if you had no risk factors that would give you your age dependent risk of disease. 219 00:22:30,620 --> 00:22:36,080 If you have some risk factors that increase your risk, that that makes this term like a higher. 220 00:22:36,110 --> 00:22:42,469 So for example, so your gives you an effective age that could be either higher or lower depending on whether you will increase your risk through, 221 00:22:42,470 --> 00:22:48,680 say, smoking or reducing it through some other other methods, such as a drug to lower your blood pressure, for example. 222 00:22:51,120 --> 00:22:56,189 You could alternatively regard this this factor here as like like a relative ageing rate because you just 223 00:22:56,190 --> 00:23:02,280 multiply that by your actual age and it gives you your effective age at risk of any particular disease for your, 224 00:23:02,280 --> 00:23:04,350 your particular set of risk factors. 225 00:23:06,610 --> 00:23:12,160 And when you look at your the difference between sporadic and late onset disease in terms of this relative ageing rate, 226 00:23:12,820 --> 00:23:18,640 then we see a very different pattern where the sporadic diseases are actually quite modifiable in terms of terms of risk. 227 00:23:19,300 --> 00:23:27,129 So not only are these diseases potentially avoidable over your lifetime, but they're also very modifiable in terms of their risks, 228 00:23:27,130 --> 00:23:31,570 in terms of well known risk factors that we can we all have some choice over. 229 00:23:33,870 --> 00:23:41,780 Okay. So there's one last topic. Now, one of the thing that you can do is you can take your data and you can split it up into pieces. 230 00:23:41,790 --> 00:23:48,060 This is referred to as stretch justification. So, for example, you can split up your data by the smokers and the non smokers. 231 00:23:49,320 --> 00:23:55,920 And as we well know now, if we plot this data appropriately, if it's consistent with this model, we should see us see a straight line. 232 00:23:57,210 --> 00:24:04,560 So what we're seeing here, we've got the smokers in red, the non smokers in blue, and I've sort of picked three, three diseases because nice examples. 233 00:24:05,730 --> 00:24:12,300 And the fact that the smokers are sort of higher up, it corresponds to a higher incidence rate. 234 00:24:12,910 --> 00:24:16,560 This is this is observed data with with a with a line fitted through it. 235 00:24:17,910 --> 00:24:21,450 Now you can start to interpret what these what these things mean. 236 00:24:22,410 --> 00:24:25,620 Let's say we've got a disease process where where you're going through a serial 237 00:24:25,800 --> 00:24:31,140 sequence of steps and you increase the rates of all of those processes. 238 00:24:31,350 --> 00:24:39,060 So you create all of these things. Then what you'll see is a rigid displacement of this this line upwards. 239 00:24:39,120 --> 00:24:45,570 Okay. And that's what you would what you would get for a proportional in a proportional hazards model, which you'll be learning about. 240 00:24:45,930 --> 00:24:51,240 So it's a proportional hazard model. This is what you're modelling. You're able to model these things moving up and down. 241 00:24:52,740 --> 00:24:57,870 And this is what was was the found for this is benign neoplasms of the colon, rectum and anus. 242 00:24:58,020 --> 00:25:00,960 Okay. And that's consistent with that sort of picture. 243 00:25:02,100 --> 00:25:09,270 Another thing that you might have half is that you might increase the rate of one of these steps sufficiently rapidly or enough. 244 00:25:09,270 --> 00:25:14,790 Rather, you just don't see it anymore. So instead of seeing three, three stages, you only see the two. 245 00:25:16,000 --> 00:25:22,540 And then what you'll see is your the line would move upwards and the slope would become less, less steep. 246 00:25:23,860 --> 00:25:30,520 Now, I should stress, if you see data like this, then you shouldn't be modelling it with the proportion model. 247 00:25:31,030 --> 00:25:35,230 Your statistical test may not pick that out. But you shouldn't be. 248 00:25:35,410 --> 00:25:39,190 You shouldn't in principle be doing that. So there's one other thing that might happen. 249 00:25:39,880 --> 00:25:47,470 What happens if you, you know, you're smoking or whatever you're doing is modifying the risk of disease by some other some 250 00:25:47,470 --> 00:25:52,780 other process so that you're the dominant disease process now becomes something different. 251 00:25:53,740 --> 00:25:59,920 Well, then you might have a situation where not only are you increasing the rate, but you might have more, more steps on this situation. 252 00:26:00,190 --> 00:26:06,880 You could even see an increase in the slope. And I think that's what I'm hypothesising is happening with the lung cancer, 253 00:26:07,300 --> 00:26:15,700 because it turns out that the type of cancer you get in smokers is probably different to the type of cancer you get in non-smokers. 254 00:26:16,450 --> 00:26:20,050 But this is all this is all kind of this is all hypothesising, really. 255 00:26:20,380 --> 00:26:25,390 But it gives you a very different sort of perspective on what might be going on in terms of disease. 256 00:26:26,580 --> 00:26:35,120 Know, these are the same results for women. You could stratify by diabetes and its results for women again, just like all other results here. 257 00:26:35,450 --> 00:26:41,960 I note for the that diabetic men seem to be at lower risk than the non-diabetic men of pulmonary embolism. 258 00:26:42,140 --> 00:26:49,400 I don't know why. It's a it's a it's an odd result. So these are my my sort of conclusions, the late onset diseases. 259 00:26:49,580 --> 00:26:57,950 They seem to be inevitable. You know, the rate at which the risk of those increases is dramatic once you get into old age. 260 00:26:58,880 --> 00:27:08,510 But these sporadic diseases, at least from a statistical point of view, it looks like the risk is is all despite existing throughout our lives. 261 00:27:08,510 --> 00:27:15,320 It doesn't increase particularly rapidly. And so you might hope that we could either prevent or avoid them throughout our lifetime. 262 00:27:16,100 --> 00:27:21,140 In addition, these diseases are strongly influenced by well known risk factors. 263 00:27:21,830 --> 00:27:32,660 So it would appear that a substantial burden of hospital admissions is in principle avoidable by just changing our lifestyle habits. 264 00:27:33,680 --> 00:27:42,440 And I have also hoped to present a slightly different statistical approach that is, at the moment, largely unexplored. 265 00:27:43,370 --> 00:27:47,090 So the opportunities I see them going forward. So for the methodology. 266 00:27:47,930 --> 00:27:51,380 Risk stratification using effective age is a very intuitive way. 267 00:27:51,590 --> 00:27:59,360 Also for communication, public communication, effective age is it is easier for many people to understand if somebody says, 268 00:27:59,360 --> 00:28:06,200 you know, if you smoke, you know, and you're 50 years old, your equivalent risk to heart disease, to somebody who's sort of 60, 65, 269 00:28:06,470 --> 00:28:10,880 that's much easier to understand than somebody saying you've got a relative risk of 20 or whatever it might be. 270 00:28:11,990 --> 00:28:18,350 You get the modelling is is improved for some things and we can get some new insights into the causes of diseases. 271 00:28:19,250 --> 00:28:24,160 And of course the big opportunity in principle is that it does look like in print, 272 00:28:24,170 --> 00:28:31,970 it does suggest that a large proportion of diseases might be avoidable or premature hospital admissions maybe, maybe avoidable. 273 00:28:32,660 --> 00:28:37,490 But again, we kind of know this already. It's just quantifying it a bit more, a bit more carefully. 274 00:28:38,540 --> 00:28:46,730 So to finish off, what might you realistically hope? Richard Dole said that death in old age is inevitable, but death before old age is not. 275 00:28:47,300 --> 00:28:52,970 And I think I would certainly agree with the first part of that. It does look like death in old age is inevitable. 276 00:28:53,390 --> 00:28:58,420 However, I think it's time to start asking whether perhaps, perhaps ill health is not. 277 00:28:58,430 --> 00:29:05,330 Perhaps we can get right through to a ripe old age, but avoid most of these conditions before finally richer. 278 00:29:05,780 --> 00:29:08,900 Something gets us. So that point, I think I'll stop.