1 00:00:08,670 --> 00:00:17,190 Today, I'm extraordinarily delighted to welcome and introduce So David Cox as our speaker, 2 00:00:17,190 --> 00:00:24,750 and when we first started planning this summer Institute for Computational Social Science, 3 00:00:24,750 --> 00:00:32,070 we were really keen to bring a statisticians perspective amongst our list of speakers. 4 00:00:32,070 --> 00:00:41,310 And in particular, we were we were very and we found that there was essentially it was really important that we would have someone who would 5 00:00:41,310 --> 00:00:51,060 talk about not just sort of the potential of big data and and the proliferation of data sources for social science research, 6 00:00:51,060 --> 00:00:58,650 but also to talk about some of the the limitations and and and provide a sort of salient critique of it. 7 00:00:58,650 --> 00:01:05,640 So then I thought, well, who better to do this than to have one of the world's most prominent statisticians? 8 00:01:05,640 --> 00:01:15,510 So David Cox, come and talk to us about how we should think of statistical analysis in a world of big data. 9 00:01:15,510 --> 00:01:25,200 I hope so. David is familiar to most of you already, but at least I guess his biggest claim to fame is perhaps the fortunate hazards model, 10 00:01:25,200 --> 00:01:32,190 which he published in his 1972 paper, which is one of the most 100 most cited papers of all time. 11 00:01:32,190 --> 00:01:39,660 And there are just many other fields of statistics that so David is contributed to and his work not just in academia, 12 00:01:39,660 --> 00:01:43,590 but he's also worked in government funded research organisations. So. 13 00:01:43,590 --> 00:01:55,140 So he has a vast experience that spans both, in a sense, industry and research organisations as well as working in academia. 14 00:01:55,140 --> 00:01:59,880 So David has, of course, had a number of very important fellowships, 15 00:01:59,880 --> 00:02:05,190 and he served as the president of a number of prestigious academic and academic organisations, 16 00:02:05,190 --> 00:02:13,110 professional organisations such as the Vinoly Society, the Royal Statistical Society and the International Statistics Institute. 17 00:02:13,110 --> 00:02:18,720 And in 2010, he received the the Copley Medal, the Royal Society's highest award. 18 00:02:18,720 --> 00:02:28,260 And in 2016, the international prise in statistics. So he's had a number of accolades, too many for me to list in this very short introduction. 19 00:02:28,260 --> 00:02:45,490 So all I have to do now is to introduce is to welcome him and thank him again for coming to come to coming, to talk, to talk with us today. 20 00:02:45,490 --> 00:02:53,410 Well, thank you very much indeed. And I'm going to talk about variability and variability could mean various things. 21 00:02:53,410 --> 00:03:01,210 It can mean systematic variability if we change this or make a policy alteration to that. 22 00:03:01,210 --> 00:03:04,600 What are the consequences? What is what is the problem? 23 00:03:04,600 --> 00:03:15,760 Very often the chain of consequences that will happen these days, the aid, investigation and careful description. 24 00:03:15,760 --> 00:03:21,610 Then there's variability that is not predictable. The most obvious is survival time. 25 00:03:21,610 --> 00:03:30,400 We have a group of homogeneous. In some sense, socially homogeneous individuals with a particular age and background and so forth. 26 00:03:30,400 --> 00:03:35,920 How long before they die? This is not a predictable variable. 27 00:03:35,920 --> 00:03:39,790 How can we describe the properties of such times? 28 00:03:39,790 --> 00:03:47,050 How can we study them? How can we say what makes people survive a bit longer or a bit shorter and so forth? 29 00:03:47,050 --> 00:03:52,690 How can these things they thought about in a systematic way? 30 00:03:52,690 --> 00:03:58,940 And how can they be studied quantitatively in the way that? 31 00:03:58,940 --> 00:04:04,370 Advances, understanding and aids communication that results. 32 00:04:04,370 --> 00:04:11,450 Now I'm going to describe the techniques, the ideas that are involved in doing that. 33 00:04:11,450 --> 00:04:18,410 A number of headings which are not to be taken too seriously and not to be taken in the order. 34 00:04:18,410 --> 00:04:30,780 I mean, everything's to be taken seriously. Obviously, there are no no jokes in this lecture tour as yet. 35 00:04:30,780 --> 00:04:37,530 The order is it is a planned one, but not to be taken too seriously, as I said, 36 00:04:37,530 --> 00:04:44,160 and the first idea would be that of formulating the research question that you 37 00:04:44,160 --> 00:04:51,480 want to investigate and doing that precisely enough if it is feasible to study. 38 00:04:51,480 --> 00:05:04,140 Now this may sound of this, but then it's a fact of life indeed, that people, 39 00:05:04,140 --> 00:05:09,840 even serious experienced investigators formulate their research questions. 40 00:05:09,840 --> 00:05:14,070 They can do so in a way that's not essentially answerable. 41 00:05:14,070 --> 00:05:21,930 You need to formulate your research questions not too rigidly, but increased in the form. 42 00:05:21,930 --> 00:05:30,930 Does that are acceptable in principle of being answered by the data that you have or the data that you are likely to be able to obtain? 43 00:05:30,930 --> 00:05:37,180 So the formulation of questions is a critical issue. 44 00:05:37,180 --> 00:05:46,570 And I'm not suggesting this is a rigid thing done once in the beginning, the questions may evolve during the course of an investigation. 45 00:05:46,570 --> 00:05:57,110 And this, incidentally, applies as much in pure mathematics as in more empirical studies asking the right question. 46 00:05:57,110 --> 00:06:04,040 Pure mathematician, my beautiful piece of work resolve the issue of no conceivable interest to anybody. 47 00:06:04,040 --> 00:06:11,510 Or they may do something that's focussed on something that has very strong consequences. 48 00:06:11,510 --> 00:06:21,980 The choice was a research question is a critical one and may need revision from time to time. 49 00:06:21,980 --> 00:06:34,430 And then closely associated with that. Obviously, the issue of design from one extreme, we have the design of experiments, 50 00:06:34,430 --> 00:06:42,920 interventions in which the investigator has virtually total control over what's going on. 51 00:06:42,920 --> 00:06:48,500 The control is never total, of course, but but it's substantial, 52 00:06:48,500 --> 00:06:57,260 and the randomised clinical trial is an example of that going, which has a long history from other fields, 53 00:06:57,260 --> 00:07:03,980 in particular, from agriculture and to some extent from physics work were carefully, 54 00:07:03,980 --> 00:07:10,270 very, very carefully and meticulously timed experiments a crucial. 55 00:07:10,270 --> 00:07:18,430 But some of the basic principles of the design, the investigations come out of agriculture and in particular, 56 00:07:18,430 --> 00:07:24,280 one key question is, I think arise in many contexts. 57 00:07:24,280 --> 00:07:35,000 Is it a good idea for your current investigation to think a one sharply focussed question and try and answer that? 58 00:07:35,000 --> 00:07:42,530 Or is it better to ask a battery of questions that are interrelated on another? 59 00:07:42,530 --> 00:07:54,020 And the great statistician or official who began his work in the nineteen twenties in an agricultural context, although his name is often prominent, 60 00:07:54,020 --> 00:07:56,000 sort of is a geneticist, 61 00:07:56,000 --> 00:08:06,470 and it began in agricultural contexts and set out in the mid 1920s the principles of the experimental design under four headings. 62 00:08:06,470 --> 00:08:16,120 And one of them was the notion of asking nature. An interrelated series of questions. 63 00:08:16,120 --> 00:08:21,270 Now that's in an experimental or interventionist context. 64 00:08:21,270 --> 00:08:24,510 It's the same principle does apply more generally. 65 00:08:24,510 --> 00:08:31,980 If you're using observational data, are you going to concentrate on your efforts on one sharply focussed question? 66 00:08:31,980 --> 00:08:41,260 Or are you good? Excuse me. Are you going to try and study an interrelated series of things? 67 00:08:41,260 --> 00:08:52,330 And then the third item in my list is what's called metrology, and that's a term used mainly by physicists to mean measurement. 68 00:08:52,330 --> 00:09:02,560 How do we measure things or are measurements measuring what we think is not what are you going to do about it? 69 00:09:02,560 --> 00:09:09,240 I like it all. And the most obvious example of that in the context, 70 00:09:09,240 --> 00:09:18,880 I guess many of you are likely to be interested in what I certainly to personally is the issue of measuring quality of life, 71 00:09:18,880 --> 00:09:32,570 perhaps in a medical context, but not totally. The key issue there is a key issue, at least, is how multidimensional they are. 72 00:09:32,570 --> 00:09:40,790 You're going to say we must have a single number that measures the quality of an individual's lives. 73 00:09:40,790 --> 00:09:48,440 Perceived as it is at the moment or how it might be in the future so that we can make suggestions. 74 00:09:48,440 --> 00:09:54,470 This is particularly in perhaps the clinical context, but more broadly. 75 00:09:54,470 --> 00:10:06,780 So it's suggestions and policies can be introduced that will in some sense maximise the quality of life and. 76 00:10:06,780 --> 00:10:09,050 But it's a difficult issue. 77 00:10:09,050 --> 00:10:23,190 I would dimensional I'm multi-dimensional, it's quality of life, and I'm being slightly unkind to economists, not all economists, but some economists. 78 00:10:23,190 --> 00:10:38,220 But if. How can I push it if if your view of life is that it's a series of optimisations optimisation to optimise something, 79 00:10:38,220 --> 00:10:44,790 the something has to be one dimensional, you can say optimise two things at the same time. 80 00:10:44,790 --> 00:10:50,910 Because in general, at least there have to be collapsed into one. 81 00:10:50,910 --> 00:11:00,570 So that forces the thinking about the quality of life to be the quality of a person's life. 82 00:11:00,570 --> 00:11:09,930 That's to say now, or maybe estimations in the future is a single number, often expressed as a so-called quality, 83 00:11:09,930 --> 00:11:19,440 quality, quality, quality, adjusted life year on the scale from the auction floor for means perfect life and so forth. 84 00:11:19,440 --> 00:11:27,850 Three Main Street courses and then having gotten no light that you can. 85 00:11:27,850 --> 00:11:36,470 Think about trying to make rational choices about it. 86 00:11:36,470 --> 00:11:47,060 The counter to that is to say the quality of life is a multidimensional thing, and I measure it by some elaborate questionnaires. 87 00:11:47,060 --> 00:11:57,980 A standard 50 50 item questionnaire is attempts to separate the quality of life into four dimensions. 88 00:11:57,980 --> 00:12:04,370 And the size of that, of course, is can particularly work in patients. 89 00:12:04,370 --> 00:12:16,600 Maybe, as I have patiently felt sorry for the clients, maybe that fell out 52 in ones, but you can't keep on doing that. 90 00:12:16,600 --> 00:12:27,050 What I was asking them a simple question may be repeatable at an extreme, and that is a study which I think took place in, but I might be wrong. 91 00:12:27,050 --> 00:12:33,630 The quality of life of patients suffering from. 92 00:12:33,630 --> 00:12:44,040 Difficulties of walking on the single question was, could you put your socks on today unaided? 93 00:12:44,040 --> 00:12:54,530 Yes or no? That gave. I think that is suffering from rheumatoid arthritis, lack of. 94 00:12:54,530 --> 00:13:05,660 And one zero binary observation, an indication of the relevant aspect of the quality of this person's life. 95 00:13:05,660 --> 00:13:17,580 And of course, that's a natural question to ask. And there's no likely to get a broadly truthful answer, which many of these self-assessed question. 96 00:13:17,580 --> 00:13:24,500 That's a bit doubtful, but likely to get it to answer an impressive or repeatable. 97 00:13:24,500 --> 00:13:30,270 So that's an example of metrology. 98 00:13:30,270 --> 00:13:35,440 Principle of measuring things in a careful and appropriate way. 99 00:13:35,440 --> 00:13:42,570 And of course, many of the advantages many, many of the developments, particularly in the medical sciences, come from physics. 100 00:13:42,570 --> 00:13:50,970 The ability to measure through scanning devices various kinds of things of great subtlety, 101 00:13:50,970 --> 00:14:00,240 starting really with X-rays 100 hundred years ago and developing from that amazing things that were previously not measurable at all. 102 00:14:00,240 --> 00:14:10,410 So metrology, the ability to measure things in appropriate and suitable way is a key item now. 103 00:14:10,410 --> 00:14:16,900 Oh, my office. So those jokes about question formulation crucial. 104 00:14:16,900 --> 00:14:23,410 Designed to be a study, whether it's experimental interventionist, 105 00:14:23,410 --> 00:14:33,550 whether it's observational or whether it's collecting appropriate data that already exists and deciding which, 106 00:14:33,550 --> 00:14:42,550 if any of the data that might be available is appropriate for your study. 107 00:14:42,550 --> 00:14:52,240 Let me move on to that, to various stages of analysis of what could divide the analysis of the data into all sorts of phases, 108 00:14:52,240 --> 00:15:01,660 depending upon whether it's data that's already been fairly thoroughly analysed from different points of view well-understood 109 00:15:01,660 --> 00:15:08,530 or whether it's the outcome of some totally new investigation in a field in which you have very little experience, 110 00:15:08,530 --> 00:15:18,080 in which case you have to proceed much more. The multi-stage aspect becomes much more relevant. 111 00:15:18,080 --> 00:15:25,190 The idea is that you in collecting the data, if you collected the data over a response or chose the data, 112 00:15:25,190 --> 00:15:28,340 you must have some idea of how you're going to analyse it. 113 00:15:28,340 --> 00:15:38,900 When you made the choice of the data that you're going to have to hold to that too rigidly, then it can be dangerous. 114 00:15:38,900 --> 00:15:43,790 And then when the analysis is completed, what on earth is it? Or maybe what, what? 115 00:15:43,790 --> 00:15:48,260 What does it, what does it suggest as a basis for action? 116 00:15:48,260 --> 00:15:59,020 What new study should be set up and so forth? And so if we go through those phases? 117 00:15:59,020 --> 00:16:06,760 All of them, you might say anything to do with statistics, their subject matter. 118 00:16:06,760 --> 00:16:16,440 Of course, the subject matter questions. The point is that the ideas and principles involved that span. 119 00:16:16,440 --> 00:16:31,880 A large range of subjects from the purest of physics to the most extreme of of application in the social sciences and everyday life. 120 00:16:31,880 --> 00:16:34,910 But there are common features, and in a sense, 121 00:16:34,910 --> 00:16:45,530 the theoretical statisticians job I count myself as a theoretical statistician, although my training is in mathematics. 122 00:16:45,530 --> 00:16:50,810 I don't I'm not talking about mathematics. I'm talking about theory. 123 00:16:50,810 --> 00:17:00,050 As a theoretical statistician, the object is to try and see some sort of unity of approach to all these problems, 124 00:17:00,050 --> 00:17:07,550 whatever that feels, that will be helpful and constructive as well as getting involved in particular applications. 125 00:17:07,550 --> 00:17:14,790 So that's the broad. See what I want to talk about. 126 00:17:14,790 --> 00:17:25,470 Now going on to analysis the basic principles of analysis, there's one overriding principle that always applies and that's the first one, 127 00:17:25,470 --> 00:17:29,850 and that's why you keep things as simple as you possibly can. 128 00:17:29,850 --> 00:17:35,900 But no simpler. It's a real difficulty there. 129 00:17:35,900 --> 00:17:44,140 It's got to be a face somehow. One aspect that one tends to think of this very much, 130 00:17:44,140 --> 00:17:53,390 the statisticians province is the whole apparatus of significance test companies into those towns do interviews, et cetera, et cetera. 131 00:17:53,390 --> 00:18:05,620 And the. There's 200 years of discussion of the quasi philosophical basis of all this, 132 00:18:05,620 --> 00:18:15,910 and it's very interesting and it's an important part of what we do and what scientists do that these techniques, 133 00:18:15,910 --> 00:18:20,860 which are basically principles of analysis and interpretation. 134 00:18:20,860 --> 00:18:25,780 How secure is your interpretation of the analysis that you proposed? 135 00:18:25,780 --> 00:18:35,610 How is that security to be expressed? The endless discussion and misunderstanding about that and its importance thing what we do? 136 00:18:35,610 --> 00:18:42,240 But the more important aspect, I think in many respects is the fourth one, I've got the third one, 137 00:18:42,240 --> 00:18:50,220 though, that I'm concerned with the lucid description of complex relationships. 138 00:18:50,220 --> 00:18:57,120 By regression equations and other other techniques that will describe the systematic component of that 139 00:18:57,120 --> 00:19:05,430 variability in is an enlightening the way it comfortable and enlightening means various things understandable, 140 00:19:05,430 --> 00:19:14,190 obviously, but also linking the data that you have with other data with underlying security and so forth. 141 00:19:14,190 --> 00:19:24,290 The depth and subtlety with which you can do that depends strongly on the field in which you are working and the nature of the data. 142 00:19:24,290 --> 00:19:27,760 And there's an endless literature on this. 143 00:19:27,760 --> 00:19:39,410 I'm referring in particular to a rather strange, you know, a diagnostic test of cosmology where Christianity Tanaka, 144 00:19:39,410 --> 00:19:51,970 who some of you may know who works in Oxford, has edited an international series of papers on. 145 00:19:51,970 --> 00:20:00,610 Modern statistical developments and in about 50 pages, so I think it's less than 50 pages. 146 00:20:00,610 --> 00:20:12,370 That's a very excellent summary of some of the main developments in statistical methodology that are common in biomedical field. 147 00:20:12,370 --> 00:20:19,120 Primary biomedical context primarily, but certainly by doubt, by no means only in that field. 148 00:20:19,120 --> 00:20:27,160 So I recommend just this work. 149 00:20:27,160 --> 00:20:32,340 No, I've spoken for how long? Fifteen minutes. 150 00:20:32,340 --> 00:20:37,200 I've managed to avoid saying big data. Are you going to outlaw me? 151 00:20:37,200 --> 00:20:45,340 Regard me as? Yes, seems to be the answer. 152 00:20:45,340 --> 00:20:55,330 Well, big day, the thing about big data. Is, of course, in a sense, big data have been around a long time of. 153 00:20:55,330 --> 00:21:06,040 But you couldn't analyse them. You might be examples I could give a personal application from one of the first that to work I've ever involved with. 154 00:21:06,040 --> 00:21:19,980 If you're interested in the thickness of a textile yarn that's being woven, textile yarn is produced in batches of Oh. 155 00:21:19,980 --> 00:21:25,920 Each single production would be perhaps 50 kilometres of yarn. 156 00:21:25,920 --> 00:21:32,220 This can be measured every five mm. So let's say thickness, no problem. 157 00:21:32,220 --> 00:21:37,050 You can produce paper trace of data of this so endlessly. 158 00:21:37,050 --> 00:21:44,620 That's one batch of yarn. And then you have lots of others. 159 00:21:44,620 --> 00:21:56,620 Well. So what you see, you can't in those days you could look at you could look at paper traces that showed these patterns, 160 00:21:56,620 --> 00:22:00,670 or it might have been stress on an aircraft in flight. 161 00:22:00,670 --> 00:22:05,850 It's actually a somewhat similar sets of data. 162 00:22:05,850 --> 00:22:13,200 You could look at it, but you got the facilities to do any sort of numerical analysis were totally absent. 163 00:22:13,200 --> 00:22:24,360 You could only take a sample. But it's not at all clear that the sample doesn't tell you that you really want to know. 164 00:22:24,360 --> 00:22:33,630 It's by no means clear that just because the data are big, that's necessarily an insightful thing. 165 00:22:33,630 --> 00:22:41,790 Does big data mean big amounts of information? Well, sometimes no doubt, but certainly not always. 166 00:22:41,790 --> 00:22:54,660 And if, if, if the answer is, well, no, not vetted, this big doesn't make it reliable or good as a warning sign. 167 00:22:54,660 --> 00:23:05,280 So the key issue is many of these big data context is the quality of the data, plus the question I put. 168 00:23:05,280 --> 00:23:13,040 When is enough enough? Because there's no particular virtues, no, perhaps no great harm. 169 00:23:13,040 --> 00:23:22,390 But if you take 10 or 100 times as much data as you really need in practise, you're likely to confuse yourself a bit. 170 00:23:22,390 --> 00:23:28,910 But then perhaps it doesn't, doesn't matter. Then there's the more technical issue. 171 00:23:28,910 --> 00:23:38,300 If you have an enormous amount of data, most measures of precision that you calculate from data that tell you how well you've done it, 172 00:23:38,300 --> 00:23:46,520 estimating some feature that you're interested in. Most of these decrease, like one over the square root of this amount of data. 173 00:23:46,520 --> 00:23:50,960 Not all, but most do. And if you have an enormous amount of data, 174 00:23:50,960 --> 00:23:57,380 one over the square root of the enormous amount and the enormous number is something very, very, very small. 175 00:23:57,380 --> 00:24:03,590 So the apparent precision of the conclusions can be very high. 176 00:24:03,590 --> 00:24:08,490 But there are. I think quite powerful reasons. 177 00:24:08,490 --> 00:24:20,170 And this is a largely unexplored scene, quite powerful reasons for thinking estimates of precision calculated from large amounts of data are often, 178 00:24:20,170 --> 00:24:29,530 while often well, perhaps almost always a bit over overoptimistic and perhaps sometimes wildly optimistic. 179 00:24:29,530 --> 00:24:42,220 They ignore features of the big data that by assuming most likely more technical systems, independent random variables that all that sort of stuff, 180 00:24:42,220 --> 00:24:50,330 by assuming that they get at one of those square is a sample size effect, and that's misleading. 181 00:24:50,330 --> 00:24:57,240 So the correct calculation of standard errors is. 182 00:24:57,240 --> 00:25:07,100 I think a largely open question and. 183 00:25:07,100 --> 00:25:19,660 To be tackled, of course, in the context of a particular case, at least to begin with, although there are some general principles involved. 184 00:25:19,660 --> 00:25:25,420 Now, I haven't mentioned. I've just used the term big data. 185 00:25:25,420 --> 00:25:32,560 I need to mention machine learning, data science. 186 00:25:32,560 --> 00:25:40,840 So on and so forth. Each day goes by the seems to be a new branch coming largely from the computer science lexicon, 187 00:25:40,840 --> 00:25:45,730 but not entirely to some extent from mathematical physics. It comes in. 188 00:25:45,730 --> 00:25:53,050 You see him a new book, Information Informatics, for example. 189 00:25:53,050 --> 00:26:00,800 I just write a book about in some metrics I'm getting it right. 190 00:26:00,800 --> 00:26:07,960 And it's a new science on variability and data. 191 00:26:07,960 --> 00:26:17,230 And some of that is, of course, much of it is driven by engineers coming to it from computer science background. 192 00:26:17,230 --> 00:26:23,020 And some of it is undoubtedly very interesting and important. I'm not being dismissive about it. 193 00:26:23,020 --> 00:26:30,220 I'm saying it has to somehow other things into a broader picture that looks at uncertainty 194 00:26:30,220 --> 00:26:42,950 in the correct way and doesn't produce unstable and over in static conclusions. 195 00:26:42,950 --> 00:26:49,570 Now, I'm so, so excited, so very brief word. About one example. 196 00:26:49,570 --> 00:26:57,330 A situation where we have individuals really interested in how long they live. 197 00:26:57,330 --> 00:27:05,320 Now this is a situation where they're going to be to sort of their ability presence, and we can't really escape this. 198 00:27:05,320 --> 00:27:17,260 That is going to be some systematic variability as between social class medical backgrounds and their background, genetics and so forth. 199 00:27:17,260 --> 00:27:29,500 It's going to be some systematic variation which will need capturing and explaining as concisely inadequately as a new cycle as we can. 200 00:27:29,500 --> 00:27:33,220 But inevitably, it's going to be a large, unexplained component, 201 00:27:33,220 --> 00:27:43,210 and we've got to be able to describe that unexplained component by graphically, numerically, perhaps by formulae. 202 00:27:43,210 --> 00:27:56,560 And that leads into. Various highly developed parts of statistical analysis can keep going back. 203 00:27:56,560 --> 00:28:01,670 Who into into the 19th century, but certainly to the early work of actuaries, 204 00:28:01,670 --> 00:28:12,140 is obviously, obviously actuaries have much interest invest in people's duration of life. 205 00:28:12,140 --> 00:28:18,750 Yeah. But also some physics and biological sciences and so forth. 206 00:28:18,750 --> 00:28:24,940 And so we have to be able to describe the distribution of lifetime. 207 00:28:24,940 --> 00:28:35,320 They have to be able to describe what might influence it systematically and but realise it'll be a large, unexplained component left over. 208 00:28:35,320 --> 00:28:41,860 We've got to be able to describe that, summarise it in nice graphs or plot. 209 00:28:41,860 --> 00:28:45,430 So summary statistics or whatever. 210 00:28:45,430 --> 00:28:57,550 And so that's that's a relatively rich started in the actuarial science sciences, but in physics and in engineering, 211 00:28:57,550 --> 00:29:09,410 my own background was just came from this case, interested in problems in engineering? 212 00:29:09,410 --> 00:29:27,740 No. Being a bit more detail, many of these investigations I've talked about are essentially comes down to saying, how does this depend upon that? 213 00:29:27,740 --> 00:29:34,290 But the pattern of dependence, of course, might be quite subtle and complex. That's what it's going to come down to. 214 00:29:34,290 --> 00:29:42,570 And I've listed the three levels at which this problem can be studied. 215 00:29:42,570 --> 00:29:53,250 The first is that of the textbook study of regression in some form, including logistic regression and survival regression and so on and so forth. 216 00:29:53,250 --> 00:30:03,180 But how does one variable depend upon a modest number of individual observations with a modest number of individuals to study? 217 00:30:03,180 --> 00:30:17,490 I'm not going to say what modesty is that? But that's you know what I mean, not so enormously large that we can't absorb the data as a whole. 218 00:30:17,490 --> 00:30:24,630 And that's really essentially studied in statistical textbooks under the heading of regression and existing regression, 219 00:30:24,630 --> 00:30:34,470 survival regression and so on and so forth. How does this depend upon that? 220 00:30:34,470 --> 00:30:44,630 I put with caution because, of course, it's always easy to overinterpret any dependencies or to find. 221 00:30:44,630 --> 00:30:56,200 Then the next step, which. Starting to get prominence, perhaps 40 or 50 years ago was when. 222 00:30:56,200 --> 00:31:04,360 You have essentially the same situation. We've got the large numbers, not the large numbers we might talk about today, 223 00:31:04,360 --> 00:31:12,730 but large numbers by the computational facilities that were available at the time. 224 00:31:12,730 --> 00:31:21,540 So of. So it's financially viable as 500 individuals and things of this sort. 225 00:31:21,540 --> 00:31:30,430 Multiple regression techniques is still available, but this is an issue we don't want to end up with, 226 00:31:30,430 --> 00:31:43,500 typically don't want to end up with an equation that describes an outcome that we're interested in with 30 things on the other side of the equation, 227 00:31:43,500 --> 00:31:51,810 we can't really. It's not an aid. It may be naive to a prediction in some context, but it's not a major understanding. 228 00:31:51,810 --> 00:31:54,540 So we have to try and simplify that. 229 00:31:54,540 --> 00:32:02,700 And there were various techniques in the time of which great regression was one of the most popular and which you essentially 230 00:32:02,700 --> 00:32:12,270 said at least squares or other type of regression equation and penalise your equation by the magnitude of the association. 231 00:32:12,270 --> 00:32:23,870 So if you have two people who buy something in the squares, you penalise a regression coefficients proportional to the square. 232 00:32:23,870 --> 00:32:31,450 What that does is to shrink the. Small coefficients towards zero. 233 00:32:31,450 --> 00:32:36,860 It leaves the big coefficients where they are pushing the small ones not to zero. 234 00:32:36,860 --> 00:32:42,640 But it is. And then some years ago, I put it done. 235 00:32:42,640 --> 00:32:54,220 Yes, 1996 rupture on end in Stanford had this ingenious idea that if you did this system, 236 00:32:54,220 --> 00:33:01,180 that you penalised your regression, not by the sum of squares of the coefficients that you introduced, 237 00:33:01,180 --> 00:33:12,820 but by the sum of the absolute that his so-called L1 norm, then some of the smaller coefficients were actually pushed right to zero. 238 00:33:12,820 --> 00:33:18,720 Not greatly decreased, but still non-zero, but pushed absolutely to zero. 239 00:33:18,720 --> 00:33:27,110 And that's because if you penalise a regression by an allowed that there is light that. 240 00:33:27,110 --> 00:33:36,620 As compared with the relation like that, then the very small, though, is a heavily hit. 241 00:33:36,620 --> 00:33:48,200 So putting it into pure honest formulation, putting a variable into an equation at all incurred a severe cost that would be one way of putting it. 242 00:33:48,200 --> 00:33:54,510 And if you're on his method? Cleverly, 243 00:33:54,510 --> 00:34:04,530 comes the last few have been quite widely used in some genetic studies recently where there may 244 00:34:04,530 --> 00:34:12,960 be a very large number of exposure edibles and the USA can reduce these to a manageable number. 245 00:34:12,960 --> 00:34:19,110 Now, my friend and co-worker added that it's an imperial college. 246 00:34:19,110 --> 00:34:27,000 We've been looking at that from a slightly different point of view and which tips departure on this method is one answer. 247 00:34:27,000 --> 00:34:35,570 It says although there are perhaps 10000 possible explanatory variables. 248 00:34:35,570 --> 00:34:48,210 These five. As well as it's possible now, there may be a totally different five, the state imperceptibly better, slightly better, 249 00:34:48,210 --> 00:35:04,020 but imperceptibly so and what hasn't and I have to do is to have a message that will find not just one model that states, but a set of simple models. 250 00:35:04,020 --> 00:35:12,280 So we can say that the explanation might be this or might be that because there's an important point here. 251 00:35:12,280 --> 00:35:23,580 If they object to the analysis, it's just to predict it is to take the input variables, genetic variables predict the outcome. 252 00:35:23,580 --> 00:35:34,620 Understanding it doesn't matter if some other predictor will do it well, you've got it got the best predictor you can hear it is good luck. 253 00:35:34,620 --> 00:35:41,130 But what it means is somebody else taking a slightly different set of days or a very different 254 00:35:41,130 --> 00:35:48,530 set of related data might perfectly well get a totally and completely different answer. 255 00:35:48,530 --> 00:35:54,290 That also would have fitted the first set of data almost as well. 256 00:35:54,290 --> 00:35:57,740 So what hasn't not tried to do? 257 00:35:57,740 --> 00:36:09,980 It's too ambitious a job to list all the possible models that equally well, but we wind up listing a number, a modest number. 258 00:36:09,980 --> 00:36:21,160 And then it's an interpretive job to say which of these might make sense, which of these is capable of being reproduced? 259 00:36:21,160 --> 00:36:29,620 I think my getting on for time. What time do you want me to stop you from? 260 00:36:29,620 --> 00:36:37,970 Well, I was going to say something about I'm going to do that to move on to something else. 261 00:36:37,970 --> 00:36:50,450 No, I'll move on to something that. This is now a shift back to general remarks. 262 00:36:50,450 --> 00:36:59,420 There are two key ideas connected with the analysis of data. 263 00:36:59,420 --> 00:37:08,680 Not not the only ideas, of course, but two key ones of which the first is are the conclusions generalisable? 264 00:37:08,680 --> 00:37:15,400 Do we have some have we find something that's totally specific to the set of data, 265 00:37:15,400 --> 00:37:28,380 maybe be nationalising or at least finding something more in some sense, more a broader interest, more generalisable. 266 00:37:28,380 --> 00:37:36,290 And there's that issue and then and this may not concern. 267 00:37:36,290 --> 00:37:42,590 LG McCue working so much at the opposite extreme of that is specificity. 268 00:37:42,590 --> 00:37:51,620 I know for that think a of the doctor treat a clinician treating patients. 269 00:37:51,620 --> 00:37:58,650 Here's a patient in front of him or her, and this patient has certain features. 270 00:37:58,650 --> 00:38:03,830 And there are two possible things that the doctor may recommend. 271 00:38:03,830 --> 00:38:11,740 I will be now there have been a let's suppose there's been a beautifully organised, randomised trial. 272 00:38:11,740 --> 00:38:18,110 It's shown that AI is significantly better to be. 273 00:38:18,110 --> 00:38:35,690 That might say of a group of patients, 70 percent do well on a 70 percent do well, I mean, does that mean that every patient does better on a day? 274 00:38:35,690 --> 00:38:41,220 No. And the question facing the clinician. 275 00:38:41,220 --> 00:38:45,800 Is what is best for my patient? 276 00:38:45,800 --> 00:38:58,990 What is the evidence that although I in the randomised trial clearly wins over me, what's the evidence that my patient will do better than me? 277 00:38:58,990 --> 00:39:06,550 Or in a quite different context, if this policy change is introduced, this is being shown in a randomised trial, 278 00:39:06,550 --> 00:39:17,900 essentially in a large study to be based on the average beneficial, will it be beneficial in this particular case that I'm interested in? 279 00:39:17,900 --> 00:39:27,750 And this is the issue of specificity. And in a certain sense, generalisability and specificity at opposite ends of the Pole. 280 00:39:27,750 --> 00:39:34,080 There's generalisability is taking a data outwards into a new context. 281 00:39:34,080 --> 00:39:39,960 The specificity is narrowing it down, not generalising it, but saying saying, 282 00:39:39,960 --> 00:39:48,080 Does it apply here to my particular policy choice or my particular position? 283 00:39:48,080 --> 00:39:57,270 Now. I put the the conditions that. 284 00:39:57,270 --> 00:40:04,440 Enable you to make a generalisation from a particular study, and I think these are fair. 285 00:40:04,440 --> 00:40:16,290 I'm I have to admit I'm thinking of primarily a randomised clinical of clinical trials, but I think the remarks apply much more broader than that. 286 00:40:16,290 --> 00:40:28,400 These are the. Issues, subject matter, understanding that you understand the psychological, social, 287 00:40:28,400 --> 00:40:35,090 biological, physical processes that are going on underneath what you what you see. 288 00:40:35,090 --> 00:40:39,060 You have some deeper understanding. 289 00:40:39,060 --> 00:40:49,260 The size of this age, big effect, particularly is expressed as ratios are much more likely to be stable than smaller effects. 290 00:40:49,260 --> 00:40:54,810 And this was shown in the reference I've given this to there. 291 00:40:54,810 --> 00:41:00,450 It was published in 1959 and reprinted in 2009. 292 00:41:00,450 --> 00:41:09,670 She was given two dates. Connellsville was this was a period when. 293 00:41:09,670 --> 00:41:17,060 Surprisingly. Bradford here and in London had found that. 294 00:41:17,060 --> 00:41:21,320 Smoking seemed to be a cause of lung cancer. 295 00:41:21,320 --> 00:41:27,200 They had expected to find when they did this study that air pollution was, 296 00:41:27,200 --> 00:41:33,410 which was horrific in London in those days, that air pollution was the cause of lung cancer. 297 00:41:33,410 --> 00:41:38,960 But they found that smoking could seem to be the only cause. 298 00:41:38,960 --> 00:41:45,910 And over the next few years, a number of other studies tended to confirm that. 299 00:41:45,910 --> 00:41:51,370 But the three leading statisticians of that time were all sceptical. 300 00:41:51,370 --> 00:42:05,730 All right, Fisher considered that. The effect had about had a genetic base that I mentioned previously, that is kind interest in genetics. 301 00:42:05,730 --> 00:42:18,580 He thought that what was going on was genetic code, which of course has a certain modern nuance to it and. 302 00:42:18,580 --> 00:42:29,320 Name in Berkeley, who could be infectious and that they had had a long and highly acrimonious and destructive conversation 303 00:42:29,320 --> 00:42:39,790 with one another name and also because it wasn't the study of it wasn't verified by a randomised trial. 304 00:42:39,790 --> 00:42:46,300 He was sceptical. And the leading medical statistician of the time, the clinician, Judge Buxton, 305 00:42:46,300 --> 00:42:57,880 who worked at the Mayo Clinic and somewhere in the centre of the United States, backs and also is sceptical because he made the point. 306 00:42:57,880 --> 00:43:10,270 That is the smoking and lung effect on lung cancer is the difference between one small effect and a very, very small effect. 307 00:43:10,270 --> 00:43:15,820 But if you look at the effect of smoking on heart disease, they're kind, superficial effect. 308 00:43:15,820 --> 00:43:28,750 It's the difference between two big numbers and the number of people apparently affected as a difference of rates was much higher. 309 00:43:28,750 --> 00:43:38,480 Now. What Collins did was to analyse all the work that's been done up to that point. 310 00:43:38,480 --> 00:43:42,930 And yes, I'm not going to go into the details of what he showed, 311 00:43:42,930 --> 00:43:52,110 but is the essence of his argument was that there were four totally different lines of argument, all pointing in the same direction. 312 00:43:52,110 --> 00:44:05,090 Laboratory studies on animals, laboratory studies on humans, certain types of trial and and ultimately clinical trials. 313 00:44:05,090 --> 00:44:12,920 And his argument was essentially that while Juan might be able to buy one of these away, 314 00:44:12,920 --> 00:44:22,370 the fact that coming at the problem from four different directions you've got to the same answer was overwhelming, 315 00:44:22,370 --> 00:44:26,050 and it had some more mathematical arguments behind it as well, 316 00:44:26,050 --> 00:44:34,580 who seemed to to do with rights rather than differences, ratios of rights rather than differences. 317 00:44:34,580 --> 00:44:39,590 And that really won the day and confidence. 318 00:44:39,590 --> 00:44:45,990 The paper was accepted by the U.S. 319 00:44:45,990 --> 00:45:01,610 That being said, the general, I suppose, as as as as a definitive statement about the causal relation of smoking and lung cancer. 320 00:45:01,610 --> 00:45:08,290 But an essence of that was that he was working with. 321 00:45:08,290 --> 00:45:21,620 The size of the effect was important. If you can show that you're what you're looking for is repeatable in many environments. 322 00:45:21,620 --> 00:45:30,590 Multiple studies with the ICE, and that's another route to the same conclusion. 323 00:45:30,590 --> 00:45:35,990 In principle? Random sampling of your target population. 324 00:45:35,990 --> 00:45:44,000 Give you the security, you want a security. My sense is virtually always nothing. 325 00:45:44,000 --> 00:45:59,620 Very few serious studies, certainly on the biomedical side, could possibly be justified as a random sample of the targeted population. 326 00:45:59,620 --> 00:46:05,110 Specificity, I'm not going to say too much about it because it's possibly not have so much interest, 327 00:46:05,110 --> 00:46:10,470 but it's to do with the circumstances, as I said, in which. 328 00:46:10,470 --> 00:46:26,430 The conclusion from some broad study, you can say, yes, this really applies here to my particular decision or policy recommendation, or it might be. 329 00:46:26,430 --> 00:46:34,520 And in many ways, the considerations are the same as the generalisability. 330 00:46:34,520 --> 00:46:44,460 Now. It's a dreaded word I have very carefully, I think I've managed to avoid mentioning, and that's causality. 331 00:46:44,460 --> 00:46:57,600 And I think I might decide that I'm going to run out of time and. 332 00:46:57,600 --> 00:47:01,440 When I started statistical work, causality was a forbidden word, 333 00:47:01,440 --> 00:47:10,310 it was philosophy and the serious scientists, mathematicians and physicists didn't want to do that. 334 00:47:10,310 --> 00:47:17,500 So nobody said it quite as good. But uh, and then. 335 00:47:17,500 --> 00:47:29,910 I was going to say quite recently, it was actually about 30 or 40 years ago that fairly suddenly changed when. 336 00:47:29,910 --> 00:47:36,900 It was an important paper was read to the Ross Listicles of society in which it 337 00:47:36,900 --> 00:47:43,950 was pointed out a massive effort had gone into randomised trials clinical trials. 338 00:47:43,950 --> 00:47:51,470 It was time to turn attention to issues that could only be addressed observational. 339 00:47:51,470 --> 00:47:57,530 By observational methods and this data and a change of emphasis, 340 00:47:57,530 --> 00:48:09,990 the formulation of guidelines that would lead to causality and to a massive current literature. 341 00:48:09,990 --> 00:48:18,420 Some of it coming into the field from the computer science, more than statistics. 342 00:48:18,420 --> 00:48:30,020 I'm not going to try and discuss that because it's a lecture more than a lecture on this by somebody else, not by me. 343 00:48:30,020 --> 00:48:36,100 I say three possible definitions. 344 00:48:36,100 --> 00:48:47,770 The one that really the two that really matter is that you have evidence that if this had been different from how it is other things being equal, 345 00:48:47,770 --> 00:48:57,540 then the outcome would have been different. Then you can say this at the beginning is a cause of that. 346 00:48:57,540 --> 00:49:02,910 If you have solid evidence that had this been different from that, 347 00:49:02,910 --> 00:49:08,940 other things being remaining the same, then you have to be very careful what that means. 348 00:49:08,940 --> 00:49:12,230 Then the outcome would have been different. 349 00:49:12,230 --> 00:49:23,180 A third that's the second definition of causality, the third definition is the first is that one, plus some understanding of process. 350 00:49:23,180 --> 00:49:39,150 Why is this going on? Look what biological, social or whatever economic mechanism is in process that produces the change we're talking about. 351 00:49:39,150 --> 00:49:52,300 And so I conclude with. What I suppose I have to confess is the limitations of statistical certainty I tried to present, 352 00:49:52,300 --> 00:49:57,100 I said very little about big data, machine learning and so forth. 353 00:49:57,100 --> 00:50:11,390 Not to be dismissive about it, but because I want to emphasise the aspects of what I see is specifically statistical. 354 00:50:11,390 --> 00:50:21,280 The literature is overwhelmingly about probability probabilistic models that they analyse and fascinating stuff that goes of the mind, 355 00:50:21,280 --> 00:50:31,040 you want to go into that. It's a tiny proportion of what statisticians do, even theoretical statisticians like me. 356 00:50:31,040 --> 00:50:35,690 The formal theory is a fairly important. 357 00:50:35,690 --> 00:50:46,280 It's important, obviously, but it's the intellectual underpinning, but it's not what one primary does. 358 00:50:46,280 --> 00:50:59,820 What it does do is to put an emphasis on trying to get individual studies whose interpretation is as secure as can be. 359 00:50:59,820 --> 00:51:12,200 But not in a way, a limitation, because although it's a good thing clearly to have studies that have got secure individual interpretations. 360 00:51:12,200 --> 00:51:15,200 Nobody can be nobody can not want that. 361 00:51:15,200 --> 00:51:25,730 But to see that as the sole objective is too much to narrow, and the very often the biggest advances come from saying this, 362 00:51:25,730 --> 00:51:33,450 this evidence from here, there's this evidence and there is this evidence and put those together in some way. 363 00:51:33,450 --> 00:51:43,760 They may get an enlightened view and confuse work that I mentioned a few moments ago, in a sense, as an example of that, where he took the very, 364 00:51:43,760 --> 00:51:52,010 very different types of study that could never have been analysed collectively because they were totally different lengths, 365 00:51:52,010 --> 00:52:00,110 different outcomes and so forth. But it is the possible impact of pushing these things all together that made. 366 00:52:00,110 --> 00:52:10,210 So the strong interpretation. Well, I put some references up. 367 00:52:10,210 --> 00:52:16,720 The notes, but not to the think don't. Well, thank you very much, sir. 368 00:52:16,720 --> 00:52:25,260 Thank you. Thanks, Ken. 369 00:52:25,260 --> 00:52:31,680 Thank you very much, David, for these general principles on just how to do research, I think. 370 00:52:31,680 --> 00:52:39,300 Not just not just statistical research, but any project could benefit from, I think, having the kind of outline that you've presented. 371 00:52:39,300 --> 00:52:49,070 We have time now for some questions. So just raise your hands and I can bring the microphone so that we have the voice recorded as well. 372 00:52:49,070 --> 00:52:55,470 Oh, yes, hi, thank you very much for your lecture, it was very interesting. 373 00:52:55,470 --> 00:53:01,560 So I had one question, so there's this one concern that I had, which I discussed with the actually three days before, 374 00:53:01,560 --> 00:53:06,690 which is that I feel like an incremental divide between machine learning and 375 00:53:06,690 --> 00:53:12,120 statistics like machine learning is being developed as it is a separate branch, 376 00:53:12,120 --> 00:53:19,080 completely independent. And now people who are being taught machine learning are not necessarily being taught statistics in the traditional way, 377 00:53:19,080 --> 00:53:26,220 but only going directly to machine learning. And so I wanted to know your thoughts on whether we should be striving for a 378 00:53:26,220 --> 00:53:31,200 combination of strategies between statistics and machine learning for analysing data. 379 00:53:31,200 --> 00:53:36,210 Or is it fine to separate both branches because you're doing just the same? 380 00:53:36,210 --> 00:53:44,810 It's. Well, separations won't be a bad idea. 381 00:53:44,810 --> 00:53:54,170 There is a different explanation, I think comes machine learning is largely come from an engineering background. 382 00:53:54,170 --> 00:54:03,620 The stuff is the most traditional statistical thinking of the mathematical and natural science background. 383 00:54:03,620 --> 00:54:08,300 And. I see machine learning. 384 00:54:08,300 --> 00:54:12,680 The work on machine learning, I see it some of it's very impressive. 385 00:54:12,680 --> 00:54:24,050 And if your object is very short term prediction in a mechanical fashion, it may well be the best, quickest and safest way to go. 386 00:54:24,050 --> 00:54:31,060 Will it lead you to understand the world better? And that I'm less convinced about. 387 00:54:31,060 --> 00:54:39,880 So the reason the strikes by I mean, you see, the work has is a mine that I went through so hastily at the end. 388 00:54:39,880 --> 00:54:42,970 That's the kind of machine learning. 389 00:54:42,970 --> 00:54:56,950 But but it's it's focussed on trying to say it might be that it might be that might be that all these gives a perfectly good fits for the various. 390 00:54:56,950 --> 00:55:05,890 Is there extra evidence somewhere that will tell you which is the right one or what committee includes some from what we see? 391 00:55:05,890 --> 00:55:16,500 And I don't see that there is in the machine learning literature. 392 00:55:16,500 --> 00:55:22,880 And also, so I mean, some of. 393 00:55:22,880 --> 00:55:40,750 Some of the neural net makes very, very clever, but because they're looking for extremely irregular brain dependencies in complex systems, whereas. 394 00:55:40,750 --> 00:55:52,660 Certainly, my feeling would be to start. Start with the simple and make it more complicated if you absolutely have to. 395 00:55:52,660 --> 00:56:01,840 So it was a difference of philosophy that. I may be being unfair to the machine, 396 00:56:01,840 --> 00:56:15,920 but my impression is if they find a new net so that that describes a dependency that very nicely, they don't attempt to say why. 397 00:56:15,920 --> 00:56:23,510 What does this suggest, what are the weaknesses of this, will this apply in some new, slightly different situation? 398 00:56:23,510 --> 00:56:32,650 It don't seem to. I've never seen this, but I'm not trying to be. 399 00:56:32,650 --> 00:56:55,100 Yes. I was just wondering if you could talk a little bit more about the issue of causality in the realm of human action and choice, because I mean, 400 00:56:55,100 --> 00:57:02,300 recently I've been thinking a lot that a lot of our model of causality are really tailored for, 401 00:57:02,300 --> 00:57:05,810 you know, this sort of exogenous shock that happened to people. 402 00:57:05,810 --> 00:57:14,210 And I was thinking that, I mean, in the social world, you know, people make decisions and have, you know, some degrees of free wills. 403 00:57:14,210 --> 00:57:23,120 And I have a lot of troubles bringing really the sort of causal framework into the social aspect. 404 00:57:23,120 --> 00:57:29,810 And if I can just give an example that was thinking the other day, I'm not entirely sure it's the correct example, 405 00:57:29,810 --> 00:57:35,630 but I think it it might touch upon some some of the issues at play. 406 00:57:35,630 --> 00:57:42,410 So for instance, if I shoot FĂ©lix and kill him, I caused fairly deaths. 407 00:57:42,410 --> 00:57:51,020 But if Nicola would pull a gun in my head and force me to kill Felix, who kills Felix, did I kill fairly? 408 00:57:51,020 --> 00:57:55,100 So did Nicolo kill Felix? And I feel that in the social world, 409 00:57:55,100 --> 00:58:04,250 we often have these sort of indirect chain that cause triggers and that it's it's unclear what caused the Y and what causing means. 410 00:58:04,250 --> 00:58:11,040 Thanks. Well, one. 411 00:58:11,040 --> 00:58:19,150 I share your considerable caution about using the word at all. 412 00:58:19,150 --> 00:58:26,920 And even so, the various levels because. If you take those out in the sense, 413 00:58:26,920 --> 00:58:32,620 we have reasonable evidence that this group of people had they been randomised to that 414 00:58:32,620 --> 00:58:39,120 treatment rather than this treatment would have done better on the whole better. 415 00:58:39,120 --> 00:58:44,310 We may it may be possible to guess that it's not totally unreasonable to say, 416 00:58:44,310 --> 00:58:52,380 then that this is the treatment, whatever it may be, has caused the outcome, on the other hand. 417 00:58:52,380 --> 00:59:00,060 It's certainly not, and it's a causal explanation at one level. 418 00:59:00,060 --> 00:59:07,590 If you say what was going on in people's minds or what was the biochemistry underlying what they did and so forth. 419 00:59:07,590 --> 00:59:16,020 That's another level. And if you pursue that idea, of course, everything, then. 420 00:59:16,020 --> 00:59:21,060 Goes back to quantum mechanics, molecules. 421 00:59:21,060 --> 00:59:27,440 And if you look at the foundations of quantum mechanics, you find that something goes. 422 00:59:27,440 --> 00:59:32,480 It's a fantastic, beautiful, extraordinary series. 423 00:59:32,480 --> 00:59:37,340 Great. One of the great intellectual triumphs of the 20th century. 424 00:59:37,340 --> 00:59:45,990 But I think I'm right still, if you look. It doesn't rest on its science, doesn't rest on its foundations. 425 00:59:45,990 --> 00:59:58,480 I think is one of the. One of the conclusions, the foundations and the and the superstructure, so to speak, 426 00:59:58,480 --> 01:00:05,620 of mutually supportive now in some situations, the support is almost all in one direction. 427 01:00:05,620 --> 01:00:12,520 But I think it's always an element. The success of one of the reasons the quantum mechanics is so fantastically 428 01:00:12,520 --> 01:00:21,400 successful is that it's predicted extraordinary phenomena with high accuracy. 429 01:00:21,400 --> 01:00:33,200 Not that it's kind of compelling intellectual structure. 430 01:00:33,200 --> 01:00:34,400 Thank you very much for the talk, 431 01:00:34,400 --> 01:00:41,990 and I'd like to especially what you just mentioned that at some point we seemed to hop onto the trains of quantum mechanics all the way in 432 01:00:41,990 --> 01:00:51,680 the lowest layer and then somewhere we just start to track things and we see perhaps predictions of associations with the thought in mind. 433 01:00:51,680 --> 01:00:58,730 I myself was very mind blown when I first got told that a logistic regression works because 434 01:00:58,730 --> 01:01:04,430 the error term is an extreme type value one distribution other than a normal distribution, 435 01:01:04,430 --> 01:01:09,020 because in mathematics, there's some mathematical arguments. Yeah. 436 01:01:09,020 --> 01:01:14,630 So trying to bridge the gap between the machine learning conjecture and the statistics, 437 01:01:14,630 --> 01:01:19,070 which is very heavily indebted to assumptions on the error term. 438 01:01:19,070 --> 01:01:27,570 I was wondering whether you see potential to retain the standard functional forms we've been using 439 01:01:27,570 --> 01:01:34,490 to allow new forms of the arguments is so fundamental to the way classical statistics works. 440 01:01:34,490 --> 01:01:43,790 And what your thoughts are on that, because so what's so fundamental? Let's say you said something is fundamental to her classical statistical work. 441 01:01:43,790 --> 01:01:48,840 Well, in the linear regression framework, I would say the normal distribution on the error term. 442 01:01:48,840 --> 01:01:59,390 But I was wondering how you see different types of because obviously every time we do an or less through our data programming or something like that, 443 01:01:59,390 --> 01:02:08,450 we usually assume this normal distribution. But obviously, we could do different things right now. 444 01:02:08,450 --> 01:02:20,360 If provisionally assume normal distribution, you then get your hands and then you might say either by subsidiary looking at the data or in general, 445 01:02:20,360 --> 01:02:27,500 I'd say, have I made any assumptions in this that might critically affect my answer? 446 01:02:27,500 --> 01:02:32,270 And if if you thought it would be there with the answer, 447 01:02:32,270 --> 01:02:40,130 depended critically on the normality of the errors in the regression and will it have to do something about it? 448 01:02:40,130 --> 01:02:45,840 Quite what I don't know, but something. I mean, it's the. 449 01:02:45,840 --> 01:02:59,370 And I don't see any of these things as stuck in stone, so to speak with which you see worth to try something like a fantastical distribution, 450 01:02:59,370 --> 01:03:07,700 which is so modal and has this kind of the conjecture I was thinking about this is. 451 01:03:07,700 --> 01:03:14,820 Well, if you find a bimodal distribution, you would surely want to know why. 452 01:03:14,820 --> 01:03:28,170 What what are what are the identifiers of of uh, that would be an instance where an intermediate stage of analysis might lead you to revise. 453 01:03:28,170 --> 01:03:36,010 At least some aspects of the question if you started with, and I did emphasise at the beginning that this sequence. 454 01:03:36,010 --> 01:03:41,200 Stages. Is a provisional. 455 01:03:41,200 --> 01:03:54,920 Saying. At any point. The worse, you know, the day before your paper is to go to the journalism and say, Oh my goodness, a better look at that. 456 01:03:54,920 --> 01:04:01,080 You better look at it. I'm understanding. 457 01:04:01,080 --> 01:04:08,510 I don't mean to sound frivolous about it at all. But these men. 458 01:04:08,510 --> 01:04:14,880 All the methods, whether AI, machine learning or. 459 01:04:14,880 --> 01:04:25,740 Driven or more formally statistically based, they're all based on it, they're all they all lie on assumptions. 460 01:04:25,740 --> 01:04:30,900 Those assumptions may be likely to be a bit wrong. 461 01:04:30,900 --> 01:04:35,040 They may be very wrong. That may be so wrong that they're misleading. 462 01:04:35,040 --> 01:04:42,660 And then you have to try and find that in some way or other discover that and rethink. 463 01:04:42,660 --> 01:04:50,380 It's a little flick. I mean, this is the limitation of of textbooks, there's textbooks, particularly textbooks. 464 01:04:50,380 --> 01:04:55,410 Some statistics make it sound also formalised. 465 01:04:55,410 --> 01:05:01,500 That's because the mathematicians have got hold of it in their life and in a way a very good 466 01:05:01,500 --> 01:05:12,960 thing to have very formalised is it's not a very good thing to take them too seriously. 467 01:05:12,960 --> 01:05:17,220 I take them seriously, but not too seriously. 468 01:05:17,220 --> 01:05:20,720 You understand what I mean. Thank you, David. 469 01:05:20,720 --> 01:05:23,040 It's a pleasure to have the opportunity to listen to you, 470 01:05:23,040 --> 01:05:30,270 I wonder if you could comment on the recent trend in various strands of social science towards preregistration, 471 01:05:30,270 --> 01:05:37,860 something with which I have some sympathy with. I'm sorry, but I'm getting a bit death. 472 01:05:37,860 --> 01:05:43,930 I wondered if you could comment on the recent need for registration in the social sciences, 473 01:05:43,930 --> 01:05:50,340 but it's the same thing registering the hypotheses that you wish to test and the expectations you have, 474 01:05:50,340 --> 01:06:03,130 and whether this goes against any of the principles you outlined that have to do with maybe, maybe reassessing or provisional expectations. 475 01:06:03,130 --> 01:06:12,670 Well, to be able to say at the beginning of this study, I have this purpose in mind and I'm going to do this and this. 476 01:06:12,670 --> 01:06:20,530 That's a good thing to say after a year or five years or 10 years or 15 years. 477 01:06:20,530 --> 01:06:34,120 I'm solely stuck with what I said 15 five years ago is an absolute disaster and a travesty 478 01:06:34,120 --> 01:06:42,780 really of what statistical thinking or what scientific thinking really is about it has to be. 479 01:06:42,780 --> 01:06:50,200 So you get the good elements out of these things that but. 480 01:06:50,200 --> 01:07:01,500 Pre-registering, I mean, that people looking in for a research grant, I suppose, you know, thinking they'll probably say what, how? 481 01:07:01,500 --> 01:07:05,970 How are they going to analyse that data, what possibly they might find great? 482 01:07:05,970 --> 01:07:10,080 Of course they should do this stuff. They should be totally tied to it. 483 01:07:10,080 --> 01:07:24,090 Absolutely no. And the longer the time period and the more rapidly involved involving the subject, the more disastrous that would be. 484 01:07:24,090 --> 01:07:36,090 It's a bit like saying wasn't there's that appalling amount of nonsense being written in the statistical literature about of significance tests, 485 01:07:36,090 --> 01:07:48,030 which is based on my own total misunderstanding of what lie about, but we don't go into that so much with the people having to say in advance. 486 01:07:48,030 --> 01:07:53,460 This must be significant and such and such a level before I can report my result. 487 01:07:53,460 --> 01:08:00,750 So they get paid equal 0.05 one disaster hit the five percent level. 488 01:08:00,750 --> 01:08:04,680 And I've actually heard a serious biologist do this. 489 01:08:04,680 --> 01:08:10,740 People point out for nine. All right, we can publish. This is absolute rubbish, of course. 490 01:08:10,740 --> 01:08:20,420 Total nonsense, total misunderstanding of the purpose of these these techniques. 491 01:08:20,420 --> 01:08:22,610 Thank thank you very much. I really enjoyed the talk, 492 01:08:22,610 --> 01:08:33,710 and I would be really interested to know what you think about the possibility to use non probability samples to draw conclusions about certain things, 493 01:08:33,710 --> 01:08:36,230 though in this summer school, we, for example, 494 01:08:36,230 --> 01:08:44,150 heard of this opportunity that the internet or big data provides that if you're interested in a question that, for example, 495 01:08:44,150 --> 01:08:49,760 you don't have data for from a representative survey or maybe a peculiar group of people, 496 01:08:49,760 --> 01:08:53,750 then we could easily go on the internet and collect the data. 497 01:08:53,750 --> 01:08:58,880 Obviously, this wouldn't have the properties of a probabilistic sample that we are used to work with, 498 01:08:58,880 --> 01:09:06,530 and then we learn that there is an opportunity to sort of make the data or stratify and then draw conclusions, 499 01:09:06,530 --> 01:09:11,450 maybe about the proportion of individuals that would vote one way or another way. 500 01:09:11,450 --> 01:09:15,230 But I'm more interested in the relationships between variables. 501 01:09:15,230 --> 01:09:21,680 In the same way, as you mentioned, what how is this thing related to my outcome of interest? 502 01:09:21,680 --> 01:09:25,610 So this was a long winded question. 503 01:09:25,610 --> 01:09:32,630 But basically my question is, do you think that we would be or will be able to use this kind of non probability samples 504 01:09:32,630 --> 01:09:40,440 to draw similar conclusions as to what we are used to based on regression type of analysis? 505 01:09:40,440 --> 01:09:48,830 So what do you think about this in general? Well, probably the samples of of. 506 01:09:48,830 --> 01:10:00,410 A beautiful, fine idea if your response rate is well above the 95, 98 percent. 507 01:10:00,410 --> 01:10:12,440 Fantastic. If your response rate is 40 percent or 50 percent, so yes, it's a probability basis is no formal. 508 01:10:12,440 --> 01:10:18,320 You have to use more delicate methods of analysis to detect well. 509 01:10:18,320 --> 01:10:30,590 Five. How is your sample and not representative of the target population and make some does it affect the answer that you're getting? 510 01:10:30,590 --> 01:10:37,400 Maybe, maybe that has to be explored. But I did have it on one of my slides. 511 01:10:37,400 --> 01:10:43,310 I was going to say outside search in industrial and physical companies, 512 01:10:43,310 --> 01:10:54,700 some context in which sampling is done according to strict rules and implemented 100 percent. 513 01:10:54,700 --> 01:10:59,260 And certain kinds of audit account auditing, for example. 514 01:10:59,260 --> 01:11:08,020 An audit goes to a company, looks it's a large number of accounts that have to be checked, 515 01:11:08,020 --> 01:11:17,470 and he takes a sample is in accordance with clearly doing his job properly with clearly specified sampling rules. 516 01:11:17,470 --> 01:11:28,980 That's that's great. It's very atypical of most, most research, social research situations such as I've seen the. 517 01:11:28,980 --> 01:11:37,030 You have your data. You have the population who have contributed to the data. 518 01:11:37,030 --> 01:11:45,430 What's the clue about? Is that a typical holiday, something that needs alarms? 519 01:11:45,430 --> 01:11:53,110 Maybe not, if you're lucky. I don't think the theory of something. 520 01:11:53,110 --> 01:12:17,690 Is there any help to you? Once the non-response rate dropped below, I'm not going to mention the number, you know, the number of somewhere. 521 01:12:17,690 --> 01:12:23,540 So I to take my take some liberties and ask you that if I may, 522 01:12:23,540 --> 01:12:29,510 one of the things that I thought so you dwell on this and I thought it was very interesting that you 523 01:12:29,510 --> 01:12:36,620 have to deal with uncertainty and think hard about uncertainty in a context where we have big data, 524 01:12:36,620 --> 01:12:39,020 but we also have quality concerns. 525 01:12:39,020 --> 01:12:49,670 So one of the, for example, challenges that we've talked about in the Summer Institute is that often that we have big data sources coming from some, 526 01:12:49,670 --> 01:12:53,690 some from the internet that that give us a lot of numbers. 527 01:12:53,690 --> 01:12:58,370 But at the same time, we don't always understand the data generating process because, you know, 528 01:12:58,370 --> 01:13:06,140 they're not they're not custom made for research in the same way that any survey would be where the researchers designed the instrument and so on. 529 01:13:06,140 --> 01:13:13,760 So in this context of big but generally dubious or unclear or black box, 530 01:13:13,760 --> 01:13:21,440 how do we think about the the measurement of how do we what are the general principles we can adopt to think about uncertainty? 531 01:13:21,440 --> 01:13:26,730 Because you commented on how, of course, standard errors don't work the same way? 532 01:13:26,730 --> 01:13:33,220 What do we do then? Do we do we just use small samples or or should we just a lot of notice, 533 01:13:33,220 --> 01:13:39,740 as some people in this room are very interested in moving into the Bayesian paradigm to address this uncertainty issue? 534 01:13:39,740 --> 01:13:43,850 So I'm interested in hearing your thoughts on how do we think about uncertainty 535 01:13:43,850 --> 01:13:53,340 in the context of big and uncertain and uncertain in the context of quality? 536 01:13:53,340 --> 01:14:02,590 Yeah. And. 537 01:14:02,590 --> 01:14:13,310 And of course. I don't quite know what to say to that. 538 01:14:13,310 --> 01:14:17,510 And. Big day. 539 01:14:17,510 --> 01:14:27,930 Well, first of all, maybe it may be possible to say some parts of this data on reasonably secure and others are not, 540 01:14:27,930 --> 01:14:39,430 and then one try and isolate that or it maybe it's the data has come from many sources that some of the sources are suspect. 541 01:14:39,430 --> 01:14:49,090 I know there are many things that you might do to check on the quality of the data that I don't think in the last analysis, 542 01:14:49,090 --> 01:14:56,770 if the data is MIS is seriously misleading and. 543 01:14:56,770 --> 01:14:58,940 Uncollectible. 544 01:14:58,940 --> 01:15:10,340 The place for it is the wastepaper basket or the modern equivalent of the wastepaper basket not taking out busy people's time analysing it. 545 01:15:10,340 --> 01:15:17,120 And of course, it's a very difficult judgement there as to when that where that threshold is. 546 01:15:17,120 --> 01:15:31,400 I am very sceptical of the idea that this magic number crunching to correct serious biases in data. 547 01:15:31,400 --> 01:15:43,090 That's not a very helpful answer on the first. But. 548 01:15:43,090 --> 01:15:52,650 Divide the data into rational sections. Analyse the section separately if you find conclusions. 549 01:15:52,650 --> 01:15:57,610 Broadly stable across two sections. That's reassuring. 550 01:15:57,610 --> 01:16:04,980 They've got a blast of sitting everything or some stability the. 551 01:16:04,980 --> 01:16:09,750 In other words, don't really rely on. 552 01:16:09,750 --> 01:16:22,440 Assessments of precision that come directly from the stability of the conclusions across what should be comparable situations, 553 01:16:22,440 --> 01:16:33,880 not by internal calculations. In general, the calculations of precision you'd find in the textbooks that lead to the Sigma 10 type of standard, 554 01:16:33,880 --> 01:16:43,830 that these are all based on the independence of individuals within the groups, 555 01:16:43,830 --> 01:16:51,600 it's probably that it's the most critical aspect of most likely to be wrong. 556 01:16:51,600 --> 01:16:59,400 But if you find a meeting apparently meaningful relation, it's stable across. 557 01:16:59,400 --> 01:17:05,610 Different parts of the UK or different countries within Europe or whatever it might be, 558 01:17:05,610 --> 01:17:12,930 if there's some external stability in the conclusions, then of course that that's it. 559 01:17:12,930 --> 01:17:19,380 They may all be wrong, but it's much, much, much less likely. 560 01:17:19,380 --> 01:17:27,670 It's not very helpful, but that good quality data is the message. 561 01:17:27,670 --> 01:17:35,860 Kind of building on that, increasingly a lot of the data that is generated is obviously proprietary to companies, 562 01:17:35,860 --> 01:17:43,540 and the way that we created on outside of the company is mediated by the algorithms that are inside of something. 563 01:17:43,540 --> 01:17:46,120 And so I think at this particular moment, 564 01:17:46,120 --> 01:17:54,880 a lot of us are facing the question of whether we stay outside in academia to try and study and deal with the fact that we can't get into this, 565 01:17:54,880 --> 01:17:58,060 the black boxes to know what the algorithms are, 566 01:17:58,060 --> 01:18:07,330 what the even just the decisions behind the mediation and socialites on these online platforms is versus going 567 01:18:07,330 --> 01:18:15,000 into the companies themselves and working as scholars there where intellectual freedom is often pointed toward. 568 01:18:15,000 --> 01:18:22,440 The aims of profitability or use for the company, there's not as much intellectual freedom, I guess I would say. 569 01:18:22,440 --> 01:18:34,170 I don't know if you have any thoughts on that. But in the aim of trying to get at the good data needing to get inside and therefore leave out the. 570 01:18:34,170 --> 01:18:44,760 Well, that's rather outside my field, I'm afraid the pharmaceutical industry. 571 01:18:44,760 --> 01:18:50,320 The peculiar reputation in many ways, but on the whole. 572 01:18:50,320 --> 01:19:02,940 It seems to me that I have been supportive of good quality research, at least to the probably the bigger companies have anyway. 573 01:19:02,940 --> 01:19:10,340 I live in that's a model of how. My perceived. 574 01:19:10,340 --> 01:19:20,520 Because it's a other than the research councils to mount a clinical trial of the size, it's not so it's necessary to get. 575 01:19:20,520 --> 01:19:28,530 Pretty clear results requires a massive. We saw both of time and I mean, 576 01:19:28,530 --> 01:19:44,140 only my own experience of that is working for 10 years with the management committee of a trial run by a pharmaceutical industry. 577 01:19:44,140 --> 01:19:56,300 And I would say that the. Treatments of the problem seem to be scrupulously honest to me. 578 01:19:56,300 --> 01:20:01,820 I think he said, you mean it's outside of your area to answer, but that's actually a very interesting answer. 579 01:20:01,820 --> 01:20:07,760 I don't think I was thinking the direction of pharmaceuticals, but more in the social media platforms and. 580 01:20:07,760 --> 01:20:17,550 But the answer? It provides a question for all of us in terms of whether the the more social media platforms need to take as seriously, 581 01:20:17,550 --> 01:20:21,780 given the impact that they can have on society, need to take the science. 582 01:20:21,780 --> 01:20:29,310 Is the science more seriously in the way that pharmaceutical companies have to run trials on their drugs to see the effects? 583 01:20:29,310 --> 01:20:37,900 So anyways, the answer was still that since. 584 01:20:37,900 --> 01:20:46,690 So I'm going to ask another one then, because I am interested, so an interesting theme that I thought throughout your talk was sort of there was 585 01:20:46,690 --> 01:20:52,960 implicitly this kind of dichotomy or tension between prediction and explanation. 586 01:20:52,960 --> 01:20:57,070 And and this is a recurring theme also both in the Summer Institute, 587 01:20:57,070 --> 01:21:03,550 but also in the context of wider discussions about statistics versus machine learning. 588 01:21:03,550 --> 01:21:10,690 And and I'm I'm sort of I find this a bit sometimes misleading the dichotomy dichotomies in this way, 589 01:21:10,690 --> 01:21:17,500 because I do think that the purpose of a lot of theorising is implicitly also a prediction. 590 01:21:17,500 --> 01:21:23,650 And so I don't think that it's as sort of separate in the way that often it is made out to be in 591 01:21:23,650 --> 01:21:31,510 this in the machine learning versus sort of classical theoretical or domain specific paradigms. 592 01:21:31,510 --> 01:21:40,810 And and I wondered what what you thought about this? There's difference the explanation versus prediction distinction that is being made and whether 593 01:21:40,810 --> 01:21:50,420 you think that it matters as neatly onto the sort of statistics machine learning divide as well. 594 01:21:50,420 --> 01:21:55,670 Now, that's a very, very important question. And. 595 01:21:55,670 --> 01:22:04,360 I think my reaction would be prediction without understanding is a perilous business. 596 01:22:04,360 --> 01:22:09,950 You may get a beautiful prediction formula that applies. 597 01:22:09,950 --> 01:22:15,390 In the very, very short term or in this particular context. 598 01:22:15,390 --> 01:22:27,680 With unless you understand, unless either you see only essential stability of this prediction formula over quite a wide range of 599 01:22:27,680 --> 01:22:37,480 conditions or you understand the sociology or the physics or whatever might be behind the formula. 600 01:22:37,480 --> 01:22:49,680 It's perilous to predicting the future. And this is seems to me the trouble with much short term economic forecasting. 601 01:22:49,680 --> 01:22:59,080 People can get very. Sometimes very good short term predictions, but. 602 01:22:59,080 --> 01:23:04,710 Then the prediction model changes. So I think. 603 01:23:04,710 --> 01:23:19,640 Even if. Well, the two lots one could take, one can say really, as academics are, the object is to understand the world. 604 01:23:19,640 --> 01:23:26,030 Hopefully make it a better place if it means something totally by. 605 01:23:26,030 --> 01:23:36,580 If you take the other view and say our object is to produce. Good prediction, I did produce good prediction. 606 01:23:36,580 --> 01:23:44,380 Well, in the longer term, my understanding. Not purely by empirical. 607 01:23:44,380 --> 01:23:52,690 Thinking. I think I hope so. 608 01:23:52,690 --> 01:24:05,470 Would you then say that the the the, uh, the machine learning paradigm is then intrinsically geared only for the short term? 609 01:24:05,470 --> 01:24:10,810 On the face of it. I mean, I wouldn't want to accuse. 610 01:24:10,810 --> 01:24:18,910 I'm sure there are people in the machine learning field, a little more broader and more subtle in their thinking than that. 611 01:24:18,910 --> 01:24:25,540 But the machine learning, as I feel, as I look at it from the outside, does seem very, 612 01:24:25,540 --> 01:24:38,630 very focussed on setting this particular set of data, even by highly contrived methods. 613 01:24:38,630 --> 01:24:56,560 The successful methods. And you say this relates to the discussion, as I glossed over the distinction between Cuba, Iran is. 614 01:24:56,560 --> 01:25:03,590 Very influential work on the LSU. Which is a political prediction. 615 01:25:03,590 --> 01:25:10,190 He wasn't concerned to say, Oh, here's here's the formula that predicts beautifully. 616 01:25:10,190 --> 01:25:17,960 Isn't it different for me? It does almost the same job in a totally different interpretation. 617 01:25:17,960 --> 01:25:27,850 I mean, I haven't discussed it with him personally, but. But I don't sense in his writing. 618 01:25:27,850 --> 01:25:35,000 An effort to do that, I may be the answer to. 619 01:25:35,000 --> 01:25:47,840 Compared with signs to say yes, we can predict is a predictor, but there are alternative explanations or, 620 01:25:47,840 --> 01:25:54,830 as it turns out, in the example that Heather and I looked at, some of the features were remarkably stable. 621 01:25:54,830 --> 01:26:01,680 We had to note we ended up with a number of prediction equations that suggested. 622 01:26:01,680 --> 01:26:09,440 Virtually identically, well. But there is a substantial common element of them, so we could say, well, 623 01:26:09,440 --> 01:26:16,030 whatever the final collection we use, we better have this, this and this in there. 624 01:26:16,030 --> 01:26:24,320 And that's a stable that's an element of stability. It seems to me important. 625 01:26:24,320 --> 01:26:30,150 It would say that you see your work with with Heather Batty as as in response to try and 626 01:26:30,150 --> 01:26:37,430 sometimes bridge this prediction explanation because your your the way I understand 627 01:26:37,430 --> 01:26:44,000 that paper is that it's also in some sense trying to see to trying to let the domain 628 01:26:44,000 --> 01:26:50,660 expert adjudicate between different models to see what might be a plausible. 629 01:26:50,660 --> 01:26:52,340 Story to tell, 630 01:26:52,340 --> 01:27:03,330 but so do you do you see that paper is as trying to say that we shouldn't necessarily be thinking of prediction and explanation as a separate goals? 631 01:27:03,330 --> 01:27:09,080 Well, I've always I think they were primarily interested in the explanation. 632 01:27:09,080 --> 01:27:14,960 And misguided predictions. Interesting. 633 01:27:14,960 --> 01:27:24,750 Side issue to that important, but not the prime goal, know how the work with Heather started on, how does work start? 634 01:27:24,750 --> 01:27:37,030 I don't know. So it's after all those years ago, started three years ago and it's a remote first. 635 01:27:37,030 --> 01:27:44,680 Yeah, but just I just wanted to ask you if you think that there should be because I don't I'm not sure I have seen this. 636 01:27:44,680 --> 01:27:54,010 Maybe you know this, but do you think there's a field which should move on the intersection between machine learning and classical statistics? 637 01:27:54,010 --> 01:27:58,450 So let's generate more strategies on combining both. 638 01:27:58,450 --> 01:28:04,780 So, for example, using predictions from machine learning algorithms and more inferential statistics or 639 01:28:04,780 --> 01:28:13,850 just generating sort of like hybrid models that takes to both the best of both worlds. 640 01:28:13,850 --> 01:28:21,830 Well, I I don't quite see it like that. I think we all have to do the best we can. 641 01:28:21,830 --> 01:28:30,410 We all have to do it as broad as we can and recognise the people in related fields to us doing similar things. 642 01:28:30,410 --> 01:28:42,350 We're interested enough and helpful so forth. I regard the division in 10 years time or 20 years time, people. 643 01:28:42,350 --> 01:28:53,520 What's the difference? There may be no difference. I don't think is at the moment, the differences are quite strong, and in fact, in Oxford, 644 01:28:53,520 --> 01:28:59,040 particularly strong because the machine learning scientists statistics in the 645 01:28:59,040 --> 01:29:05,370 statistics department is much bigger than the statistics of statistics thought. 646 01:29:05,370 --> 01:29:20,970 But it's an accident of time. I these things, the emphasis within big statistics is a very broad field ranging from the very, very, 647 01:29:20,970 --> 01:29:34,410 very mathematical to purely descriptive and subject, not purely subject matter based and interests in it, and move even over my lifetime. 648 01:29:34,410 --> 01:29:40,410 They've evolved and changed, some with some periods. This is a focus of most interest. 649 01:29:40,410 --> 01:29:52,700 Sometimes that. A sign of life. 650 01:29:52,700 --> 01:29:59,620 If we don't have any further question, do we OK, we have one more. 651 01:29:59,620 --> 01:30:05,380 Thank you very much for this very nice talk. It's basically actually follow up of what he was asking. 652 01:30:05,380 --> 01:30:12,100 But I wouldn't like to not so much to focus on the distinction between explanation and prediction or the the the computational 653 01:30:12,100 --> 01:30:20,980 field and statistics of maybe what the how do you think your general rules extends to the computational worlds? 654 01:30:20,980 --> 01:30:26,050 For example, is parsimonious ness of models still so important or actually not anymore, 655 01:30:26,050 --> 01:30:30,250 because they only serve us better, comprehend the models or something? 656 01:30:30,250 --> 01:30:39,610 How is it? How important is it still to be super explicit about your uncertainty if you just all you look for is like the best model? 657 01:30:39,610 --> 01:30:44,800 So you are very general, very nice goals for statistics. 658 01:30:44,800 --> 01:30:50,920 Where do you see the commonalities through to the new world? 659 01:30:50,920 --> 01:30:57,670 Oh, that's a very difficult question. They have brutal answers I never think of. 660 01:30:57,670 --> 01:31:02,350 I just I, I just so many working scientists. 661 01:31:02,350 --> 01:31:11,680 One presses on and tries to do something useful when thinking about these general issues is is fantastic, 662 01:31:11,680 --> 01:31:24,960 and that's why it's so good to come and talk to you, hear what you have to say about these things and. 663 01:31:24,960 --> 01:31:39,080 The important. Out of out of diversity comes strength, somebody must have said that and. 664 01:31:39,080 --> 01:31:43,190 The statistical world has changed enormously in my lifetime. 665 01:31:43,190 --> 01:31:47,510 It's very different in different countries and Western Europe. 666 01:31:47,510 --> 01:31:53,760 It's a much more mathematical and it's here. Most British statisticians. 667 01:31:53,760 --> 01:31:59,020 Have a strong interest in monument field of the application. That's not always the case. 668 01:31:59,020 --> 01:32:02,790 It's certainly not the case in the United States, for example. 669 01:32:02,790 --> 01:32:12,570 So there are there are national differences of emphasis and style as between very basic principles, 670 01:32:12,570 --> 01:32:20,260 applying those principles to complicated problems, being involved in real applications and so on. 671 01:32:20,260 --> 01:32:26,340 And these or cede one another. That's a terribly useless answer. 672 01:32:26,340 --> 01:32:34,090 But I don't I can't give a better one. And that's. 673 01:32:34,090 --> 01:32:41,770 All right, I think that takes us to time. I want to thank David once again for a very thought-provoking talk. 674 01:32:41,770 --> 01:32:48,850 And for his comments on, I think, a wide range of issues that we've been discussing throughout the Summer Institute. 675 01:32:48,850 --> 01:32:54,869 Yes. Thank you, David. Thank you.