1 00:00:07,210 --> 00:00:14,210 This today's instalment of our Department Statistics, Distinguished Assembly series. 2 00:00:14,210 --> 00:00:18,500 I'm very glad to introduce to you today a most distinguished speaker. 3 00:00:18,500 --> 00:00:29,360 Binnu is Cassilis distinguished professor at the class of 1936, second chair in the Department of Statistics and Easy, yes, at UC Berkeley. 4 00:00:29,360 --> 00:00:38,270 She has won many accolades and prises. And amongst which she is a member of the US National Academy of Sciences. 5 00:00:38,270 --> 00:00:45,770 A member of the American Academy of Arts and Sciences has president of the Institute of Mathematics Statistics Guggenheim Final. 6 00:00:45,770 --> 00:00:56,430 And fortunately for the UK, she is also on the Scientific Advisory Committee of the Tubing of our Alan Turing Institute for Data Science and I. 7 00:00:56,430 --> 00:01:07,670 She was formally trained as a statistician by her research. Now extends way beyond the well, the realm of statistics, as you see in her top world, 8 00:01:07,670 --> 00:01:14,040 has leveraged new computational developments to solve important scientific problems by combining novels. 9 00:01:14,040 --> 00:01:23,220 That Discussion commission approaches with the domain expertise of her Manicka collaborators in neuroscience genomics in that position. 10 00:01:23,220 --> 00:01:27,930 Over to you. Thank you. Thank you for the very kind introduction. 11 00:01:27,930 --> 00:01:44,010 Also might me. It's a pleasure to be here. So as I mentioned that I would just share some our work. 12 00:01:44,010 --> 00:01:46,590 It's kind of this framework which would call this. Yes. 13 00:01:46,590 --> 00:01:54,060 Which our engineers really divide up a word like the last 10 years and started with a paper by myself on stability. 14 00:01:54,060 --> 00:02:00,180 And, you know, now we're we're really without a shirt is this framework of murder to data science? 15 00:02:00,180 --> 00:02:04,080 And actually, it's not just for medical research, but the particular example, 16 00:02:04,080 --> 00:02:10,530 our use of a case study is kind of come from I actually some attempt to look at 17 00:02:10,530 --> 00:02:18,490 cardiovascular diseases and looking for gene interactions or epi static interactions. 18 00:02:18,490 --> 00:02:22,990 So if someone you're wondering, what do I mean by vertical? I didn't either. 19 00:02:22,990 --> 00:02:26,790 This term was suggested by our colleague Jen Jen from Columbia. 20 00:02:26,790 --> 00:02:32,830 I to look it up to it actually means truthful. So it means coinciding with your reality. 21 00:02:32,830 --> 00:02:40,270 And we kind of liked the name and we took it for my paper has like a straight principle that size, which is super long. 22 00:02:40,270 --> 00:02:44,920 And she and, you know, like mature learning people which have something short and sweet. 23 00:02:44,920 --> 00:02:55,800 So we got the paper now. We're reticle davison's. So a lot of the problems my group's been working with is Spermatic colourise. 24 00:02:55,800 --> 00:02:58,220 You know, I did my had this huge, ah, 25 00:02:58,220 --> 00:03:08,510 four fold advance recently and you will see well get organised to really look at the electronic health data and we'll have data break. 26 00:03:08,510 --> 00:03:11,850 So have another inside track you're already putting praising the file. 27 00:03:11,850 --> 00:03:21,920 You see hospitals had a meeting recently. Now two years ago and everybody bias's all of that is part of precision medicine. 28 00:03:21,920 --> 00:03:28,860 However, a special case that A.I.s like nuclear energy, both promising and dangerous. 29 00:03:28,860 --> 00:03:33,690 It's part of life and holds a lot of possibilities for us. 30 00:03:33,690 --> 00:03:41,600 But I think if we're not ragers and careful, then can also bring a lot of harm. 31 00:03:41,600 --> 00:03:45,960 Data science is open under the hood for many A.I. devices. 32 00:03:45,960 --> 00:03:54,870 Right. There's a lot more to it than then state data science emission. But it's like the heart or the brain of a lot of A.I. devices. 33 00:03:54,870 --> 00:04:02,970 And. Machine like machine learning sits at the interface of computer science institutes and maths. 34 00:04:02,970 --> 00:04:08,460 And that was domain knowledge. You kind of have this is one diagram describing data science. 35 00:04:08,460 --> 00:04:13,620 Many of us kind of agree with these sites where they tried to combine data with stormy 36 00:04:13,620 --> 00:04:21,910 knowledge to make decisions and generally knowledge in the context of a particular problem. 37 00:04:21,910 --> 00:04:30,190 To formulate define a vertical data sites, let's say they have vertical data science, extra reliable and reproducible information from data. 38 00:04:30,190 --> 00:04:34,160 But I also want to emphasise we need to really divine love and reach the technical 39 00:04:34,160 --> 00:04:38,810 language to communicate and evaluate empirical evidence in the context of humanisation, 40 00:04:38,810 --> 00:04:46,580 domain knowledge. We need your terms and but the new terms such respond to reality in different problems so we can use a language, 41 00:04:46,580 --> 00:04:50,550 communicate, already talking about reality, not give something up. 42 00:04:50,550 --> 00:04:55,160 Months. And the goal is really to realise the promise. 43 00:04:55,160 --> 00:05:01,780 This mitigates dangers of A.I. old data science of machine learning statistics. 44 00:05:01,780 --> 00:05:07,450 So my view of data science, I think it's best if we think about science as a whole process, 45 00:05:07,450 --> 00:05:13,070 not disconnected steps, and think of it almost like a hardware system. 46 00:05:13,070 --> 00:05:18,280 Another, the concept would be very easy to understand which I describe. 47 00:05:18,280 --> 00:05:22,480 Of course. No, there's no data scientist are people. So here's a photo. 48 00:05:22,480 --> 00:05:29,800 Some my group members, your staff on Domain's suggest question and data collection. 49 00:05:29,800 --> 00:05:34,850 Well, selection of other kinds. Because there are a lot of public repositories that we can access. 50 00:05:34,850 --> 00:05:41,980 But now you have less information on the collection and the decision is made a lot of heat here. 51 00:05:41,980 --> 00:05:49,420 Estimate how you like and how you clean. And then we might also have to contain the data after we. 52 00:05:49,420 --> 00:05:56,070 Gather data either from collectors or from databases and data exploration. 53 00:05:56,070 --> 00:06:01,610 And a lot of the emphasis has been put out modelling algorithm, which is very, very important. 54 00:06:01,610 --> 00:06:04,820 But all the other steps are equally important. 55 00:06:04,820 --> 00:06:12,090 Even data cleaning, I mean, most of us know that there is a clean can take up to 70, 80 percent of our time, but we rarely talk about it. 56 00:06:12,090 --> 00:06:18,260 Um, problem fulmination, why this problem? Trust the testing coefficient, you know, linear regression being zero. 57 00:06:18,260 --> 00:06:20,070 Now, that's a job. 58 00:06:20,070 --> 00:06:27,020 The domain problem doesn't come with the name, although you make a decision in all the data that having the decision and then you do ad hoc analysis. 59 00:06:27,020 --> 00:06:32,180 You communicate and you updated on my knowledge and many of these steps down in you. 60 00:06:32,180 --> 00:06:38,930 Maybe I'll do data cleaning. Well, I actually I formulate the problem. I go back to this very non-linear back and forth. 61 00:06:38,930 --> 00:06:46,970 And what's missing in this system, if we borrow Diderot's from no quality control, which the church was very successful, 62 00:06:46,970 --> 00:06:56,200 certainly we should standardise the process so that the results become more reliable and transparent. 63 00:06:56,200 --> 00:07:02,320 So for the rest of talk it to the process framework for original data science to go to 64 00:07:02,320 --> 00:07:10,950 a case study in biomedical research to develop new trefoil at this Disick interactions. 65 00:07:10,950 --> 00:07:13,380 So the piece from there was developed. 66 00:07:13,380 --> 00:07:23,010 I mean, really, through my work and my with my collaborators and the paper came out with my former student Konkan here now at UCSF that last year. 67 00:07:23,010 --> 00:07:33,780 Now it's really to try to follow up on the O'Briens two cultures thinking and integrate the two cultures instead of. 68 00:07:33,780 --> 00:07:40,720 Separating them and streamline and unify many good practises by many groups, including my group, 69 00:07:40,720 --> 00:07:53,390 and putting in both a conceptual and philosophical kind of coherent framework so that other people can bedside and take advantage of. 70 00:07:53,390 --> 00:07:59,220 So predictability very much is. Which has been at the heart of machine learning, of course, statistics. 71 00:07:59,220 --> 00:08:06,780 We thought of us prediction, too, but was narrative central. I think of a that people don't think about prediction and compatibility. 72 00:08:06,780 --> 00:08:12,000 Of course we did computation by machine learning included at the forefront and stability. 73 00:08:12,000 --> 00:08:17,460 It's really expansion on the concept of uncertainty from sample to sample to a much. 74 00:08:17,460 --> 00:08:28,520 Bigger scope. That. We can deal with data perturbation data, Canadian innovation, problem formulation, perturbation. 75 00:08:28,520 --> 00:08:31,310 And with special like to classical statistics, 76 00:08:31,310 --> 00:08:38,530 if actually we have very well justified probability framework that something like random trials that we can go back, 77 00:08:38,530 --> 00:08:44,910 you even there will have to clean. So it's not completely the stitchery inference. 78 00:08:44,910 --> 00:08:48,960 Really need to worry about important steps that they took. 79 00:08:48,960 --> 00:08:55,710 So this is intended to unify, streamline, expands on ideas and best practise from machine learning and statistics stability. 80 00:08:55,710 --> 00:09:05,610 If you look at my group's project, we did worry about stability that was not as consistent as we formulated a framework that things become like model. 81 00:09:05,610 --> 00:09:12,630 I would say a coherent and consistent across different subgroups in my group. 82 00:09:12,630 --> 00:09:17,790 And why nothing of what you see, as you said, sexually really booting out a lot of passive work. 83 00:09:17,790 --> 00:09:25,600 These are signs to me as post signs that mesh and engineering predictability in poor parents philosophy, 84 00:09:25,600 --> 00:09:33,700 science is really important step for of us to vacation as the benefits really try to represent replication idea from a scientific research, 85 00:09:33,700 --> 00:09:39,070 computability, invest the scalability, convergence. 86 00:09:39,070 --> 00:09:48,280 All of that from computer science, but also collectability that include data in spa assimilations, which is, I think, under in statistics. 87 00:09:48,280 --> 00:09:58,340 We should use the data to come out with things close to reality and then simulate it from that models and we'll come up. 88 00:09:58,340 --> 00:10:05,830 And so imagine the stability I kind of embarked on this realising and promoting stability while I had the opportunity now like once, 89 00:10:05,830 --> 00:10:10,390 10 years ago, to give the Tukey lecture Opportunity Society so to come back. 90 00:10:10,390 --> 00:10:17,080 What I was seeing was the sample instability of last year, neuroscience problem with to create robust statistics. 91 00:10:17,080 --> 00:10:22,030 So stability is a challenge. And Joe, robust as well. I w that he is has a particular meaning. 92 00:10:22,030 --> 00:10:26,110 So is value stability and also calm. I was applying massive dynamic systems. 93 00:10:26,110 --> 00:10:31,210 People use stability. So it's really you have a perturbation. You justifies your documentation. 94 00:10:31,210 --> 00:10:38,110 And then you look at the result about the stability measure to be raising tolerable threshold. 95 00:10:38,110 --> 00:10:44,390 So it's about interprete ability that way. S stability. You shouldn't interpret your result. 96 00:10:44,390 --> 00:10:50,540 And reproductive rights and now had to expanded to all core scientific recommendations. 97 00:10:50,540 --> 00:10:57,410 So I've been working with scientists to reduce the scope of experimental space and 98 00:10:57,410 --> 00:11:05,810 fall both bad experiments and possibly in causal inference about intervention design. 99 00:11:05,810 --> 00:11:16,310 To be a little more specific, 2013, I define stability through this paragraph, reproducibility is imperative for any scientific discovery. 100 00:11:16,310 --> 00:11:19,910 More often than not, modern scientific findings renounced the paper and analysed it. 101 00:11:19,910 --> 00:11:24,920 High dimensional data as a minimum reproducibility manifests itself instability 102 00:11:24,920 --> 00:11:32,810 of Stachel results relative to reasonable perturbations to the data model used. 103 00:11:32,810 --> 00:11:40,070 And of course, reason, though, that's the heavy duty work, that documentation kind of back up, 104 00:11:40,070 --> 00:11:46,820 coming back to this system view of data science, predictability is really a reality cheque. 105 00:11:46,820 --> 00:11:52,940 We shouldn't anybody try to interpret anything unless we know key structures of the data is being kept. 106 00:11:52,940 --> 00:11:55,160 I'm on the island. There are many talks. 107 00:11:55,160 --> 00:12:03,530 People talk about hypothesis testing, ways of dressing, like why you can interpret its model and stability in Syria. 108 00:12:03,530 --> 00:12:10,220 You try to shake the system and take different parts out and continue another part that you could just use. 109 00:12:10,220 --> 00:12:14,060 Then the system doesn't break. It's really very common sense. 110 00:12:14,060 --> 00:12:21,530 Those principles are very common sense, reality, cheque and robustness. 111 00:12:21,530 --> 00:12:25,930 And the pieces of workflow, so pieces consist of two parts. 112 00:12:25,930 --> 00:12:31,040 The workflow and also the inference. 113 00:12:31,040 --> 00:12:38,680 The workflow is really try to think about the same principles every step of the days assigned lifecycle. 114 00:12:38,680 --> 00:12:44,480 And recognising then would make so many judgement calls, human judgement costs in the process. 115 00:12:44,480 --> 00:12:50,300 And let's make that transparent and maybe perturb a bit to make sure the system still works. 116 00:12:50,300 --> 00:12:56,220 So P you know, your model. Formulation stage really means. 117 00:12:56,220 --> 00:13:02,850 Future. New data. So we shall always keep in mind that when we develop not it's not just by the data we have. 118 00:13:02,850 --> 00:13:07,620 It's about some future applications and future diagnosis. And keep that in mind. 119 00:13:07,620 --> 00:13:12,330 And she tried to bring that in computations everywhere. Right. You know, you have the data. 120 00:13:12,330 --> 00:13:14,850 You do it's computer. 121 00:13:14,850 --> 00:13:24,750 And stability even includes language, stability that I had the opportunity to talk to a cancer expert from Berkeley Lawrence Lab. 122 00:13:24,750 --> 00:13:31,070 And she has this concept in biological terms, kind of matrix in her micro environment theory for cancer, 123 00:13:31,070 --> 00:13:38,130 which actually took 30 years for people to accept that cancer is not just something shocking. 124 00:13:38,130 --> 00:13:41,760 So it's really like a micro organism. 125 00:13:41,760 --> 00:13:50,760 And I know we use Matrix for a table of data and do not try to explain to her that her matrix was different from my matrix. 126 00:13:50,760 --> 00:13:56,820 No, I came across, but, of course, was a social occasion. So if I'm not working that hard, then that's something we have to make sure that, you know, 127 00:13:56,820 --> 00:14:06,600 in the context of both cancer research and analysis, we will make sure Matrix means the same thing to both of us when we speak about it. 128 00:14:06,600 --> 00:14:12,050 And that will come through the process entrance. It's really tried to expand. 129 00:14:12,050 --> 00:14:21,770 This quo in France and use perturbation as a basic concept instead of a property distribution and make it transparent and pretty algorithmic. 130 00:14:21,770 --> 00:14:24,710 Again, it was specialised institutes in France. 131 00:14:24,710 --> 00:14:34,230 If the probabilistic model is bad, it then we go back to classical inference and data cleaning is now the issue. 132 00:14:34,230 --> 00:14:41,120 So the data perturbations, that's where it kind of the whole, you know, 133 00:14:41,120 --> 00:14:46,180 pieces really started with this very in-depth collaboration and had reject a landslide. 134 00:14:46,180 --> 00:14:52,990 So about 10 tools, six took a whole year off from city history position and said in Jacob Landslips in your science 135 00:14:52,990 --> 00:15:01,510 lab to get into neuroscience and we're working and working on the movie reconstructions together. 136 00:15:01,510 --> 00:15:03,760 And after that, to interpret them on a fellow. 137 00:15:03,760 --> 00:15:12,730 This is a basic science I have to really see which features really drive a particular FMI of oxalate responses. 138 00:15:12,730 --> 00:15:16,990 And then I was like, well, they recorded data for an hour and fifteen minutes. 139 00:15:16,990 --> 00:15:21,400 I just thought it was weird. Why an hour? Fifteen minutes, not two hours, whatever. 140 00:15:21,400 --> 00:15:28,390 And then just like if I take ten minutes of the data away, basically something to track, do I get the same result. 141 00:15:28,390 --> 00:15:35,710 And then the last step, the mother and let's sue after couple futures, things just not stable. 142 00:15:35,710 --> 00:15:41,260 So that's basic. The first step to what I said, this is disconcerting. So we develop easier. 143 00:15:41,260 --> 00:15:48,820 You see, ECV tried to put stability on top of cross-validation and got much smaller like 144 00:15:48,820 --> 00:15:55,360 possible voxels in case you want to intervene and put some different stimuli there. 145 00:15:55,360 --> 00:15:59,290 And then the opportunity to give the Tukey lecture. So start thinking at this. 146 00:15:59,290 --> 00:16:06,150 So actually, we have so many different forms of data preservation. And Will, you did do one. 147 00:16:06,150 --> 00:16:12,150 And then even if we do all of them, do we get the same result? 148 00:16:12,150 --> 00:16:19,440 And the new data portability form we want to include in the piece C as framework is we can take synthetic 149 00:16:19,440 --> 00:16:25,350 data through mechanistic 3D model that's become a form at the missable form of data preservation. 150 00:16:25,350 --> 00:16:33,330 If we leave that a proper a framework of cards, you can also still have stochastic differential equation. 151 00:16:33,330 --> 00:16:40,770 But events and say they're from a different kind of, you know. You can have a stochastic, Mulder, based on one Pupper PD. 152 00:16:40,770 --> 00:16:47,220 You might have a ternary PD. Then you can bring that in without using a bigger PD model that includes bugs. 153 00:16:47,220 --> 00:16:51,510 And then choice of data modality. Right. You can have we do data. 154 00:16:51,510 --> 00:17:02,820 And I'll do data on Piak. It's it's a ward organisation to send to different country to look at the skills of different countries, a workforce. 155 00:17:02,820 --> 00:17:08,040 So the interview can go to some of his house. And Mr. Test is audio or video. 156 00:17:08,040 --> 00:17:15,750 Right. Would you get some data and data differentially if differential privacy is a form of stability? 157 00:17:15,750 --> 00:17:19,900 And the most noted for how? Pay attention to deep learning, which is hard now to this. 158 00:17:19,900 --> 00:17:29,070 Is that adverse or attacks that you can competitive like medical diagnosis by doing some adverse attacks on the medical images? 159 00:17:29,070 --> 00:17:34,610 So all of this. It can be viewed as different form as a partnership, and we can unify. 160 00:17:34,610 --> 00:17:40,070 And then depends on the purpose. We could choose different forms. 161 00:17:40,070 --> 00:17:46,400 And of course, robust statistic was Chukchis attempt to look at different models, to look at the centre of distribution. 162 00:17:46,400 --> 00:17:54,330 You want the century statistic to be stable for long tail where discussion do. 163 00:17:54,330 --> 00:18:00,680 How many different comics? Optima Non-Com I subtrend station probably end up with different modes and sensitivity nights. 164 00:18:00,680 --> 00:18:09,060 Invasion modelling's also follow of perturbation. People are aware of that, but we don't really do that very much humanity. 165 00:18:09,060 --> 00:18:12,680 She's a researcher at the Researcher Perturbation, right? 166 00:18:12,680 --> 00:18:18,650 We do for our straight faculty colleagues from Oxford and give them the same problem. 167 00:18:18,650 --> 00:18:23,270 Do you guys get the same results and is qualitatively we don't even try. 168 00:18:23,270 --> 00:18:28,880 So now it's all the software availability. I tried to promote parameters and the comment. 169 00:18:28,880 --> 00:18:32,760 Scientists already doing that. You always see it as Paul is nine different climate models. 170 00:18:32,760 --> 00:18:38,210 And for the global temperature prediction, you have a whole interview interval. 171 00:18:38,210 --> 00:18:46,720 So they kind of already do perturbation nonsense and just want the other side that we need to make all our human judgement calls. 172 00:18:46,720 --> 00:18:50,870 Transparent. There is no magic. You're documented. And you make an argument. 173 00:18:50,870 --> 00:18:56,340 And this is documented. So this is just a list of things we do. 174 00:18:56,340 --> 00:19:02,340 We choose. All of that has to be documented. 175 00:19:02,340 --> 00:19:07,320 So you have models there, amount mental constructs, and you have reality. 176 00:19:07,320 --> 00:19:12,870 They don't have to come. I just feel you write on X, suddenly X means a gene. 177 00:19:12,870 --> 00:19:18,940 So we do documentation is a brain step. Unclear. 178 00:19:18,940 --> 00:19:27,690 So you have to make a break by doing quantitative and qualitative narratives with, say, our Mike Dong, oh, troupers, 179 00:19:27,690 --> 00:19:39,700 your notebook and just explicit make the judgement calls written and people can judge whether they trust your judgement calls or not. 180 00:19:39,700 --> 00:19:46,990 Otherwise, we're not on solid ground. So people often ask me about how to choose perturbations in the peace framework. 181 00:19:46,990 --> 00:19:51,370 You can do all of the perturbation. We never go home. We can just keep put her because everything. 182 00:19:51,370 --> 00:19:59,090 So it's not feasible. But if you pledge to stability principle, then you wouldn't want to do. 183 00:19:59,090 --> 00:20:08,110 You know, naturally you want to think harder about what other perturbation because you've had enough perturbation, you won't have anything consistent. 184 00:20:08,110 --> 00:20:15,730 And this just requires the document and make the case and choose your perturbation carefully. 185 00:20:15,730 --> 00:20:28,210 And that's part of your evidence. So let's come to the part which I will follow up in the Peach Tree project, which is inference. 186 00:20:28,210 --> 00:20:34,690 Right. You can do predictions, stability and noses, but people then want to have a measure of strength of the evidence, 187 00:20:34,690 --> 00:20:40,520 which traditionally has been played by t betting. All the work I have done. 188 00:20:40,520 --> 00:20:48,230 We never had institution, never have the decision power. We provide evidence to help experts to make decisions. 189 00:20:48,230 --> 00:20:56,190 Maybe FDA will have that. But FDA probably also have medical experts agree on certain procedure. 190 00:20:56,190 --> 00:21:06,750 So the key for most of us is really provide data evidence in a transparent manner so that experts can understand and make decision maybe with us. 191 00:21:06,750 --> 00:21:14,120 And P-value definitely has problems, as we all know. 192 00:21:14,120 --> 00:21:23,330 So if I take a very critical look at the properly the statements institute for you in France, you see that? 193 00:21:23,330 --> 00:21:27,860 I definitely dated for many, many years, if not decades, go to my statistic, 194 00:21:27,860 --> 00:21:35,530 classes start with random variable and just start saying no, this random hour represent my data without second thought. 195 00:21:35,530 --> 00:21:44,600 But if you take a step back, which later I did, is that saying the data is a realisation or random process associate assumption, 196 00:21:44,600 --> 00:21:48,510 even when we talk about random variable lives, that. 197 00:21:48,510 --> 00:21:57,310 If you only care about a date, I mean, you don't really need random Laravel, because whatever it is is it's really random marabous meaningful. 198 00:21:57,310 --> 00:22:02,390 But you want to think about two realisations of the same random larible and say they come from the same. 199 00:22:02,390 --> 00:22:08,340 It's a way to tie two things together. And that's implicit, assuming stability. 200 00:22:08,340 --> 00:22:12,600 Otherwise, you don't need a Renoir. But you only come out of data to describe, to tabulate. 201 00:22:12,600 --> 00:22:18,020 You don't need a random sample. So when we use random variable, then bringing some assumptions or. 202 00:22:18,020 --> 00:22:19,020 Right. 203 00:22:19,020 --> 00:22:30,920 But when this assumption is now substantiated, oh, the properties, his statements become questionable and often the model structures is not right. 204 00:22:30,920 --> 00:22:38,040 Now, we only deal with assuming the mellado structure is cracked and then do with no quantitative measures of different parameters, 205 00:22:38,040 --> 00:22:41,310 open small provided a measured model bias. 206 00:22:41,310 --> 00:22:49,740 And I definitely think we need to move away from the word true model because, you know, you tell students a true model Echinacea is not true. 207 00:22:49,740 --> 00:22:57,930 It's better to use approximate postulated model just for consistency and. 208 00:22:57,930 --> 00:23:07,680 What we really do. So the PCOS influence really tried to mitigate some of the problems. 209 00:23:07,680 --> 00:23:12,240 I was just mentioning. So when do we used to do diagnosis? 210 00:23:12,240 --> 00:23:18,840 Right. But for regression, you know, we don't have classical Bokan apply regression analysis, exeter's the chapter on diagnosis. 211 00:23:18,840 --> 00:23:25,610 But because the high dimensional data become very difficult. So we don't do it and not jump to insurance right away. 212 00:23:25,610 --> 00:23:31,710 So think is good to bring in by compute the unpredictability or prediction which Michelle Obama made very popular. 213 00:23:31,710 --> 00:23:39,160 Just use prediction as mother checking. And then stability. 214 00:23:39,160 --> 00:23:48,590 We want to now expand sample to sample my ability to include even say I formulate the problem, but the genes is important in two different ways. 215 00:23:48,590 --> 00:23:53,860 The sue a random forests and now having importance. Magers and I can put them together. 216 00:23:53,860 --> 00:24:02,840 The ranking become comparable so we can go beyond the same class of models to talk about some measure of uncertainty and computation. 217 00:24:02,840 --> 00:24:14,060 Of course, this require action. One frontier of research, PCOS, is actually how coming to stability analysis streamlined so that a sufficient. 218 00:24:14,060 --> 00:24:20,560 And the goal is really you can move away from always thinking about. 219 00:24:20,560 --> 00:24:26,510 It's a proper artistic statement. When we don't have a probabilistic model justified, when we do, of course, let's do that. 220 00:24:26,510 --> 00:24:31,640 But often we just use the probably Comodo as a surrogate. So get something going. 221 00:24:31,640 --> 00:24:39,450 And then I think things better not to use the probably frame and use perturbation tables and is transparent. 222 00:24:39,450 --> 00:24:47,670 So to be specific, you can follow me there probably multiple ways and then you can have the target of interest, which is if a comparable cross, 223 00:24:47,670 --> 00:24:55,410 a different population, say, the ranking of change from the sewer ranking of genes for random forests, then they become comparable. 224 00:24:55,410 --> 00:25:00,420 And then we do a sample split training test, which takes a lot of thinking. 225 00:25:00,420 --> 00:25:07,060 If you don't, you're not in the idee case. And then we scream. 226 00:25:07,060 --> 00:25:09,340 The model is based on predictability. 227 00:25:09,340 --> 00:25:15,800 So the ones who don't pass this grey training data, we can maybe new cross-validation set of validation parts aside. 228 00:25:15,800 --> 00:25:20,980 We're seeing the training. Then otherwise pass the screening. 229 00:25:20,980 --> 00:25:25,000 You start worried about a state is like a piece s inference, 230 00:25:25,000 --> 00:25:29,300 then you use documentation to argue for appropriate data perturbation and other 231 00:25:29,300 --> 00:25:35,250 perturbation and then formulate a perturbation interval for reporting which can be useful. 232 00:25:35,250 --> 00:25:43,960 P c. S p value as a form of evidence about preservation of stability knows after model checking. 233 00:25:43,960 --> 00:25:49,540 Any questions? Yeah, I have a question. 234 00:25:49,540 --> 00:25:54,730 I mean, very, very nice, but, um. So a lot of from what I understand, 235 00:25:54,730 --> 00:26:05,230 a lot of your preservation and DDA is validated using projections predict like you you cheque whether the predictions is correct or not. 236 00:26:05,230 --> 00:26:10,720 And that allows you to sort of get away from the Moutet assumptions in a sense, if I'm so right. 237 00:26:10,720 --> 00:26:15,910 But if you're interested in inference, like you don't only want to do good predictions, 238 00:26:15,910 --> 00:26:22,930 but you actually want to interpret some of the inference that you're doing in a sense like showing to the BP Titos, 239 00:26:22,930 --> 00:26:27,040 you want to do inference because you want to take a decision on something specific. 240 00:26:27,040 --> 00:26:31,240 And what kind of stability checking are you going to do? 241 00:26:31,240 --> 00:26:37,310 Can you still do prediction that? So is there something else? Can you just keep stability cheques without more? 242 00:26:37,310 --> 00:26:44,650 Would modelling behind all this makes sense? So the next part of menu will answer your question is to a certain extent. 243 00:26:44,650 --> 00:26:48,310 That's who our first case actually go there were to do decision making. 244 00:26:48,310 --> 00:26:55,240 But we still use prediction as the screening and then we have models and then we compare a now 245 00:26:55,240 --> 00:27:03,560 distribution for perturbation and have a reference distribution and we come up with PCOS providers. 246 00:27:03,560 --> 00:27:09,620 So maybe if I go to the next part, maybe we'll keep some answers. 247 00:27:09,620 --> 00:27:16,220 So the piece says the inference still pretty is the newest part. I mean, the original paper, we we've said this, but we do that. 248 00:27:16,220 --> 00:27:22,730 We have some ranking comparison, but we didn't go formally p value confidence interval. 249 00:27:22,730 --> 00:27:28,240 So this project, which our share is tragic. All there in the. 250 00:27:28,240 --> 00:27:36,050 Contacts of his status discovery. So please know, ask questions, I gather. 251 00:27:36,050 --> 00:27:45,530 Yes. Thanks. So the second part is for every static interaction based in non-linear interaction between genes. 252 00:27:45,530 --> 00:27:53,420 And this is part of like multi discipline, a multi institute project called the Monty's Go Deep Learning. 253 00:27:53,420 --> 00:27:58,030 And single cell models for cardiovascular health. 254 00:27:58,030 --> 00:28:03,350 It's so the bio hub, which is next to UCSF. 255 00:28:03,350 --> 00:28:06,890 It's a West Coast kind of broad institute. 256 00:28:06,890 --> 00:28:18,130 Started a couple of years ago and they called for Inter Campus Award, which need people from all three campuses, Stanbury, UCSF and Berkeley. 257 00:28:18,130 --> 00:28:23,090 And we teamed up with a cardiologist from Stanford and UCSF. 258 00:28:23,090 --> 00:28:27,980 And my colleague institutes a plan, BRONG, where the data scientist and we have many posts. 259 00:28:27,980 --> 00:28:32,630 I have been working on this. 260 00:28:32,630 --> 00:28:42,290 The particular thing we did is the first step of this multi discipline, multi institute project is really to locate non-linearity interactions. 261 00:28:42,290 --> 00:28:44,990 And we develop this massive Kulp IP train. 262 00:28:44,990 --> 00:28:51,950 The paper is out by archive and has actually a different name called the Linear Learning Epistemic Phylogenetic, 263 00:28:51,950 --> 00:28:58,990 you know, corresponding interaction being assigned to journal human genetics. So it's kind of tailored to all that. 264 00:28:58,990 --> 00:29:03,230 I want to climb down the two young people eating they two bright young scientists, 265 00:29:03,230 --> 00:29:07,900 a girl who is supposed to now returned to Germany with spare and calm, complete. 266 00:29:07,900 --> 00:29:11,980 My former student now at UCSF and the two senior. 267 00:29:11,980 --> 00:29:18,570 Do you like a James Priest who is a cardiologist from Stanford? I myself. 268 00:29:18,570 --> 00:29:28,350 So at this stage, this is really a name fall non interaction and might have been given by feature in the paper in 1919. 269 00:29:28,350 --> 00:29:37,460 And this is a classical case and textbook fall. These days is you have you Drosophila fruit fly with brown eyes and skalla. 270 00:29:37,460 --> 00:29:42,420 So Brown actually creese found two dozen red. 271 00:29:42,420 --> 00:29:48,890 Pigment. In the Drosophila and Scollard executive at the Bob Brown counter, so the names on it. 272 00:29:48,890 --> 00:29:59,930 Now the company in line, if you cross them and get, you know, F one, you get wrapped around it to know kind of jeans work I get. 273 00:29:59,930 --> 00:30:05,720 But if you do the second generation crossing, you actually get a lot more. 274 00:30:05,720 --> 00:30:12,140 You get broad scholar and you get what? So this is example, not Nanine interaction that you never seen my. 275 00:30:12,140 --> 00:30:21,980 And something occurred when you do the second generation and the GMG interact and Fisher basically formulated syllogistic 276 00:30:21,980 --> 00:30:29,160 model which multiplexed interaction about problem formulation and try this Nanine interactions pretty awake. 277 00:30:29,160 --> 00:30:32,940 But then it got translated into a logic model. 278 00:30:32,940 --> 00:30:37,190 Was then your term. That's statistically and pretty normal term. 279 00:30:37,190 --> 00:30:42,860 The problem is that depends on the scale, which is now logic and t piece the penetrance, 280 00:30:42,860 --> 00:30:50,190 which is the probability of getting, say, a particular phenotype giving Gene A and B. 281 00:30:50,190 --> 00:30:56,580 It depends on the scale. Right. If you have some time, how much can you take log scale, become additive? 282 00:30:56,580 --> 00:31:04,740 So we still wealthy? Fine. And there's also mathematical theorem saying that any function, if you find the right scale, can become additive. 283 00:31:04,740 --> 00:31:08,070 So the scale thing was bypassed. Just took logit. 284 00:31:08,070 --> 00:31:12,880 And they're also evidence showing that, you know, multiple decades of logic might not be the right thing. 285 00:31:12,880 --> 00:31:24,120 Which Polje. And that when we have like like 10 sovern barons, as you see no data then is competition intractable? 286 00:31:24,120 --> 00:31:28,500 This polynomial may in fact, it is quickly, girls. 287 00:31:28,500 --> 00:31:33,930 So you can you have to cut through this. Let me in fact, mentality. 288 00:31:33,930 --> 00:31:38,670 Both for computational. Also tourism. Because often the genes and not important march alone. 289 00:31:38,670 --> 00:31:42,820 But in Parliament, they interact. 290 00:31:42,820 --> 00:31:50,350 So we decide that we want to prove the concept with the methodology, Barnham, before we go to the cardiovascular disease. 291 00:31:50,350 --> 00:31:55,390 Because for MEAC, for any human trace has been very challenging. 292 00:31:55,390 --> 00:31:58,210 So we decide we have access to the UK buyback data. 293 00:31:58,210 --> 00:32:04,060 We decide to go for something relatively easy by a human trait, which is red hair, which is self reported. 294 00:32:04,060 --> 00:32:11,030 And we have about 500 South individuals. And we can, you know, I think. 295 00:32:11,030 --> 00:32:18,320 About 30 salved on them. Fifteen soundin them self claimed red haired people. 296 00:32:18,320 --> 00:32:24,140 And then we just matched another random sample of the same side to make the searches solvent. 297 00:32:24,140 --> 00:32:29,760 So it's people think it's genetic. And believe is controlled by Epis Stacie's. 298 00:32:29,760 --> 00:32:40,390 And it's pretty common that we don't have a problem having the data. So we want to divine something which is flexible now, nature and. 299 00:32:40,390 --> 00:32:46,730 Chooses scale, which is small. Make more sense, that is to us. 300 00:32:46,730 --> 00:32:53,320 And then we want to detect interaction. More than. Alderton. 301 00:32:53,320 --> 00:33:02,980 So the data came with with like a 10 million snips of variance and there's a very common data reduction scheme, 302 00:33:02,980 --> 00:33:06,520 like a pipeline developed by people based on t data. 303 00:33:06,520 --> 00:33:15,550 Different tissue. Call the Predix scan and then that will translate impute gene expressions from the variance. 304 00:33:15,550 --> 00:33:18,610 So that's what we used for cardiovascular disease. 305 00:33:18,610 --> 00:33:25,030 Actually, it's not a good idea because a lot of the brain signals when you have that tissues, doesn't really have the right signal. 306 00:33:25,030 --> 00:33:31,210 So helpful fault for the pigmentation. It's OK. 307 00:33:31,210 --> 00:33:37,630 So that's the first that we did to improve the gene expression and dimensionality reduction. 308 00:33:37,630 --> 00:33:45,100 And then we use something called run random, pass our introduce basis, saying other genes fall on the same path. 309 00:33:45,100 --> 00:33:54,220 And we have efficient way of searching it. Then these are the genes called selected kind of interaction, but also model selection. 310 00:33:54,220 --> 00:34:01,780 And then we're used to selection genes to my back to the veterans and we reduce the dimensionality of the snips and then we do the same thing. 311 00:34:01,780 --> 00:34:09,760 So there are two how Ommaney concentrate on the gene expression pipeline. 312 00:34:09,760 --> 00:34:18,930 So far, the positive control red hair phenotype from your biobank. We end up with 30 solvent, as I mentioned, subjects balanced. 313 00:34:18,930 --> 00:34:24,680 Fifteen in Casement's red hair and 50 solved and with other self hair colours. 314 00:34:24,680 --> 00:34:32,700 Premium one uses random for it, random them far ellickson juice and they selex stable predictive. 315 00:34:32,700 --> 00:34:41,840 Interactions. Of our two, Ohia. And they each interaction gets a scar because stability score. 316 00:34:41,840 --> 00:34:49,650 And that's how we rank them. And we just cut it at point five. 317 00:34:49,650 --> 00:34:55,440 So was it was random flyers, extras. Can be viewed as a special case of pieces as well. 318 00:34:55,440 --> 00:35:01,580 The paper published twenty eighteen as the two young bright people did. 319 00:35:01,580 --> 00:35:11,780 The work is Samantha. As soon as this professors come out again, Cochlear and UCSF, the two senior leaders at Bamburgh myself. 320 00:35:11,780 --> 00:35:16,780 So last year, the project was probably like twenty sixteen oh tentage. 321 00:35:16,780 --> 00:35:21,320 Fifteen ballasts working on genomics for a long time. I was not. 322 00:35:21,320 --> 00:35:27,930 And Ben had been using random forest and really liked it because really gave me a good predictive results. 323 00:35:27,930 --> 00:35:35,290 But it was very hard to interpret because I would change the data about the stability issue that you. 324 00:35:35,290 --> 00:35:44,590 The jeans on the same path would change a lot. So this using jeans on the same path as interaction. 325 00:35:44,590 --> 00:35:52,150 It's not our idea. Many people have done that is saying that you two jeans fell on the same path and decision tree in a random forest. 326 00:35:52,150 --> 00:35:58,930 We kind of make this leap saying that they might interact biologically so you can see on the same path. 327 00:35:58,930 --> 00:36:07,390 It's a mathematical interaction, an aggressive interaction. But we believe they might interact biologically. 328 00:36:07,390 --> 00:36:12,150 So what we did was add stability to random forest and follow these protocols. 329 00:36:12,150 --> 00:36:20,200 You actually improve predictive accuracy. We use the importance index to do weighted sampling for the next step. 330 00:36:20,200 --> 00:36:28,570 A random forest. And we also use random intersections, your Qamar housing to fight intersection path. 331 00:36:28,570 --> 00:36:35,930 And then we have an outer loop begging to assess stability. So you start with you from future. 332 00:36:35,930 --> 00:36:45,180 And then you danwei to importance. So this iteration, two, UJA, three to five, this is nothing from my experience. 333 00:36:45,180 --> 00:36:50,490 And so some genes get emphasised and the models, genes getting shown. 334 00:36:50,490 --> 00:36:54,930 But we don't delete any genes. Of course, the genes, the waiting's very small. 335 00:36:54,930 --> 00:37:04,290 You never see them. But this way, you allow something now so important since enter because you are constantly done. 336 00:37:04,290 --> 00:37:11,850 Kind of. The thing is, the importance measure, which we're still trying to understand, is not just manufacturing. 337 00:37:11,850 --> 00:37:17,020 It has more to that because there's a lot of correlation glassine. So the ME using the weighting. 338 00:37:17,020 --> 00:37:24,970 It's not the same as fading editing model and looking at main effects. 339 00:37:24,970 --> 00:37:31,270 So when we have all this trace, I would just use the same random forest, that kind of algorithm, by changing the weight. 340 00:37:31,270 --> 00:37:39,090 And we have to collect the different parts. And that's why we use a generalisation of this random intersection trace, 341 00:37:39,090 --> 00:37:48,470 which was from my Canadian basket that you have to set with zero one like man and woman buying from shopping. 342 00:37:48,470 --> 00:37:52,680 I want to see the shared like items purchased purchase on man or woman. 343 00:37:52,680 --> 00:37:58,950 So what, we're 10 days into a zero one by taking each pass and turn that into a zero one vector. 344 00:37:58,950 --> 00:38:05,880 If X1 gene is spread, it gets a one. So look at the past and put older ones for the gene scale split. 345 00:38:05,880 --> 00:38:10,140 And then otherwise zero. So you turn into the same problem as Schaar a month. 346 00:38:10,140 --> 00:38:19,980 Housing was that. It was. And they have a random way randomise algorithm to collect the shared kind of path. 347 00:38:19,980 --> 00:38:26,840 And then they have some research showing that if the share pass is sparse, you have positive properties, you're getting them. 348 00:38:26,840 --> 00:38:35,330 And then we have the actor, do you have all this collected collections and then each collection gets his car by a strap. 349 00:38:35,330 --> 00:38:44,460 So we use random five worthwhile parts of fiction and random intersection, triple computations, stability, which is added certainty. 350 00:38:44,460 --> 00:38:51,510 We did. Trace. To get it selected. 351 00:38:51,510 --> 00:38:55,950 So far, the original paper will use Drosophila data for the prediction there. 352 00:38:55,950 --> 00:39:02,520 We had discovered like 20 pairwise interactions and 80 percent of them were actually already. 353 00:39:02,520 --> 00:39:11,680 There were physical experiments, not starting from the 90s, verifying that things were actually found actually biologically interact. 354 00:39:11,680 --> 00:39:16,680 So with the red hair data looked like that is that it was C curve. 355 00:39:16,680 --> 00:39:25,050 And we have full curves here. And the best is the green one, which is either random forest. 356 00:39:25,050 --> 00:39:30,430 With the snake level. Right. So was this election and. 357 00:39:30,430 --> 00:39:31,410 And then the next, why? 358 00:39:31,410 --> 00:39:37,570 It turned around and forest at the gene level and the others left sue at the gene level and ranger base is a version of random, 359 00:39:37,570 --> 00:39:44,570 the ring that you can see that actually the stability outperforms random forest for that Drosophila data. 360 00:39:44,570 --> 00:39:49,960 We worked on didn't improve by the prediction profonde was similar. 361 00:39:49,960 --> 00:39:58,050 So it says I run so far it's through doesn't happen. They say that enough organisation's instability added more organisation and help prediction. 362 00:39:58,050 --> 00:40:02,590 This is our health, our test data. Also, banker follows suit. 363 00:40:02,590 --> 00:40:09,390 So the suit has been used by red hair in previous work. So we do quite a bit. 364 00:40:09,390 --> 00:40:17,400 Better. And then we look at the genes, discardable our eyes. 365 00:40:17,400 --> 00:40:23,790 And then you can go to go look at the go term, which, you know, James did, and find that they seem to make sense. 366 00:40:23,790 --> 00:40:30,670 This is all juristic. And then also, James looked into the protein protein interaction in Richmond. 367 00:40:30,670 --> 00:40:37,880 And also find the names of Gene Rex seem to make sense. But this is all heuristic just to like sanity cheque. 368 00:40:37,880 --> 00:40:47,120 So this is a screening. And so going back to ethicists now, I have this interaction so likely bar. 369 00:40:47,120 --> 00:40:52,940 Right. So they come. All right. The output is ANP interact, all ABC interact. 370 00:40:52,940 --> 00:40:57,830 Now, we now want to look at how we going to detect this MEAC interactions. 371 00:40:57,830 --> 00:41:02,720 You can go from multiplicative in terms of interaction form. 372 00:41:02,720 --> 00:41:06,890 You can just direct full cart of decision tree. That's right. 373 00:41:06,890 --> 00:41:11,450 Interpret both form of interaction. Oh you can just do random forehead's which is not right. 374 00:41:11,450 --> 00:41:16,430 Interpret polarisers managing side. But you have a noun in your surface in terms of scale. 375 00:41:16,430 --> 00:41:21,440 You can do logic assfish or it can do penetrance, which is pretty straightforward. 376 00:41:21,440 --> 00:41:28,400 So we felt like khat is interpretable as our previous work interpreting intracellular biology. 377 00:41:28,400 --> 00:41:36,830 And those of us in penetrance is makes more sense. So we just went for penetrance and duology because it is no good, really. 378 00:41:36,830 --> 00:41:50,970 Ballajura argument for going logic seems to be a tradition. So what is penetration penetration is the probability of a red herring giving Jean Amby. 379 00:41:50,970 --> 00:41:57,640 Yeah, so it's just a probability in binary classification. 380 00:41:57,640 --> 00:42:10,710 But in genetic, it's called penetrance. So the Fisher would do it Longet of the penetrance and ride it as a main, in fact, and interaction in fact. 381 00:42:10,710 --> 00:42:20,780 So we don't do the larger power, just directly model the property of Red Harry giving GNB into a cartful like this Asian triphone. 382 00:42:20,780 --> 00:42:27,710 Thanks for the question. So just give you a sound science that actually card model is the only way they're trying to justify this 383 00:42:27,710 --> 00:42:36,830 is in penetrance scale that on the right hand side is the smooth proportions of red hair with two genes. 384 00:42:36,830 --> 00:42:42,030 Cut a sip, which is, well, no red hair. Gene and Tuff's three. 385 00:42:42,030 --> 00:42:46,560 So you see the stripes, because this is computer data from this great data. 386 00:42:46,560 --> 00:42:51,990 So this tribes', because snips take bad, it was zero one. 387 00:42:51,990 --> 00:42:58,200 And the cart. So on the right hand, on the left hand side, you have phone LoDo fated surfaces. 388 00:42:58,200 --> 00:43:03,370 The first is just Lainer J. Plunge into this. No, the gene expression is continuous. 389 00:43:03,370 --> 00:43:09,900 It was smooth. And the outer bands with the interaction term multiplicative interact and ended. 390 00:43:09,900 --> 00:43:20,200 You can see they look very similar. I'm full. The card we have, we fit cart to a car to be and do additive. 391 00:43:20,200 --> 00:43:24,340 And then at the assisted Valdo, which is not Iraqi, feed the car interaction term. 392 00:43:24,340 --> 00:43:30,190 And that actually if we use our P C. S P value, it turned out to be significant. 393 00:43:30,190 --> 00:43:38,410 Which I'll explain how we do that. You can see that the stripes are kept or captured better by part model. 394 00:43:38,410 --> 00:43:41,420 And so some interpersonal. 395 00:43:41,420 --> 00:43:49,000 So this is the example of acid and which analogy which turned out to be not significant, to fit the mould on the training data. 396 00:43:49,000 --> 00:43:55,220 Jerry interposing SCDP lesson minus point five. You know. 397 00:43:55,220 --> 00:43:58,370 Don't have red hair. And then that's what you have, right? 398 00:43:58,370 --> 00:44:05,420 So this is a country, the probabilities which at the end nodes and then the kapi has little. 399 00:44:05,420 --> 00:44:17,630 Level two. That's the Nung. Oh, no opposition model and this model, is it just fit the interactions together so we can see that incident. 400 00:44:17,630 --> 00:44:21,410 So the first is split is on that eight. 401 00:44:21,410 --> 00:44:27,650 And then you said depends on how strong as base you're going to split differently. 402 00:44:27,650 --> 00:44:38,370 Oh, you spit on that again. So this is announcing their interaction, right? 403 00:44:38,370 --> 00:44:48,460 So now the inference I think maybe your answer to this question is so now I have all these interactions found through eatery to random farm. 404 00:44:48,460 --> 00:44:55,430 I just take the first line and decode what I planted there. So if you look at concentrate on the first. 405 00:44:55,430 --> 00:45:01,600 Two bars. Orange and. Black. 406 00:45:01,600 --> 00:45:09,000 For this interaction, look at the test set that this is the prediction error on the test, that. 407 00:45:09,000 --> 00:45:17,490 Orange is the Narmada bigger than Etty Stacey Monda, which is black and the vertical black line is the prediction error. 408 00:45:17,490 --> 00:45:22,120 If you use all the. Jeans fund by F. 409 00:45:22,120 --> 00:45:26,280 Not just the two. And then use R after to the prediction. 410 00:45:26,280 --> 00:45:31,080 OK. So of course you do worse because you only do two of them. 411 00:45:31,080 --> 00:45:35,560 The asked framework says that we actually use this as a screening. 412 00:45:35,560 --> 00:45:41,850 So unless the black bar is shorter than the orange bar. 413 00:45:41,850 --> 00:45:47,760 I'm not going to do any P-value if the orange bar is longer than the black bar. 414 00:45:47,760 --> 00:45:56,420 I've just cut P-value one. So we're done. We're not going to do any further evidence sicking just, you know, the now model epis. 415 00:45:56,420 --> 00:46:06,850 This is model. No. No deficits, the model needs to do better, prediction needs to be shoulder otherwise with them. 416 00:46:06,850 --> 00:46:11,560 So that's what you don't see any further buyers for the ones. It's the other way around. 417 00:46:11,560 --> 00:46:22,870 So why did the orange bars longer than the black bar, which says every station mother is better than you go on for the calculation of PSP batting. 418 00:46:22,870 --> 00:46:28,770 And then what you see on the right legen gives you the peace, SPDM. 419 00:46:28,770 --> 00:46:38,110 I'll explain how we did the calculation. OK. So this is what states that that's all the interactions. 420 00:46:38,110 --> 00:46:47,170 You know. Being screamed out because the prediction and then the one's past, the prediction screening, we go on calculating the P. 421 00:46:47,170 --> 00:46:55,900 P. S P man. So at a high level, as I said, if everything gets the worst prediction, which is said to about eight to one. 422 00:46:55,900 --> 00:47:03,230 Otherwise at the high end, I will give you details. Next slide. Tibetans calculate based on refind complete comparison between two models. 423 00:47:03,230 --> 00:47:05,590 By taking into account past, they have variability. 424 00:47:05,590 --> 00:47:14,790 So we sample from test data and as a result, we have more reasonable P values instead of if you lose those just multiplicative, 425 00:47:14,790 --> 00:47:22,120 not chi squared test, which then approximate again, very, very small sometime even to the minus 17. 426 00:47:22,120 --> 00:47:28,810 And this industry, which only two genes, but it seemed to work for high order interactions. 427 00:47:28,810 --> 00:47:35,380 So here is some detail. After passing the screening, we just sample a photo. 428 00:47:35,380 --> 00:47:40,050 We have the model ready. Faded. 429 00:47:40,050 --> 00:47:46,470 From the training data and from the Narmada, the editing model, you can use that. 430 00:47:46,470 --> 00:47:51,100 James. To predict the probabilities. 431 00:47:51,100 --> 00:47:56,590 So you have the test set and you're going to sample from the noun distribution, 432 00:47:56,590 --> 00:48:01,480 using the probabilities, using the jinx pass until you aren't the model. 433 00:48:01,480 --> 00:48:10,920 And that's the name of them. And then for the. For the tentative wait. 434 00:48:10,920 --> 00:48:20,000 We also sampled from random sample from the test data and then. 435 00:48:20,000 --> 00:48:24,290 Calculate based on test statistic, which Polony likely ratio. 436 00:48:24,290 --> 00:48:32,150 Now over a tentative and then which just calculate. So every time but each bootstrap sample we we have a non sample. 437 00:48:32,150 --> 00:48:39,410 We have a plus tracked data sample and we compare which I have stronger evidence and which is to the average over the bootstrap sample. 438 00:48:39,410 --> 00:48:43,310 There's a you can do a normal approximation, but the number of bootstrap, 439 00:48:43,310 --> 00:48:50,670 something that's big and so big you don't have to do otherwise very computation. 440 00:48:50,670 --> 00:48:56,190 So some comments are in order, is that true, says Preval, it is conservative. 441 00:48:56,190 --> 00:49:00,750 Let's see if we see this Patrique example have a smaller P values. 442 00:49:00,750 --> 00:49:07,110 And for some spectral eye models, we can do some calculation, which is still ongoing, that the peace? 443 00:49:07,110 --> 00:49:10,890 S non-proliferation distributions seem to be a fat in version of the shop. 444 00:49:10,890 --> 00:49:21,380 Now distribution. You d you come in with a precise model. And we see that it's smaller than the faces on P-value. 445 00:49:21,380 --> 00:49:26,470 So it's kind of a more robust fight, a traditional hypothesis testing. 446 00:49:26,470 --> 00:49:30,440 But this is we didn't deal with data claiming perturbation. 447 00:49:30,440 --> 00:49:37,070 I can't put that into us in one form or perturbation, which we are only addressing this in the modern mainstage. 448 00:49:37,070 --> 00:49:43,320 So it seems I'm kind of where it is a time and Andrew say that the interaction we find a buyer level. 449 00:49:43,320 --> 00:49:50,930 No, identify the two impotent red hair jeans and see while are there's actually a magazine with a name. 450 00:49:50,930 --> 00:49:59,340 And as I said, I think it's on chromosome 20 and and see what art's across them. 451 00:49:59,340 --> 00:50:07,660 Sixteen. We find a lot of genes, which is near one the two genes, because they are next to each other and have similar functions. 452 00:50:07,660 --> 00:50:14,990 OK, I'll tell you something. We Divino you know, I could on a superheat student butter that we can look at. 453 00:50:14,990 --> 00:50:21,590 You can see that. So on the right is the superheat that this is some feature based on the different interactions. 454 00:50:21,590 --> 00:50:30,830 And you look at the columns, they are different individuals. You almost see like Tracey twitchiness groups four down to the red hair, the first block. 455 00:50:30,830 --> 00:50:35,870 It's like you have light up, cross the different interactions of different genes. 456 00:50:35,870 --> 00:50:39,920 And then the middle group is like black kind of light up on black. 457 00:50:39,920 --> 00:50:43,640 The last. I have no different patterns. So this is a way to look at. 458 00:50:43,640 --> 00:50:47,750 Oh, really? Who are the red haired people and the people who are not? 459 00:50:47,750 --> 00:50:53,270 Some actually have a little similar signature. 460 00:50:53,270 --> 00:50:59,840 So to summarise, we actually find three out of three our interaction, which hasn't been seen before. 461 00:50:59,840 --> 00:51:04,160 And then we also have some new discoveries. But this needs to be validated. 462 00:51:04,160 --> 00:51:11,630 And this is more suggestive. And we recovered the genes people, right, you know. 463 00:51:11,630 --> 00:51:20,210 So to summarise, right, we have this whole pipeline imputation and R.F. kind of nonlinear interaction amongst the selection. 464 00:51:20,210 --> 00:51:26,750 And he says P-value to make the decisions on each of the discovered interacting genes. 465 00:51:26,750 --> 00:51:33,150 And then we make a decision and then you can do a similar day for this snips. 466 00:51:33,150 --> 00:51:40,800 To summarise, we propose this law review data science framework through the three principles and documentation, 467 00:51:40,800 --> 00:51:46,890 and my group has that eight different studies. So we're pretty confident that this is a conceptual framework. 468 00:51:46,890 --> 00:51:53,930 There's still a lot of detail to be worked out. And three or four of them were actually motivating samples I raised. 469 00:51:53,930 --> 00:52:00,540 And you have them. Marks a mild case that included arbitrary. 470 00:52:00,540 --> 00:52:06,840 And those most recent work we did is with a yard doctor from UCSF. 471 00:52:06,840 --> 00:52:13,350 It's really use PCR to stress tests already existing clinical decisions that to evaluate it. 472 00:52:13,350 --> 00:52:19,820 And actually the decision will pass. So decision was how to treat a kid. 473 00:52:19,820 --> 00:52:30,150 Is the paediatric emergency room E.R. with the abdominal trauma injury, whether to send this kid to a C.T. or not? 474 00:52:30,150 --> 00:52:34,230 So we basically use it to stress tests to evaluate this. 475 00:52:34,230 --> 00:52:36,540 They pass pretty well. We're also external study. 476 00:52:36,540 --> 00:52:45,150 So it's both use for development of decision rules, but it is also for evaluating sitting ones until May not is extremely important. 477 00:52:45,150 --> 00:52:50,700 You can see what my judgement calls about why the card decision tree in genomics makes sense. 478 00:52:50,700 --> 00:52:56,700 And we hope to generate hypotheses for external motivation. 479 00:52:56,700 --> 00:53:01,350 So back to where I started was by how project for about her health. 480 00:53:01,350 --> 00:53:08,450 This is a lot more challenging, right? We need a new method actually to really do. 481 00:53:08,450 --> 00:53:17,070 The data we have. We have FMI and we extracted a different one dimensional features and gestures. 482 00:53:17,070 --> 00:53:21,700 No predictive predictability. And this is rare disease. 483 00:53:21,700 --> 00:53:30,010 Why over 500 people have this disease or particular HCM called hypertrophic cardiomyopathy. 484 00:53:30,010 --> 00:53:37,640 It's like flattening of your heart, more left ventricle war that. 485 00:53:37,640 --> 00:53:47,410 We don't see predictability and stability of those variable. So we need to get better data or back get better phenotypes and also the HCM diagnosis. 486 00:53:47,410 --> 00:53:54,620 I'm married and we have the older people from the UK Biobank. We need a younger patient data and the genetics to really show. 487 00:53:54,620 --> 00:54:03,690 So right now, we really. Needing some alcohol, we are kind of in a dark age for this project. 488 00:54:03,690 --> 00:54:13,290 Which is a lot more difficult. It's well known that human disease rallies is difficult, but we hope we can access standards that our cadabra. 489 00:54:13,290 --> 00:54:16,080 You were actually they started collecting their own data. 490 00:54:16,080 --> 00:54:23,460 So if you venture to have some good data, but right now, the UK, by ban, the signal is very, very weak or it's not there. 491 00:54:23,460 --> 00:54:28,440 Again, I want to send my group, really my group, to really concentrate on problem solving. 492 00:54:28,440 --> 00:54:36,190 But we recently returned to theory for a deep learning theory. Well, you know, it's picked up by Cumber and other people on this foundation. 493 00:54:36,190 --> 00:54:42,630 So Deep Learning Theory group. I also take a science grant called Floating This Many People. 494 00:54:42,630 --> 00:54:53,950 And although we finish in the first paper looking at mother selection, property of Random Farje that by Murro and also my students year one. 495 00:54:53,950 --> 00:54:58,830 We're also finishing a book with my former student kind of post on Rebecca Botter exam 496 00:54:58,830 --> 00:55:05,490 entry price using the preset framework and deal with the entire data science lifecycle. 497 00:55:05,490 --> 00:55:16,930 I tried to help make the break from reality to symbols and we use mass, but we do not intend to be a maths book. 498 00:55:16,930 --> 00:55:22,690 So thank you. I hope some of the ideas will be useful for our projects and all the papers and codes for the last two. 499 00:55:22,690 --> 00:55:28,650 Both our F. We always have a spark parallel computation version of that with my. 500 00:55:28,650 --> 00:55:33,340 There's a tap cult code and we have all the codes available. 501 00:55:33,340 --> 00:55:37,750 Thank you. Thank you. 502 00:55:37,750 --> 00:55:42,480 He is very far from cock up. 503 00:55:42,480 --> 00:55:48,740 I think we have a few more minutes for questions and I'll start off with Judith. 504 00:55:48,740 --> 00:55:56,280 I'm sorry. I have again, like a. Nice. Again, let's say you do either way, too, right. 505 00:55:56,280 --> 00:56:03,460 Don't like a framework so that you can prove something theoretically when you're using your sort of workflow, 506 00:56:03,460 --> 00:56:08,590 in a sense, the speciate spiciest type of insurance. 507 00:56:08,590 --> 00:56:19,430 Yeah, we have a little bit derivation, stuffiest immolations for very simple Intergration mother in that Appletree model. 508 00:56:19,430 --> 00:56:22,680 But that's where we like to do you have to assume with stochastic Malda and then 509 00:56:22,680 --> 00:56:26,900 we can probably derive something without the data cleaning and cleaning house. 510 00:56:26,900 --> 00:56:34,000 You can do analysis on data cleaning. But if the mother thought if you do idee case, that's what we hope to do. 511 00:56:34,000 --> 00:56:39,370 So to prove anything, you would need to refer to some kind of probabilistic more data. 512 00:56:39,370 --> 00:56:47,050 Yeah, yeah. So we can probably do this. We're seeing simulations under different we specify models to show that our kids. 513 00:56:47,050 --> 00:56:55,890 Yes. Provider is more robust. I think that's possible. And we did quite a bit similar to a very simple case of one simple linear regression. 514 00:56:55,890 --> 00:57:04,740 Only one with one. Correct. And then we show that from simulation that while you wonder now model is wrong. 515 00:57:04,740 --> 00:57:08,670 That with some variance wrongs. Now the homosex that stick. 516 00:57:08,670 --> 00:57:14,380 And then we are more robust. We give more reliable result than the classical way. 517 00:57:14,380 --> 00:57:20,930 Yeah. So that's actually supposed to be ongoing that we do want to do that paper to do. 518 00:57:20,930 --> 00:57:26,760 Alysse, following our models can get some, you know. And that ethical insight into what's going on. 519 00:57:26,760 --> 00:57:30,510 So I've been taking the approach that I want to see things useful first. 520 00:57:30,510 --> 00:57:34,890 If I had to theory. So I kind of I convinced myself that's useful. 521 00:57:34,890 --> 00:57:41,960 And now I'm getting ready to have different stylised models and do some theoretical work. 522 00:57:41,960 --> 00:57:47,070 Yeah. So hopefully we'll finish this year. 523 00:57:47,070 --> 00:57:55,930 Yeah. Okay. Next question will be NASA's. 524 00:57:55,930 --> 00:58:00,900 Yeah, hi, thanks for the talk. Is there any connexion between. 525 00:58:00,900 --> 00:58:08,730 The methods you're using and the kind of things that are in the paper by Dukes and OneSteel and which is kind of attempts 526 00:58:08,730 --> 00:58:15,240 to give some sort of premature as the model in such a way that even if your model is not quite correctly specified, 527 00:58:15,240 --> 00:58:23,030 it will still give useful answers and that you can kind of build in machine learning. 528 00:58:23,030 --> 00:58:30,350 You put it on on top. What kind of machine learning algorithm for learning the nuisance parameters, if you're familiar with? 529 00:58:30,350 --> 00:58:33,260 I don't know that paper ballots of from your description. 530 00:58:33,260 --> 00:58:40,330 I would think that that that I think in high level probably tried to do the same thing to get robustness. 531 00:58:40,330 --> 00:58:46,240 The detailed implementation, probably my they are probably more descriptive than we are right now. 532 00:58:46,240 --> 00:58:52,750 We are pretty. I can no stick. And just let people choose their models instead of, for example, in our framework. 533 00:58:52,750 --> 00:58:56,830 I think Bayesian model nomination models can be integrated together. 534 00:58:56,830 --> 00:59:01,300 If you help same target of interest, that's comparable. Support them. 535 00:59:01,300 --> 00:59:04,810 I think they probably more like what people do you can do. 536 00:59:04,810 --> 00:59:14,380 Confidence interval, Sitiveni. Aggression. And you have a model. Like, don't use the fish information, you have a sandwich covariance matrix. 537 00:59:14,380 --> 00:59:20,080 My guess is probably they're working maybe in that. In the spirit of that work. 538 00:59:20,080 --> 00:59:24,500 So, I mean, so obviously is focussed on robustness. 539 00:59:24,500 --> 00:59:29,700 But the point is that they don't. So they didn't just fit together. 540 00:59:29,700 --> 00:59:35,730 They fit it in such a way that that's the in estimating equations that much by the likelihood and 541 00:59:35,730 --> 00:59:40,080 the estimate equation is chosen in such a way that even if the model is not correctly specified, 542 00:59:40,080 --> 00:59:43,350 you still get a result that you can interpret. 543 00:59:43,350 --> 00:59:50,160 And that allows them to sort of play around with the nuisance model so they can have something really, really complicated for that. 544 00:59:50,160 --> 01:00:03,120 And even if it's not doing exactly what they want. They know that the sort of metric they're interested in will still be interpretable in a nice way. 545 01:00:03,120 --> 01:00:09,510 So, yeah, I'm sure there's a high level of connexion, but specifically, I think we're not very prescriptive yet. 546 01:00:09,510 --> 01:00:13,750 Robin's leaving a lot of room for the user to to make choices. 547 01:00:13,750 --> 01:00:20,850 So I would say that this I, I say that that's the particular way they made a choice to bring the stability. 548 01:00:20,850 --> 01:00:26,120 And I think if I can think of this as a way to. 549 01:00:26,120 --> 01:00:31,570 You know, a mother perturbation to introduce, I probably can put it under this broad framework. 550 01:00:31,570 --> 01:00:33,220 But for them, it's pretty specific. 551 01:00:33,220 --> 01:00:39,280 It's kind of relate to what they did, is there something called double robustness, double machine learning in causal inference? 552 01:00:39,280 --> 01:00:45,430 So that's why people have to come to different models and then they have to step method, which will adapt to either. 553 01:00:45,430 --> 01:00:49,930 But people also show that if neither is true, then the thing can be worse. 554 01:00:49,930 --> 01:00:54,860 So so I think definitely sounds. 555 01:00:54,860 --> 01:01:01,160 Related, but we are pretty conceptual right now that we don't really prescribe exactly. 556 01:01:01,160 --> 01:01:06,950 We did the different choices in the different context of the problem, mostly a lot of related random past. 557 01:01:06,950 --> 01:01:12,200 And there seem to be multiple magic. But I think you asked me to pay probably very interesting looking. 558 01:01:12,200 --> 01:01:18,390 Taking a look at my email with me at Berkeley WTU. 559 01:01:18,390 --> 01:01:28,210 Thanks. We're running out of time, but I wonder whether there might be some more questions. 560 01:01:28,210 --> 01:01:38,970 Yeah, feel free to show me your e-mail. I'd be happy to continue the discussion. 561 01:01:38,970 --> 01:01:48,230 I'm waiting to see whether there's any more question. 562 01:01:48,230 --> 01:01:54,630 Cool. I guess if that's a normal questions, do [INAUDIBLE]. 563 01:01:54,630 --> 01:01:58,321 We have an e-mail we have you've collected. Conflict is often.