1 00:00:00,060 --> 00:00:04,170 Four are Corcoran Memorial Lecture. 2 00:00:04,170 --> 00:00:11,610 So I wanted to start by just giving an explanation of what the series of memorial lectures is. 3 00:00:11,610 --> 00:00:14,550 So this is this year's Corcoran Memorial Lecture. 4 00:00:14,550 --> 00:00:24,000 It's named in the memory of Stephen Corcoran, who was a graduate student in the Department of Statistics until his death in nineteen ninety six. 5 00:00:24,000 --> 00:00:31,050 He had been an undergraduate at Oxford and got first class honours in mathematics in nineteen ninety one. 6 00:00:31,050 --> 00:00:40,050 Then he went and studied for a diploma in mathematical statistics at Cambridge and returned to study for his DS fill in statistics at Oxford. 7 00:00:40,050 --> 00:00:45,480 And every other year we have a Kroeker memorial prise. 8 00:00:45,480 --> 00:00:51,340 But. Without fail, annually, we have a corporate memorial lecture. 9 00:00:51,340 --> 00:01:00,310 So this year we do. It is not a prise year. So we had free reign to choose the topic of our speaker. 10 00:01:00,310 --> 00:01:04,270 And so we asked Karen Magnuson to come in and speak yesterday. 11 00:01:04,270 --> 00:01:07,000 So we're very pleased that she agreed to do so. 12 00:01:07,000 --> 00:01:14,190 She is a distinguished professor of statistics at Queensland University of Technology, which is in Brisbane, Australia. 13 00:01:14,190 --> 00:01:22,000 She is the deputy director of the Australian Research Council Centre for Excellence in Mathematical Frontiers. 14 00:01:22,000 --> 00:01:27,640 She's a fellow of the Australian Academy of Science and the Australian Academy of Social Sciences. 15 00:01:27,640 --> 00:01:34,000 She's also a member of various national and international statistical societies. 16 00:01:34,000 --> 00:01:44,860 She has wide ranging research interests in both mathematical statistics and its applications in areas including health, environment and industry. 17 00:01:44,860 --> 00:01:52,540 In particular, she focuses on Bayesian methods. So she has very kindly agreed to present to us today. 18 00:01:52,540 --> 00:02:01,900 And so we have this morning scheduled lecture because she is speaking to us from Australia in these covered, restricted times. 19 00:02:01,900 --> 00:02:08,620 So I'm just about to hand over to her. We will have a question and answer session at the end. 20 00:02:08,620 --> 00:02:15,520 So if you have questions, please types them into the chat and I will read them out to her or a selection of them, 21 00:02:15,520 --> 00:02:20,200 because we probably won't be able to answer all of them at the end. 22 00:02:20,200 --> 00:02:28,080 And otherwise, thank you very much for joining us. And I will hand over to Kerry for the lecture. 23 00:02:28,080 --> 00:02:35,160 Thank you very much, Crystal, and thank you very much for the invitation to present this lecture. 24 00:02:35,160 --> 00:02:42,120 And and thank you to the people who have given up some morning time to attend. 25 00:02:42,120 --> 00:02:49,170 As you heard in Australia, it would be great to be there. But but in fact, this is the way it is at the moment. 26 00:02:49,170 --> 00:02:58,100 So I'd like to just share my screen. And. 27 00:02:58,100 --> 00:03:06,710 And. And we'll talk to. 28 00:03:06,710 --> 00:03:10,500 Can you see my screen? Yes. That's great. Okay. 29 00:03:10,500 --> 00:03:15,350 Thank you. So I'd like to talk to you about not aggregating data, 30 00:03:15,350 --> 00:03:22,880 and I'm going to do this by first presenting a couple of case studies where the imperative for this. 31 00:03:22,880 --> 00:03:30,590 This area of work has come about. And then I'd like to talk about some of the work in progress in this area. 32 00:03:30,590 --> 00:03:32,840 So this is a new work that we're doing. 33 00:03:32,840 --> 00:03:40,700 And if people are involved in this or have some suggestions or comments, then I'd be very excited to hear about them. 34 00:03:40,700 --> 00:03:45,650 It's certainly not complete at the moment. So the first the two case studies, 35 00:03:45,650 --> 00:03:50,660 the first one is the Australian Cancer Atlas and the Australian Cancer Atlas 36 00:03:50,660 --> 00:03:59,300 was is an interactive online atlas that gives brings together cancer data, 37 00:03:59,300 --> 00:04:11,000 GISS location details, digital earth technology and statistical models to provide an online interactive map of cancer. 38 00:04:11,000 --> 00:04:16,790 About 20 cancers in Australia at the small area, a two level. 39 00:04:16,790 --> 00:04:21,050 So there's about two thousand of these areas across Australia. 40 00:04:21,050 --> 00:04:30,230 And this map is based on the underpinning technology or modelling in this atlas is a Bayesian statistical model. 41 00:04:30,230 --> 00:04:39,730 So we have here why eyes the number of cancer cases in a small area and why is the expected number of cancer cases in that area? 42 00:04:39,730 --> 00:04:46,610 Agent six matched. This is, of course, on models. So we model our number of cancer cases as on. 43 00:04:46,610 --> 00:04:55,670 And we have each of the new eye being the the relative risk of the kind of cancer in that particular area. 44 00:04:55,670 --> 00:05:04,310 So you are then is modelled in terms of some potential covariance EXI and also a spatial term S.I. 45 00:05:04,310 --> 00:05:12,320 So this spatial term is going to incorporate the information of the neighbours in the area. 46 00:05:12,320 --> 00:05:18,560 Given that we are in geographical space, so we have a number of options for that spatial term. 47 00:05:18,560 --> 00:05:20,510 For example, a base egg yolk Mollee, 48 00:05:20,510 --> 00:05:33,290 a model proposed by basic getable in nineteen seventy four is is has a term splits up that aside, the residual into two parts. 49 00:05:33,290 --> 00:05:44,890 One is the, the spatial component UI and one is an unstructured component A.I. and the UI is then the spatial components depend on the neighbours, 50 00:05:44,890 --> 00:05:49,230 you Tilda Eye. And so it's going to be normally distributed. 51 00:05:49,230 --> 00:05:57,980 That random effect is going to be normally distributed based on the average of the the random effects in the neighbourhood of UI. 52 00:05:57,980 --> 00:06:04,880 Now an alternative to that is from Leroux in two thousand and his colleagues. 53 00:06:04,880 --> 00:06:09,710 And here we model Essi as you are and you are has two components. 54 00:06:09,710 --> 00:06:11,880 One has a it's almost like a mixture. 55 00:06:11,880 --> 00:06:22,820 So it has a component that has something to do with the spatial neighbourhood and also a component that is the overall variance. 56 00:06:22,820 --> 00:06:32,270 And this then allows us to have some we have this mixture model that is governed by a parameter ROE, 57 00:06:32,270 --> 00:06:36,500 which says how much weight we want to put on the the neighbourhood. 58 00:06:36,500 --> 00:06:41,670 The special neighbourhood. And how much we want to regress towards the overall mean. 59 00:06:41,670 --> 00:06:46,190 I know this LaRue model, Ben has only this one parameter, Roe, 60 00:06:46,190 --> 00:06:54,060 and turns out to be quite a good model for the kind of spatial analysis that we want to do in the Cancer Atlas. 61 00:06:54,060 --> 00:06:58,720 So the Cancer Atlas then, as I said, is this online model, 62 00:06:58,720 --> 00:07:05,590 and I'll just show you some of the features of it, you can you can pull down any of the cancers you like. 63 00:07:05,590 --> 00:07:07,920 You can get information from those cancers, 64 00:07:07,920 --> 00:07:16,830 in particular the visualisations about the probability of being above the national average or below the national average. 65 00:07:16,830 --> 00:07:23,280 And we're talking here about uncertainty and estimates and we're also talking about probabilities. 66 00:07:23,280 --> 00:07:29,310 And so we're also able to compare areas and and provide the information as well in the 67 00:07:29,310 --> 00:07:37,650 form of Fantham modelled estimates at the small area level survey Cancer Atlas has been. 68 00:07:37,650 --> 00:07:41,380 We're going to extend this to space time and. 69 00:07:41,380 --> 00:07:47,760 And already it's shown quite a lot of information in terms of, for example, spatial inequalities. 70 00:07:47,760 --> 00:07:52,980 So this chart, he has showed us the difference between major cities, 71 00:07:52,980 --> 00:07:59,520 regional and remote areas, in terms of for females, for different kinds of cancers. 72 00:07:59,520 --> 00:08:07,950 And you can see here some really startling differences in the standardised incidence rate on the left axis there. 73 00:08:07,950 --> 00:08:16,830 This is all compared to the the national average, which is one some stark differences for people in remote areas. 74 00:08:16,830 --> 00:08:19,460 You can also get from this information. 75 00:08:19,460 --> 00:08:27,660 And we've been able to provide relative survival curves and relative survival estimates for each small area level. 76 00:08:27,660 --> 00:08:37,080 And we can then look at localised and advanced relative survival curves and the differences between the different areas. 77 00:08:37,080 --> 00:08:47,470 So remote areas and cities. And you can see here that there is this kind of real difference that this that lets us being able to to reveal. 78 00:08:47,470 --> 00:08:54,300 And we can ask questions like how many premature deaths could be prevented if there were no spatial inequalities. 79 00:08:54,300 --> 00:08:58,200 And so before by being able to quantify these terms, 80 00:08:58,200 --> 00:09:08,080 then this is really led then to and a real media release or real public acknowledgement of differences, 81 00:09:08,080 --> 00:09:12,120 spatial differences in cancer across our country. 82 00:09:12,120 --> 00:09:23,430 It's also led then to a changing government policy in terms of the Treb travel subsidy for people in the push to get treatment in the city. 83 00:09:23,430 --> 00:09:25,650 So we think this is successful. 84 00:09:25,650 --> 00:09:35,040 And what's interesting for us is that this is a product that has underpinning it a spatial statistical model, a Bayesian model. 85 00:09:35,040 --> 00:09:37,590 Now, we've also been looking at just more recently, 86 00:09:37,590 --> 00:09:44,850 spatial Bayesian empirical likelihood so we can avoid some of the parametric assumptions in in our modelling. 87 00:09:44,850 --> 00:09:53,460 And so the model here looks the same as before. But what we're going to do, instead of assuming a normal distribution for the low relative risk, 88 00:09:53,460 --> 00:10:01,500 we're going to have estimating equations and I use these estimating equations to be able to derive our parameter estimates. 89 00:10:01,500 --> 00:10:05,430 So we have the typical estimating equations here. 90 00:10:05,430 --> 00:10:14,010 We're going to have the constraints on the main and the variance. And then we we also have priors on this and we can obtain the estimates. 91 00:10:14,010 --> 00:10:16,260 So the new part here is, 92 00:10:16,260 --> 00:10:25,250 is including the spatial random effect in these empirical likelihood models and the kinds of priors that we can have for our random effects of, 93 00:10:25,250 --> 00:10:31,320 for example, an independent Gaussian prior excuse, the spelling of Gaussian. 94 00:10:31,320 --> 00:10:37,770 We have conditional order, aggressive price like the way I am and LaRue as before. 95 00:10:37,770 --> 00:10:39,150 We also have, for example, 96 00:10:39,150 --> 00:10:48,630 a generalised moron basis prior that has been used in previous work on spatial empirical likelihood in the paper water et al. 97 00:10:48,630 --> 00:10:53,010 So there's not a lot of work on spatial empirical likelihood. 98 00:10:53,010 --> 00:10:56,160 And this was a contribution to that effort. 99 00:10:56,160 --> 00:11:05,850 The speed at the stimulation results, when we look at the the the empirical likelihood approach compared with the parametric approach. 100 00:11:05,850 --> 00:11:11,910 You can see here in a simulation study where we have hired a correlation, low order correlation and also outliers. 101 00:11:11,910 --> 00:11:16,860 And we have that for small a small number of areas and also a larger number of areas. 102 00:11:16,860 --> 00:11:26,580 It turns out, as we might expect, that the the empirical likelihood approach really shows its colours when either we have a small number of areas. 103 00:11:26,580 --> 00:11:32,160 And so the normality assumptions may not hold out or we have these outliers. 104 00:11:32,160 --> 00:11:41,730 So when we had this 20 percent of outliers, then our base empirical likelihood approaches seem to outperform the parametric approaches. 105 00:11:41,730 --> 00:11:49,700 So this has promise for our atlas. Now, that's fine in terms of being able to model. 106 00:11:49,700 --> 00:11:53,600 But but one of our concerns still in this is. 107 00:11:53,600 --> 00:11:59,930 The issue that we needed to aggregate the data now Australia is based on a number of states. 108 00:11:59,930 --> 00:12:05,510 Each of those states holds the data and we needed to have agreements with each of the 109 00:12:05,510 --> 00:12:11,180 states in order to be able to bring those data together and then to create the models. 110 00:12:11,180 --> 00:12:20,810 Now, this has to happen regularly. And if we want to change anything or obtain new data, we need to go through those agreements again. 111 00:12:20,810 --> 00:12:31,100 There are very real issues of privacy and there are also issues of of obtaining permission to for these data. 112 00:12:31,100 --> 00:12:35,400 So the methods that we have at the moment don't address those issues. 113 00:12:35,400 --> 00:12:41,870 They certainly address the privacy issue in terms of providing the information at the small area level. 114 00:12:41,870 --> 00:12:51,210 But we still need to aggregate the data into one database. Our second case study here is the Virtual Reef Diver Project. 115 00:12:51,210 --> 00:12:55,070 So the Great Barrier Reef is one of the world's treasures. 116 00:12:55,070 --> 00:13:02,820 It's two thousand three hundred kilometres long and there is monitoring at certain sites along the reef. 117 00:13:02,820 --> 00:13:06,360 But there is a lot of area where there's no monitoring. 118 00:13:06,360 --> 00:13:18,030 So what can we do in those areas in order to be able to help us to develop accurate and useful predictive models across the whole of the reef? 119 00:13:18,030 --> 00:13:25,500 Well, this underwater drones, the satellite imagery, and there's numbers of divers who are out there collecting information. 120 00:13:25,500 --> 00:13:29,490 So we built an online interactive reef. 121 00:13:29,490 --> 00:13:43,960 So it's called virtual reef diver. We are inviting organisations and citizens to contribute or geotag their photos, the images of the reef. 122 00:13:43,960 --> 00:13:50,100 And we are also asking citizens to go into the into the images and annotate them. 123 00:13:50,100 --> 00:14:00,680 So if you are interested in contributing to the beach, the health of the Great Barrier Reef, please visit Virtual Reef, Dorell or you. 124 00:14:00,680 --> 00:14:08,270 Now, what we have then are these images which you can download the images of the underwater images of the reef. 125 00:14:08,270 --> 00:14:15,550 And then also classify them. So we ask citizens to annotate the. 126 00:14:15,550 --> 00:14:20,570 The circles here as to whether these are coral algae, sand and so on. 127 00:14:20,570 --> 00:14:25,460 And we can use that information to improve our statistical models. 128 00:14:25,460 --> 00:14:28,670 There's a huge impact of this work. 129 00:14:28,670 --> 00:14:39,610 This this project was selected as one of the three shortlisted for the Eureka Prise in Australia here last year, which was very exciting to us. 130 00:14:39,610 --> 00:14:47,260 So we have citizen scientists who can annotate the images and we can use those annotations in their own right. 131 00:14:47,260 --> 00:14:52,690 But we can also use them to test for automated image classifications. 132 00:14:52,690 --> 00:14:55,240 So machine learning and statistical methods. 133 00:14:55,240 --> 00:15:05,170 And you can see there the list of all kinds of statistical methods that we've looked at for classifying images in the environmental space. 134 00:15:05,170 --> 00:15:11,500 But then also in the medical space now, at the last of phase is matrix facts factorisation. 135 00:15:11,500 --> 00:15:21,400 And I'd like to just concentrate on that for a moment. So in particular, we've been looking at Canalis sparse Bayesian matrix factorisation. 136 00:15:21,400 --> 00:15:25,990 So the aim here is to extract low rank or sparse structures. 137 00:15:25,990 --> 00:15:32,140 So, for example, our classes in our image and we're going to use matrix factorisation techniques to do it. 138 00:15:32,140 --> 00:15:37,120 So we have data, which is why an M by N matrix like an image. 139 00:15:37,120 --> 00:15:41,920 And this can be the the the values that each of the pixels in the image. 140 00:15:41,920 --> 00:15:47,680 And we want to recover. Then the low ranked matrix X where we have Y equals X plus eight. 141 00:15:47,680 --> 00:15:51,970 And this X is going to be then decomposed into two vectors. 142 00:15:51,970 --> 00:16:00,640 You and V. And these vectors, you and V are going to be much smaller than the original Matrix X. 143 00:16:00,640 --> 00:16:10,510 So you if we have X being M by N, then we have we have V, for example, being in by R. 144 00:16:10,510 --> 00:16:16,360 And we have sorry you being in by R and V. 145 00:16:16,360 --> 00:16:24,700 Being in by R as well. And so when we put this together, the R here is much smaller than M or N. 146 00:16:24,700 --> 00:16:31,480 And then this induces sparsity into this. This low rank approach. 147 00:16:31,480 --> 00:16:35,320 Now we also can use our priors to induce sparsity. 148 00:16:35,320 --> 00:16:39,970 So we have calcium price for the columns of U and V. As you can see here. 149 00:16:39,970 --> 00:16:46,000 And one of the features of this approach is that we have the same the same parameter, 150 00:16:46,000 --> 00:16:54,790 Gamma J, which is going to appear in both the the variance for you and say this. 151 00:16:54,790 --> 00:16:58,960 Then we can control that gamma J. To induce the sparsity. 152 00:16:58,960 --> 00:17:08,080 So we have the prior on the gamma has a gamma distribution and the hyper parameters of that prior distribution are going to be small. 153 00:17:08,080 --> 00:17:15,910 And this is the way that we can really induce the sparsity in you and V, we coupled you with a kernel matrix K. 154 00:17:15,910 --> 00:17:22,370 And that gives us a light matrix G. And this G then has a prioress you can see here. 155 00:17:22,370 --> 00:17:28,300 Now we also then have Jefferys prise on the other parameters here. 156 00:17:28,300 --> 00:17:35,960 So we also can do the same kind of a prior for V as well. 157 00:17:35,960 --> 00:17:42,830 So just to complete the model, we then have a residual turn and we have a prior on as well. 158 00:17:42,830 --> 00:17:49,640 But we're focussing really on that you and be here. And so just to show you a graphical model of what we're doing here, 159 00:17:49,640 --> 00:18:08,360 we have y y is going to now be governed by G M and H dot n and G dot m is going to be a combination of K, you and U M and Hijau N is K, V and V. 160 00:18:08,360 --> 00:18:16,840 And then these are the hyper parameters on the swell. So these some we're going to induce the sparsity through these Prior's. 161 00:18:16,840 --> 00:18:19,540 We have conditional and joint distribution. 162 00:18:19,540 --> 00:18:28,930 So we have a conditional distribution for our observational model, which is going to have this G and the H there and terms. 163 00:18:28,930 --> 00:18:37,420 And we have then a joint distribution, which is, of course, is going to be the product of each of these individual distributions. 164 00:18:37,420 --> 00:18:45,110 We can use variation or base then to to undertake the the analysis here. 165 00:18:45,110 --> 00:18:49,850 Now, the choice of colonel here are this many different kinds of colonels, and so for images we want, 166 00:18:49,850 --> 00:18:54,350 then one that incorporates some similarity of information between patches. 167 00:18:54,350 --> 00:19:00,380 And so what we want to be able to try to do is to have Patch Group, Matrix factorisation. 168 00:19:00,380 --> 00:19:10,400 And so we want areas that are similar to to have to be in the same role to to base, to stay together effectively. 169 00:19:10,400 --> 00:19:15,470 So what we do then is we look at this distances Euclidean distance between a pair of patches. 170 00:19:15,470 --> 00:19:25,940 This is the D. And we're going to define the similarity between that and the pair of patches as well in terms of that distance D. 171 00:19:25,940 --> 00:19:34,760 So we have a K, as you can see here, being a function of that Euclidean distance between the pair of patches. 172 00:19:34,760 --> 00:19:40,910 So we're going to have a pixel and it's nearest neighbours then are going to be modelled as a column vector. 173 00:19:40,910 --> 00:19:48,290 And we can construct this M by N Patch group matrix Y by grouping the other patches 174 00:19:48,290 --> 00:19:53,660 with similar local spatial structures in the underlying one in the local window. 175 00:19:53,660 --> 00:19:59,870 So we're bringing together the m the common information in what we call a patch. 176 00:19:59,870 --> 00:20:08,990 And since each of these patches then a common then we can induce this low rank sparsity that we wish. 177 00:20:08,990 --> 00:20:16,700 So the overall algorithm then is to cost the patches with similar spatial structure to form a patch matrix. 178 00:20:16,700 --> 00:20:25,850 And then we can apply our matrix factorisation approach in succession on each of these patch matrices. 179 00:20:25,850 --> 00:20:30,920 And then we can aggregate the patches to reconstruct the whole image. 180 00:20:30,920 --> 00:20:42,220 So this is one approach to being able to deal with a large image in order to be able to understand the low rank of the structures in that image. 181 00:20:42,220 --> 00:20:47,020 So when we look at virtual reef diver, for example, we might want image restoration, 182 00:20:47,020 --> 00:20:53,200 we might want classification of components like coral and so on in the image. 183 00:20:53,200 --> 00:20:57,280 And you can see here the the the utility of this function. 184 00:20:57,280 --> 00:21:03,310 We have an original image. We've made noise in that image. 185 00:21:03,310 --> 00:21:10,870 And then we can retain or resurrect that image using this method. 186 00:21:10,870 --> 00:21:19,570 So the method then allows for integration of side information through the through the patches in first parameters and latent variables, 187 00:21:19,570 --> 00:21:23,890 including the reduced rank using variation of Bayesian inference. 188 00:21:23,890 --> 00:21:32,170 And we get this low rank, three sparsity induced by an enforced constraint on those light, very light infector matrices. 189 00:21:32,170 --> 00:21:40,510 And we can show that this improves on some of the state of art approaches for image restoration tasks. 190 00:21:40,510 --> 00:21:50,170 So that's great, except we still have the issue of not wanting to aggregate the data in this case. 191 00:21:50,170 --> 00:21:54,520 We have two issues. One is that these images are quite large. 192 00:21:54,520 --> 00:22:04,840 And so if we're going to create a single database to to analyse the images, that's going to become very unwieldy very quickly. 193 00:22:04,840 --> 00:22:12,130 Even analysing a single image is difficult because we've already seen how we break it into patches. 194 00:22:12,130 --> 00:22:20,230 And then we need to do the analysis on the patches. The second aspect is that we have citizens contributing information. 195 00:22:20,230 --> 00:22:30,670 Now, again, because of the provenance and the the data laws, the the data ownership laws in the UK, in Europe and in the US. 196 00:22:30,670 --> 00:22:39,490 And coming into Australia, we're going to be really restricted soon about how we can actually deal with individual people's data. 197 00:22:39,490 --> 00:22:43,150 So we want to think about ways that we don't. We can't. 198 00:22:43,150 --> 00:22:48,640 We can avoid creating one database to rule them all. 199 00:22:48,640 --> 00:22:51,520 So what are our options? If we if we look at this? 200 00:22:51,520 --> 00:23:00,130 Well, our single database says that we put all the data in a single database and then we do the analysis on that. 201 00:23:00,130 --> 00:23:15,450 We can also look at distributed computing. So here we can if we have a number of databases and we then can have a centralised computer, 202 00:23:15,450 --> 00:23:28,000 a centralised source that then distributes the the the the commands to the different databases and then receives the information back. 203 00:23:28,000 --> 00:23:31,180 So this kind of distributed computing is very useful. 204 00:23:31,180 --> 00:23:41,950 We have horizontal scalability, which means that we can then add more databases or more nodes to this and this distributed system, 205 00:23:41,950 --> 00:23:47,890 and then we can obtain more or greater capacity by doing so. 206 00:23:47,890 --> 00:23:57,280 Of course, that comes at a cost because now we have computational interchange and those computational that overhead may be quite large in some cases. 207 00:23:57,280 --> 00:24:02,410 So there is sort of it. And there's a utility in having a vertical scalability. 208 00:24:02,410 --> 00:24:10,900 In other words, a larger database. And then after point, we get to more utility by having this horizontal scalability. 209 00:24:10,900 --> 00:24:13,980 And there's a huge amount of work on distributed computing. 210 00:24:13,980 --> 00:24:23,170 You know, this this whole area is in in computer science and in statistical areas that are that focussed on distributed computing. 211 00:24:23,170 --> 00:24:29,440 We know that we can now we have a greater Folt tolerance in this case of one node pulls over. 212 00:24:29,440 --> 00:24:33,690 We have the other nodes. There's low latency in this. 213 00:24:33,690 --> 00:24:38,730 And we can also use methods like shotting and so on. And I'll show an example of that in a minute. 214 00:24:38,730 --> 00:24:47,020 This distributed computing system as motivated approaches like MAP, reduce, Petchey stock, Hadoop and so on. 215 00:24:47,020 --> 00:24:52,390 So. This is out two of the approaches. 216 00:24:52,390 --> 00:25:02,950 Another one is decentralised computing, whereas our distributed computing had a central node that then governed the information guy out and back, 217 00:25:02,950 --> 00:25:12,280 decentralised computing doesn't do that. So it says the information is processed in the cloud and there is no one actor that owns that information. 218 00:25:12,280 --> 00:25:23,640 This is led then to new work on distributed apps and also lead has led to edge computing and so many of you that you'll be familiar with this. 219 00:25:23,640 --> 00:25:35,290 But I'll just explain it quickly. Edge computing is that we don't process the information on the cloud filtered through these remotes data centres, 220 00:25:35,290 --> 00:25:46,060 but instead we do the analysis in the cloud on the individual nodes and then they are aggregated as needed. 221 00:25:46,060 --> 00:25:53,230 So this is then led to areas, for example, like Federated Learning and Federated Analysis. 222 00:25:53,230 --> 00:25:58,840 And so there are lots of advantages in this because this is very much the system that we're in 223 00:25:58,840 --> 00:26:04,360 with this or the area that we would like to get into with the two case studies that I showed. 224 00:26:04,360 --> 00:26:08,230 So in the first one, we have the Australian Cancer Atlas, 225 00:26:08,230 --> 00:26:13,210 where we actually have these different states that would like to retain their data 226 00:26:13,210 --> 00:26:18,940 and we would like to be able to analyse the data in situ and then bring it together. 227 00:26:18,940 --> 00:26:29,590 We would also like to be able to add the virtual Raef Diavik case, be able to analyse the the images in a more decentralised manner. 228 00:26:29,590 --> 00:26:34,350 And also, when we come to integrating personal information, 229 00:26:34,350 --> 00:26:42,700 then we would like to be able to come up with systems where we can retain the province provenance and the individual privacy of of citizens, 230 00:26:42,700 --> 00:26:47,380 for example, and subjects in a more broad sense. 231 00:26:47,380 --> 00:26:49,660 So if we go back to our Australian Cancer Atlas, 232 00:26:49,660 --> 00:26:59,790 if we look at a single database and we ask the question of how do we retain privacy when we we we have the data at this small area level. 233 00:26:59,790 --> 00:27:06,280 One option is that we actually analyse it at that small area level and then we're in the business of a single database. 234 00:27:06,280 --> 00:27:07,990 And so we've been looking at, for example, 235 00:27:07,990 --> 00:27:18,010 summary data analysis that we might do on the perform on take on the actual modelled estimates at the small area level. 236 00:27:18,010 --> 00:27:25,180 So a natural way that we might think about this then is through a hierarchical meta analysis where we have our modelled log, 237 00:27:25,180 --> 00:27:34,420 I ask for each of the essay twos, the small area levels, and we might be interested in remoteness, cities, regional and remote areas. 238 00:27:34,420 --> 00:27:40,420 And then we have an associated standard deviation for that logger's CSIR. 239 00:27:40,420 --> 00:27:49,180 That's actually part of the information that's provided from the underlying Bayesian spatial model that I described before. 240 00:27:49,180 --> 00:27:55,510 So we can set up then a hierarchical meta analysis in that in the way that we have. 241 00:27:55,510 --> 00:28:09,250 Why then the why? I for particular, within the regions of the cities, the regions and regional and remote areas that's normally distributed with a. 242 00:28:09,250 --> 00:28:19,660 We imagine that there's some measurement error or some uncertainty around the why I that's described by the signalised Sigma squared IJA, 243 00:28:19,660 --> 00:28:27,460 Ammu IJA then is going to this has been the the true or longest. 244 00:28:27,460 --> 00:28:29,440 I ask for a particular region, 245 00:28:29,440 --> 00:28:37,540 remoteness region and they're going to have an overall value of Theta J and those Theta J then for each of the remote areas 246 00:28:37,540 --> 00:28:49,600 if we wish can also be considered to be drawls from an overall Lagus I r which is going to be described by theta nought. 247 00:28:49,600 --> 00:28:56,230 So we have our individual estimates of the log siac for each of the remoteness areas. 248 00:28:56,230 --> 00:29:03,310 They are drawn from an overall mean log assai our distribution of of Lagus. 249 00:29:03,310 --> 00:29:08,970 I ask for that particular remoteness area and then the means of those areas. 250 00:29:08,970 --> 00:29:15,220 Then the remoteness regions come from some overall longest site. 251 00:29:15,220 --> 00:29:25,030 That then governs the the distribution of bogus areas that governs then b the the the whole structure. 252 00:29:25,030 --> 00:29:37,840 The sigma squared IJA stain can be related to our observed standard deviations s IJA through the usual course quid association. 253 00:29:37,840 --> 00:29:42,040 We can also add covariates to this, as you can see in the middle panel here. 254 00:29:42,040 --> 00:29:48,370 And this is sort of a common approach for analysing auto to add covariance. 255 00:29:48,370 --> 00:29:57,940 And we can then look at some new work, which is when we look at our spatial Bayesian empirical likelihood approach in a meta analysis context. 256 00:29:57,940 --> 00:30:07,870 And that's on the right panel there. So this is fine for one approach that we might use for modelling, feed the summary data. 257 00:30:07,870 --> 00:30:10,720 And so we overcome privacy issues here. 258 00:30:10,720 --> 00:30:20,200 And we could then also Arius could be more encouraged or different states could be more encouraged to provide information at this small area level. 259 00:30:20,200 --> 00:30:23,230 And we could perform analysis like this. 260 00:30:23,230 --> 00:30:32,950 We can show that if we do this, then we get quite similar results in these overall estimates because the remoteness regions like major cities, 261 00:30:32,950 --> 00:30:43,090 regional and remote, we get here posterior means 95 percent credible intervals and the probability that one region is greater than another region. 262 00:30:43,090 --> 00:30:47,170 So that the natural outputs from our Bayesian models. 263 00:30:47,170 --> 00:30:57,100 We also can see then if we go to the individual essay twos and we look at those estimates that we we are getting differences between 264 00:30:57,100 --> 00:31:06,520 when we analysed the data at the individual level versus when we analyse it at the USA to label this aggregate level as anticipated. 265 00:31:06,520 --> 00:31:12,790 And we can also see that our out empirical likelihood approach is going to give us quite a 266 00:31:12,790 --> 00:31:20,260 larger interval or spread of the data compared to our parametric approach as anticipated, 267 00:31:20,260 --> 00:31:29,200 because they stayed or are quite moderate and don't have those kind of outlaws as quickly as we were talking about before. 268 00:31:29,200 --> 00:31:34,060 So centralised summary analysis has some benefits and some drawbacks. 269 00:31:34,060 --> 00:31:38,800 The benefits are that we can preserve privacy because we have this aggregate approach. 270 00:31:38,800 --> 00:31:43,180 It's a simple model. It's computationally straightforward and fast. 271 00:31:43,180 --> 00:31:49,560 But there is challenging inferential capability because we're modelling at the small area level. 272 00:31:49,560 --> 00:31:55,840 And and therefore, we may have we have to be careful of controlling for biases in this case. 273 00:31:55,840 --> 00:32:00,280 We also have some questions about what can we say about covariance. 274 00:32:00,280 --> 00:32:10,000 For example, is the spatial distribution of our response consistent with the spatial distribution patterns of our covariance? 275 00:32:10,000 --> 00:32:13,630 And we have to be careful in what kinds of inferences we can draw. 276 00:32:13,630 --> 00:32:18,620 And those inferences will change as we change the scale of the data. 277 00:32:18,620 --> 00:32:25,830 So we're still get to scale this to different states and countries in a hierarchical, which you could imagine and how this would go. 278 00:32:25,830 --> 00:32:31,690 We add a hierarchy to those models that we've already described. 279 00:32:31,690 --> 00:32:36,040 We come now to distributed computing and so in distributed computing, 280 00:32:36,040 --> 00:32:40,480 then we're starting to think about how we don't have to create a single database. 281 00:32:40,480 --> 00:32:45,610 And I've just put this slide up because this was a burse workshop in Canada. 282 00:32:45,610 --> 00:32:54,830 I wasn't I wasn't at this workshop, unfortunately, but I just wanted to show you here some of the topics that were covered even in 2008. 283 00:32:54,830 --> 00:33:00,700 And this was about to start on developments in statistical theory and methods based on distributed computing. 284 00:33:00,700 --> 00:33:10,060 As I said, this is a huge area in computer science, but also there's a substantial amount of work in the statistical community as well. 285 00:33:10,060 --> 00:33:17,500 You can see here approaches or focus on statistical methods with computational efficiency, 286 00:33:17,500 --> 00:33:21,980 with statistical properties, with guarantees about the estimates. 287 00:33:21,980 --> 00:33:26,740 These things are really important for these kinds of approaches are robust. 288 00:33:26,740 --> 00:33:32,140 What's the divide and conquer algorithms that really underpin a lot of the work in this area? 289 00:33:32,140 --> 00:33:40,600 And rights of convergence for our estimate is here. Mathematical theory for deep, convolutional neural networks and distributed learning. 290 00:33:40,600 --> 00:33:48,760 A lot of work on distributed neural networks. And and as I said, divide and conquer methods, for example, 291 00:33:48,760 --> 00:33:55,000 for correlated data and spatial creaking, which is of interest for us in our spatial analysis. 292 00:33:55,000 --> 00:34:01,090 And so you can say here that that there are these kinds of issues in the distributed 293 00:34:01,090 --> 00:34:08,220 computing that really focus on the statistical underpinnings of these particular methods. 294 00:34:08,220 --> 00:34:18,790 And again, just pointing out that one of the main approaches here is that the approach that many of us are very familiar with, 295 00:34:18,790 --> 00:34:27,380 the divide and conquer algorithms. So just as an example of some of this work, we have distributed Bayesian inference s, 296 00:34:27,380 --> 00:34:34,590 for example, Wang and Dance Donson and there's weighing in this 2015 work. 297 00:34:34,590 --> 00:34:38,330 We're really we're looking at decomposing the global posteriors. 298 00:34:38,330 --> 00:34:46,410 So we take the global posterior distribution and we're going to break it up into a product of subsamples posteriors. 299 00:34:46,410 --> 00:34:57,140 And so that's the second equation that you can see there, which is going to just be conditional on the Z JS instead of all of the data Z, 300 00:34:57,140 --> 00:35:03,110 so that we have this prior beam that's raised to the power one on K and. 301 00:35:03,110 --> 00:35:10,140 And we have this DJI, which is a normalising constant. But it really what we're doing here is we've got these subsamples. 302 00:35:10,140 --> 00:35:14,460 We can start the posterior distributions from those subsamples. 303 00:35:14,460 --> 00:35:19,470 We bring those together and we weight them according to the. 304 00:35:19,470 --> 00:35:25,860 So this sort of like this weighted combination or the weighted combination of the likelihoods. 305 00:35:25,860 --> 00:35:29,100 So this approach then really motivates the macro Joose approach. 306 00:35:29,100 --> 00:35:35,670 So we run separate Markov chains in parallel on different machines based on the local data. 307 00:35:35,670 --> 00:35:40,530 We transmit this back to these local posterior draws to a central node. 308 00:35:40,530 --> 00:35:49,590 And then we combine those jaws to form an approximation or a surrogate likelihood that approximates our true global likelihood. 309 00:35:49,590 --> 00:35:54,850 So this kind of approach has said Spain. It's familiar to us. 310 00:35:54,850 --> 00:35:59,850 There's been a huge amount of work over the last 10 years in this area. 311 00:35:59,850 --> 00:36:07,260 So if we come back to the virtual reef diver approach, we think about how this kind of approach might work. 312 00:36:07,260 --> 00:36:15,000 In our case, an example I'd like to just talk through with you is where we were looking at crowd sourcing by Mechanical Turk. 313 00:36:15,000 --> 00:36:20,760 So we wanted lots of citizens now to to classify our images. 314 00:36:20,760 --> 00:36:25,170 And so we used Amazon Mechanical Turk to recruit people. 315 00:36:25,170 --> 00:36:34,650 So this is an online system where people often pay, but people will undertake tasks. 316 00:36:34,650 --> 00:36:38,940 And so here we had we asked them to classify our images. 317 00:36:38,940 --> 00:36:43,710 Each person was assigned up to 40 images. They were asked to classify points. 318 00:36:43,710 --> 00:36:50,520 And we paid them for this. We got thousands of people who classify these images. 319 00:36:50,520 --> 00:36:56,610 And, of course, then we needed to do something with the data. So what we were we and one of the issues is, of course, 320 00:36:56,610 --> 00:37:02,670 that we need to take into account that the ability of the citizens who are analysing 321 00:37:02,670 --> 00:37:06,870 these data and we have to take into account when we're looking at their ability. 322 00:37:06,870 --> 00:37:18,960 The difficulty of the images that we asked them to annotate. So we think that this in a rash model or is a common sort of model to use in this case. 323 00:37:18,960 --> 00:37:27,180 So this is the item response model and a three parameter logistic crash model is is useful for us. 324 00:37:27,180 --> 00:37:34,920 So we expanded this model to say, alright, we'll use the three parameter logistic model, we'll add spatially dependent item. 325 00:37:34,920 --> 00:38:14,800 Difficult is the key to this model. And then we'll also expand it. 326 00:38:14,800 --> 00:38:22,900 And you give me. Yes, you're back and you left us for a little while. 327 00:38:22,900 --> 00:38:34,450 And it's no longer secret in sharing. Right, AK. 328 00:38:34,450 --> 00:38:45,650 Great. Thank you. But today, I get to it's. 329 00:38:45,650 --> 00:38:49,800 Okay, so. Okay. Thank you. 330 00:38:49,800 --> 00:38:50,660 All right. 331 00:38:50,660 --> 00:39:02,580 So we want to be able to analyse the data from the citizens and so we can use this three parameter logistic resh model that has a spatial term in here 332 00:39:02,580 --> 00:39:11,610 in the spatial term is going to be based on the images and say images in a particular area are going to be similar in terms of their difficulty. 333 00:39:11,610 --> 00:39:17,820 And this kind of model then allows us to identify the latent ability of each of the uses, 334 00:39:17,820 --> 00:39:24,490 adjust for the difficulty of the the images also allow for discrimination parameter. 335 00:39:24,490 --> 00:39:35,150 So that's how quickly we we can we were changing the probabilities and also account for guessing or pseudo guessing in this case. 336 00:39:35,150 --> 00:39:43,770 And so what we want to do here is we want to be able to identify the abilities of each of the citizens and adjust for that. 337 00:39:43,770 --> 00:39:48,900 So what we can do if we if we do that, I'll just show you here, 338 00:39:48,900 --> 00:39:56,190 is that we can say we can estimate the ability and this some this thought here shows that we did quite well in that. 339 00:39:56,190 --> 00:40:00,630 So we have some test data. We can assess their their ability. 340 00:40:00,630 --> 00:40:08,160 In a small sample of the cases, we looked at their estimated ability and we can pretty much get that right from this approach. 341 00:40:08,160 --> 00:40:14,520 So this means then that we can only take the people who are doing a good job, who have high ability, 342 00:40:14,520 --> 00:40:20,490 and if we can also identify training or personalise the training for our citizens. 343 00:40:20,490 --> 00:40:26,620 So the options here when we have all of this information is again a divide and recombine approach. 344 00:40:26,620 --> 00:40:34,830 And so we divide it into if we were to use a common divide and recombine approach, we'd split into multiple shards or subsets. 345 00:40:34,830 --> 00:40:39,120 We'd fit the model two into independent subsets on independent machines. 346 00:40:39,120 --> 00:40:46,260 And we combine those posterior estimates in into a global estimate using some sort of consensus Monte Carlo. 347 00:40:46,260 --> 00:40:50,760 And we'd have weighted averages of the posterior M CMC chains. 348 00:40:50,760 --> 00:40:52,380 That's as I described. 349 00:40:52,380 --> 00:41:00,270 So then an alternative is to divide the users into ten equal groups with respect to the number of classifications that they did. 350 00:41:00,270 --> 00:41:05,640 This works very well in our case because what we end up with is about 10 shards. 351 00:41:05,640 --> 00:41:08,970 With about half a million classifications per shot. 352 00:41:08,970 --> 00:41:16,380 And so then we fit the shards in parallel and then we combine combine that using some stratified sampling approaches. 353 00:41:16,380 --> 00:41:22,530 And we found this kind of alternative to the usual divide and recombine works really well here. 354 00:41:22,530 --> 00:41:30,390 So the benefits and drawbacks of this distributed computing approach is that we have efficient computing. 355 00:41:30,390 --> 00:41:38,430 We avoid repetition of these tasks, that because we have this distributed setup, we are we have a wealth of tools to do this. 356 00:41:38,430 --> 00:41:46,320 Now we have privacy prevert preservation in some sense because the we have the distributed 357 00:41:46,320 --> 00:41:52,200 analysis and we can also increase inferential power by combining the datasets. 358 00:41:52,200 --> 00:41:55,770 We do have, on the other hand, the potential for data leaks. 359 00:41:55,770 --> 00:42:03,990 We have we can have possible biased and variable inferences if we use just the naive approaches. 360 00:42:03,990 --> 00:42:06,180 It's usually limited to point estimates. 361 00:42:06,180 --> 00:42:15,030 It's more difficult to get uncertainty estimates, although that's changing and we still have some issues about time, degeneracy. 362 00:42:15,030 --> 00:42:25,290 And we have to be careful of that degeneracy of the weights that we have on the partial likelihoods or the the local likelihoods on that. 363 00:42:25,290 --> 00:42:30,360 And there's a lot of work on avoiding or addressing those issues. 364 00:42:30,360 --> 00:42:38,580 So I'd like to come now to to the approach that we're investigating currently, which is around Federated Learning. 365 00:42:38,580 --> 00:42:47,100 So now this is the next step out. So this is really decentralised approach where we want to analyse the data without the data leaving the source. 366 00:42:47,100 --> 00:42:57,120 So this can then help us to avoid the ethical, legal, political, administrative and computational barriers to combining data from multiple sources. 367 00:42:57,120 --> 00:43:03,470 It increases control for the the the the owners of the data. 368 00:43:03,470 --> 00:43:11,820 There is a potential to improve data quality and timeliness because the data will be managed by the data owners. 369 00:43:11,820 --> 00:43:21,870 And we also have then increased inferential capability if we by bringing together small or dispersed datasets. 370 00:43:21,870 --> 00:43:32,070 So the way that the federated analysis works is that we have groups, we have the parties, the manager and the communication computational framework. 371 00:43:32,070 --> 00:43:36,390 We have then the components. So we want to petition our data. We have a model. 372 00:43:36,390 --> 00:43:40,320 We have a privacy mechanism and the communication architecture. 373 00:43:40,320 --> 00:43:47,470 How are we going to get all these actors to talk? To each other, and then we also have the modelling approaches and. 374 00:43:47,470 --> 00:43:50,820 And this is where we can really play a role here. 375 00:43:50,820 --> 00:43:58,770 And combined with the computer scientists and the information technology people in in making this federated approach. 376 00:43:58,770 --> 00:44:08,550 So modelling approaches range from date, new networks based decision trees, regressions, support vector machines and so on. 377 00:44:08,550 --> 00:44:14,900 You'll see the sort of the the the image of the federated analysis. 378 00:44:14,900 --> 00:44:19,410 And below that, I have a figure also of the type of datasets that we might have. 379 00:44:19,410 --> 00:44:24,690 So we might have data that split or petitioned horizontally. 380 00:44:24,690 --> 00:44:32,700 So each of the the nodes, the local data, not local nodes, has all of the data. 381 00:44:32,700 --> 00:44:39,450 So the whys and the Xs that are required for the analysis. It's just that they have subsets of that that whole dataset. 382 00:44:39,450 --> 00:44:43,530 The other option is that we have data that split vertically. 383 00:44:43,530 --> 00:44:49,050 So one nodes holds one of the Xs, another node holds another X and another one holds the Y. 384 00:44:49,050 --> 00:44:54,810 And then we want to bring all of these data together. So there's different ways that the data at a petition. 385 00:44:54,810 --> 00:45:00,810 Now, there's a lot of approaches for federated analysis. But this is really sort of just in the last few years. 386 00:45:00,810 --> 00:45:06,630 So there's some commentaries that are useful to to to get an overview of this. 387 00:45:06,630 --> 00:45:14,580 There's an idea of having a repository with data sharing agreements and these repositories of very common now. 388 00:45:14,580 --> 00:45:21,810 And there's a whole lot of work on developing these. But the question is then about doing the analysis once we have these repositories. 389 00:45:21,810 --> 00:45:28,770 We also have Bayesian networks as a way of being able to include data from different areas. 390 00:45:28,770 --> 00:45:34,110 And this is used, for example, in this sample for survival prediction. 391 00:45:34,110 --> 00:45:40,830 There's neural networks to a lot of work on your networks. And then sequential and hierarchical Bayesian models. 392 00:45:40,830 --> 00:45:45,150 For now, we're starting to get into the correlated data for time series data. 393 00:45:45,150 --> 00:45:53,010 And also for spatial data. And there's also some work on very nice work on communication, efficient approaches. 394 00:45:53,010 --> 00:46:01,140 I'll just talk about those briefly in a minute. So we have for the federated sharing of genomic data sets, we have these. 395 00:46:01,140 --> 00:46:05,810 An example of genetic status it's in this case is an example of this. 396 00:46:05,810 --> 00:46:10,650 Some is federated sharing of internal registry sense here. 397 00:46:10,650 --> 00:46:16,980 So this is really going to be controlled. Each group controls its own data. 398 00:46:16,980 --> 00:46:24,000 They can dynamically add or drop instances from the group and they can make agreements about the group. 399 00:46:24,000 --> 00:46:37,730 So there's quite a lot of sort of federated or autonomous, some autonomy in the way that these groups might engage with the overall registry. 400 00:46:37,730 --> 00:46:43,790 This example is from from Jordan and colleagues. 401 00:46:43,790 --> 00:46:49,190 This work started around two thousand and fifteen sixteen when the archive paper 402 00:46:49,190 --> 00:46:55,250 appeared and then was published in 2018 and there's been extensions of the work since. 403 00:46:55,250 --> 00:46:59,660 But basically here, the way that this works is again, a surrogate likelihood approach. 404 00:46:59,660 --> 00:47:09,200 So we start off by initialising. They did not. And then we're going to transmit that value to local machines. 405 00:47:09,200 --> 00:47:11,710 The current day it to the local machines. 406 00:47:11,710 --> 00:47:19,980 We're going to now create the local gradient of the likelihood at each of the machines and we're going to transmit that machine. 407 00:47:19,980 --> 00:47:22,760 MJ, we're going to transmit that to two machine. 408 00:47:22,760 --> 00:47:32,090 And one, we're going to calculate then a global gradient on machine and one, we're going to form in the surrogate function there. 409 00:47:32,090 --> 00:47:36,800 And then we're going to either do two one of two things. 410 00:47:36,800 --> 00:47:43,430 We either update the THETAS directly or we're going to then have some sort of one stop, 411 00:47:43,430 --> 00:47:47,810 one step quadratic approximation to get our new values of thetas. 412 00:47:47,810 --> 00:47:53,300 Then we will do the whole thing. We'll transmit those. That again will update the the gradients. 413 00:47:53,300 --> 00:48:00,230 We'll bring those gradients back in. We'll combine the gradients, compute this the the the likelihood and so on. 414 00:48:00,230 --> 00:48:08,000 So it's a very computationally efficient way of using a surrogate likelihoods through the gradient. 415 00:48:08,000 --> 00:48:11,600 So this Bayesian approaches to this, I won't go through it, 416 00:48:11,600 --> 00:48:17,720 but it's really a similar sort of thing where now we're not going to be transmitting the posterior drawers anymore, 417 00:48:17,720 --> 00:48:24,490 which is computationally intensive. We're going to be transmitting these some of these gradients. 418 00:48:24,490 --> 00:48:35,020 And then we can show then for logistic regression of images in this type of by Jordan et al that we can see from the plot here, 419 00:48:35,020 --> 00:48:42,260 that tag that this approach is really gives us a much better classis classification. 420 00:48:42,260 --> 00:48:55,510 Right. Compared to the usual approach. I'm going to skip this example other than to say this is a spatial modelling approach, which is very useful. 421 00:48:55,510 --> 00:49:01,360 And one of the key to the future approaches that actually takes into account 422 00:49:01,360 --> 00:49:06,560 this some strongly correlated data and spatial approach that we have here. 423 00:49:06,560 --> 00:49:11,770 Am I going full time? Crystal? Oh, yeah. 424 00:49:11,770 --> 00:49:18,730 If you could finish relatively soon, then we'll have time for another question because we've got a few in their chat show. 425 00:49:18,730 --> 00:49:26,740 So coming back to our Atlas of Cancer, then we can combine cancer data sets across each state and count countries. 426 00:49:26,740 --> 00:49:34,960 That's what we would like to do. And so we partnered with a group called with them, the Netherlands Cancer Agency. 427 00:49:34,960 --> 00:49:38,230 And so they've been doing some excellent work about this time last year. 428 00:49:38,230 --> 00:49:46,060 I was in the Netherlands to talk to them about this. And they've done some excellent work in the meantime, a collaboration called URO Care. 429 00:49:46,060 --> 00:49:50,800 So this is open source federated analysis, which really demonstrates that this can be done. 430 00:49:50,800 --> 00:49:58,610 So achieved. And so if any of you are interested in this, I would recommend checking this site out. 431 00:49:58,610 --> 00:50:07,510 That is the site and teach six, which is open source privacy prevert preserving federated learning infrastructure. 432 00:50:07,510 --> 00:50:14,320 And they've got some great examples of combining data from the Netherlands and Tyre and Taiwan, 433 00:50:14,320 --> 00:50:20,860 combining data from different agencies and then combining data from the Netherlands and Italy. 434 00:50:20,860 --> 00:50:25,990 And so different kinds of federated approaches. So there's a federated approach. 435 00:50:25,990 --> 00:50:34,570 And you can see on the the right of the screen an approach where it talks about how these individual nodes work. 436 00:50:34,570 --> 00:50:38,110 So if people are interested. 437 00:50:38,110 --> 00:50:48,880 So you can have a federated generalised linear model approach where you can you can calculate the MLA for BTR via phishers scoring. 438 00:50:48,880 --> 00:50:54,940 So what you do is you calculate the the individual terms at each of the nodes. 439 00:50:54,940 --> 00:50:59,740 You then aggregate them to the the the global node. 440 00:50:59,740 --> 00:51:06,250 You get a beta out of that. And then the beta goes back to individual nodes and so on. 441 00:51:06,250 --> 00:51:15,190 So we can get then the original version of your care was where you had each of the individual countries. 442 00:51:15,190 --> 00:51:19,920 They all provided information to a single source. And then you got your estimates. 443 00:51:19,920 --> 00:51:25,630 B, the federated version is that the data stay where they are then the sum. 444 00:51:25,630 --> 00:51:28,540 This approach works as I described before. 445 00:51:28,540 --> 00:51:36,100 So you get these individual values and then you get your estimates and you show the estimates are relatively the same. 446 00:51:36,100 --> 00:51:39,850 So for virtual reefed either then we've done a similar thing with them. 447 00:51:39,850 --> 00:51:44,320 Federated analysis here with Federated Matrix factorisation. 448 00:51:44,320 --> 00:51:49,330 So we've taken a fact Matrix factorisation approach that I showed before. 449 00:51:49,330 --> 00:51:56,020 And we've included an adaptive learning rate. And we've used to Capstick gradient descent as an approach here. 450 00:51:56,020 --> 00:52:04,960 And so we've we've effectively decomposed the stochastic gradient descent approach so that we can use it in these federated approach. 451 00:52:04,960 --> 00:52:10,330 So what we get is a dynamic method. It's some it's privacy protection. 452 00:52:10,330 --> 00:52:19,750 It has learning efficiency significantly reduces the training time and the and maintains high predictive accuracy in this. 453 00:52:19,750 --> 00:52:28,530 So I won't go through the details of this, but. But you're welcome to to talk to me afterwards or we'll see from the slides. 454 00:52:28,530 --> 00:52:34,270 We do show that that has some strong advantages in this federated approach. 455 00:52:34,270 --> 00:52:37,660 So in summary, then, our data are changing. 456 00:52:37,660 --> 00:52:46,180 We have size, privacy, provenance, quality, diversity issues, federated analysis and Federated Learning offers some solutions. 457 00:52:46,180 --> 00:52:48,190 But we do need statistical methods. 458 00:52:48,190 --> 00:52:56,590 And combining it with their implementation, we've shown talked about two case studies and then looked at a number of approaches. 459 00:52:56,590 --> 00:53:03,500 As I said, this is work in progress and I'll be very happy for any comments and suggestions and I'll hand it back to you. 460 00:53:03,500 --> 00:53:09,670 Christo, thank you. Thank you very much. Unfortunately, we can't give you around for us. 461 00:53:09,670 --> 00:53:17,950 But for the people who say thank you very much indeed. So we have some questions already in the chat. 462 00:53:17,950 --> 00:53:27,880 And so one of them is about what assumptions you've made when you're analysing cancer data about the surveillance. 463 00:53:27,880 --> 00:53:33,940 So what are you getting a fixed proportion of cases or do they vary across areas? 464 00:53:33,940 --> 00:53:37,270 And then there was a bit of a follow up from the same thing is how do you account for 465 00:53:37,270 --> 00:53:43,080 different data collection methods in the different datasets that you're bringing together? 466 00:53:43,080 --> 00:53:53,300 In the cancer data, we're fortunate in that this census basically said the cancer is a notifiable disease. 467 00:53:53,300 --> 00:53:59,600 And we have a count. We have registries now. The registries aren't perfect, but but they're very good. 468 00:53:59,600 --> 00:54:03,740 And so and so we take these data as given. You're right, though, 469 00:54:03,740 --> 00:54:13,310 in terms of bringing together data sets in this federated approach for that may be of different quality and also different representation. 470 00:54:13,310 --> 00:54:25,220 So those kinds of issues, you as statisticians, we have some good ways of being able to deal with them with data that we need. 471 00:54:25,220 --> 00:54:34,700 We have sampling approaches that we have some sort of uncertainty that we might associate with them with different kinds of issues around sampling. 472 00:54:34,700 --> 00:54:38,420 And so we can bring those to bear on this federated approach. 473 00:54:38,420 --> 00:54:43,010 And I think it's really important for us, the statisticians, to be part of this. 474 00:54:43,010 --> 00:54:51,860 This move to these sorts of more decentralised analysis approaches. 475 00:54:51,860 --> 00:55:02,420 And so also relating to fisheries learning and how things are moving, would you be able to do model checking or model learning in this context? 476 00:55:02,420 --> 00:55:07,950 How does that affect these approaches? Yeah. 477 00:55:07,950 --> 00:55:15,510 I think this is a good case. The same as just the whole model uncertainty in the in the distributed computing case. 478 00:55:15,510 --> 00:55:23,070 And so this people at Oxford who's who have been working in this area a lot, 479 00:55:23,070 --> 00:55:31,650 in some ways the I think the federated approach actually might help us to to understand more of that model robustness. 480 00:55:31,650 --> 00:55:34,830 So if you think about this and when we can actually think about like, 481 00:55:34,830 --> 00:55:41,030 what are they surrogate likelihoods so that when we create these surrogate likelihoods about the local likelihoods and, 482 00:55:41,030 --> 00:55:50,520 um, and perhaps we can learn more about how robust our model is if taking into account the assumptions about the different nodes. 483 00:55:50,520 --> 00:55:53,400 But then what's how likely what are our likelihoods telling it? 484 00:55:53,400 --> 00:56:01,530 It's almost like we have replicates of the experiment, if you want, depending on whether we have the vertical or the horizontal petitioning. 485 00:56:01,530 --> 00:56:09,410 But we do have an opportunity, I think, to learn more about our models, perhaps in this situation when we set up. 486 00:56:09,410 --> 00:56:21,750 OK. And then more of a political context question, which is what would achieving spatial equality mean in the context of a cancer atlas? 487 00:56:21,750 --> 00:56:29,530 So, in other words, is that the worst performing area is being raised to the level of the better performing areas. 488 00:56:29,530 --> 00:56:37,210 So, yeah, so. Well, that's what you would hope, is that that you would have it. 489 00:56:37,210 --> 00:56:44,840 And that's why we were looking at, for example, how many deaths might we avoid or other kinds of measures like that. 490 00:56:44,840 --> 00:56:51,260 How much morbidity would we avoid if we were able to resolve that spatial inequality? 491 00:56:51,260 --> 00:56:57,710 So typically, even in a hospital situation, you might say what we would like to do is to get to the 80 percent mark. 492 00:56:57,710 --> 00:57:02,460 So we would like everybody to be at least at that 80 percent mark. 493 00:57:02,460 --> 00:57:06,400 You know, it's one of those. I look forward to the day when everybody earns above the average wage. 494 00:57:06,400 --> 00:57:11,970 But that's pretty much what we're trying to aim for here. OK. 495 00:57:11,970 --> 00:57:16,470 And then there was a more technical question, which was about a specific parameter. 496 00:57:16,470 --> 00:57:27,030 What is the impact of the choice of kasib you? For instance, why is this taking today the power of a fourth? 497 00:57:27,030 --> 00:57:33,630 Yes, sir, the the the power of the fourth was because of the distance metric that we used. 498 00:57:33,630 --> 00:57:37,920 And I'm happy to go through the details, so we'll talk about that. 499 00:57:37,920 --> 00:57:44,970 And that distance metric then was the way that we created these some of these shots. 500 00:57:44,970 --> 00:57:51,230 And then the case up, you was actually being developed on each of the shots. 501 00:57:51,230 --> 00:57:59,820 So the perimeter that was really inducing the sparsity of those those vectors was really the gamma. 502 00:57:59,820 --> 00:58:10,900 That was it was in both the U and the V variant, the variances for both the U and the V turns. 503 00:58:10,900 --> 00:58:18,630 Thanks. And then I have a question about the the people that you've got sort of evaluating your images. 504 00:58:18,630 --> 00:58:25,050 And so when you're recruiting nonexperts to do that, is it possible to somehow. 505 00:58:25,050 --> 00:58:29,040 I mean, if you have only a limited number for each person, it probably doesn't matter so much. 506 00:58:29,040 --> 00:58:37,700 But if you were going forward with people over a longer time, is it possible to provide feedback to help improve the performance of individuals? 507 00:58:37,700 --> 00:58:43,020 So then. I mean, I suppose you could smooth everybody in the same direction, which wouldn't necessarily help. 508 00:58:43,020 --> 00:58:50,780 But I just wondered whether or not it you can have a sort of feedback to improve performance. 509 00:58:50,780 --> 00:59:01,100 Yes, definitely so this. This couple of things fun, is this a learning parameter in that in that equation that we have the three plus model? 510 00:59:01,100 --> 00:59:05,540 And so we can we people will learn as they go along as well. 511 00:59:05,540 --> 00:59:07,850 But if we can certainly have feedback, in fact, 512 00:59:07,850 --> 00:59:18,290 what we're doing at the moment is creating an AR to VR package so people can actually then use we then create virtual reality and then annotate. 513 00:59:18,290 --> 00:59:30,860 So I have eyes like artificial intelligence so we can have questions and and reminders and training actually appear in the virtual reality world. 514 00:59:30,860 --> 00:59:37,040 It's pretty exciting moment, but that's a topic for another time. 515 00:59:37,040 --> 00:59:43,400 OK. So what are the feet where the bits of feedback we got was what an inspiring talk. 516 00:59:43,400 --> 00:59:49,500 This has been to to finish the week with which is perhaps a bit premature for many of our weeks. 517 00:59:49,500 --> 00:59:55,250 It depends on where you are, how close you are to the end of your week. But it's been great to hear from you. 518 00:59:55,250 --> 01:00:03,350 It's pretty interesting to to have such specific examples that are both important and and also very different. 519 01:00:03,350 --> 01:00:06,890 So you can see how it's being used in a number of contexts. 520 01:00:06,890 --> 01:00:13,340 And that, I think, really helps us envision, you know, the wide range of other opportunities where this can be brought together. 521 01:00:13,340 --> 01:00:19,400 We're also in a particular context within the U.K. of just having gone through Brexit. 522 01:00:19,400 --> 01:00:24,590 And so that does change some of our data sharing relationships with other countries. 523 01:00:24,590 --> 01:00:29,930 And so it would be important for us to think about these federated approaches and how this might be 524 01:00:29,930 --> 01:00:38,720 used to avoid running into complications with data sharing and where different datasets are stored, 525 01:00:38,720 --> 01:00:45,770 that we can still get the benefits of all the data that are out there without having to put them all in the same place. 526 01:00:45,770 --> 01:00:52,510 Yeah, that's true. I think it will be it's a challenge that all of us are going to need to face into work. 527 01:00:52,510 --> 01:01:03,780 Right. We can certainly learn from each other. Well, thank you very much again for spending part of your evening with us. 528 01:01:03,780 --> 01:01:05,700 Thank you very much for the opportunity. 529 01:01:05,700 --> 01:01:13,500 It's a great honour to be able to speak and and especially for such a great time, for such a great cause and a great memory. 530 01:01:13,500 --> 01:01:23,970 So it's fantastic that you win. You remember people in this way, and I'm proud and pleased to be part of it. 531 01:01:23,970 --> 01:01:27,750 Thank you. Thank you very much. And we we still owe you a dinner. 532 01:01:27,750 --> 01:01:33,280 So at some point in the future, you don't really speak again if you're in the Oxford area. 533 01:01:33,280 --> 01:01:38,550 Oh, OK. It's a deal. Thank you. In a more traditional manner. 534 01:01:38,550 --> 01:01:47,663 Okay. Thanks very much, Crystal. And thank you very much, everybody, for attending today.