1 00:00:05,160 --> 00:00:09,930 So, ladies and gentlemen and Charlotte, thank you very much. 2 00:00:09,930 --> 00:00:14,730 It's fantastic to be here today when this new building is going to be out. 3 00:00:14,730 --> 00:00:27,420 So that's a real privilege, but also to to be here and to see Charlotte, who you use a Ph.D. student now is head of department and obviously thriving. 4 00:00:27,420 --> 00:00:33,750 And I had a lovely lunch with some of her students and post-docs, so that's that's really great. 5 00:00:33,750 --> 00:00:40,650 But also being asked to give the Florence Nightingale lecture is quite quite an honour. 6 00:00:40,650 --> 00:00:46,750 Actually, I got you get these emails. I thought the Florence Nightingale lecture Wow. 7 00:00:46,750 --> 00:00:48,010 Sounds really, really good. 8 00:00:48,010 --> 00:00:58,140 So of course, I did a little bit of homework about Florence Nightingale, and Charlotte's already told us a little bit about her. 9 00:00:58,140 --> 00:01:03,540 When she left, she lived till she was 90. She was a social reformer and a statistician. 10 00:01:03,540 --> 00:01:07,860 Of course, I have no idea that she was a statistician before. 11 00:01:07,860 --> 00:01:14,220 And clearly she cared about people as being a founder of of of nursing. 12 00:01:14,220 --> 00:01:21,150 That was really important to her. She wrote very much and effectively what we would now call PR. 13 00:01:21,150 --> 00:01:28,020 She didn't. It wasn't called PR then, but she wrote about medical knowledge using simple English. 14 00:01:28,020 --> 00:01:35,880 And I think this is a message for all of us as scientists to try and use simple English to explain to people what we're doing. 15 00:01:35,880 --> 00:01:41,460 But also, she helped to popularise graphical representation of statistical data, 16 00:01:41,460 --> 00:01:49,560 which in this I'm sure those of you will know this is some a letter that she sent to Queen Victoria. 17 00:01:49,560 --> 00:01:53,520 I can't imagine us sending letters to the Queen even today, 18 00:01:53,520 --> 00:02:05,490 but she sent it with these sort of charts showing the results of the causes of mortality in the army in the Crimea. 19 00:02:05,490 --> 00:02:16,530 And what this shows beautifully, I think, is that actually the causes of mortality from wounds here was very small, 20 00:02:16,530 --> 00:02:23,310 whereas the cause of mortality from infectious diseases. The great part was much, much bigger. 21 00:02:23,310 --> 00:02:31,770 And it was that that need to be addressed as much as the the wounds and the sort of surgery that went on. 22 00:02:31,770 --> 00:02:40,200 And so as somebody who loves data myself and also loves presenting graphical presentations of data, 23 00:02:40,200 --> 00:02:46,640 I was really quite delighted to find this because it really I think it's a suitable it explains it. 24 00:02:46,640 --> 00:02:51,970 Explain to me why this lecture would be called the Florence Nightingale lecture. 25 00:02:51,970 --> 00:02:56,160 So actually, I'm not going to do exactly what I wrote in the abstract. 26 00:02:56,160 --> 00:03:01,080 You'll be pleased to know I'm going to say a little bit. 27 00:03:01,080 --> 00:03:11,790 What I wanted to try and do was talk about how bioinformatics is, how I've been involved in bioinformatics through my career. 28 00:03:11,790 --> 00:03:20,820 And that is like a history, a little bit of structural bioinformatics and then show how this work then feeds into the genomic 29 00:03:20,820 --> 00:03:28,350 medicine that I think will change all of our lives and is perhaps relevant for this sort of lecture. 30 00:03:28,350 --> 00:03:36,060 So I started out, as Charlotte mentioned, looking I was a physicist. 31 00:03:36,060 --> 00:03:42,900 I came to the National Institute of Medical Research knowing no biology whatsoever, 32 00:03:42,900 --> 00:03:51,180 and my project was to study this little molecule, which is an I.D., which the scientists in the audience will know. 33 00:03:51,180 --> 00:03:55,770 The biologist is a critical coenzyme that we all have and we all need. 34 00:03:55,770 --> 00:04:04,980 And it was my job to do some experimental work wet and dry, but also to do some computational work on this tiny molecule. 35 00:04:04,980 --> 00:04:11,880 So for the chemists amongst us, this molecule has 10 rotatable bonds now compared with the protein. 36 00:04:11,880 --> 00:04:16,470 It's tiny, but I was in this National Institute of Medical Research. 37 00:04:16,470 --> 00:04:24,120 I was the only there were statisticians there because there is a history of the long history of statistics in medical scientists. 38 00:04:24,120 --> 00:04:32,490 But there was nobody else really doing computing and the computer room was up in the loft. 39 00:04:32,490 --> 00:04:39,600 It was a little room about the size of my office now, and I've not got a big office, I should say. 40 00:04:39,600 --> 00:04:50,100 And up there, we had some, some graphics and I was trying to do computations on this molecule, 41 00:04:50,100 --> 00:04:57,630 and it became very clear that actually this was quite difficult to do. 42 00:04:57,630 --> 00:05:02,610 There was nobody else in the whole institute doing this sort of stuff. 43 00:05:02,610 --> 00:05:06,750 There wasn't a discipline of bioinformatics at all. 44 00:05:06,750 --> 00:05:10,200 It was just, you know, I was doing it, 45 00:05:10,200 --> 00:05:17,940 and the great thing about actually this computer was that he'd had to screen two screens and I could look at these molecules in three dimensions. 46 00:05:17,940 --> 00:05:22,680 Now for, of course, the young kids, they're used to doing three dimensional things on screens. 47 00:05:22,680 --> 00:05:26,790 But then it was really fun, and I could programme the potential. 48 00:05:26,790 --> 00:05:35,340 There were 10 potential images and I could programme them so that I had one for each of these rotatable bonds and I could make this this protein. 49 00:05:35,340 --> 00:05:40,230 This molecule change its conformation just by trialling these little potentialities. 50 00:05:40,230 --> 00:05:46,860 Now I made a big cow. You know, you learn in science that you go wrong sometimes. 51 00:05:46,860 --> 00:05:54,210 What I did these twiddling and I thought that I would minimise the energy by hand. 52 00:05:54,210 --> 00:05:58,050 This was my idea. Ten variables. I thought, no problem. 53 00:05:58,050 --> 00:06:04,980 I'll just minimise it by hand. It took me six months to write the programme to allow me to do that. 54 00:06:04,980 --> 00:06:10,500 It took me about two minutes to realise there was no way I could cope with ten variables in my head. 55 00:06:10,500 --> 00:06:17,910 It's just too hard to do it. So, so anyway, I learnt a lot about sort of the computing. 56 00:06:17,910 --> 00:06:33,030 But then I came to Oxford and I was in the zoology building, which was then beautifully white and and we we were up on on the top floor here. 57 00:06:33,030 --> 00:06:40,560 And this was the time when David Phillips and Dorothy Hodgkin were beginning to solve the structure's proteins. 58 00:06:40,560 --> 00:06:44,340 And I came actually to this is rather ironic. 59 00:06:44,340 --> 00:06:50,790 I came to be the systems manager of the computer. Now anybody who knows me, I hate systems management. 60 00:06:50,790 --> 00:06:59,730 And this was a complete disaster from one perspective, but from another perspective, I began to to learn about protein structures. 61 00:06:59,730 --> 00:07:01,570 And you can see just from these. 62 00:07:01,570 --> 00:07:11,190 So for those who don't know about protein structures, they are long molecules that fold up in three dimensions to create these beautiful things. 63 00:07:11,190 --> 00:07:16,560 Some of them are very symmetric. You can see that one up there so that they rather like flowers. 64 00:07:16,560 --> 00:07:18,990 When you look at them, they're beautiful things. 65 00:07:18,990 --> 00:07:29,160 And we we spent during this time when when I came, so I started my Ph.D. sort of here, and there was only one structure. 66 00:07:29,160 --> 00:07:32,880 And about this time they created something called the protein data bank, 67 00:07:32,880 --> 00:07:37,740 where everybody who sold the structure around the world deposited these structures 68 00:07:37,740 --> 00:07:44,820 into the data and AI systems manager used to get tapes that were about this big. 69 00:07:44,820 --> 00:07:53,940 With these new molecules on every, every three months, it was we'd get a tape with the new molecules on and you can see that during my time. 70 00:07:53,940 --> 00:07:58,080 So I was in Oxford in the seventies, basically. 71 00:07:58,080 --> 00:08:04,020 And gradually, the number of structures that we had gradually increased. 72 00:08:04,020 --> 00:08:10,650 And of course, it was a big event. We'd all go and look at nature when a new structure came out because we didn't have many to look at. 73 00:08:10,650 --> 00:08:18,420 And so looking at those was was fantastic and we began to learn how to handle them. 74 00:08:18,420 --> 00:08:25,290 But actually, looking at six is very different from looking at one hundred and fifty. 75 00:08:25,290 --> 00:08:31,290 Today we have a hundred and eighteen thousand entries in the PDP. 76 00:08:31,290 --> 00:08:34,710 So during my academic lifetime, 77 00:08:34,710 --> 00:08:42,120 the number of structures has really increased from almost nothing to this vast body of knowledge that we have to handle. 78 00:08:42,120 --> 00:08:47,400 And we have to develop the computational tools to deal with those. 79 00:08:47,400 --> 00:08:55,350 And when I started, we were very keen. Being a physicist, of course, physicists think they can solve all these problems. 80 00:08:55,350 --> 00:09:01,410 And the idea was that you could take the what the amino acid sequence, which is just a list of letters, 81 00:09:01,410 --> 00:09:09,360 as you can see here, and predict how big show, how it folded into these wonderful three dimensional structures. 82 00:09:09,360 --> 00:09:14,070 Now that problem is this problem that a three year old really can understand. 83 00:09:14,070 --> 00:09:18,510 If you give them a shoelace and say, wind it up so that it goes to wall structure, 84 00:09:18,510 --> 00:09:27,030 they would understand that even today we are some way away from being able to answer this question. 85 00:09:27,030 --> 00:09:36,120 So the really hard question of doing this AB this show without previous knowledge is really hard. 86 00:09:36,120 --> 00:09:43,230 And that was what drove really much of my research during the seventies when we looked at all different aspects. 87 00:09:43,230 --> 00:09:51,130 And this is when it came sort of statistical because most of the most solving one of those structures generally took. 88 00:09:51,130 --> 00:09:58,830 Originally, it would take five years or 20 years, even for somebody to solve just one protein structure. 89 00:09:58,830 --> 00:10:07,310 As you can tell from the numbers today, it doesn't take that long. It's really quite well, still difficult sometimes, but what we did, 90 00:10:07,310 --> 00:10:15,950 the other people perhaps didn't do was instead of just looking at one structure, I was in the lab where lots of structures were coming up. 91 00:10:15,950 --> 00:10:25,760 We looked at many structures and began to do more statistical analysis of what was likely, what was unlikely we'd need at our random distributions. 92 00:10:25,760 --> 00:10:31,640 So we have this different approach to looking at these structures practically. 93 00:10:31,640 --> 00:10:39,680 And so salt bridges, when I started, I saw salt bridges, which are plus minus interactions between parts of the chain. 94 00:10:39,680 --> 00:10:45,020 I thought those would be the thing that really stabilised the protein completely wrong. 95 00:10:45,020 --> 00:10:53,750 They just they lie on the outside and they provide stability, but they are not the main contribution to stability for proteins. 96 00:10:53,750 --> 00:11:01,160 And so we looked at lots of different things and did analyses that still are really used today in terms of, 97 00:11:01,160 --> 00:11:04,760 I mean, Charlotte mentioned the beta turns that was that's a little beta term, 98 00:11:04,760 --> 00:11:07,880 though that's a little tiny, little part of a structure, 99 00:11:07,880 --> 00:11:14,690 but they're very common in proteins and they help the same to fold around into into a positive structure. 100 00:11:14,690 --> 00:11:23,660 So we looked at all these details. But increasingly, as we got by this time, we probably had 300 fold. 101 00:11:23,660 --> 00:11:31,940 We began to think, Well, actually, we need to find a way to classify these because we can't cope with them. 102 00:11:31,940 --> 00:11:38,900 So I started off in Oxford, I had a blue book, and every time a new structure came out, I'd write in the book. 103 00:11:38,900 --> 00:11:46,130 And of course, as a physicist, I wasn't really conscious of evolution and the impact that that has on proteins. 104 00:11:46,130 --> 00:11:50,030 And these proteins can be called something completely different. 105 00:11:50,030 --> 00:11:55,100 And yet look almost identical. And that was really confusing to me. 106 00:11:55,100 --> 00:12:02,210 I thought, why does this all the other way round? Why does this cytochrome P450 not look like cytochrome b? 107 00:12:02,210 --> 00:12:06,560 You know, why aren't the same? They've got the same name. They should look the same. 108 00:12:06,560 --> 00:12:12,320 They look completely different. And so we have to find ways to really think about things. 109 00:12:12,320 --> 00:12:18,530 And this is almost identical to what people did with plants or with animals in 110 00:12:18,530 --> 00:12:24,470 terms of trying to classify them into their different species in some way. 111 00:12:24,470 --> 00:12:29,990 And so we develop the thing that Charlotte mentioned cats, and this was very much with Christina Ringo, 112 00:12:29,990 --> 00:12:38,600 who still looks after at UCLA and has made it really excellent, which provides a classification of these structures. 113 00:12:38,600 --> 00:12:43,250 And this is essential and in applied statistics, of course, 114 00:12:43,250 --> 00:12:52,550 this is always you have to get your data set right because if your data sets wrong, everything that follows is wrong. 115 00:12:52,550 --> 00:12:58,910 And so deciding what you include and what you don't to generate a random distribution is just critical. 116 00:12:58,910 --> 00:13:03,230 And so this was what allowed us, and that's why we started doing it. 117 00:13:03,230 --> 00:13:06,200 But in fact, it's been used for many other things, 118 00:13:06,200 --> 00:13:12,680 and we developed our classification of structures and we could see which structures were more common and which structures were less common, 119 00:13:12,680 --> 00:13:21,620 et cetera, et cetera. And from this also which is interesting, we began to develop validation methods. 120 00:13:21,620 --> 00:13:26,420 So when when people put structures into the protein data bank, 121 00:13:26,420 --> 00:13:36,230 originally it was done by hand and it was all hand curated and during about the ninth, the late eighties structures were sold. 122 00:13:36,230 --> 00:13:40,880 That proved to be wrong. 123 00:13:40,880 --> 00:13:47,990 Now this this was amazing because in structural biology that never happened and still actually now doesn't happen. 124 00:13:47,990 --> 00:13:56,090 Structural biology is fantastic because you know when you've got it right or wrong, fundamentally, if you do do it properly. 125 00:13:56,090 --> 00:14:05,450 And so what we found was that we could take all of these structures and join them together and develop parameters that 126 00:14:05,450 --> 00:14:12,860 would tell you whether it whether a given new structure looked like the previous structures that have been deposited, 127 00:14:12,860 --> 00:14:16,010 whether it have the right stereo chemical properties. 128 00:14:16,010 --> 00:14:23,270 And so Roman Laskowski, who's in the group and still works with me, he developed this tool called pre-check. 129 00:14:23,270 --> 00:14:35,270 And in fact, wrote the paper, describing it in the Journal of Applied Crystallography that well-known journal that nobody has heard of, 130 00:14:35,270 --> 00:14:38,930 and this is one of the top 100 cited papers. 131 00:14:38,930 --> 00:14:45,740 So you don't have to publish it in the top journals to get citations if people are going to use the stuff, that's that. 132 00:14:45,740 --> 00:14:51,410 But this was built on a statistical analysis of those data. 133 00:14:51,410 --> 00:15:02,130 And and so when you get new datasets, being able to really understand, you know, mean values, deviations, et cetera, et cetera. 134 00:15:02,130 --> 00:15:12,300 Is really important. And then we moved on to look at function, but I'm not going to go into that because really there isn't time to do that. 135 00:15:12,300 --> 00:15:25,560 But I should say that during this time, up to about 19 nine, well, even up to nineteen ninety five, there were never any jobs for bioinformatics. 136 00:15:25,560 --> 00:15:35,370 People advertised. There was never, you know, people go, you know, when you're at this stage, you go through nature looking for adverts for a job. 137 00:15:35,370 --> 00:15:37,150 There were no jobs for. 138 00:15:37,150 --> 00:15:48,330 In fact, when I was made a professor, the person who was doing the the the, you know, whatever it's called, where you say, Oh, congratulations. 139 00:15:48,330 --> 00:15:58,710 Said, Well, of course, it works in what used to be known as the dust bin area of crystallography because this was the computational biology part. 140 00:15:58,710 --> 00:16:10,320 Now, in fact, I think probably the bioinformatics world is actually, I would write probably five times larger than the structural biology community. 141 00:16:10,320 --> 00:16:21,300 It's not a discipline area, it's a very powerful and very important, and I honestly think it really is at the heart of biology. 142 00:16:21,300 --> 00:16:28,080 So this I this was this is what I was doing for many, many years. 143 00:16:28,080 --> 00:16:34,290 But it became apparent to me that actually understanding using this structural 144 00:16:34,290 --> 00:16:42,690 data to understand how proteins in the body work and how diseases occur, 145 00:16:42,690 --> 00:16:49,440 you don't just need structural biology data, you need many other sorts of data as well. 146 00:16:49,440 --> 00:16:56,400 And because I've been involved with the PDB, the protein data bank, I've been on their advisory board. 147 00:16:56,400 --> 00:17:10,090 I've done lots of things and we we because of protocol and doing things like that, I was deeply involved in that and it frustrated me that. 148 00:17:10,090 --> 00:17:15,580 When these structures were deposited, there would be many errors in them. 149 00:17:15,580 --> 00:17:20,650 And this is part of what Project sought to clean up. 150 00:17:20,650 --> 00:17:28,000 And so I realised that if this is done, what then the whole community benefits from it. 151 00:17:28,000 --> 00:17:31,930 It's not just me or that person who solved that structure. 152 00:17:31,930 --> 00:17:40,150 If you clean it up, if you make it as good as it can be at once, that everybody else can benefit from that. 153 00:17:40,150 --> 00:17:46,900 And so at that point, I have been in starting up Mill Hill, Oxford. 154 00:17:46,900 --> 00:17:58,300 But back in UCL, I was asked if I would consider going as director to more Ebi and Eby has as its 155 00:17:58,300 --> 00:18:05,890 chief mission to provide biomolecular data of all sorts really to the world. 156 00:18:05,890 --> 00:18:14,230 And because of my experience with the protein structures, I really felt that this was a job worth doing was very important, 157 00:18:14,230 --> 00:18:23,260 and it was also something that I knew something about, although only in one little part of biology, not in the whole part. 158 00:18:23,260 --> 00:18:37,840 And so I went off up to Cambridge and basically what the key the USP of of the FBI is that we essentially labs around the data around the world, 159 00:18:37,840 --> 00:18:47,650 send that data, we archive it, we classify it, we share it with other data providers and analyse it, 160 00:18:47,650 --> 00:18:51,130 and then we provide tools to help researchers use it. 161 00:18:51,130 --> 00:19:00,160 So this is actually when I went, there were about one hundred and fifty people at TBI, but it was clear this was going to be key. 162 00:19:00,160 --> 00:19:08,560 Now this shows all the different types of data resources that we are responsible for. 163 00:19:08,560 --> 00:19:17,530 And believe me and you can see here we have our structures and we also have the protein sequences here. 164 00:19:17,530 --> 00:19:21,640 And when I got there, there was the European nucleotide archive. 165 00:19:21,640 --> 00:19:32,350 It was called and more than an array xpress, and that was all. Now there are over 40 different resources that are looked after by the FBI. 166 00:19:32,350 --> 00:19:35,500 And the reason, as I said, that I thought so. 167 00:19:35,500 --> 00:19:45,310 The idea of this arrow here is that we go from the molecular one end to the system to the whole organisms at the other end. 168 00:19:45,310 --> 00:19:53,530 Now, at the moment, we really only handle properly, I think the molecules and they're not perfect by a long way. 169 00:19:53,530 --> 00:20:02,320 And we were having a discussion at lunch time actually about the PDB and its pluses and minuses, but they're relatively under control. 170 00:20:02,320 --> 00:20:11,290 If we think about the whole organism data or even the cellular data that is buried in the literature, 171 00:20:11,290 --> 00:20:18,460 and we really don't yet have a good handle on that or good data resources that tackle that. 172 00:20:18,460 --> 00:20:25,600 And so over the time that have been assembled, you can see we we have the literature, which is critical. 173 00:20:25,600 --> 00:20:30,970 We have pathways and interactions. The chemical biology is important. 174 00:20:30,970 --> 00:20:43,240 The expression on the metabolomics and proteomics, all of those now are part and parcel of us trying to understand the molecular composition of life. 175 00:20:43,240 --> 00:20:46,990 One of the things that's happened, of course, is that during this time, 176 00:20:46,990 --> 00:20:53,230 during the time I've been there, all of the data of increased very, very rapidly. 177 00:20:53,230 --> 00:21:01,480 And in fact, if we compare our data with what is at CERN, which everybody always thinks is big, big data. 178 00:21:01,480 --> 00:21:06,850 So instead, they store about 30 petabytes of data a year. 179 00:21:06,850 --> 00:21:18,790 That doubling time is about two years, and they effectively established a whole network of futures across the world to analyse this data. 180 00:21:18,790 --> 00:21:29,170 Our data are very different. They're not coming out of one big LHC in Geneva that they're developed by many people all around the world, 181 00:21:29,170 --> 00:21:34,210 all the life science community which is running into the millions around the world. 182 00:21:34,210 --> 00:21:41,770 And so our data are now large. We have now sixty five petabytes of storage. 183 00:21:41,770 --> 00:21:47,210 The data aren't all the same. They're heterogeneous. I showed you all those different data resources. 184 00:21:47,210 --> 00:21:53,440 They're all different, and that makes it much more complicated to handle. Our doubling time is about a year. 185 00:21:53,440 --> 00:22:00,700 And obviously, this is a joint enterprise by everybody around the world. 186 00:22:00,700 --> 00:22:05,260 And if we look so this now, this is a log plot. 187 00:22:05,260 --> 00:22:09,220 This is the number of bytes here going up. So that's 10 pets a. 188 00:22:09,220 --> 00:22:22,480 Right here, we've got from two, six to 216, and the these lines represent doubling every year, so you can see that the sequence data, 189 00:22:22,480 --> 00:22:30,670 which is by far the largest by long way, and now the human sequence data that is about along the doubling every year. 190 00:22:30,670 --> 00:22:39,700 Not quite. But something like that. You can also see that metabolomics, although it's so this is looking at small molecules and their distributions. 191 00:22:39,700 --> 00:22:44,680 So if we look at the small molecules, that is increasing very rapidly. 192 00:22:44,680 --> 00:22:53,920 But of course, because it's a log plot, this is tiny compared with the big data. 193 00:22:53,920 --> 00:22:58,840 And the other thing that's happened. Well, actually, maybe I'll go back to that slide. 194 00:22:58,840 --> 00:23:06,770 Is that the way that biology is done has changed radically with with big projects. 195 00:23:06,770 --> 00:23:13,480 And this is just three four big projects that Ebi is being involved with. 196 00:23:13,480 --> 00:23:20,860 And these are global projects that happen around the world. So they are projects that that generate massive. 197 00:23:20,860 --> 00:23:26,740 So this is looking at marine microbial biodiversity. 198 00:23:26,740 --> 00:23:35,410 This is the Big International Cancer Genome Consortium. This is ENCODE, which was looking at expression data, regulatory data. 199 00:23:35,410 --> 00:23:40,990 And Tara Oceans was looking at organisms around the world. 200 00:23:40,990 --> 00:23:55,010 And so this is. The way that the science is done now and the need to create these communities that use all of this data really becomes important. 201 00:23:55,010 --> 00:24:04,700 The other thing is that as part of this, we gradually started to get information that is relevant to medicine. 202 00:24:04,700 --> 00:24:15,050 So again, when I first went to eBay, I can remember saying, Oh well, the big frontier is how we engage with medical data. 203 00:24:15,050 --> 00:24:21,560 Fifteen years ago, I have to say that the medicks really weren't interested. 204 00:24:21,560 --> 00:24:29,570 It was too far away. It was too big a jump. Fifteen years later, that has completely inverted. 205 00:24:29,570 --> 00:24:39,020 And now there is a very clear recognition that these data are relevant to to to what goes on. 206 00:24:39,020 --> 00:24:47,780 So big data we we have big demand 11 million users a day to the website. 207 00:24:47,780 --> 00:24:54,950 Five million unique sites. 9.2 million jobs a month, etc. So it's a lot. 208 00:24:54,950 --> 00:25:03,130 It's a big setup. This infrastructure, and if we just look, this is going to work. 209 00:25:03,130 --> 00:25:08,980 See whether it works. So this is alive as we speak, 210 00:25:08,980 --> 00:25:17,860 usage of ebi data resources around the world and these maps that show the colours actually represent the different resources. 211 00:25:17,860 --> 00:25:22,300 So you can see who's using what is any time and what you can see here. 212 00:25:22,300 --> 00:25:28,420 So now actually, this year for the first time, our biggest useless are China. 213 00:25:28,420 --> 00:25:34,150 And they use 20 percent of the hits on the website are from China. 214 00:25:34,150 --> 00:25:39,430 And this has changed completely because when I started, there was almost no nothing. 215 00:25:39,430 --> 00:25:44,980 So this has changed completely. And you know, you can't see, well, the US. 216 00:25:44,980 --> 00:25:51,130 Yes, it's obviously changes with the time of day you can guarantee. 217 00:25:51,130 --> 00:25:59,110 So you can't. Europe is obviously a big splurge here, but usually Australia goes to sleep round about now, 218 00:25:59,110 --> 00:26:05,530 so there won't be much going on in Australia and the Americas wake up. 219 00:26:05,530 --> 00:26:14,140 But I remember I was giving a talk in Iceland. And of course, everybody always looks at their place when you when you show this sort of a map. 220 00:26:14,140 --> 00:26:18,880 And I said, Oh yes, Iceland uses it. But of course there are no uses and I thought, Oh no. 221 00:26:18,880 --> 00:26:23,710 And just as I was speaking up came a hit from Iceland. They're fantastic. 222 00:26:23,710 --> 00:26:37,090 So people all around the world really use the data and that I think that is fantastic because it's kind of that's what science is about. 223 00:26:37,090 --> 00:26:43,330 It's about sharing the data and building on everybody else's data as well. 224 00:26:43,330 --> 00:26:52,670 So bioinformatics. Effectively is able to bridge different biological disciplines, 225 00:26:52,670 --> 00:26:59,570 and that's very important because it brings data together from different aspects of biology. 226 00:26:59,570 --> 00:27:05,150 And increasingly, you can see that we go into it. 227 00:27:05,150 --> 00:27:12,680 So most of the time the people, computational people have been down at the molecular or certainly in the structural biology, 228 00:27:12,680 --> 00:27:19,700 obviously in the molecular end. But increasingly, the need to model the cellular, 229 00:27:19,700 --> 00:27:28,700 the organ now at the organism scale is going to become more and more important, and we will get more and more data about that. 230 00:27:28,700 --> 00:27:34,190 And this is a terrific opportunity where people who can do applied statistics honestly 231 00:27:34,190 --> 00:27:41,690 will be king because it really makes a difference being able to handle all of this data. 232 00:27:41,690 --> 00:27:49,280 And just to emphasise the importance of this, if I could say that for EMBL. 233 00:27:49,280 --> 00:27:54,830 So EMBL is a European organisation and we have five year plans. 234 00:27:54,830 --> 00:27:58,610 As Ian is wont to say, with the last one of the last, you know, 235 00:27:58,610 --> 00:28:05,060 the Soviet five year plans, we still have our five year plans of what we're going to do. 236 00:28:05,060 --> 00:28:15,530 But the next indicative scheme, which is what this five year plan is called, is entitled Digital Biology. 237 00:28:15,530 --> 00:28:20,480 And I think that and of all of the scientists, assemble. 238 00:28:20,480 --> 00:28:30,740 Now, about half of them spend on average half of their time doing computational biology and not just the experimental. 239 00:28:30,740 --> 00:28:39,110 And so the growth of this bioinformatics computational biology is really important. 240 00:28:39,110 --> 00:28:45,800 And you can see we have the experimental technology, which obviously is critical generating the data. 241 00:28:45,800 --> 00:28:51,080 But of equal importance is the computational approach to these things. 242 00:28:51,080 --> 00:29:01,400 And clearly, if we have better methods, we will get better biology and ultimately better medicine as well. 243 00:29:01,400 --> 00:29:05,150 And so this is a list and I'm not going to go through them. 244 00:29:05,150 --> 00:29:20,570 You can read them all of areas where which are sort of hot, where the need for good computational biology and bioinformatics is very, very clear. 245 00:29:20,570 --> 00:29:25,760 So what I have not mentioned so far is the human genome. 246 00:29:25,760 --> 00:29:31,910 And so this is the sort of second of last part of the talk. 247 00:29:31,910 --> 00:29:36,380 I'm just talking about the impact that this has had. 248 00:29:36,380 --> 00:29:45,800 So I should say when I was in Oxford, I drew a graph of the number of protein structures versus the date, 249 00:29:45,800 --> 00:29:52,670 and you can predict how many structures there are. And actually, that prediction was pretty accurate. 250 00:29:52,670 --> 00:29:57,710 You know, you know that you're going to get better at doing it, you're going to get quicker and everything. 251 00:29:57,710 --> 00:30:03,950 The one thing which I didn't in any shape or form see was the fact that we would 252 00:30:03,950 --> 00:30:11,630 have the whole human genome determined and the ability not just to sequence humans, 253 00:30:11,630 --> 00:30:23,360 but to sequence a human and look at that and that the impact of this, this is just the stats for this is enormous in biology. 254 00:30:23,360 --> 00:30:29,420 And I think it's only only just started just as this is sort of a fun thing. 255 00:30:29,420 --> 00:30:35,990 In 2003, the cost of sequencing a genome was equivalent to that. 256 00:30:35,990 --> 00:30:51,320 The price then of a house, the most expensive price house in London in by 230, it was the price of it was an Arsenal season ticket. 257 00:30:51,320 --> 00:30:57,410 Now is probably the price of maybe an Oxford City season ticket. 258 00:30:57,410 --> 00:31:07,040 I don't know. I'm not sure how much those costs, but clearly, actually most of us could today afford to have our genome sequence, 259 00:31:07,040 --> 00:31:12,800 and that is a huge total change in perspective. 260 00:31:12,800 --> 00:31:18,770 And of course, it's we all have basically the same sequence ninety nine point nine percent. 261 00:31:18,770 --> 00:31:20,510 We've all got the same sequence, 262 00:31:20,510 --> 00:31:27,890 but it's these little bits where we different these odd letters in the sequence that make us different from one another. 263 00:31:27,890 --> 00:31:39,110 And so understanding this human variation is a huge and inspiring is what makes us human and what makes us different from each other. 264 00:31:39,110 --> 00:31:46,450 And so there are many sequence efforts around the world of generating data that. 265 00:31:46,450 --> 00:31:53,200 Some of which ends up a TBI. To look at all these different. 266 00:31:53,200 --> 00:31:57,310 So in the UK, probably we have the biggest. 267 00:31:57,310 --> 00:32:08,710 The 100000 genomes, which started just last year on campus, we have the building, the sequencing factory and it is like a factory. 268 00:32:08,710 --> 00:32:18,820 The ground floor is taken by Illumina, who will sequence these 100000 genomes, and these are supposed to be all done by 270. 269 00:32:18,820 --> 00:32:20,170 So it's very soon. 270 00:32:20,170 --> 00:32:32,080 I mean, it's really very quick, and the only reason it's possible is because the technology has changed so radically over the last few years. 271 00:32:32,080 --> 00:32:44,140 And of course, the problem isn't now to determine the genome, which, as I say, I never even envisage, but it's actually to decipher what it means. 272 00:32:44,140 --> 00:32:50,290 And that is a huge challenge for the whole of biology and indeed for the whole of medicine. 273 00:32:50,290 --> 00:32:58,720 Because when it goes wrong, then you can look to ask why you have a certain preference. 274 00:32:58,720 --> 00:33:04,420 So no, I don't think there's going to be time for all of this. 275 00:33:04,420 --> 00:33:11,830 Perhaps I could say that the the coding, the protein coding variants. 276 00:33:11,830 --> 00:33:19,690 So what when you get this variation, one variation here, you can say for some of these variations, 277 00:33:19,690 --> 00:33:24,880 not most of them, but some of them will occur in a protein structure. 278 00:33:24,880 --> 00:33:29,800 And so how does that variant lead to a disease? 279 00:33:29,800 --> 00:33:31,060 What's the effect? 280 00:33:31,060 --> 00:33:43,210 So this is a summary of the way that these variants that where there are causative variants that cause or have an impact on the disease. 281 00:33:43,210 --> 00:33:48,790 So here we've got some of those variants lead to a translocation. 282 00:33:48,790 --> 00:33:55,990 So this is like a protein chopped in half with the blood coming out. So this is Roman's pictures, I should say. 283 00:33:55,990 --> 00:34:04,450 So you can truncate the protein, you can have a variation in the middle that causes the thing to disrupt and unfold. 284 00:34:04,450 --> 00:34:13,180 You can have a variation at the heart of the enzyme, the residue, and that can cause the enzyme not to work. 285 00:34:13,180 --> 00:34:20,080 You can have a variation. There are metal binding site so that the metal doesn't bind anymore, and that causes a problem. 286 00:34:20,080 --> 00:34:25,300 You can have variations that stop a DNA binding protein being able to recognise 287 00:34:25,300 --> 00:34:31,270 DNA or protein protein interactions if you have a variant in the middle. 288 00:34:31,270 --> 00:34:35,890 The protein that can't recognise its partner and that causes problems. 289 00:34:35,890 --> 00:34:39,490 Or here you can have something that's affecting the substrate binding. 290 00:34:39,490 --> 00:34:50,050 So this is a summary of some of the variants that we all carry and what their impact might be. 291 00:34:50,050 --> 00:34:59,950 Now the question is, we all have individually, I think somewhere here, we all between any two individuals. 292 00:34:59,950 --> 00:35:03,760 There are about, Oh gosh, no, I'm going to get this this number. 293 00:35:03,760 --> 00:35:11,440 I always get these numbers wrong. There are about 10000 different DNA bases. 294 00:35:11,440 --> 00:35:15,130 Some of those will be in proteins, but not all of them. 295 00:35:15,130 --> 00:35:22,390 And so the question is how the one some of these variants, in fact, most of them have no effect. 296 00:35:22,390 --> 00:35:26,050 They don't cause the disease so we can carry them around quite happily. 297 00:35:26,050 --> 00:35:35,210 We can pass them on to our children. No problem. But others of these variants really do lead to disaster. 298 00:35:35,210 --> 00:35:44,890 Or that disaster, of course, is only a modulator disaster because if it was a real disaster, then the foetus would die. 299 00:35:44,890 --> 00:35:51,070 If you pass it on to your children, then the child has to. 300 00:35:51,070 --> 00:36:00,610 It's only modulating the effect. And this is one of the big challenges that actually the variants we see are often rather gentle variants. 301 00:36:00,610 --> 00:36:08,050 They just modulate something. And so identifying what's going on becomes quite a challenge. 302 00:36:08,050 --> 00:36:13,600 So, oh, I just conscious of the time here. 303 00:36:13,600 --> 00:36:19,780 Perhaps I could just say, I'll go through these very quickly. 304 00:36:19,780 --> 00:36:25,150 One thing that people have done and we've done is to compare those variants, 305 00:36:25,150 --> 00:36:31,120 those changes in our sequences that are associated with diseases and those that are, 306 00:36:31,120 --> 00:36:38,590 if you like natural mutations that we all carry and you can do that by looking at this new data 307 00:36:38,590 --> 00:36:46,190 and you can generate matrices that tell you how often different amino acid changes occur. 308 00:36:46,190 --> 00:36:52,790 And then you can compare those matrices with those that occur for disease associated ones. 309 00:36:52,790 --> 00:36:57,050 So I've not got time to go into the detail what you see. 310 00:36:57,050 --> 00:37:01,700 You can then ask the amino acids, the residues in the protein. 311 00:37:01,700 --> 00:37:06,110 You can run code to them according to how likely they are to change. 312 00:37:06,110 --> 00:37:13,880 And you can do that for the natural variants, i.e. the ones that don't cause diseases and the ones that do. 313 00:37:13,880 --> 00:37:19,310 And what you what you see is that there's a big difference between these two. 314 00:37:19,310 --> 00:37:25,340 So some amino acids, when they change are very often associated with diseases. 315 00:37:25,340 --> 00:37:36,260 Others are not. This is a purely statistical way of looking at it, and you can derive quite good on average values of what's good and what's bad. 316 00:37:36,260 --> 00:37:43,790 And the predictions are really accurate about the eighty five, eighty seven percent, they're pretty good predictions. 317 00:37:43,790 --> 00:37:46,700 The problem comes and in particular, 318 00:37:46,700 --> 00:37:59,450 what you find is that the the disease associated variants occur in those residues that are best conserved during evolution. 319 00:37:59,450 --> 00:38:09,140 So if you have a residue that's really, well, conserved and then you change it, that is likely to lead to a disease of the opposite end. 320 00:38:09,140 --> 00:38:14,120 If you have an unconcern residue, then it doesn't matter if he changes or not. 321 00:38:14,120 --> 00:38:21,770 So it's more or less what you'd expect. And we did a whole analysis of structural data, but there really isn't time to talk about that. 322 00:38:21,770 --> 00:38:24,410 So I'm not going to go into that. 323 00:38:24,410 --> 00:38:35,570 But I wanted then to think about this is if you like basic biology, the question is, is this really relevant for medicine? 324 00:38:35,570 --> 00:38:37,290 And what are the challenges? 325 00:38:37,290 --> 00:38:48,560 How can we move from our molecular knowledge to the medical challenges which are often more organismal rather than an individual molecule? 326 00:38:48,560 --> 00:38:53,750 And where will this sequencing really have an impact in the clinic? 327 00:38:53,750 --> 00:38:58,820 And so you can really immediately identify, Hey, 328 00:38:58,820 --> 00:39:08,600 I've got five areas where the sequencing will begin and is already beginning quite clearly to have a major impact. 329 00:39:08,600 --> 00:39:13,820 The first, of course, is understanding the molecular basis of disease. 330 00:39:13,820 --> 00:39:20,840 As we understand more about diseases, we stand a chance of being able to do something about it and solve it. 331 00:39:20,840 --> 00:39:32,180 The human variation and disease risk, which of those variants that we hold is likely to make us susceptible to cardiac arrest or diabetes or whatever. 332 00:39:32,180 --> 00:39:36,020 You can begin to estimate this. The cancer genomics. 333 00:39:36,020 --> 00:39:42,500 We all know that cancer is a genetic disease, which because we get changes to our DNA, we get those variants. 334 00:39:42,500 --> 00:39:46,490 How does that happen and the work going on in the cancer, 335 00:39:46,490 --> 00:39:53,600 some of it on the campus is phenomenal in terms of understanding how these cancers evolve and 336 00:39:53,600 --> 00:40:01,670 what we're going to have to do to actually develop therapeutics that can really work long term. 337 00:40:01,670 --> 00:40:08,300 Another amazing use of the sequencing is tracking the source of infectious diseases. 338 00:40:08,300 --> 00:40:17,480 And by looking at the the genome of the bacteria or the virus, you can actually watch from this, 339 00:40:17,480 --> 00:40:25,160 the data from the sequence data, how the bug can transfer from one back to another or from one war to another. 340 00:40:25,160 --> 00:40:32,180 So you can really begin to track this infectious diseases and then you can say, 341 00:40:32,180 --> 00:40:38,330 what can the structural data really help us to explain the causes of disease? 342 00:40:38,330 --> 00:40:48,710 And the answer is often is the case is that to really ask these questions, you need to be very specific where you look and what you're looking at. 343 00:40:48,710 --> 00:40:55,670 And this is a study that was done across the UK led from the Sanger Institute 344 00:40:55,670 --> 00:41:01,190 funded by the Wellcome Trust in trying to decipher developmental disorders. 345 00:41:01,190 --> 00:41:12,200 So these are children who often before the age of five will present to the hospital with some sort of problem. 346 00:41:12,200 --> 00:41:23,480 And in fact, they did a study of just over a thousand children who had were identified as having one of these developmental disorders. 347 00:41:23,480 --> 00:41:31,820 They did screening and they did whole exome sequencing, and they sequenced these children, 348 00:41:31,820 --> 00:41:37,160 and these were the sort of disorders that they presented with. 349 00:41:37,160 --> 00:41:44,820 So this is lots of difficult disorders. The problem for these children is that they present and they are not diagnosed. 350 00:41:44,820 --> 00:41:45,990 So some of these children. 351 00:41:45,990 --> 00:41:54,390 When they present when they're five, they're not diagnosed until they're 12 or later, and this can be really challenging not only for the child, 352 00:41:54,390 --> 00:42:00,840 but also for the parents because they get passed from one part of the system to the other part of the system. 353 00:42:00,840 --> 00:42:03,210 And so they collect data. 354 00:42:03,210 --> 00:42:12,930 So the reason this is such a very nice dataset is that they collected data from the two parents on the child or sometimes those siblings, 355 00:42:12,930 --> 00:42:21,060 and they could get all the data. The average age of the child when they first presented was about five. 356 00:42:21,060 --> 00:42:26,280 So with with that data, for those children they had, 357 00:42:26,280 --> 00:42:33,570 mainly these were the disorders that these children had so varied, they had very many different sorts of disorders. 358 00:42:33,570 --> 00:42:40,710 What they were able to do from the sequencing was to identify in these children different. 359 00:42:40,710 --> 00:42:50,640 So, so the key thing about doing the parents is that instead of 10000 variants, you only have 100 Nova variants. 360 00:42:50,640 --> 00:43:01,560 And that cuts down because one of the major challenges, which still is really relevant, is that you can't identify which is the causal variant. 361 00:43:01,560 --> 00:43:10,800 You can't say what's going on. And in this dataset, instead of having 10000 variants to look at, you only have 100. 362 00:43:10,800 --> 00:43:18,150 And so they looked at this and what was they found 12 novel recurring genes. 363 00:43:18,150 --> 00:43:23,130 And these are genes that were mutated in more than one child. 364 00:43:23,130 --> 00:43:26,570 Now this seems to me like a fairly hairy way of going about things. 365 00:43:26,570 --> 00:43:36,450 So you've got this mutation twice. But actually, the probability of that happening and not to have some cause is very low. 366 00:43:36,450 --> 00:43:42,390 And so they can look at these recurring mutations and try and understand it. 367 00:43:42,390 --> 00:43:46,050 So we looked at one examples of these. 368 00:43:46,050 --> 00:43:54,720 So this was a protein, this protein and it and there were four different variants that were observed in different children. 369 00:43:54,720 --> 00:44:04,830 Each of these was a different child. This protein is quite a long protein, and it's made up of these WD 40 repeats. 370 00:44:04,830 --> 00:44:16,380 So it's this I said, like flowers and symmetry. This shows the the the structure of the protein, and it's basically got these repeats in it. 371 00:44:16,380 --> 00:44:23,550 Now, when you look at these variants, those are where the variants occur. 372 00:44:23,550 --> 00:44:30,570 This is what's called a sequence logo of the variants, and you can see that there are some places where the letters are big. 373 00:44:30,570 --> 00:44:40,950 It means that residue is conserved. So this h histidine is conserved all the aspartic acid the day or the trip to fund the W. It's D.H.S. 374 00:44:40,950 --> 00:44:44,910 W. And it turned out that those variants occurred there. 375 00:44:44,910 --> 00:44:49,080 So this is now going back to our protein structures in this structure. 376 00:44:49,080 --> 00:44:54,930 You have this beautiful network of hydrogen bond that ties together. 377 00:44:54,930 --> 00:44:59,550 This the the auspice, serine and tryptophan. 378 00:44:59,550 --> 00:45:08,670 And that link which occurs up here, is actually the thing that holds the whole of that repeat together. 379 00:45:08,670 --> 00:45:14,520 And so it's a very nice example where the structure is able to explain these variants. 380 00:45:14,520 --> 00:45:19,740 And so we saw variants in the ASP and in the in the histidine here. 381 00:45:19,740 --> 00:45:27,180 And when we look at them on the protein, we see that they're all at the top of the protein, which is actually a recognition. 382 00:45:27,180 --> 00:45:32,700 This is where it recognises another protein and that the natural variants are at the bottom. 383 00:45:32,700 --> 00:45:42,720 So it's segregated very, very nicely. So let me just say that also, I'll skip that. 384 00:45:42,720 --> 00:45:53,610 So this basic research can enable you to understand more about the variants and to link in some way, 385 00:45:53,610 --> 00:46:01,470 at least make the first step between the sequence and the variant and how that disrupts the protein. 386 00:46:01,470 --> 00:46:10,000 The question then, of course, is how that disruption goes on to create the the disorder. 387 00:46:10,000 --> 00:46:19,860 You know why? Why does that protein being disrupted, lead to developmental delay or run or psychiatric problems? 388 00:46:19,860 --> 00:46:23,880 So this there's still a long, long way to go. 389 00:46:23,880 --> 00:46:33,210 So let's just step right back from this and say, Well, suppose we sequence every child born in Europe that's not going to be too far away. 390 00:46:33,210 --> 00:46:42,780 I don't think I would think given a fair wind, it would be within the next 10 years that we will begin to sequence every child in Europe. 391 00:46:42,780 --> 00:46:47,760 I hope so because it's getting cheaper and cheaper and we'll get. 392 00:46:47,760 --> 00:46:56,790 Obviously, what this is going to do is generate huge amounts of medical data, so we're going to get individual patient data, it's closed data. 393 00:46:56,790 --> 00:47:02,040 There's a there's a privacy constraint to it. Right. 394 00:47:02,040 --> 00:47:13,590 So we have different sorts of data and obviously we need to bridge between our biomolecular data, which is open, free for everybody to use, 395 00:47:13,590 --> 00:47:17,760 etc. And this very private clinical data, 396 00:47:17,760 --> 00:47:28,110 which nevertheless needs to be shared so that we can really understand what are the causes of diseases and develop methods to improve it. 397 00:47:28,110 --> 00:47:38,340 So perhaps I could just finish by saying if we think about the 100000 genomes as it is at the moment, the genomic medicine census, 398 00:47:38,340 --> 00:47:44,040 they send their samples to Illumina to be sequenced and then all the data goes 399 00:47:44,040 --> 00:47:50,460 into the dataset and it's actually going to be the interpretation of this data. 400 00:47:50,460 --> 00:48:01,530 It's here that we're going to begin to understand what's going on and be able to then feedback to the clinicians in some way. 401 00:48:01,530 --> 00:48:11,430 And this is the new part, I think, and to my mind, this is at the heart of genomic medicine because it will clearly, 402 00:48:11,430 --> 00:48:15,720 if we don't get that part of it right, then none of the other part can follow. 403 00:48:15,720 --> 00:48:18,570 It's not enough, but it's the beginning. 404 00:48:18,570 --> 00:48:29,230 And so just to finish, then I think it's clear that in biology, the life sciences have developed a substantial data infrastructure, 405 00:48:29,230 --> 00:48:38,550 of which Ebi is the European part of the European, and there's also in the states and in Japan and increasingly in China. 406 00:48:38,550 --> 00:48:43,890 So we have many public tools and methods to analyse the data carefully. 407 00:48:43,890 --> 00:48:50,940 It's also clear to me that genomic medicine will lead and even that BigQuery is going to be more complicated. 408 00:48:50,940 --> 00:49:02,640 But we'll need new methods to handle and integrate the data so that we can make sense of the genome data and the outcomes from that. 409 00:49:02,640 --> 00:49:09,330 And it seems to me that this endeavour is really only just beginning for the young people in the audience. 410 00:49:09,330 --> 00:49:15,390 I think it's a fantastic area to work in, and if I was starting again, this is exactly what I would do. 411 00:49:15,390 --> 00:49:21,682 So thank you very much.