1 00:00:00,730 --> 00:00:05,170 OK, I have to consent to being recorded. OK, I consent. 2 00:00:05,170 --> 00:00:08,860 Excellent, right. So just by way of introduction. 3 00:00:08,860 --> 00:00:17,080 My name is Fergus Boyle Escarpments and I'm a research software engineer in the Oxford Protein Informatics Group here in the Department of Statistics. 4 00:00:17,080 --> 00:00:22,540 And prior to that, I was a student in the Department of Statistics. So I've been around for a while. 5 00:00:22,540 --> 00:00:31,130 And I'm going to talk to you a little bit about the use of machine learning in drug discovery and. 6 00:00:31,130 --> 00:00:36,620 I'm going to touch on quite a quite a few topics, both things that I worked on and things that other people have worked on, 7 00:00:36,620 --> 00:00:40,280 but certainly the latter half of the talk is really going to be focussing on examples of 8 00:00:40,280 --> 00:00:45,920 research done either by myself or by other members of the Oxford Project Informatics Group. 9 00:00:45,920 --> 00:00:51,140 So this is by no means intended to be an exhaustive discussion of all of the things that 10 00:00:51,140 --> 00:00:56,360 people have done using machine learning and drug discovery that would take an entire course. 11 00:00:56,360 --> 00:01:05,750 And you probably still wouldn't do all of it. But I hope that this will give you some idea as to how computational methods in general. 12 00:01:05,750 --> 00:01:14,660 But machine learning in particular is really benefiting the drug discovery process and what the practical implications of that are. 13 00:01:14,660 --> 00:01:21,920 So just brief overview I want to talk about. I'm aware that this is very much a statistics audience. 14 00:01:21,920 --> 00:01:29,030 So I'm going to open with just an introduction to the drug discovery process, what it what it entails, 15 00:01:29,030 --> 00:01:33,770 why we care about it, why we might want to apply computational methods to it at all. 16 00:01:33,770 --> 00:01:41,480 And then I'll briefly introduce the concept of computerised drug design and discuss sort of prior to the machine learning hype, 17 00:01:41,480 --> 00:01:47,900 train or revolution, depending on your stance on it, what sort of computational methods have historically been employed? 18 00:01:47,900 --> 00:01:53,600 And then I'll give some examples of sort of well established machine learning techniques and drug discovery, 19 00:01:53,600 --> 00:01:59,330 what they use for what sort of problems they accomplish, why it's beneficial over other methods. 20 00:01:59,330 --> 00:02:06,530 And then towards the end, I like to spend some time highlighting recent developments in drug discovery that 21 00:02:06,530 --> 00:02:12,590 have really benefited from recent advanced advancements in deep learning techniques. 22 00:02:12,590 --> 00:02:14,150 So just before I start, 23 00:02:14,150 --> 00:02:23,600 I'm aware that quite a few people in this audience are either members of OPEC or have been through the doctoral training centre in some capacity. 24 00:02:23,600 --> 00:02:30,350 So I'm aware you may have seen the introduction to drug discovery talk anywhere between one and 50 times before. 25 00:02:30,350 --> 00:02:36,430 So if you want to type out for the first ten minutes, I won't be offended. 26 00:02:36,430 --> 00:02:45,940 So just get us started. What's what is what is drug discovery to really answer that question, we first need to understand. 27 00:02:45,940 --> 00:02:52,120 So what what is a drug we intuitively think of a drug is, you know, it's a medicine. 28 00:02:52,120 --> 00:03:00,910 You know, we take a medicine to cure ourselves. But what war actually are drugs or biological processes? 29 00:03:00,910 --> 00:03:07,540 Infection and disease are just the result of the behaviour of macromolecules in the body. 30 00:03:07,540 --> 00:03:12,340 Proteins perform pretty much every task in the body to make. 31 00:03:12,340 --> 00:03:17,530 Proteins have specific functions if they're carrying out function and correctly, the body is doing OK. 32 00:03:17,530 --> 00:03:27,700 If a protein starts to misbehave or functions too much or too little, or a foreign body like a virus introduces a foreign protein into the body. 33 00:03:27,700 --> 00:03:34,440 Bad things can happen. This is how you get diseases. This is how you get symptoms of infection. 34 00:03:34,440 --> 00:03:43,000 So the key to trying to treat or manage diseases or infections is really trying to figure out what is causing the problem 35 00:03:43,000 --> 00:03:50,680 and how can we either make make up molecule behave properly or stop that molecule doing the thing it's not supposed to do. 36 00:03:50,680 --> 00:03:57,580 And so in pharmaceutical research, we have this concept of a drug target and a drug target is a key molecule that's typically a 37 00:03:57,580 --> 00:04:03,420 protein or occasionally a nucleic acid that has been implicated to an infection or disease. 38 00:04:03,420 --> 00:04:10,540 And this can be a protein, as I said, a protein in the body that's misbehaving or a protein that's part of, for example, the life cycle of a virus. 39 00:04:10,540 --> 00:04:20,230 And I'll get to an example of both of these in just a moment. And there are all sorts of ways of identifying and validating whether a target 40 00:04:20,230 --> 00:04:23,740 is indeed implicated in a condition that I'm not going to go into today, 41 00:04:23,740 --> 00:04:33,760 that that's really a topic of research in and of itself. But the key idea is the in order to treat disease, we want to target a usually a protein, 42 00:04:33,760 --> 00:04:39,310 occasionally nucleic acid in the body and alter or inhibit its function. 43 00:04:39,310 --> 00:04:45,670 Now a drug in and in and in pharmacology, at least a drug is any molecule. 44 00:04:45,670 --> 00:04:50,350 The interacts with a drug target in order to obtain therapeutic effects. 45 00:04:50,350 --> 00:04:56,470 And that therapeutic effect could be mediating a condition, managing symptoms, restoring function of a protein. 46 00:04:56,470 --> 00:05:00,940 It could be treating an infection by disrupting the life cycle of of a pathogen. 47 00:05:00,940 --> 00:05:06,980 Anything like a. a really a broad catchall. 48 00:05:06,980 --> 00:05:14,810 Now, just to just to sort of distinguish between different types of drugs, because it's an incredibly broad umbrella term, 49 00:05:14,810 --> 00:05:18,860 I'd like to just distinguish between two key fundamental classes of drugs, 50 00:05:18,860 --> 00:05:27,150 and those are small molecule drugs, things such as paracetamol, you know, anything that you take in tablet form, for example. 51 00:05:27,150 --> 00:05:33,380 And these are small chemical compounds that are typically produced by chemical synthesis. 52 00:05:33,380 --> 00:05:37,520 And in contrast to this, we have a class of drugs known as biopharmaceuticals, 53 00:05:37,520 --> 00:05:44,840 which are an incredibly broad category of drugs that are extracted or synthesised or obtained from biological sources. 54 00:05:44,840 --> 00:05:49,310 And the obvious topical example of this is a vaccine. 55 00:05:49,310 --> 00:05:54,680 And these can be, you know, very potentially very large molecules like, you know, 56 00:05:54,680 --> 00:06:00,560 an antibody is an entire protein is much larger than something like a paracetamol molecule. 57 00:06:00,560 --> 00:06:04,430 Today, I'm going to just focus on small molecule drug discovery. 58 00:06:04,430 --> 00:06:17,510 But just be aware there is a whole there's an enormous field of different applications of computational methods in medical research. 59 00:06:17,510 --> 00:06:21,320 The example of Walsenburg is a target. 60 00:06:21,320 --> 00:06:29,210 How does a drug function? I'd like to start start with an example of a protein in the human body. 61 00:06:29,210 --> 00:06:38,180 This is a protein called thrombin thrombin. The grey structure on the right is is an experimentally determined structure of the protein, 62 00:06:38,180 --> 00:06:42,740 thrombin from an enzyme that acts as a catalyst in the blood clotting process. 63 00:06:42,740 --> 00:06:52,760 So, so and blood clotting of this entire cascade of biological processes that results in blood cells are aggregating, which obviously, 64 00:06:52,760 --> 00:07:03,920 you know, seals wounds, but also when it's behaving, when it when it's misbehaving, you get conditions like blood clots, thrombosis, strokes. 65 00:07:03,920 --> 00:07:07,310 So it's something that we need to be very aware of. 66 00:07:07,310 --> 00:07:17,060 An example of a of a drug that the target that targets thrombin for a therapeutic effect is a peptide known as Hayrettin. 67 00:07:17,060 --> 00:07:25,220 And the structure of this is shown in the naturally occurring peptide that's produced by leaches, which, as we know, feed on blood. 68 00:07:25,220 --> 00:07:33,410 In order to feed on blood, they need to prevent the blood clotting. The salivary glands naturally produce a peptide that binds to thrombin and stops. 69 00:07:33,410 --> 00:07:37,220 It stops the thrombin molecule interacting with other things because it's already interacting with 70 00:07:37,220 --> 00:07:44,540 the heroin and thereby preventing it from from from from catalysing the blood clotting process. 71 00:07:44,540 --> 00:07:53,030 And this makes and this makes heroin useful as as an anticoagulant for treating both this and indeed 72 00:07:53,030 --> 00:08:00,450 several anticoagulant drugs on the market are based on Hebridean or chemical derivatives of the. 73 00:08:00,450 --> 00:08:11,380 The second example of the pathogen, so, for example, I'm going to use here is the human immunodeficiency virus, HIV. 74 00:08:11,380 --> 00:08:16,050 A key protein that plays a role in the HIV lifecycle is a protein called HIV. 75 00:08:16,050 --> 00:08:21,780 One protease. Now a protease is. And again, this is. 76 00:08:21,780 --> 00:08:30,900 An enzyme that that breaks up a large chain of amino acids into distinct subunits. 77 00:08:30,900 --> 00:08:36,090 This is important for the life cycle of HIV protease because the proteins that are 78 00:08:36,090 --> 00:08:40,710 involved in the lifecycle of HIV proteins are produced as a single amino acid chain. 79 00:08:40,710 --> 00:08:43,230 So you have multiple proteins all together. 80 00:08:43,230 --> 00:08:51,000 In order for these proteins to be functional, they need to be split up into independent units and that is the job of the HIV proteins. 81 00:08:51,000 --> 00:08:59,820 My pointer has this sort of groove or channel in the middle, and this is where it sticks to the peptide chain and breaks it up. 82 00:08:59,820 --> 00:09:08,970 Now. In order and so the way antiretroviral treatments for HIV work is by inhibiting the function of HIV protease, 83 00:09:08,970 --> 00:09:12,900 thus preventing it from breaking up these proteins and therefore, you know, 84 00:09:12,900 --> 00:09:18,870 disrupting the life cycle of the virus, the way this works is an inhibitor like adenovirus. 85 00:09:18,870 --> 00:09:25,740 And this is the molecular structure you see on the right is designed to to bind in that in that groove, 86 00:09:25,740 --> 00:09:31,950 in that in that binding site, on the on the protease. The protease can't do anything to the of molecule. 87 00:09:31,950 --> 00:09:33,840 It can't kleve it like a peptide chain. 88 00:09:33,840 --> 00:09:40,500 And so it just stays stuck in there preventing the HIV protease from sticking to the peptides that it's supposed to be cleaving, 89 00:09:40,500 --> 00:09:45,330 thereby preventing the drug, disrupting the life cycle of the virus. 90 00:09:45,330 --> 00:09:54,980 So those are just two examples of very different types of drug targets that we treat using molecule drugs. 91 00:09:54,980 --> 00:10:02,130 OK, so that's what drug is, so how do we actually develop drugs in practise? 92 00:10:02,130 --> 00:10:08,690 So if you've been to any drug discovery talks before, you'll have seen a variant on this diagram in one form or another. 93 00:10:08,690 --> 00:10:15,260 And so the pharmaceutical development process is sort of the first thing to understand is a very long winded, a very expensive process. 94 00:10:15,260 --> 00:10:24,740 And an enormous amount of time is invested just getting from identifying a target to having a candidate drug binding against that target. 95 00:10:24,740 --> 00:10:30,050 That that initial phase is known as drug discovery anywhere from a couple of years, 96 00:10:30,050 --> 00:10:36,030 up to over 10 years, with an average of around four years across the UK pharmaceutical companies. 97 00:10:36,030 --> 00:10:37,880 But even once you have such a candidate, 98 00:10:37,880 --> 00:10:45,890 you then have to go through several stages of preclinical animal models and clinical trials in order to verify that the drug works, 99 00:10:45,890 --> 00:10:52,820 that the drug's safe, that the drugs are effective enough to warrant any potential side effects. 100 00:10:52,820 --> 00:10:58,280 And each of these steps can take between one and two years and cost millions of pounds. 101 00:10:58,280 --> 00:11:04,340 And so you may end up with that from target identification to actually having an approved drug on the market. 102 00:11:04,340 --> 00:11:12,380 And it can take in excess of 10 years and cost well in excess of a billion pounds. 103 00:11:12,380 --> 00:11:21,800 And so this early stage drug discovery process where you developed drug candidates, which is what we're really going to focus on today. 104 00:11:21,800 --> 00:11:27,140 And so the drug discovery process, once you have a target identified, 105 00:11:27,140 --> 00:11:32,000 is a cyclical process of starting from a collection of compounds that you have access to. 106 00:11:32,000 --> 00:11:37,520 You can buy, you can you can make in the lab, somebody else can make for you whatever, 107 00:11:37,520 --> 00:11:43,070 taking your library of compounds and screening that entire library or a section, 108 00:11:43,070 --> 00:11:54,070 that library against your biological target compounds in that binder that talk about all the hits. 109 00:11:54,070 --> 00:11:58,760 So first you're trying to just identify hits and then in subsequent stages, 110 00:11:58,760 --> 00:12:04,520 you then take your your initial hits and try to optimise their of both their affinity for the target. 111 00:12:04,520 --> 00:12:10,580 How strongly they point to that target and also the selectivity so they don't bind to other targets. 112 00:12:10,580 --> 00:12:16,160 And side effects of medicines are often caused by a molecule also interacting in 113 00:12:16,160 --> 00:12:21,300 some way with another protein other than the intended target uptalk effects. 114 00:12:21,300 --> 00:12:26,850 And so try trying to balance affinity and selectivity is a really important part of this process. 115 00:12:26,850 --> 00:12:32,600 And once you have a molecule that you think has satisfactory affinity and selectivity, 116 00:12:32,600 --> 00:12:38,600 you don't have to go into a further process where you optimise other desirable pharmacological properties, 117 00:12:38,600 --> 00:12:46,400 for example, ensuring it's not toxic and showing that it doesn't aggregate all the while retaining the desired affinity and selectivity. 118 00:12:46,400 --> 00:12:51,150 And the diagram on the right just gives and really emphasises the research. 119 00:12:51,150 --> 00:12:57,020 You identify some pets, you check for toxicity, you optimise that, you optimise that, 120 00:12:57,020 --> 00:13:06,080 you check that you can actually make the compound because it doesn't matter how good inhibitor it is if you can't synthesiser and know, 121 00:13:06,080 --> 00:13:16,550 could take many repetitive cycles. And so it really is a very long, very expensive, multifaceted process. 122 00:13:16,550 --> 00:13:19,210 So just an idea of how this is done in practise, 123 00:13:19,210 --> 00:13:25,670 the initial identification stage has traditionally been performed in a process known as high throughput screening, 124 00:13:25,670 --> 00:13:27,440 where you have robots in a lab, 125 00:13:27,440 --> 00:13:36,770 the rapidly test or assay very large numbers of chemical compounds against the biological target of interest to see if any of them bind at all. 126 00:13:36,770 --> 00:13:40,790 And although advances in technology and methodology have continued, 127 00:13:40,790 --> 00:13:47,030 continually increased the efficiency, the speed, efficiency and reduce the cost of this process, 128 00:13:47,030 --> 00:13:56,750 high throughput screening in general business, even if you can do it very quickly with certain set ups, you need the you need that set up in place. 129 00:13:56,750 --> 00:14:07,520 You need the resources to do it. You need the expertise to do it. And so it's it's an incredibly it's an incredibly laborious process. 130 00:14:07,520 --> 00:14:15,080 And so you can start to understand why drug discovery is such a slow and expensive task deeds. 131 00:14:15,080 --> 00:14:22,460 So something that you may have read headlines about at various points is there's a well-known productivity problem in the pharmaceutical industry. 132 00:14:22,460 --> 00:14:31,310 And this is you know, it's been observed that despite continuous advances in technology and research methodology and increasingly available resources, 133 00:14:31,310 --> 00:14:37,310 the productivity of the pharmaceutical industry has continued to decline. I just put some solid numbers on that. 134 00:14:37,310 --> 00:14:48,200 In 2012, a paper by Scandal et al showed that ever since 1950, the cost of bringing a new drug to market has doubled roughly every nine years. 135 00:14:48,200 --> 00:14:53,000 And indeed, if you look at more, that's continued from 2012 to 2021. 136 00:14:53,000 --> 00:14:58,670 So it's really quite terrifying. And there are a myriad of reasons for the same part. 137 00:14:58,670 --> 00:15:06,110 It can be attributed to more marketing issues of this phenomenon known as better than the Beatles. 138 00:15:06,110 --> 00:15:11,120 If you're designing a new drug, it doesn't just have to work. It has to work better than anything else that we have. 139 00:15:11,120 --> 00:15:18,140 And it has to work sufficiently, better than anything else to be worth the investment, to be worth making, to be worth marketing. 140 00:15:18,140 --> 00:15:25,790 And in addition to this, for a very good reason that I'll get on to in a moment, we've been observing increasingly stringent conditions, 141 00:15:25,790 --> 00:15:32,480 requirements from from from government regulators to ensure the safety and efficacy of drugs. 142 00:15:32,480 --> 00:15:38,660 And these really are sort of fundamentally can't address by just throwing computers at the problem. 143 00:15:38,660 --> 00:15:43,220 However, other problems, such as inefficient resource allocation, you know, 144 00:15:43,220 --> 00:15:47,660 just brute forcing by throwing money at the problem certainly contribute to this 145 00:15:47,660 --> 00:15:54,900 productivity crisis and really try and optimise the process to bring costs down. 146 00:15:54,900 --> 00:16:01,490 Just an aside on why there are very good reasons for having regulations in place that that increase the 147 00:16:01,490 --> 00:16:09,840 cost and time taken to develop a drug is a historical drug called thalidomide that you may have heard of. 148 00:16:09,840 --> 00:16:20,490 Now, thalidomide was initially marketed as an over-the-counter sedative in the late 1950s in Europe of things like insomnia, anxiety and such like. 149 00:16:20,490 --> 00:16:28,680 And initially, it was noted as safe for use in pregnancy, however, glowing evidence in the late 1950s, 150 00:16:28,680 --> 00:16:35,700 early 1960s linked thalidomide to birth defects in children of mothers who had been taking 151 00:16:35,700 --> 00:16:41,830 thalidomide during pregnancy has led to most countries withdrawing its use in the early 1970s. 152 00:16:41,830 --> 00:16:52,140 However, precisely due to a lack of clear regulation, it remained in use in Spain well into the 1970s and possibly estimated. 153 00:16:52,140 --> 00:16:59,280 Anywhere between 10 and 20000 people are now affected by the horrific birth defects that were caused by the misuse. 154 00:16:59,280 --> 00:17:08,280 And it's really because there was really no regulation or formal requirements for proving efficacy or safety and drugs in the 1950s. 155 00:17:08,280 --> 00:17:15,060 Now, in the aftermath of the thalidomide tragedy, many countries introduced stricter regulations for drug testing and approval. 156 00:17:15,060 --> 00:17:15,660 So, for example, 157 00:17:15,660 --> 00:17:25,350 the U.K. Medicines Act of 1968 that required all current and future drug inefficacy was a direct consequence of the thalidomide disaster. 158 00:17:25,350 --> 00:17:32,850 And just put some numbers on this in. In the late 1950s, early 1960s, when this was happening, 159 00:17:32,850 --> 00:17:39,960 there were there were on the order of 30 to 40000 drugs that were available in some form in the U.K., 160 00:17:39,960 --> 00:17:45,450 legally available in the U.K. by the start of the 1990s when this regulation went, 161 00:17:45,450 --> 00:17:52,200 when all of these drugs had finally been tested in accordance with the Medicines Act, only 5000 licenced and approved for use. 162 00:17:52,200 --> 00:18:00,100 So it's a really terrifying number of drugs that were just thrown onto the market with no real care for if they were if they were safe. 163 00:18:00,100 --> 00:18:07,930 So there are very good reasons that we can't that we can't just try to cut back on the clinical trials phase, 164 00:18:07,930 --> 00:18:12,520 we can't save time, that we can't save money there. So what can we do? Well, 165 00:18:12,520 --> 00:18:20,500 it turns out the very few candidates that enter clinical trials make it to the market with most failing due to lack of efficacy or safety concerns. 166 00:18:20,500 --> 00:18:25,840 And this in itself contributes enormously to costs because a successful drug has to not only pay for its own development, 167 00:18:25,840 --> 00:18:32,020 but for all of the work, the optimisation, the development that went into the drugs that did fail. 168 00:18:32,020 --> 00:18:39,250 And so one thing that we can do to try and address this productivity crisis is to try and replace the expensive 169 00:18:39,250 --> 00:18:44,170 and laborious steps preceding clinical trials using computational methods and the really two aspects of this. 170 00:18:44,170 --> 00:18:50,690 The first is reducing the cost of designing the drug candidates by automating processes for both. 171 00:18:50,690 --> 00:18:54,070 Improve the quality of the candidates that enter clinical trials, for example. 172 00:18:54,070 --> 00:18:59,540 Can we predict beforehand that a molecule is going to is going to be toxic? 173 00:18:59,540 --> 00:19:07,860 That immediately allows you to remove things from the clinical trials pool? So this brings onto the concept of computer aided drug design, 174 00:19:07,860 --> 00:19:20,460 and so it refers to any of a set of computational methods that are used in the in the preclinical drug discovery process in order to identify, 175 00:19:20,460 --> 00:19:26,140 identify and develop your compounds into clinical drug candidates. 176 00:19:26,140 --> 00:19:31,140 And fundamentally, the goal of computer aided drug design, or CAD, as is often known, 177 00:19:31,140 --> 00:19:35,910 is to predict what just to predict whether a molecule binds to a biological target. 178 00:19:35,910 --> 00:19:43,680 If so, how strongly? And so it's sort of an analogy to the high throughput screening I mentioned previously. 179 00:19:43,680 --> 00:19:49,970 The process of applying computational methods to screen a large compound library is known as virtual screening. 180 00:19:49,970 --> 00:19:54,990 And just like traditional lab based drug design, this is an iterative process. 181 00:19:54,990 --> 00:19:56,850 You perform virtual screening, 182 00:19:56,850 --> 00:20:03,510 then go and try and optimise your hits from virtual screening and then go back to a computational method to see I think it still binds. 183 00:20:03,510 --> 00:20:07,410 And so, again, it can carry on for quite a few iterations. 184 00:20:07,410 --> 00:20:12,130 And so in this talk, I'm really going to focus on the virtual screening task for much of the talk. 185 00:20:12,130 --> 00:20:21,280 But and I'll mention this a few times, computer models have been successfully used for all sorts of tasks and computerised, 186 00:20:21,280 --> 00:20:26,460 you know, analysing properties such as how is how, you know, trying to model how is a compound going to be metabolised? 187 00:20:26,460 --> 00:20:30,870 Is it going to be toxic? Is it going to. Is it going to aggregate all sorts of things? 188 00:20:30,870 --> 00:20:34,560 I'll give a few examples of this later on. 189 00:20:34,560 --> 00:20:43,520 So just to break down what virtual screen screening entails, you can typically break down virtual screening into two types of approaches. 190 00:20:43,520 --> 00:20:47,430 The first of this is looking based virtual screening where you're using methods that 191 00:20:47,430 --> 00:20:55,590 are entirely based on the the chemical properties of of of of your of your molecules. 192 00:20:55,590 --> 00:21:03,510 And look at these virtual screening in the process of saying, OK, I already have some Leganes that I know by my target of interest. 193 00:21:03,510 --> 00:21:10,670 So can I use that information to screen all of my other compounds to see if anything else is also likely to bind my target of interest? 194 00:21:10,670 --> 00:21:13,710 Nor do I have anything that similar to things that I know interact. 195 00:21:13,710 --> 00:21:21,180 So if you have if you have some known negatives for a target, you can directly go and apply liquid based methods. 196 00:21:21,180 --> 00:21:24,660 In contrast with this, we will also have structure based virtual screening, 197 00:21:24,660 --> 00:21:34,580 which instead uses information about the 3D structure of the biological target to predict not not only if a molecule will bind, but if so, where, how? 198 00:21:34,580 --> 00:21:38,880 How is it going to bind? What interaction does it make and how strongly does it bind? 199 00:21:38,880 --> 00:21:42,330 And so these two methods use very different forms of information. 200 00:21:42,330 --> 00:21:46,830 If you have no inelegance, you can you might use a ligand based method if you don't have any known Leganes, 201 00:21:46,830 --> 00:21:51,570 but you do have a 3D structure of the protein. You might use that. 202 00:21:51,570 --> 00:21:53,970 And of course, there are some some quote unquote, 203 00:21:53,970 --> 00:22:00,680 hybrid methods that combine these two approaches when you have both of those forms of data available. 204 00:22:00,680 --> 00:22:04,190 And just give an idea of what living based social screening entails. 205 00:22:04,190 --> 00:22:09,380 If one of the key concepts in sort of comparing screening molecules computationally is we 206 00:22:09,380 --> 00:22:16,060 need a way of representing a molecule in a computer and given given such a representation, 207 00:22:16,060 --> 00:22:26,510 a way of rationally comparing the similarity of molecules, if one example of how this is done is is a technique known as molecular fingerprinting, 208 00:22:26,510 --> 00:22:31,790 the idea is we have we have we know the structure and the composition of our molecule. 209 00:22:31,790 --> 00:22:37,230 And on here I have the example of this is the tutee structure of paracetamol molecule. 210 00:22:37,230 --> 00:22:44,090 You've got different molecules, are very different sizes and shapes, so it's not necessarily clear how to compare them analytically. 211 00:22:44,090 --> 00:22:52,790 Molecular fingerprinting is a concept that looks at the structure of the molecule and what features are present in this molecule, 212 00:22:52,790 --> 00:22:56,780 what atoms are next to each other, what groups are next to each other and converts? 213 00:22:56,780 --> 00:23:03,560 Molecules are potentially very different size and composition into a fixed length, finite size vector, 214 00:23:03,560 --> 00:23:12,290 which each bit encapsulates a certain functionality or chemical part of the chemical structure. 215 00:23:12,290 --> 00:23:15,080 And once you have this sort of vector representation, 216 00:23:15,080 --> 00:23:25,100 you're you're you're now in a very good position to take any method of earning back to the point where similarity searching, for example, 217 00:23:25,100 --> 00:23:32,540 is to compute these sorts of bit vectors for all of your molecules and then just compare your library of compounds to the fingerprints 218 00:23:32,540 --> 00:23:41,780 of your own molecule using a metric such as the Jaccard autonomous coefficient or something else like a similarity score. 219 00:23:41,780 --> 00:23:49,630 So the sort of approach is known as similarity searching, and it's just one way of representing a small molecule in a computer. 220 00:23:49,630 --> 00:24:01,550 Is this done in practise? So this we have structure based virtual screening where instead we're trying to make use of the 3D structure of the target, 221 00:24:01,550 --> 00:24:05,660 so we might try to explore how the molecule might interact with the target and are 222 00:24:05,660 --> 00:24:11,270 sort of two main contrasting computational techniques that might be used for this. 223 00:24:11,270 --> 00:24:18,950 The first of these is a technique known as protein, like a docking where you try to rapidly sample possible bound confirmations of the ligand. 224 00:24:18,950 --> 00:24:24,560 So try to see how do I think it might bind. 225 00:24:24,560 --> 00:24:31,250 You might use, for example, a Montecarlo search algorithm to do this and then try to rapidly estimate the binding affinity 226 00:24:31,250 --> 00:24:34,940 using what's known as a scoring function or go into more detail in just just a little bit. 227 00:24:34,940 --> 00:24:40,730 And the key idea is you're trying to do this quickly because you have millions of compounds to screen. 228 00:24:40,730 --> 00:24:47,690 Now, in contrast to this sort of rapid fire approach, is known as molecular dynamics, has all sorts of applications. 229 00:24:47,690 --> 00:24:50,540 But in this context, 230 00:24:50,540 --> 00:25:00,500 first locations of the protein laden interactions to try and predict how how is where does the molecule want to set in the active sites of the place 231 00:25:00,500 --> 00:25:07,250 and how to set off the simulation and try and let it decide where it wants to set and based off of those you can try and gain an understanding for, 232 00:25:07,250 --> 00:25:13,760 OK, what are the dynamics of binding? Because this is fundamentally a dynamic biological process, not a static snapshot. 233 00:25:13,760 --> 00:25:22,970 So you really want to understand those binding dynamics and from that to try and compute again using using force fields, 234 00:25:22,970 --> 00:25:29,450 trying to actually compute the interaction, energy or the binding affinity between the protein and the ligand Hierophant ligand, 235 00:25:29,450 --> 00:25:33,980 something that has a greater change, infringe upon binding that makes it more tightly bound. 236 00:25:33,980 --> 00:25:36,290 It doesn't want to separate. And that's what you're looking for. 237 00:25:36,290 --> 00:25:45,740 And in a binder, in a drug, it so protein Lewandowsky is much faster than molecular dynamics. 238 00:25:45,740 --> 00:25:51,040 It can be orders of, you know, orders of magnitude faster, depending on how you configure them like that. 239 00:25:51,040 --> 00:25:57,050 It faces this dynamical information and sort of detailed, accurate, free energy calculations for speed. 240 00:25:57,050 --> 00:26:04,340 And so this is always the Trade-Off that you're making when you're trying to trying to efficiently screen large numbers of compounds. 241 00:26:04,340 --> 00:26:09,470 And in practise, protein like in blocking, is the most common starch based technique used in drug discovery. 242 00:26:09,470 --> 00:26:17,180 Just because it's efficient, you couldn't use molecular dynamics to screen millions of compounds. 243 00:26:17,180 --> 00:26:22,640 That would be crazy. So more on protein like in dockings. 244 00:26:22,640 --> 00:26:30,540 One of the active areas of research is particularly trying to use machine learning to improve things. 245 00:26:30,540 --> 00:26:34,200 So protein looking, docking, in addition to its search algorithm, it uses, 246 00:26:34,200 --> 00:26:38,040 as I said, what is known as a scoring function, which is an approximate a quick, 247 00:26:38,040 --> 00:26:44,160 dirty, approximate function that tries to estimate the free energy of binding based on a single static snapshot of this is where the protein is. 248 00:26:44,160 --> 00:26:46,080 This is where the ligand is. 249 00:26:46,080 --> 00:26:56,220 And this enables it gives you a quick estimate of what you think the ELR rapidly assess the poses predicted by the docking algorithm, decide. 250 00:26:56,220 --> 00:26:58,890 Do I think this is a reasonable pose? 251 00:26:58,890 --> 00:27:05,730 How strongly do I think it by and can I rank all of my different Lykins by how strongly my scoring function thinks they bind? 252 00:27:05,730 --> 00:27:10,140 And so that lets you prioritise things that you think bind more strongly. 253 00:27:10,140 --> 00:27:14,580 There are many, many, many pieces of docking software that are regularly used for this process. 254 00:27:14,580 --> 00:27:22,780 They all have different strengths and weaknesses. I'm not going to name names here in case I upset certain people in. 255 00:27:22,780 --> 00:27:30,800 So that was that was all a lot of theory, but just give an example of what a protein like result might look like in practise, 256 00:27:30,800 --> 00:27:34,900 if we go back to our example of from an inheritance and we have the structure of thrombin and 257 00:27:34,900 --> 00:27:41,110 grey and we have an experimentally determined binding pose of the Hebridean molecule and Siân, 258 00:27:41,110 --> 00:27:47,620 and this is determined by X-ray crystallography. Now, just using a talking algorithm to try and sample that binding pose. 259 00:27:47,620 --> 00:27:53,560 The best result that was returned rythm is the pose in magenta. 260 00:27:53,560 --> 00:28:01,750 And you can see that a lot of the structure a line aligns very strongly. Apart from on the left, we have one ring that sort of clearly out of place. 261 00:28:01,750 --> 00:28:07,990 This is an example of the sort of quick and dirty dorking scoring process that gives you a rough idea of where the molecule sits. 262 00:28:07,990 --> 00:28:17,450 And based on that confirmation, your scoring function will give you some estimate of of the free energy of binding. 263 00:28:17,450 --> 00:28:25,370 So just in practise today, what I just might look like is it is really quite a key component of this process. 264 00:28:25,370 --> 00:28:32,390 Scoring function is is any sort of any sort of approximate method that tries to estimate the free energy of binding. 265 00:28:32,390 --> 00:28:33,950 And sort of classically in Dorking, 266 00:28:33,950 --> 00:28:44,630 this is done as a sum of physical or empirical energy terms that are the key being that they're all easy to compute rapidly. 267 00:28:44,630 --> 00:28:53,240 And this might include, for example, terms that represent Vanderveldt potential, terms that represent a more electrostatic potentials, 268 00:28:53,240 --> 00:29:00,040 terms that try to quantify the energy of of hydrophobic contacts, of hydrogen bonding terms, 269 00:29:00,040 --> 00:29:04,880 all sorts of things like this that might go on in molecular interactions. 270 00:29:04,880 --> 00:29:10,310 I know quite a very common thing to do is to find some of these terms, approximate them quickly, 271 00:29:10,310 --> 00:29:14,280 and then just use a linear regression to assign weights to each of these terms. 272 00:29:14,280 --> 00:29:21,800 That gives you the best estimate of binding affinity that you can compute rapidly and is really a multi-tool 273 00:29:21,800 --> 00:29:28,290 of structure based drug discovery they use to determine whether a pose is physically reasonable. 274 00:29:28,290 --> 00:29:33,050 That used to rank Leganes by the likelihood of binding and used to try and actually predict the strength 275 00:29:33,050 --> 00:29:39,890 of that binding of the binding affinity of that leg and the real use for a lot of different tasks. 276 00:29:39,890 --> 00:29:41,990 And that brings us on. 277 00:29:41,990 --> 00:29:53,480 And so that's sort of a very brief overview of the sort of techniques that are used in computer aided drug design, particularly in virtual screening. 278 00:29:53,480 --> 00:30:01,100 And so with that in place, I'd like to finally talk about how machine learning methods are being used in drug discovery, 279 00:30:01,100 --> 00:30:04,050 particularly for this virtual screening process. 280 00:30:04,050 --> 00:30:09,680 The context, text, statistical modelling and machine learning are well-established tools and drug discovery. 281 00:30:09,680 --> 00:30:14,570 And I could give you an exhaustive list of things that people have done in the past 30 years. 282 00:30:14,570 --> 00:30:22,940 But just a few examples. Using representations such as molecular molecular fingerprints we introduced earlier as 283 00:30:22,940 --> 00:30:29,510 features for support vector machines has been successfully used for a virtual screening, 284 00:30:29,510 --> 00:30:42,560 for example, by Jabat Atoll in 2018. An interesting example of of sort of substituting secondary assays with computational methods has been the 285 00:30:42,560 --> 00:30:48,280 use of decision tree classifiers to try and predict whether or not a molecule passes the blood brain barrier, 286 00:30:48,280 --> 00:30:54,800 which is a very important, a very important topic in pharmacology. 287 00:30:54,800 --> 00:31:00,680 And just as a third example of this caution under in 2006, 288 00:31:00,680 --> 00:31:08,510 a very good paper on the use of naïf based classifiers to try and predict whether a molecule, whether a molecule is likely to be toxic or not. 289 00:31:08,510 --> 00:31:19,340 So are just a few examples of the things that people have applied machine learning to sort of historically and in this field. 290 00:31:19,340 --> 00:31:25,430 But in recent years, there's been a lot of a lot of interest in the use of machine learning and drug discovery, 291 00:31:25,430 --> 00:31:31,580 and arguably one of the big reasons for this is the ever increasing quantity of data that's actually available, 292 00:31:31,580 --> 00:31:37,290 the ability of traditional methods such as, you know, using a linear regression to fit a scoring function. 293 00:31:37,290 --> 00:31:43,190 It simply can't leverage all the data that's available. So just give a feel for the sort of data that's available. 294 00:31:43,190 --> 00:31:47,660 Some of the publicly available databases might be a database known as Zenk, 295 00:31:47,660 --> 00:31:56,360 which contains 230 MIRTHA available compounds with 3-D confirmation of the and ready to use 296 00:31:56,360 --> 00:32:02,870 and dorking and a further 750 million compounds that are known to be commercially available. 297 00:32:02,870 --> 00:32:08,840 The idea being that you can take your computer, Zenk, you can screen them and you know you can go and buy them somewhere else. 298 00:32:08,840 --> 00:32:16,190 Another example of this is a database known as Cambell, which records some biological assay data. 299 00:32:16,190 --> 00:32:23,900 So measuring do things interact, which contains around 17 billion for its biological activities? 300 00:32:23,900 --> 00:32:30,140 You know, how strongly do things bind for two million different compounds across around 14000 targets? 301 00:32:30,140 --> 00:32:35,420 That really is an enormous amount of data that you might use to try and fit some predictive model. 302 00:32:35,420 --> 00:32:39,770 And thinking about structure based drug discovery, a database known as PTB. 303 00:32:39,770 --> 00:32:44,930 Binde is the largest collection of solved structures about protein living 304 00:32:44,930 --> 00:32:53,600 complexes that contains around 18000 of these of these complexes ready to bolts. 305 00:32:53,600 --> 00:32:59,150 So that's a lot of data. But just just just to give a feel for. 306 00:32:59,150 --> 00:33:11,290 Whether this data really represent is representative of some of chemistry and indeed if this data can be representative of chemistry. 307 00:33:11,290 --> 00:33:15,880 What one thing that's quite interesting to do is to say, OK, 308 00:33:15,880 --> 00:33:22,270 I know what properties a drug like molecule typically exhibits, and it's puts constraints on the base of the molecule. 309 00:33:22,270 --> 00:33:22,990 Based on that, 310 00:33:22,990 --> 00:33:31,540 you can you can sort of use combinatorics to estimate what is the size of the space of molecules that could possibly exist and be drug like. 311 00:33:31,540 --> 00:33:37,830 And a very common estimate of the size of the space is ten to the power of 60 molecules. 312 00:33:37,830 --> 00:33:46,650 That is enormous. That is impossibly enormous. Just to give an idea of how impossibly enormous that is, you can do if you're if you're boring like me, 313 00:33:46,650 --> 00:33:56,730 you can sit down and do a back of envelope estimate of how many atoms are to revive a figure of around 10 to the 57 atoms in the solar system. 314 00:33:56,730 --> 00:34:01,980 There are there are potentially a thousand times as many drug like molecules that you could possibly be 315 00:34:01,980 --> 00:34:10,530 interested in as there are atoms in the solar system is a physical impossibility to make all of these molecules. 316 00:34:10,530 --> 00:34:17,720 So this is a very important question of is the state can this data be relied on to be truly representative? 317 00:34:17,720 --> 00:34:25,280 The answer to that is. It's a question as something that needs to be borne in mind, but nevertheless, nevertheless, 318 00:34:25,280 --> 00:34:30,390 the availability of this data has really spurred a lot of use of machine learning. 319 00:34:30,390 --> 00:34:35,420 Of course, machine learning methods require robust validation is not enough for a linear 320 00:34:35,420 --> 00:34:39,020 regression on one hundred data points to test on another hundred anymore. 321 00:34:39,020 --> 00:34:48,170 Some examples of data sets that been used for this in drug discovery is, for example, a database known as the Director of Useful Decoys, 322 00:34:48,170 --> 00:34:57,750 which consists of 102 different protein targets for the 2000 ligands spread across those targets and around a million of what are known as decoys. 323 00:34:57,750 --> 00:35:03,080 And these are molecules that are believed to not bind to those to those targets. 324 00:35:03,080 --> 00:35:11,010 The idea being that you now have large data set that simulates the real world situation of a large compound library with a small number of binders. 325 00:35:11,010 --> 00:35:17,210 And you could use this to test your algorithm to see does it rank the binders more highly than the non binders? 326 00:35:17,210 --> 00:35:23,090 One of the obvious issues of this is potential biases, the identify decoys. 327 00:35:23,090 --> 00:35:30,710 And several people, such as Rhorer and Bowmen in 2009 have come up with various ways of ensuring that ligands 328 00:35:30,710 --> 00:35:35,300 are embedded next to decoys in chemical space to make them hard to differentiate. 329 00:35:35,300 --> 00:35:44,000 But it's very much an ongoing area of research and and finally, specific to the task of actually developing a good scoring function, 330 00:35:44,000 --> 00:35:48,260 Tchang, at all in 2009 started what's known as the comparative assessment of scoring functions, 331 00:35:48,260 --> 00:35:56,360 or CASSER to five percent of PDV derived by derived complexes where the Fanti measurements that 332 00:35:56,360 --> 00:36:00,890 you can use to directly measure how well does my scoring function predict binding affinity. 333 00:36:00,890 --> 00:36:06,520 And it's sort of become a de facto standard in the field as something or crop up. 334 00:36:06,520 --> 00:36:12,790 OK, to focus on the on in particular, on the use of machine learning to develop scoring function, 335 00:36:12,790 --> 00:36:15,730 because it's something I've worked on over the course of my DFL, 336 00:36:15,730 --> 00:36:24,220 is still very much an active area of research just to just establish why the why in particular, this problem has drawn a lot of attention. 337 00:36:24,220 --> 00:36:28,450 The classical scoring functions used in Dorking are often very good at saying whether 338 00:36:28,450 --> 00:36:34,600 a predicted binding pose is good and identifying blinders over non binder's. 339 00:36:34,600 --> 00:36:41,740 But the energies that they estimate often completely fail to correlate with the actual experimentally observed binding affinity. 340 00:36:41,740 --> 00:36:47,800 And so their application and actually measuring affinity is incredibly limited. 341 00:36:47,800 --> 00:36:56,590 Now, in the last decade or so, starting around 2010, any different machine learning approaches have been shown using all sorts of situations, 342 00:36:56,590 --> 00:37:02,260 and algorithms have been shown to consistently outperform these classical scoring functions at the Affinity Protection Task 343 00:37:02,260 --> 00:37:11,920 on common benchmarks sets such as Cassie and Just Emphasise These all relied on engineered features such as counting. 344 00:37:11,920 --> 00:37:16,540 How many pairwise interactions are there between atoms in the protein and ligands, 345 00:37:16,540 --> 00:37:25,540 or fingerprint's describing those protein legate interactions so that it's there and hearing those features. 346 00:37:25,540 --> 00:37:26,470 And in addition to this, 347 00:37:26,470 --> 00:37:34,060 a lot of these methods appear to be strongly dependent on the data they're trained on and often generalise poorly to unseen targets, 348 00:37:34,060 --> 00:37:41,740 which is not ideal given that in the real world we're trying to screen Lukins against a potentially novel drug target. 349 00:37:41,740 --> 00:37:46,090 And the scoring functions, although primarily optimised for predicting affinity, 350 00:37:46,090 --> 00:37:50,080 have been applied to the virtual screen and classification task of identifying binder's. 351 00:37:50,080 --> 00:37:56,360 But again, the. Form on an unseen novel Target. 352 00:37:56,360 --> 00:38:04,010 And finally, it's quite an important concept here is most of these studies have relied on training and validating, 353 00:38:04,010 --> 00:38:09,620 using only experimentally determined binding poses of ligands determined by crystallography. 354 00:38:09,620 --> 00:38:15,800 And only a few have explored how models can be expected to perform on posters, even though in reality, 355 00:38:15,800 --> 00:38:20,240 in a virtual screening campaign, you don't have crystal structures of all of your approaching complex. 356 00:38:20,240 --> 00:38:26,270 Because if you do, you're fine. So you don't need to screen them. 357 00:38:26,270 --> 00:38:29,630 And this leads onto some of the works I did during my DFL, 358 00:38:29,630 --> 00:38:36,860 and the first thing I'd like to just briefly mention is one of the things we looked at was aggregating structure based on ligand based methods, 359 00:38:36,860 --> 00:38:45,170 using random forests. And just want to figure out, I just just a brief illustration of don't worry about the details. 360 00:38:45,170 --> 00:38:50,720 The solid lines indicates the correlation obtained by method combining structure and lingonberries 361 00:38:50,720 --> 00:38:55,520 information that all lines indicate the corresponding method using the structure based information. 362 00:38:55,520 --> 00:39:00,210 We found that regardless of how you train and validate the model, a model using both structure and ligaments, 363 00:39:00,210 --> 00:39:07,070 the information was consistently superior at predicting the protein in binding affinity. 364 00:39:07,070 --> 00:39:11,300 However, regardless of the features, an algorithm used the same again. 365 00:39:11,300 --> 00:39:18,370 As mentioned previously, the similarity between your training and validation data had a strong influence on your model performance. 366 00:39:18,370 --> 00:39:23,710 This is a problem that clearly needs to. 367 00:39:23,710 --> 00:39:30,550 One of the common criticisms of machine learning is as a somewhat earned reputation as a black box, it's not entirely true, though. 368 00:39:30,550 --> 00:39:35,890 An advantage of the rain forest algorithm, for example, is the ability to actually look at the importance of each feature in the model. 369 00:39:35,890 --> 00:39:43,690 And indeed, in our work. What we found when we inspected this was that, again, regardless of how you're training the model, 370 00:39:43,690 --> 00:39:48,930 both ligand based and structure based information was consistently found to be important in making these predictions. 371 00:39:48,930 --> 00:39:56,020 So on the right, the red and yellow importance of structure based features, the blue bars of the importance of ligand based features. 372 00:39:56,020 --> 00:40:00,940 No matter how you train the model, you consistently see this combination of features. 373 00:40:00,940 --> 00:40:07,210 Being important is the best thing that capturing useful orthogonal information. 374 00:40:07,210 --> 00:40:11,870 I think that we looked at was this problem of, well, how do we do in the real world on Doctorow's, 375 00:40:11,870 --> 00:40:18,400 it's rather a crystal process allowing for the fact that the poses might not necessarily be that accurate. 376 00:40:18,400 --> 00:40:21,850 And again, I worry about the details, but you're right, 377 00:40:21,850 --> 00:40:27,400 the solid lines correspond to a model that was trained and validated using experimentally determined poses. 378 00:40:27,400 --> 00:40:35,230 The dotted line is the same model that was trained and validated using DOT poses, some of which are really good, some of which were not so good. 379 00:40:35,230 --> 00:40:41,380 And again, what jumped out to us was, firstly, regardless of how you trained and validated the model, 380 00:40:41,380 --> 00:40:48,280 the model using crystal poses was always performing better than a model using Doctorow's as sometimes very little, sometimes by quite a lot. 381 00:40:48,280 --> 00:40:56,830 So it clearly has an impact on the model, which gives you an overly optimistic estimate of how you're going to do in the real world. 382 00:40:56,830 --> 00:41:04,630 But in addition to this, what we also found was that the relative to often performance of a hybrid method that was using both 383 00:41:04,630 --> 00:41:10,720 structural and ligand based information was much smaller when using doorposts than that of the model, 384 00:41:10,720 --> 00:41:16,200 using only starch based information, which intuitively makes sense. Information about the ligand is independent of that pose. 385 00:41:16,200 --> 00:41:24,160 So appears that it helps to actually compensate for errors introduced using these imperfect poses. 386 00:41:24,160 --> 00:41:29,350 And indeed, we had to look at what happened when we provided multiple training examples of the same Lincolnton different. 387 00:41:29,350 --> 00:41:39,490 Not only does seeing different poses help or hurt the algorithm, and what we actually found was the one the performance of the model, 388 00:41:39,490 --> 00:41:45,130 regardless of how it was trained, consistently dropped when it was given multiple examples of poses for a ligand. 389 00:41:45,130 --> 00:41:54,850 But also when this happened to the ligand, based features in the model became far more or less dominated in its ability to make predictions. 390 00:41:54,850 --> 00:42:01,090 So corroborating this idea that when you have noise introduced through docking errors, 391 00:42:01,090 --> 00:42:07,060 you leveraging ligand based methods that, you know, work really can really help to recover your performance. 392 00:42:07,060 --> 00:42:08,500 And, you know, 393 00:42:08,500 --> 00:42:14,980 this illustrates the advantage of using a more interpretable algorithms that really allows you to drill into the model and see what's going on. 394 00:42:14,980 --> 00:42:25,650 Why is your performance being affected? So as to what we've observed and others have found is that your performance often depends greatly 395 00:42:25,650 --> 00:42:32,280 on the target of interest and that generalising to novel data can be incredibly challenging, 396 00:42:32,280 --> 00:42:35,850 even if your model appears to perform well, which is really quite, 397 00:42:35,850 --> 00:42:41,460 quite damning when you think about the real world application to a novel drug target. 398 00:42:41,460 --> 00:42:43,990 Well, we also found was that. 399 00:42:43,990 --> 00:42:51,250 Performance gain on standard benchmarks like Casarett when you add more training data could often be attributed not to having more data, 400 00:42:51,250 --> 00:42:56,710 but the data you add being similar in some way to the data in that benchmark and 401 00:42:56,710 --> 00:43:00,550 that benchmark set in a different Leganes binding against the same protein. 402 00:43:00,550 --> 00:43:08,020 And as soon as you remove that similar data, even if you have more data, you're back to where you are with with the smaller data set. 403 00:43:08,020 --> 00:43:12,370 So it's just sort of an artificial performance gain that's masking what's really going on. 404 00:43:12,370 --> 00:43:15,610 And this is just like to sort of leave this there. 405 00:43:15,610 --> 00:43:24,430 But this is a really glaring problem and one that clearly can't be addressed by simply getting all better on a standard benchmark. 406 00:43:24,430 --> 00:43:31,390 And this is an active it's still very much an active area of research, so maybe start a little bit late. 407 00:43:31,390 --> 00:43:40,420 So I'd like to just sort of quickly introduce some recent developments in applying deep learning to to to drug discovery, both to again, 408 00:43:40,420 --> 00:43:41,560 to this virtual screening task, 409 00:43:41,560 --> 00:43:51,180 but also some some really interesting ideas about molecule generation that's been enabled by by by deep learning techniques. 410 00:43:51,180 --> 00:43:56,880 So clearly went all the way there, even with machine learning, scoring functions, they depend on well engineered features. 411 00:43:56,880 --> 00:43:59,700 Even good features can introduce human biases. 412 00:43:59,700 --> 00:44:08,040 What we'd ideally like to be able to do is find some way of taking a raw representation of the data without human bias and getting a loan for itself. 413 00:44:08,040 --> 00:44:12,690 What a molecule looks like, what an interaction looks like, what a bad interaction looks like. 414 00:44:12,690 --> 00:44:22,810 And this. And this is. The actual application of deep learning, you know, you let the model engineer features for itself in a hierarchical manner. 415 00:44:22,810 --> 00:44:28,300 There have been all sorts of applications to deep learning, to taskin drug discovery of all solubility production, 416 00:44:28,300 --> 00:44:33,250 toxicity prediction, predicting reaction outcomes for synthesis planning. 417 00:44:33,250 --> 00:44:39,700 Again, going back to the idea of it doesn't matter how good the ligand is if you can't synthesise it to test it, 418 00:44:39,700 --> 00:44:44,050 molecular design using reinforcement like so iteratively modifying the molecule 419 00:44:44,050 --> 00:44:49,210 to make it better and also improving dorking and virtual screening results. 420 00:44:49,210 --> 00:44:50,050 And in the last few minutes, 421 00:44:50,050 --> 00:45:00,350 I'm just going to talk about some recent work from DR Protein Informatics Group that's touched on applying deep learning to drug discovery. 422 00:45:00,350 --> 00:45:10,370 The work that came out of the group was a piece of work by a former circus embroiled in 2018 where they took a used a convolutional neural network. 423 00:45:10,370 --> 00:45:17,890 Can you take that? You can take a 3D structure for like a complex, split it up into different atom types, for example, or, you know, 424 00:45:17,890 --> 00:45:23,890 where the allostatic carbons instructions are in the protein and the liquid and 425 00:45:23,890 --> 00:45:30,130 generated from the sort of voxel maps of the density of these atoms in that structure. 426 00:45:30,130 --> 00:45:36,790 And then you can you can treat these voxels maps as analogous to colour channels and an Ojibwe image. 427 00:45:36,790 --> 00:45:41,050 So in an Ojibwe image, you have three benchmarks, one for red, one for green, one for blue. 428 00:45:41,050 --> 00:45:50,140 That builds up the whole image. Here you can have these maps of where different atom types are that together represent a full protein complex. 429 00:45:50,140 --> 00:45:53,510 And with with this size 3D representation, you know, 430 00:45:53,510 --> 00:46:01,480 you now have a representation of the data that feeds quite naturally into the convolutional 431 00:46:01,480 --> 00:46:05,530 neural networks that have been applied successfully in computer vision tasks, 432 00:46:05,530 --> 00:46:18,120 image recognition, video processing, things like this. And this was shown by David COAS at 2017 to be quite effective for virtual screening. 433 00:46:18,120 --> 00:46:27,850 The piece of work the Fergus Emery did was taking these and applying architectural advancements from computer vision, 434 00:46:27,850 --> 00:46:35,740 in this case using densely connected layers in the network to see to see if this improved your ability to screen compounds, 435 00:46:35,740 --> 00:46:38,830 as it had been shown, to improve the ability to classify images. 436 00:46:38,830 --> 00:46:45,280 And indeed, they found that by introducing these densely connected blocks, exactly as you would in computer vision, 437 00:46:45,280 --> 00:46:52,030 you immediately got an improvement in your virtual screening performance, suggesting that this task does work. 438 00:46:52,030 --> 00:46:56,720 There is a computer vision task. However, 439 00:46:56,720 --> 00:47:01,190 some early analysis of these these sorts of CNN's revealed the oftentimes the CNN was 440 00:47:01,190 --> 00:47:04,790 actually just using the channels that represent the ligands to make its predictions, 441 00:47:04,790 --> 00:47:11,270 know it's implicitly learning biases about ligands, even though it's not been given engineered information about them. 442 00:47:11,270 --> 00:47:14,570 And it wasn't actually using the structure of the protein. 443 00:47:14,570 --> 00:47:21,540 Some work by a current member of OPEC, Jack Scantlebury, however, showed that by taking their training, 444 00:47:21,540 --> 00:47:29,120 augmenting by taking their known binder's, repositioning the ligand and obviously on physical poses and labelling these as non binder's, 445 00:47:29,120 --> 00:47:35,810 you force the network to classify them as incorrect and thereby force it to use the structure 446 00:47:35,810 --> 00:47:41,800 of the protein to differentiate between these two different poses of the bound ligand. 447 00:47:41,800 --> 00:47:48,240 So what they found was that if you don't do this augmentation on the right, 448 00:47:48,240 --> 00:47:55,250 you you basically get the same information, the same result, whether you use the structure of the protein or not. 449 00:47:55,250 --> 00:48:02,420 However, in this figure, in the middle, what they find is if you use if you do this data augmentation, 450 00:48:02,420 --> 00:48:06,110 you force the model to actually use the information about the structure. 451 00:48:06,110 --> 00:48:15,290 Your model generalises a lot better. And so this looks like this the figure in the image on the bottom, 452 00:48:15,290 --> 00:48:22,280 the third image on the bottom row here have a ligand sitting at an active side of a target and read 453 00:48:22,280 --> 00:48:27,830 read parts of this image correspond to parts of the structure that were masked during the test process. 454 00:48:27,830 --> 00:48:37,400 And so the model wasn't able to use the areas in green contributed favourably to prediction areas and very unfavourably. 455 00:48:37,400 --> 00:48:49,220 What we see is the network was augmented by the favourable hydrogen bonds indicated by the yellow dots and the green areas positively. 456 00:48:49,220 --> 00:48:55,130 So it's learning to make the prediction for the right reason. However, and the other images where you're not using this form of data augmentation, 457 00:48:55,130 --> 00:49:00,230 you're still getting the you're still getting correct predictions, but you're making these predictions for the wrong reasons. 458 00:49:00,230 --> 00:49:08,570 You're not actually learning that those interactions are there. So clearly, this is a key component of training these models. 459 00:49:08,570 --> 00:49:14,030 And just again, consciously start a little bit later, I'm going to gloss over this little bit at the end, 460 00:49:14,030 --> 00:49:18,170 but something that I've been working on recently with a party student, 461 00:49:18,170 --> 00:49:22,040 Oliver Turnbull, in the department is this idea of, well, OK, 462 00:49:22,040 --> 00:49:28,190 we've seen that this sort of data augmentation helps you model generalised for virtual screen and classification. 463 00:49:28,190 --> 00:49:33,770 Can we leverage that to perform better? A regression task now in regression. 464 00:49:33,770 --> 00:49:39,380 It's not clear how you would label a non-physical pose because you can't just label that as blind or no. 465 00:49:39,380 --> 00:49:44,300 You have to give it a binding affinity value and it doesn't make sense to assign that to a non-physical pose. 466 00:49:44,300 --> 00:49:52,340 So how you do that data augmentation is not clear, but what you can do is use transfer to take the model that was trained for the screen, 467 00:49:52,340 --> 00:49:59,390 the classification task that's adopted that generalisability and then use transferring to fine tune the class. 468 00:49:59,390 --> 00:50:06,920 The final layer using a regression dataset such as PTB binds to train that final layer for 469 00:50:06,920 --> 00:50:14,000 the regression task and see if you actually retain the benefit of that data augmentation. 470 00:50:14,000 --> 00:50:17,530 And so this is something the Oliver looked at recently. 471 00:50:17,530 --> 00:50:25,430 And so so what we have to look at is if you perform the same masking process using a network that was trained from trained, 472 00:50:25,430 --> 00:50:28,940 that was fine tuned from Jack Scantlebury network, 473 00:50:28,940 --> 00:50:33,500 you get something on the left where we have a ligand that we know binds with a certain binding affinity. 474 00:50:33,500 --> 00:50:39,980 And again, we can mask atoms and see atoms that appear as green. 475 00:50:39,980 --> 00:50:45,980 Here are those that contributed favourably to the final of energy production. And things that are in the red are those are contributed unfavourably. 476 00:50:45,980 --> 00:50:50,990 What we see on left is when we fine-tune a model that's had that data augmentation applied, 477 00:50:50,990 --> 00:50:54,260 the models correctly, learning where important hydrogen bonds are. 478 00:50:54,260 --> 00:51:02,030 It's clearly it's clearly rewarding those interactions being present and penalising parts of the molecule that don't contribute 479 00:51:02,030 --> 00:51:07,760 to those sorts of interactions on the one we just train the same neural network from scratch purely for energy production, 480 00:51:07,760 --> 00:51:12,860 no data augmentation. What we see, even though we got a correct prediction for this compound, 481 00:51:12,860 --> 00:51:18,740 what we see is that actually there's no real rhyme or reason to what atoms or parts 482 00:51:18,740 --> 00:51:22,400 of the protein the network thought were important for predicting binding affinity. 483 00:51:22,400 --> 00:51:29,960 For example, up here we have what should be an important bond is marked as completely unfavourable and not down here. 484 00:51:29,960 --> 00:51:36,020 What should be an important patch on the protein surface, again, is not is not contributing strongly. 485 00:51:36,020 --> 00:51:45,520 So clearly, this network. Can actually retain some of that information from the virtual screening data augmentation process, 486 00:51:45,520 --> 00:51:49,630 when you when you find tune it for this for this regression process. 487 00:51:49,630 --> 00:51:58,790 This is this seems to be a really promising lead for actually improving your ability to predict binding affinity in this virtual screen setting. 488 00:51:58,790 --> 00:52:09,190 So I was going to talk a little bit about what sort of line I don't want, but I want to make sure we stop in reasonable time. 489 00:52:09,190 --> 00:52:11,410 So I'm going to gloss over that. But if you want to ask about it, 490 00:52:11,410 --> 00:52:19,090 feel free just to just to sort of emphasise what we've learnt so far from from all the experiments that people 491 00:52:19,090 --> 00:52:25,720 have done on this topic is the machine learning methods at this stage are ubiquitous in drug discovery, 492 00:52:25,720 --> 00:52:31,750 and they often outperform traditional methods either in accuracy or accomplishing the same task faster and cheaper. 493 00:52:31,750 --> 00:52:34,990 For example, virtual screening versus high throughput screening, 494 00:52:34,990 --> 00:52:42,420 computational synthesis planning versus having a chemist plan out the synthesis of every single compound. 495 00:52:42,420 --> 00:52:50,730 And although they have a reputation as uninterpretable, blackbox is actually using using appropriately chosen algorithms or sensible 496 00:52:50,730 --> 00:52:54,630 approaches to exploring your data can actually make your model quite interpretable, 497 00:52:54,630 --> 00:52:57,540 giving you insight into how predictions are being made, 498 00:52:57,540 --> 00:53:05,970 whether the molecule is actually learning the underlying biophysics or simply spurious correlations in the data. 499 00:53:05,970 --> 00:53:11,280 And so on that note, deep learning provided you perform careful data, 500 00:53:11,280 --> 00:53:18,630 augmentation and training of the model can enable virtual screening with a model that learns directly from the 501 00:53:18,630 --> 00:53:24,930 underlying data of what favourable interactions look like without inheriting human biases from engineered features, 502 00:53:24,930 --> 00:53:30,310 regardless of how sensible it might be. 503 00:53:30,310 --> 00:53:33,900 And I'd like to mention, even though we don't have time to talk about it, 504 00:53:33,900 --> 00:53:41,310 is generative models such as arrogated neural networks, which are used to build up graphs. 505 00:53:41,310 --> 00:53:45,300 You can represent a molecule as a graph and use something like a graph gaited neural 506 00:53:45,300 --> 00:53:51,690 network to rapidly elaborate from that graph to to generate possible new molecules, 507 00:53:51,690 --> 00:53:58,560 which gives you a new way of exploring. Coming back to this idea of chemical spaces and possibly vast gives you another effective 508 00:53:58,560 --> 00:54:04,380 tool for exploring chemical space in a way that a chemist might not be able to on paper. 509 00:54:04,380 --> 00:54:11,760 All of this works also raise some important questions and challenges. And as I alluded to when I spoke about the size of chemical space, 510 00:54:11,760 --> 00:54:18,450 is this question of is our available data sufficient for our purposes and sufficiently representative, 511 00:54:18,450 --> 00:54:23,400 representative of chemical space to actually allow us to truly train generalisable 512 00:54:23,400 --> 00:54:30,490 models that can actually perform these sorts of tasks without inheriting human biases? 513 00:54:30,490 --> 00:54:39,370 The important question is sort of underpins all of this is can we actually can we currently rely on protein to generate binding poses, 514 00:54:39,370 --> 00:54:45,200 are accurate and useful enough to enable all of the machine learning models to work effectively. 515 00:54:45,200 --> 00:54:51,460 A common maxim in machine learning. Of course, it's garbage in, garbage out. This most certainly applies here. 516 00:54:51,460 --> 00:54:57,790 Another very important question that I alluded to previously was molecular dynamics is all of these methods for virtual screening 517 00:54:57,790 --> 00:55:04,400 and predicting binding affinity fundamentally operate from a static snapshot of the protein and complex a single doctor pose. 518 00:55:04,400 --> 00:55:08,530 But in reality, molecular interactions is a very dynamical biological process. 519 00:55:08,530 --> 00:55:17,980 You know, things are wobbling around. And so does it really make sense to expect to be able to solve this problem using a single static snapshot? 520 00:55:17,980 --> 00:55:23,170 Or do we need to or do we need to explore this dynamic, these dynamic processes more? 521 00:55:23,170 --> 00:55:27,790 And finally, we've seen a lot of promising work with with deep learning methods, 522 00:55:27,790 --> 00:55:33,010 being able to not only screen molecules, but indeed generate new molecules. 523 00:55:33,010 --> 00:55:41,470 So can we expect deep learning methods to fully remove the need for slow and costly human involvement in things like design and synthesis planning? 524 00:55:41,470 --> 00:55:47,530 Or do we still need the expert human on hand in order to guide this process? 525 00:55:47,530 --> 00:55:52,530 I just like to wrap up there, so. 526 00:55:52,530 --> 00:55:55,960 I'd like to thank the entire of the protein informatics group, 527 00:55:55,960 --> 00:56:04,170 which you can see all all looking very professional on the right hand side, but in particular, former and present members Fergus Emery, 528 00:56:04,170 --> 00:56:12,840 Tom Hatfield, Jack Scantlebury and all of a Turnbull whose whose workers underpin a lot of the things I've spoken about today, 529 00:56:12,840 --> 00:56:18,870 particularly on the deep learning side of things. And thanks to Garrett and Beverley for inviting me to speak today. 530 00:56:18,870 --> 00:56:25,410 And thanks for all of you to for listening to me. So I'd like to leave it there. 531 00:56:25,410 --> 00:56:29,820 And we have some time available. I'd like to answer any questions you might have. 532 00:56:29,820 --> 00:56:32,800 Thank you. OK, thank you very much, Ferguson. 533 00:56:32,800 --> 00:56:44,000 That was wonderful, and I invite everyone to use your yellow or whatever preferred colour and symbols to indicate your appreciation. 534 00:56:44,000 --> 00:56:49,058 I will stop recording now.