1
00:00:00,730 --> 00:00:05,170
OK, I have to consent to being recorded. OK, I consent.

2
00:00:05,170 --> 00:00:08,860
Excellent, right. So just by way of introduction.

3
00:00:08,860 --> 00:00:17,080
My name is Fergus Boyle Escarpments and I'm a research software engineer in the Oxford Protein Informatics Group here in the Department of Statistics.

4
00:00:17,080 --> 00:00:22,540
And prior to that, I was a student in the Department of Statistics. So I've been around for a while.

5
00:00:22,540 --> 00:00:31,130
And I'm going to talk to you a little bit about the use of machine learning in drug discovery and.

6
00:00:31,130 --> 00:00:36,620
I'm going to touch on quite a quite a few topics, both things that I worked on and things that other people have worked on,

7
00:00:36,620 --> 00:00:40,280
but certainly the latter half of the talk is really going to be focussing on examples of

8
00:00:40,280 --> 00:00:45,920
research done either by myself or by other members of the Oxford Project Informatics Group.

9
00:00:45,920 --> 00:00:51,140
So this is by no means intended to be an exhaustive discussion of all of the things that

10
00:00:51,140 --> 00:00:56,360
people have done using machine learning and drug discovery that would take an entire course.

11
00:00:56,360 --> 00:01:05,750
And you probably still wouldn't do all of it. But I hope that this will give you some idea as to how computational methods in general.

12
00:01:05,750 --> 00:01:14,660
But machine learning in particular is really benefiting the drug discovery process and what the practical implications of that are.

13
00:01:14,660 --> 00:01:21,920
So just brief overview I want to talk about. I'm aware that this is very much a statistics audience.

14
00:01:21,920 --> 00:01:29,030
So I'm going to open with just an introduction to the drug discovery process, what it what it entails,

15
00:01:29,030 --> 00:01:33,770
why we care about it, why we might want to apply computational methods to it at all.

16
00:01:33,770 --> 00:01:41,480
And then I'll briefly introduce the concept of computerised drug design and discuss sort of prior to the machine learning hype,

17
00:01:41,480 --> 00:01:47,900
train or revolution, depending on your stance on it, what sort of computational methods have historically been employed?

18
00:01:47,900 --> 00:01:53,600
And then I'll give some examples of sort of well established machine learning techniques and drug discovery,

19
00:01:53,600 --> 00:01:59,330
what they use for what sort of problems they accomplish, why it's beneficial over other methods.

20
00:01:59,330 --> 00:02:06,530
And then towards the end, I like to spend some time highlighting recent developments in drug discovery that

21
00:02:06,530 --> 00:02:12,590
have really benefited from recent advanced advancements in deep learning techniques.

22
00:02:12,590 --> 00:02:14,150
So just before I start,

23
00:02:14,150 --> 00:02:23,600
I'm aware that quite a few people in this audience are either members of OPEC or have been through the doctoral training centre in some capacity.

24
00:02:23,600 --> 00:02:30,350
So I'm aware you may have seen the introduction to drug discovery talk anywhere between one and 50 times before.

25
00:02:30,350 --> 00:02:36,430
So if you want to type out for the first ten minutes, I won't be offended.

26
00:02:36,430 --> 00:02:45,940
So just get us started. What's what is what is drug discovery to really answer that question, we first need to understand.

27
00:02:45,940 --> 00:02:52,120
So what what is a drug we intuitively think of a drug is, you know, it's a medicine.

28
00:02:52,120 --> 00:03:00,910
You know, we take a medicine to cure ourselves. But what war actually are drugs or biological processes?

29
00:03:00,910 --> 00:03:07,540
Infection and disease are just the result of the behaviour of macromolecules in the body.

30
00:03:07,540 --> 00:03:12,340
Proteins perform pretty much every task in the body to make.

31
00:03:12,340 --> 00:03:17,530
Proteins have specific functions if they're carrying out function and correctly, the body is doing OK.

32
00:03:17,530 --> 00:03:27,700
If a protein starts to misbehave or functions too much or too little, or a foreign body like a virus introduces a foreign protein into the body.

33
00:03:27,700 --> 00:03:34,440
Bad things can happen. This is how you get diseases. This is how you get symptoms of infection.

34
00:03:34,440 --> 00:03:43,000
So the key to trying to treat or manage diseases or infections is really trying to figure out what is causing the problem

35
00:03:43,000 --> 00:03:50,680
and how can we either make make up molecule behave properly or stop that molecule doing the thing it's not supposed to do.

36
00:03:50,680 --> 00:03:57,580
And so in pharmaceutical research, we have this concept of a drug target and a drug target is a key molecule that's typically a

37
00:03:57,580 --> 00:04:03,420
protein or occasionally a nucleic acid that has been implicated to an infection or disease.

38
00:04:03,420 --> 00:04:10,540
And this can be a protein, as I said, a protein in the body that's misbehaving or a protein that's part of, for example, the life cycle of a virus.

39
00:04:10,540 --> 00:04:20,230
And I'll get to an example of both of these in just a moment. And there are all sorts of ways of identifying and validating whether a target

40
00:04:20,230 --> 00:04:23,740
is indeed implicated in a condition that I'm not going to go into today,

41
00:04:23,740 --> 00:04:33,760
that that's really a topic of research in and of itself. But the key idea is the in order to treat disease, we want to target a usually a protein,

42
00:04:33,760 --> 00:04:39,310
occasionally nucleic acid in the body and alter or inhibit its function.

43
00:04:39,310 --> 00:04:45,670
Now a drug in and in and in pharmacology, at least a drug is any molecule.

44
00:04:45,670 --> 00:04:50,350
The interacts with a drug target in order to obtain therapeutic effects.

45
00:04:50,350 --> 00:04:56,470
And that therapeutic effect could be mediating a condition, managing symptoms, restoring function of a protein.

46
00:04:56,470 --> 00:05:00,940
It could be treating an infection by disrupting the life cycle of of a pathogen.

47
00:05:00,940 --> 00:05:06,980
Anything like a. a really a broad catchall.

48
00:05:06,980 --> 00:05:14,810
Now, just to just to sort of distinguish between different types of drugs, because it's an incredibly broad umbrella term,

49
00:05:14,810 --> 00:05:18,860
I'd like to just distinguish between two key fundamental classes of drugs,

50
00:05:18,860 --> 00:05:27,150
and those are small molecule drugs, things such as paracetamol, you know, anything that you take in tablet form, for example.

51
00:05:27,150 --> 00:05:33,380
And these are small chemical compounds that are typically produced by chemical synthesis.

52
00:05:33,380 --> 00:05:37,520
And in contrast to this, we have a class of drugs known as biopharmaceuticals,

53
00:05:37,520 --> 00:05:44,840
which are an incredibly broad category of drugs that are extracted or synthesised or obtained from biological sources.

54
00:05:44,840 --> 00:05:49,310
And the obvious topical example of this is a vaccine.

55
00:05:49,310 --> 00:05:54,680
And these can be, you know, very potentially very large molecules like, you know,

56
00:05:54,680 --> 00:06:00,560
an antibody is an entire protein is much larger than something like a paracetamol molecule.

57
00:06:00,560 --> 00:06:04,430
Today, I'm going to just focus on small molecule drug discovery.

58
00:06:04,430 --> 00:06:17,510
But just be aware there is a whole there's an enormous field of different applications of computational methods in medical research.

59
00:06:17,510 --> 00:06:21,320
The example of Walsenburg is a target.

60
00:06:21,320 --> 00:06:29,210
How does a drug function? I'd like to start start with an example of a protein in the human body.

61
00:06:29,210 --> 00:06:38,180
This is a protein called thrombin thrombin. The grey structure on the right is is an experimentally determined structure of the protein,

62
00:06:38,180 --> 00:06:42,740
thrombin from an enzyme that acts as a catalyst in the blood clotting process.

63
00:06:42,740 --> 00:06:52,760
So, so and blood clotting of this entire cascade of biological processes that results in blood cells are aggregating, which obviously,

64
00:06:52,760 --> 00:07:03,920
you know, seals wounds, but also when it's behaving, when it when it's misbehaving, you get conditions like blood clots, thrombosis, strokes.

65
00:07:03,920 --> 00:07:07,310
So it's something that we need to be very aware of.

66
00:07:07,310 --> 00:07:17,060
An example of a of a drug that the target that targets thrombin for a therapeutic effect is a peptide known as Hayrettin.

67
00:07:17,060 --> 00:07:25,220
And the structure of this is shown in the naturally occurring peptide that's produced by leaches, which, as we know, feed on blood.

68
00:07:25,220 --> 00:07:33,410
In order to feed on blood, they need to prevent the blood clotting. The salivary glands naturally produce a peptide that binds to thrombin and stops.

69
00:07:33,410 --> 00:07:37,220
It stops the thrombin molecule interacting with other things because it's already interacting with

70
00:07:37,220 --> 00:07:44,540
the heroin and thereby preventing it from from from from catalysing the blood clotting process.

71
00:07:44,540 --> 00:07:53,030
And this makes and this makes heroin useful as as an anticoagulant for treating both this and indeed

72
00:07:53,030 --> 00:08:00,450
several anticoagulant drugs on the market are based on Hebridean or chemical derivatives of the.

73
00:08:00,450 --> 00:08:11,380
The second example of the pathogen, so, for example, I'm going to use here is the human immunodeficiency virus, HIV.

74
00:08:11,380 --> 00:08:16,050
A key protein that plays a role in the HIV lifecycle is a protein called HIV.

75
00:08:16,050 --> 00:08:21,780
One protease. Now a protease is. And again, this is.

76
00:08:21,780 --> 00:08:30,900
An enzyme that that breaks up a large chain of amino acids into distinct subunits.

77
00:08:30,900 --> 00:08:36,090
This is important for the life cycle of HIV protease because the proteins that are

78
00:08:36,090 --> 00:08:40,710
involved in the lifecycle of HIV proteins are produced as a single amino acid chain.

79
00:08:40,710 --> 00:08:43,230
So you have multiple proteins all together.

80
00:08:43,230 --> 00:08:51,000
In order for these proteins to be functional, they need to be split up into independent units and that is the job of the HIV proteins.

81
00:08:51,000 --> 00:08:59,820
My pointer has this sort of groove or channel in the middle, and this is where it sticks to the peptide chain and breaks it up.

82
00:08:59,820 --> 00:09:08,970
Now. In order and so the way antiretroviral treatments for HIV work is by inhibiting the function of HIV protease,

83
00:09:08,970 --> 00:09:12,900
thus preventing it from breaking up these proteins and therefore, you know,

84
00:09:12,900 --> 00:09:18,870
disrupting the life cycle of the virus, the way this works is an inhibitor like adenovirus.

85
00:09:18,870 --> 00:09:25,740
And this is the molecular structure you see on the right is designed to to bind in that in that groove,

86
00:09:25,740 --> 00:09:31,950
in that in that binding site, on the on the protease. The protease can't do anything to the of molecule.

87
00:09:31,950 --> 00:09:33,840
It can't kleve it like a peptide chain.

88
00:09:33,840 --> 00:09:40,500
And so it just stays stuck in there preventing the HIV protease from sticking to the peptides that it's supposed to be cleaving,

89
00:09:40,500 --> 00:09:45,330
thereby preventing the drug, disrupting the life cycle of the virus.

90
00:09:45,330 --> 00:09:54,980
So those are just two examples of very different types of drug targets that we treat using molecule drugs.

91
00:09:54,980 --> 00:10:02,130
OK, so that's what drug is, so how do we actually develop drugs in practise?

92
00:10:02,130 --> 00:10:08,690
So if you've been to any drug discovery talks before, you'll have seen a variant on this diagram in one form or another.

93
00:10:08,690 --> 00:10:15,260
And so the pharmaceutical development process is sort of the first thing to understand is a very long winded, a very expensive process.

94
00:10:15,260 --> 00:10:24,740
And an enormous amount of time is invested just getting from identifying a target to having a candidate drug binding against that target.

95
00:10:24,740 --> 00:10:30,050
That that initial phase is known as drug discovery anywhere from a couple of years,

96
00:10:30,050 --> 00:10:36,030
up to over 10 years, with an average of around four years across the UK pharmaceutical companies.

97
00:10:36,030 --> 00:10:37,880
But even once you have such a candidate,

98
00:10:37,880 --> 00:10:45,890
you then have to go through several stages of preclinical animal models and clinical trials in order to verify that the drug works,

99
00:10:45,890 --> 00:10:52,820
that the drug's safe, that the drugs are effective enough to warrant any potential side effects.

100
00:10:52,820 --> 00:10:58,280
And each of these steps can take between one and two years and cost millions of pounds.

101
00:10:58,280 --> 00:11:04,340
And so you may end up with that from target identification to actually having an approved drug on the market.

102
00:11:04,340 --> 00:11:12,380
And it can take in excess of 10 years and cost well in excess of a billion pounds.

103
00:11:12,380 --> 00:11:21,800
And so this early stage drug discovery process where you developed drug candidates, which is what we're really going to focus on today.

104
00:11:21,800 --> 00:11:27,140
And so the drug discovery process, once you have a target identified,

105
00:11:27,140 --> 00:11:32,000
is a cyclical process of starting from a collection of compounds that you have access to.

106
00:11:32,000 --> 00:11:37,520
You can buy, you can you can make in the lab, somebody else can make for you whatever,

107
00:11:37,520 --> 00:11:43,070
taking your library of compounds and screening that entire library or a section,

108
00:11:43,070 --> 00:11:54,070
that library against your biological target compounds in that binder that talk about all the hits.

109
00:11:54,070 --> 00:11:58,760
So first you're trying to just identify hits and then in subsequent stages,

110
00:11:58,760 --> 00:12:04,520
you then take your your initial hits and try to optimise their of both their affinity for the target.

111
00:12:04,520 --> 00:12:10,580
How strongly they point to that target and also the selectivity so they don't bind to other targets.

112
00:12:10,580 --> 00:12:16,160
And side effects of medicines are often caused by a molecule also interacting in

113
00:12:16,160 --> 00:12:21,300
some way with another protein other than the intended target uptalk effects.

114
00:12:21,300 --> 00:12:26,850
And so try trying to balance affinity and selectivity is a really important part of this process.

115
00:12:26,850 --> 00:12:32,600
And once you have a molecule that you think has satisfactory affinity and selectivity,

116
00:12:32,600 --> 00:12:38,600
you don't have to go into a further process where you optimise other desirable pharmacological properties,

117
00:12:38,600 --> 00:12:46,400
for example, ensuring it's not toxic and showing that it doesn't aggregate all the while retaining the desired affinity and selectivity.

118
00:12:46,400 --> 00:12:51,150
And the diagram on the right just gives and really emphasises the research.

119
00:12:51,150 --> 00:12:57,020
You identify some pets, you check for toxicity, you optimise that, you optimise that,

120
00:12:57,020 --> 00:13:06,080
you check that you can actually make the compound because it doesn't matter how good inhibitor it is if you can't synthesiser and know,

121
00:13:06,080 --> 00:13:16,550
could take many repetitive cycles. And so it really is a very long, very expensive, multifaceted process.

122
00:13:16,550 --> 00:13:19,210
So just an idea of how this is done in practise,

123
00:13:19,210 --> 00:13:25,670
the initial identification stage has traditionally been performed in a process known as high throughput screening,

124
00:13:25,670 --> 00:13:27,440
where you have robots in a lab,

125
00:13:27,440 --> 00:13:36,770
the rapidly test or assay very large numbers of chemical compounds against the biological target of interest to see if any of them bind at all.

126
00:13:36,770 --> 00:13:40,790
And although advances in technology and methodology have continued,

127
00:13:40,790 --> 00:13:47,030
continually increased the efficiency, the speed, efficiency and reduce the cost of this process,

128
00:13:47,030 --> 00:13:56,750
high throughput screening in general business, even if you can do it very quickly with certain set ups, you need the you need that set up in place.

129
00:13:56,750 --> 00:14:07,520
You need the resources to do it. You need the expertise to do it. And so it's it's an incredibly it's an incredibly laborious process.

130
00:14:07,520 --> 00:14:15,080
And so you can start to understand why drug discovery is such a slow and expensive task deeds.

131
00:14:15,080 --> 00:14:22,460
So something that you may have read headlines about at various points is there's a well-known productivity problem in the pharmaceutical industry.

132
00:14:22,460 --> 00:14:31,310
And this is you know, it's been observed that despite continuous advances in technology and research methodology and increasingly available resources,

133
00:14:31,310 --> 00:14:37,310
the productivity of the pharmaceutical industry has continued to decline. I just put some solid numbers on that.

134
00:14:37,310 --> 00:14:48,200
In 2012, a paper by Scandal et al showed that ever since 1950, the cost of bringing a new drug to market has doubled roughly every nine years.

135
00:14:48,200 --> 00:14:53,000
And indeed, if you look at more, that's continued from 2012 to 2021.

136
00:14:53,000 --> 00:14:58,670
So it's really quite terrifying. And there are a myriad of reasons for the same part.

137
00:14:58,670 --> 00:15:06,110
It can be attributed to more marketing issues of this phenomenon known as better than the Beatles.

138
00:15:06,110 --> 00:15:11,120
If you're designing a new drug, it doesn't just have to work. It has to work better than anything else that we have.

139
00:15:11,120 --> 00:15:18,140
And it has to work sufficiently, better than anything else to be worth the investment, to be worth making, to be worth marketing.

140
00:15:18,140 --> 00:15:25,790
And in addition to this, for a very good reason that I'll get on to in a moment, we've been observing increasingly stringent conditions,

141
00:15:25,790 --> 00:15:32,480
requirements from from from government regulators to ensure the safety and efficacy of drugs.

142
00:15:32,480 --> 00:15:38,660
And these really are sort of fundamentally can't address by just throwing computers at the problem.

143
00:15:38,660 --> 00:15:43,220
However, other problems, such as inefficient resource allocation, you know,

144
00:15:43,220 --> 00:15:47,660
just brute forcing by throwing money at the problem certainly contribute to this

145
00:15:47,660 --> 00:15:54,900
productivity crisis and really try and optimise the process to bring costs down.

146
00:15:54,900 --> 00:16:01,490
Just an aside on why there are very good reasons for having regulations in place that that increase the

147
00:16:01,490 --> 00:16:09,840
cost and time taken to develop a drug is a historical drug called thalidomide that you may have heard of.

148
00:16:09,840 --> 00:16:20,490
Now, thalidomide was initially marketed as an over-the-counter sedative in the late 1950s in Europe of things like insomnia, anxiety and such like.

149
00:16:20,490 --> 00:16:28,680
And initially, it was noted as safe for use in pregnancy, however, glowing evidence in the late 1950s,

150
00:16:28,680 --> 00:16:35,700
early 1960s linked thalidomide to birth defects in children of mothers who had been taking

151
00:16:35,700 --> 00:16:41,830
thalidomide during pregnancy has led to most countries withdrawing its use in the early 1970s.

152
00:16:41,830 --> 00:16:52,140
However, precisely due to a lack of clear regulation, it remained in use in Spain well into the 1970s and possibly estimated.

153
00:16:52,140 --> 00:16:59,280
Anywhere between 10 and 20000 people are now affected by the horrific birth defects that were caused by the misuse.

154
00:16:59,280 --> 00:17:08,280
And it's really because there was really no regulation or formal requirements for proving efficacy or safety and drugs in the 1950s.

155
00:17:08,280 --> 00:17:15,060
Now, in the aftermath of the thalidomide tragedy, many countries introduced stricter regulations for drug testing and approval.

156
00:17:15,060 --> 00:17:15,660
So, for example,

157
00:17:15,660 --> 00:17:25,350
the U.K. Medicines Act of 1968 that required all current and future drug inefficacy was a direct consequence of the thalidomide disaster.

158
00:17:25,350 --> 00:17:32,850
And just put some numbers on this in. In the late 1950s, early 1960s, when this was happening,

159
00:17:32,850 --> 00:17:39,960
there were there were on the order of 30 to 40000 drugs that were available in some form in the U.K.,

160
00:17:39,960 --> 00:17:45,450
legally available in the U.K. by the start of the 1990s when this regulation went,

161
00:17:45,450 --> 00:17:52,200
when all of these drugs had finally been tested in accordance with the Medicines Act, only 5000 licenced and approved for use.

162
00:17:52,200 --> 00:18:00,100
So it's a really terrifying number of drugs that were just thrown onto the market with no real care for if they were if they were safe.

163
00:18:00,100 --> 00:18:07,930
So there are very good reasons that we can't that we can't just try to cut back on the clinical trials phase,

164
00:18:07,930 --> 00:18:12,520
we can't save time, that we can't save money there. So what can we do? Well,

165
00:18:12,520 --> 00:18:20,500
it turns out the very few candidates that enter clinical trials make it to the market with most failing due to lack of efficacy or safety concerns.

166
00:18:20,500 --> 00:18:25,840
And this in itself contributes enormously to costs because a successful drug has to not only pay for its own development,

167
00:18:25,840 --> 00:18:32,020
but for all of the work, the optimisation, the development that went into the drugs that did fail.

168
00:18:32,020 --> 00:18:39,250
And so one thing that we can do to try and address this productivity crisis is to try and replace the expensive

169
00:18:39,250 --> 00:18:44,170
and laborious steps preceding clinical trials using computational methods and the really two aspects of this.

170
00:18:44,170 --> 00:18:50,690
The first is reducing the cost of designing the drug candidates by automating processes for both.

171
00:18:50,690 --> 00:18:54,070
Improve the quality of the candidates that enter clinical trials, for example.

172
00:18:54,070 --> 00:18:59,540
Can we predict beforehand that a molecule is going to is going to be toxic?

173
00:18:59,540 --> 00:19:07,860
That immediately allows you to remove things from the clinical trials pool? So this brings onto the concept of computer aided drug design,

174
00:19:07,860 --> 00:19:20,460
and so it refers to any of a set of computational methods that are used in the in the preclinical drug discovery process in order to identify,

175
00:19:20,460 --> 00:19:26,140
identify and develop your compounds into clinical drug candidates.

176
00:19:26,140 --> 00:19:31,140
And fundamentally, the goal of computer aided drug design, or CAD, as is often known,

177
00:19:31,140 --> 00:19:35,910
is to predict what just to predict whether a molecule binds to a biological target.

178
00:19:35,910 --> 00:19:43,680
If so, how strongly? And so it's sort of an analogy to the high throughput screening I mentioned previously.

179
00:19:43,680 --> 00:19:49,970
The process of applying computational methods to screen a large compound library is known as virtual screening.

180
00:19:49,970 --> 00:19:54,990
And just like traditional lab based drug design, this is an iterative process.

181
00:19:54,990 --> 00:19:56,850
You perform virtual screening,

182
00:19:56,850 --> 00:20:03,510
then go and try and optimise your hits from virtual screening and then go back to a computational method to see I think it still binds.

183
00:20:03,510 --> 00:20:07,410
And so, again, it can carry on for quite a few iterations.

184
00:20:07,410 --> 00:20:12,130
And so in this talk, I'm really going to focus on the virtual screening task for much of the talk.

185
00:20:12,130 --> 00:20:21,280
But and I'll mention this a few times, computer models have been successfully used for all sorts of tasks and computerised,

186
00:20:21,280 --> 00:20:26,460
you know, analysing properties such as how is how, you know, trying to model how is a compound going to be metabolised?

187
00:20:26,460 --> 00:20:30,870
Is it going to be toxic? Is it going to. Is it going to aggregate all sorts of things?

188
00:20:30,870 --> 00:20:34,560
I'll give a few examples of this later on.

189
00:20:34,560 --> 00:20:43,520
So just to break down what virtual screen screening entails, you can typically break down virtual screening into two types of approaches.

190
00:20:43,520 --> 00:20:47,430
The first of this is looking based virtual screening where you're using methods that

191
00:20:47,430 --> 00:20:55,590
are entirely based on the the chemical properties of of of of your of your molecules.

192
00:20:55,590 --> 00:21:03,510
And look at these virtual screening in the process of saying, OK, I already have some Leganes that I know by my target of interest.

193
00:21:03,510 --> 00:21:10,670
So can I use that information to screen all of my other compounds to see if anything else is also likely to bind my target of interest?

194
00:21:10,670 --> 00:21:13,710
Nor do I have anything that similar to things that I know interact.

195
00:21:13,710 --> 00:21:21,180
So if you have if you have some known negatives for a target, you can directly go and apply liquid based methods.

196
00:21:21,180 --> 00:21:24,660
In contrast with this, we will also have structure based virtual screening,

197
00:21:24,660 --> 00:21:34,580
which instead uses information about the 3D structure of the biological target to predict not not only if a molecule will bind, but if so, where, how?

198
00:21:34,580 --> 00:21:38,880
How is it going to bind? What interaction does it make and how strongly does it bind?

199
00:21:38,880 --> 00:21:42,330
And so these two methods use very different forms of information.

200
00:21:42,330 --> 00:21:46,830
If you have no inelegance, you can you might use a ligand based method if you don't have any known Leganes,

201
00:21:46,830 --> 00:21:51,570
but you do have a 3D structure of the protein. You might use that.

202
00:21:51,570 --> 00:21:53,970
And of course, there are some some quote unquote,

203
00:21:53,970 --> 00:22:00,680
hybrid methods that combine these two approaches when you have both of those forms of data available.

204
00:22:00,680 --> 00:22:04,190
And just give an idea of what living based social screening entails.

205
00:22:04,190 --> 00:22:09,380
If one of the key concepts in sort of comparing screening molecules computationally is we

206
00:22:09,380 --> 00:22:16,060
need a way of representing a molecule in a computer and given given such a representation,

207
00:22:16,060 --> 00:22:26,510
a way of rationally comparing the similarity of molecules, if one example of how this is done is is a technique known as molecular fingerprinting,

208
00:22:26,510 --> 00:22:31,790
the idea is we have we have we know the structure and the composition of our molecule.

209
00:22:31,790 --> 00:22:37,230
And on here I have the example of this is the tutee structure of paracetamol molecule.

210
00:22:37,230 --> 00:22:44,090
You've got different molecules, are very different sizes and shapes, so it's not necessarily clear how to compare them analytically.

211
00:22:44,090 --> 00:22:52,790
Molecular fingerprinting is a concept that looks at the structure of the molecule and what features are present in this molecule,

212
00:22:52,790 --> 00:22:56,780
what atoms are next to each other, what groups are next to each other and converts?

213
00:22:56,780 --> 00:23:03,560
Molecules are potentially very different size and composition into a fixed length, finite size vector,

214
00:23:03,560 --> 00:23:12,290
which each bit encapsulates a certain functionality or chemical part of the chemical structure.

215
00:23:12,290 --> 00:23:15,080
And once you have this sort of vector representation,

216
00:23:15,080 --> 00:23:25,100
you're you're you're now in a very good position to take any method of earning back to the point where similarity searching, for example,

217
00:23:25,100 --> 00:23:32,540
is to compute these sorts of bit vectors for all of your molecules and then just compare your library of compounds to the fingerprints

218
00:23:32,540 --> 00:23:41,780
of your own molecule using a metric such as the Jaccard autonomous coefficient or something else like a similarity score.

219
00:23:41,780 --> 00:23:49,630
So the sort of approach is known as similarity searching, and it's just one way of representing a small molecule in a computer.

220
00:23:49,630 --> 00:24:01,550
Is this done in practise? So this we have structure based virtual screening where instead we're trying to make use of the 3D structure of the target,

221
00:24:01,550 --> 00:24:05,660
so we might try to explore how the molecule might interact with the target and are

222
00:24:05,660 --> 00:24:11,270
sort of two main contrasting computational techniques that might be used for this.

223
00:24:11,270 --> 00:24:18,950
The first of these is a technique known as protein, like a docking where you try to rapidly sample possible bound confirmations of the ligand.

224
00:24:18,950 --> 00:24:24,560
So try to see how do I think it might bind.

225
00:24:24,560 --> 00:24:31,250
You might use, for example, a Montecarlo search algorithm to do this and then try to rapidly estimate the binding affinity

226
00:24:31,250 --> 00:24:34,940
using what's known as a scoring function or go into more detail in just just a little bit.

227
00:24:34,940 --> 00:24:40,730
And the key idea is you're trying to do this quickly because you have millions of compounds to screen.

228
00:24:40,730 --> 00:24:47,690
Now, in contrast to this sort of rapid fire approach, is known as molecular dynamics, has all sorts of applications.

229
00:24:47,690 --> 00:24:50,540
But in this context,

230
00:24:50,540 --> 00:25:00,500
first locations of the protein laden interactions to try and predict how how is where does the molecule want to set in the active sites of the place

231
00:25:00,500 --> 00:25:07,250
and how to set off the simulation and try and let it decide where it wants to set and based off of those you can try and gain an understanding for,

232
00:25:07,250 --> 00:25:13,760
OK, what are the dynamics of binding? Because this is fundamentally a dynamic biological process, not a static snapshot.

233
00:25:13,760 --> 00:25:22,970
So you really want to understand those binding dynamics and from that to try and compute again using using force fields,

234
00:25:22,970 --> 00:25:29,450
trying to actually compute the interaction, energy or the binding affinity between the protein and the ligand Hierophant ligand,

235
00:25:29,450 --> 00:25:33,980
something that has a greater change, infringe upon binding that makes it more tightly bound.

236
00:25:33,980 --> 00:25:36,290
It doesn't want to separate. And that's what you're looking for.

237
00:25:36,290 --> 00:25:45,740
And in a binder, in a drug, it so protein Lewandowsky is much faster than molecular dynamics.

238
00:25:45,740 --> 00:25:51,040
It can be orders of, you know, orders of magnitude faster, depending on how you configure them like that.

239
00:25:51,040 --> 00:25:57,050
It faces this dynamical information and sort of detailed, accurate, free energy calculations for speed.

240
00:25:57,050 --> 00:26:04,340
And so this is always the Trade-Off that you're making when you're trying to trying to efficiently screen large numbers of compounds.

241
00:26:04,340 --> 00:26:09,470
And in practise, protein like in blocking, is the most common starch based technique used in drug discovery.

242
00:26:09,470 --> 00:26:17,180
Just because it's efficient, you couldn't use molecular dynamics to screen millions of compounds.

243
00:26:17,180 --> 00:26:22,640
That would be crazy. So more on protein like in dockings.

244
00:26:22,640 --> 00:26:30,540
One of the active areas of research is particularly trying to use machine learning to improve things.

245
00:26:30,540 --> 00:26:34,200
So protein looking, docking, in addition to its search algorithm, it uses,

246
00:26:34,200 --> 00:26:38,040
as I said, what is known as a scoring function, which is an approximate a quick,

247
00:26:38,040 --> 00:26:44,160
dirty, approximate function that tries to estimate the free energy of binding based on a single static snapshot of this is where the protein is.

248
00:26:44,160 --> 00:26:46,080
This is where the ligand is.

249
00:26:46,080 --> 00:26:56,220
And this enables it gives you a quick estimate of what you think the ELR rapidly assess the poses predicted by the docking algorithm, decide.

250
00:26:56,220 --> 00:26:58,890
Do I think this is a reasonable pose?

251
00:26:58,890 --> 00:27:05,730
How strongly do I think it by and can I rank all of my different Lykins by how strongly my scoring function thinks they bind?

252
00:27:05,730 --> 00:27:10,140
And so that lets you prioritise things that you think bind more strongly.

253
00:27:10,140 --> 00:27:14,580
There are many, many, many pieces of docking software that are regularly used for this process.

254
00:27:14,580 --> 00:27:22,780
They all have different strengths and weaknesses. I'm not going to name names here in case I upset certain people in.

255
00:27:22,780 --> 00:27:30,800
So that was that was all a lot of theory, but just give an example of what a protein like result might look like in practise,

256
00:27:30,800 --> 00:27:34,900
if we go back to our example of from an inheritance and we have the structure of thrombin and

257
00:27:34,900 --> 00:27:41,110
grey and we have an experimentally determined binding pose of the Hebridean molecule and Siân,

258
00:27:41,110 --> 00:27:47,620
and this is determined by X-ray crystallography. Now, just using a talking algorithm to try and sample that binding pose.

259
00:27:47,620 --> 00:27:53,560
The best result that was returned rythm is the pose in magenta.

260
00:27:53,560 --> 00:28:01,750
And you can see that a lot of the structure a line aligns very strongly. Apart from on the left, we have one ring that sort of clearly out of place.

261
00:28:01,750 --> 00:28:07,990
This is an example of the sort of quick and dirty dorking scoring process that gives you a rough idea of where the molecule sits.

262
00:28:07,990 --> 00:28:17,450
And based on that confirmation, your scoring function will give you some estimate of of the free energy of binding.

263
00:28:17,450 --> 00:28:25,370
So just in practise today, what I just might look like is it is really quite a key component of this process.

264
00:28:25,370 --> 00:28:32,390
Scoring function is is any sort of any sort of approximate method that tries to estimate the free energy of binding.

265
00:28:32,390 --> 00:28:33,950
And sort of classically in Dorking,

266
00:28:33,950 --> 00:28:44,630
this is done as a sum of physical or empirical energy terms that are the key being that they're all easy to compute rapidly.

267
00:28:44,630 --> 00:28:53,240
And this might include, for example, terms that represent Vanderveldt potential, terms that represent a more electrostatic potentials,

268
00:28:53,240 --> 00:29:00,040
terms that try to quantify the energy of of hydrophobic contacts, of hydrogen bonding terms,

269
00:29:00,040 --> 00:29:04,880
all sorts of things like this that might go on in molecular interactions.

270
00:29:04,880 --> 00:29:10,310
I know quite a very common thing to do is to find some of these terms, approximate them quickly,

271
00:29:10,310 --> 00:29:14,280
and then just use a linear regression to assign weights to each of these terms.

272
00:29:14,280 --> 00:29:21,800
That gives you the best estimate of binding affinity that you can compute rapidly and is really a multi-tool

273
00:29:21,800 --> 00:29:28,290
of structure based drug discovery they use to determine whether a pose is physically reasonable.

274
00:29:28,290 --> 00:29:33,050
That used to rank Leganes by the likelihood of binding and used to try and actually predict the strength

275
00:29:33,050 --> 00:29:39,890
of that binding of the binding affinity of that leg and the real use for a lot of different tasks.

276
00:29:39,890 --> 00:29:41,990
And that brings us on.

277
00:29:41,990 --> 00:29:53,480
And so that's sort of a very brief overview of the sort of techniques that are used in computer aided drug design, particularly in virtual screening.

278
00:29:53,480 --> 00:30:01,100
And so with that in place, I'd like to finally talk about how machine learning methods are being used in drug discovery,

279
00:30:01,100 --> 00:30:04,050
particularly for this virtual screening process.

280
00:30:04,050 --> 00:30:09,680
The context, text, statistical modelling and machine learning are well-established tools and drug discovery.

281
00:30:09,680 --> 00:30:14,570
And I could give you an exhaustive list of things that people have done in the past 30 years.

282
00:30:14,570 --> 00:30:22,940
But just a few examples. Using representations such as molecular molecular fingerprints we introduced earlier as

283
00:30:22,940 --> 00:30:29,510
features for support vector machines has been successfully used for a virtual screening,

284
00:30:29,510 --> 00:30:42,560
for example, by Jabat Atoll in 2018. An interesting example of of sort of substituting secondary assays with computational methods has been the

285
00:30:42,560 --> 00:30:48,280
use of decision tree classifiers to try and predict whether or not a molecule passes the blood brain barrier,

286
00:30:48,280 --> 00:30:54,800
which is a very important, a very important topic in pharmacology.

287
00:30:54,800 --> 00:31:00,680
And just as a third example of this caution under in 2006,

288
00:31:00,680 --> 00:31:08,510
a very good paper on the use of naïf based classifiers to try and predict whether a molecule, whether a molecule is likely to be toxic or not.

289
00:31:08,510 --> 00:31:19,340
So are just a few examples of the things that people have applied machine learning to sort of historically and in this field.

290
00:31:19,340 --> 00:31:25,430
But in recent years, there's been a lot of a lot of interest in the use of machine learning and drug discovery,

291
00:31:25,430 --> 00:31:31,580
and arguably one of the big reasons for this is the ever increasing quantity of data that's actually available,

292
00:31:31,580 --> 00:31:37,290
the ability of traditional methods such as, you know, using a linear regression to fit a scoring function.

293
00:31:37,290 --> 00:31:43,190
It simply can't leverage all the data that's available. So just give a feel for the sort of data that's available.

294
00:31:43,190 --> 00:31:47,660
Some of the publicly available databases might be a database known as Zenk,

295
00:31:47,660 --> 00:31:56,360
which contains 230 MIRTHA available compounds with 3-D confirmation of the and ready to use

296
00:31:56,360 --> 00:32:02,870
and dorking and a further 750 million compounds that are known to be commercially available.

297
00:32:02,870 --> 00:32:08,840
The idea being that you can take your computer, Zenk, you can screen them and you know you can go and buy them somewhere else.

298
00:32:08,840 --> 00:32:16,190
Another example of this is a database known as Cambell, which records some biological assay data.

299
00:32:16,190 --> 00:32:23,900
So measuring do things interact, which contains around 17 billion for its biological activities?

300
00:32:23,900 --> 00:32:30,140
You know, how strongly do things bind for two million different compounds across around 14000 targets?

301
00:32:30,140 --> 00:32:35,420
That really is an enormous amount of data that you might use to try and fit some predictive model.

302
00:32:35,420 --> 00:32:39,770
And thinking about structure based drug discovery, a database known as PTB.

303
00:32:39,770 --> 00:32:44,930
Binde is the largest collection of solved structures about protein living

304
00:32:44,930 --> 00:32:53,600
complexes that contains around 18000 of these of these complexes ready to bolts.

305
00:32:53,600 --> 00:32:59,150
So that's a lot of data. But just just just to give a feel for.

306
00:32:59,150 --> 00:33:11,290
Whether this data really represent is representative of some of chemistry and indeed if this data can be representative of chemistry.

307
00:33:11,290 --> 00:33:15,880
What one thing that's quite interesting to do is to say, OK,

308
00:33:15,880 --> 00:33:22,270
I know what properties a drug like molecule typically exhibits, and it's puts constraints on the base of the molecule.

309
00:33:22,270 --> 00:33:22,990
Based on that,

310
00:33:22,990 --> 00:33:31,540
you can you can sort of use combinatorics to estimate what is the size of the space of molecules that could possibly exist and be drug like.

311
00:33:31,540 --> 00:33:37,830
And a very common estimate of the size of the space is ten to the power of 60 molecules.

312
00:33:37,830 --> 00:33:46,650
That is enormous. That is impossibly enormous. Just to give an idea of how impossibly enormous that is, you can do if you're if you're boring like me,

313
00:33:46,650 --> 00:33:56,730
you can sit down and do a back of envelope estimate of how many atoms are to revive a figure of around 10 to the 57 atoms in the solar system.

314
00:33:56,730 --> 00:34:01,980
There are there are potentially a thousand times as many drug like molecules that you could possibly be

315
00:34:01,980 --> 00:34:10,530
interested in as there are atoms in the solar system is a physical impossibility to make all of these molecules.

316
00:34:10,530 --> 00:34:17,720
So this is a very important question of is the state can this data be relied on to be truly representative?

317
00:34:17,720 --> 00:34:25,280
The answer to that is. It's a question as something that needs to be borne in mind, but nevertheless, nevertheless,

318
00:34:25,280 --> 00:34:30,390
the availability of this data has really spurred a lot of use of machine learning.

319
00:34:30,390 --> 00:34:35,420
Of course, machine learning methods require robust validation is not enough for a linear

320
00:34:35,420 --> 00:34:39,020
regression on one hundred data points to test on another hundred anymore.

321
00:34:39,020 --> 00:34:48,170
Some examples of data sets that been used for this in drug discovery is, for example, a database known as the Director of Useful Decoys,

322
00:34:48,170 --> 00:34:57,750
which consists of 102 different protein targets for the 2000 ligands spread across those targets and around a million of what are known as decoys.

323
00:34:57,750 --> 00:35:03,080
And these are molecules that are believed to not bind to those to those targets.

324
00:35:03,080 --> 00:35:11,010
The idea being that you now have large data set that simulates the real world situation of a large compound library with a small number of binders.

325
00:35:11,010 --> 00:35:17,210
And you could use this to test your algorithm to see does it rank the binders more highly than the non binders?

326
00:35:17,210 --> 00:35:23,090
One of the obvious issues of this is potential biases, the identify decoys.

327
00:35:23,090 --> 00:35:30,710
And several people, such as Rhorer and Bowmen in 2009 have come up with various ways of ensuring that ligands

328
00:35:30,710 --> 00:35:35,300
are embedded next to decoys in chemical space to make them hard to differentiate.

329
00:35:35,300 --> 00:35:44,000
But it's very much an ongoing area of research and and finally, specific to the task of actually developing a good scoring function,

330
00:35:44,000 --> 00:35:48,260
Tchang, at all in 2009 started what's known as the comparative assessment of scoring functions,

331
00:35:48,260 --> 00:35:56,360
or CASSER to five percent of PDV derived by derived complexes where the Fanti measurements that

332
00:35:56,360 --> 00:36:00,890
you can use to directly measure how well does my scoring function predict binding affinity.

333
00:36:00,890 --> 00:36:06,520
And it's sort of become a de facto standard in the field as something or crop up.

334
00:36:06,520 --> 00:36:12,790
OK, to focus on the on in particular, on the use of machine learning to develop scoring function,

335
00:36:12,790 --> 00:36:15,730
because it's something I've worked on over the course of my DFL,

336
00:36:15,730 --> 00:36:24,220
is still very much an active area of research just to just establish why the why in particular, this problem has drawn a lot of attention.

337
00:36:24,220 --> 00:36:28,450
The classical scoring functions used in Dorking are often very good at saying whether

338
00:36:28,450 --> 00:36:34,600
a predicted binding pose is good and identifying blinders over non binder's.

339
00:36:34,600 --> 00:36:41,740
But the energies that they estimate often completely fail to correlate with the actual experimentally observed binding affinity.

340
00:36:41,740 --> 00:36:47,800
And so their application and actually measuring affinity is incredibly limited.

341
00:36:47,800 --> 00:36:56,590
Now, in the last decade or so, starting around 2010, any different machine learning approaches have been shown using all sorts of situations,

342
00:36:56,590 --> 00:37:02,260
and algorithms have been shown to consistently outperform these classical scoring functions at the Affinity Protection Task

343
00:37:02,260 --> 00:37:11,920
on common benchmarks sets such as Cassie and Just Emphasise These all relied on engineered features such as counting.

344
00:37:11,920 --> 00:37:16,540
How many pairwise interactions are there between atoms in the protein and ligands,

345
00:37:16,540 --> 00:37:25,540
or fingerprint's describing those protein legate interactions so that it's there and hearing those features.

346
00:37:25,540 --> 00:37:26,470
And in addition to this,

347
00:37:26,470 --> 00:37:34,060
a lot of these methods appear to be strongly dependent on the data they're trained on and often generalise poorly to unseen targets,

348
00:37:34,060 --> 00:37:41,740
which is not ideal given that in the real world we're trying to screen Lukins against a potentially novel drug target.

349
00:37:41,740 --> 00:37:46,090
And the scoring functions, although primarily optimised for predicting affinity,

350
00:37:46,090 --> 00:37:50,080
have been applied to the virtual screen and classification task of identifying binder's.

351
00:37:50,080 --> 00:37:56,360
But again, the. Form on an unseen novel Target.

352
00:37:56,360 --> 00:38:04,010
And finally, it's quite an important concept here is most of these studies have relied on training and validating,

353
00:38:04,010 --> 00:38:09,620
using only experimentally determined binding poses of ligands determined by crystallography.

354
00:38:09,620 --> 00:38:15,800
And only a few have explored how models can be expected to perform on posters, even though in reality,

355
00:38:15,800 --> 00:38:20,240
in a virtual screening campaign, you don't have crystal structures of all of your approaching complex.

356
00:38:20,240 --> 00:38:26,270
Because if you do, you're fine. So you don't need to screen them.

357
00:38:26,270 --> 00:38:29,630
And this leads onto some of the works I did during my DFL,

358
00:38:29,630 --> 00:38:36,860
and the first thing I'd like to just briefly mention is one of the things we looked at was aggregating structure based on ligand based methods,

359
00:38:36,860 --> 00:38:45,170
using random forests. And just want to figure out, I just just a brief illustration of don't worry about the details.

360
00:38:45,170 --> 00:38:50,720
The solid lines indicates the correlation obtained by method combining structure and lingonberries

361
00:38:50,720 --> 00:38:55,520
information that all lines indicate the corresponding method using the structure based information.

362
00:38:55,520 --> 00:39:00,210
We found that regardless of how you train and validate the model, a model using both structure and ligaments,

363
00:39:00,210 --> 00:39:07,070
the information was consistently superior at predicting the protein in binding affinity.

364
00:39:07,070 --> 00:39:11,300
However, regardless of the features, an algorithm used the same again.

365
00:39:11,300 --> 00:39:18,370
As mentioned previously, the similarity between your training and validation data had a strong influence on your model performance.

366
00:39:18,370 --> 00:39:23,710
This is a problem that clearly needs to.

367
00:39:23,710 --> 00:39:30,550
One of the common criticisms of machine learning is as a somewhat earned reputation as a black box, it's not entirely true, though.

368
00:39:30,550 --> 00:39:35,890
An advantage of the rain forest algorithm, for example, is the ability to actually look at the importance of each feature in the model.

369
00:39:35,890 --> 00:39:43,690
And indeed, in our work. What we found when we inspected this was that, again, regardless of how you're training the model,

370
00:39:43,690 --> 00:39:48,930
both ligand based and structure based information was consistently found to be important in making these predictions.

371
00:39:48,930 --> 00:39:56,020
So on the right, the red and yellow importance of structure based features, the blue bars of the importance of ligand based features.

372
00:39:56,020 --> 00:40:00,940
No matter how you train the model, you consistently see this combination of features.

373
00:40:00,940 --> 00:40:07,210
Being important is the best thing that capturing useful orthogonal information.

374
00:40:07,210 --> 00:40:11,870
I think that we looked at was this problem of, well, how do we do in the real world on Doctorow's,

375
00:40:11,870 --> 00:40:18,400
it's rather a crystal process allowing for the fact that the poses might not necessarily be that accurate.

376
00:40:18,400 --> 00:40:21,850
And again, I worry about the details, but you're right,

377
00:40:21,850 --> 00:40:27,400
the solid lines correspond to a model that was trained and validated using experimentally determined poses.

378
00:40:27,400 --> 00:40:35,230
The dotted line is the same model that was trained and validated using DOT poses, some of which are really good, some of which were not so good.

379
00:40:35,230 --> 00:40:41,380
And again, what jumped out to us was, firstly, regardless of how you trained and validated the model,

380
00:40:41,380 --> 00:40:48,280
the model using crystal poses was always performing better than a model using Doctorow's as sometimes very little, sometimes by quite a lot.

381
00:40:48,280 --> 00:40:56,830
So it clearly has an impact on the model, which gives you an overly optimistic estimate of how you're going to do in the real world.

382
00:40:56,830 --> 00:41:04,630
But in addition to this, what we also found was that the relative to often performance of a hybrid method that was using both

383
00:41:04,630 --> 00:41:10,720
structural and ligand based information was much smaller when using doorposts than that of the model,

384
00:41:10,720 --> 00:41:16,200
using only starch based information, which intuitively makes sense. Information about the ligand is independent of that pose.

385
00:41:16,200 --> 00:41:24,160
So appears that it helps to actually compensate for errors introduced using these imperfect poses.

386
00:41:24,160 --> 00:41:29,350
And indeed, we had to look at what happened when we provided multiple training examples of the same Lincolnton different.

387
00:41:29,350 --> 00:41:39,490
Not only does seeing different poses help or hurt the algorithm, and what we actually found was the one the performance of the model,

388
00:41:39,490 --> 00:41:45,130
regardless of how it was trained, consistently dropped when it was given multiple examples of poses for a ligand.

389
00:41:45,130 --> 00:41:54,850
But also when this happened to the ligand, based features in the model became far more or less dominated in its ability to make predictions.

390
00:41:54,850 --> 00:42:01,090
So corroborating this idea that when you have noise introduced through docking errors,

391
00:42:01,090 --> 00:42:07,060
you leveraging ligand based methods that, you know, work really can really help to recover your performance.

392
00:42:07,060 --> 00:42:08,500
And, you know,

393
00:42:08,500 --> 00:42:14,980
this illustrates the advantage of using a more interpretable algorithms that really allows you to drill into the model and see what's going on.

394
00:42:14,980 --> 00:42:25,650
Why is your performance being affected? So as to what we've observed and others have found is that your performance often depends greatly

395
00:42:25,650 --> 00:42:32,280
on the target of interest and that generalising to novel data can be incredibly challenging,

396
00:42:32,280 --> 00:42:35,850
even if your model appears to perform well, which is really quite,

397
00:42:35,850 --> 00:42:41,460
quite damning when you think about the real world application to a novel drug target.

398
00:42:41,460 --> 00:42:43,990
Well, we also found was that.

399
00:42:43,990 --> 00:42:51,250
Performance gain on standard benchmarks like Casarett when you add more training data could often be attributed not to having more data,

400
00:42:51,250 --> 00:42:56,710
but the data you add being similar in some way to the data in that benchmark and

401
00:42:56,710 --> 00:43:00,550
that benchmark set in a different Leganes binding against the same protein.

402
00:43:00,550 --> 00:43:08,020
And as soon as you remove that similar data, even if you have more data, you're back to where you are with with the smaller data set.

403
00:43:08,020 --> 00:43:12,370
So it's just sort of an artificial performance gain that's masking what's really going on.

404
00:43:12,370 --> 00:43:15,610
And this is just like to sort of leave this there.

405
00:43:15,610 --> 00:43:24,430
But this is a really glaring problem and one that clearly can't be addressed by simply getting all better on a standard benchmark.

406
00:43:24,430 --> 00:43:31,390
And this is an active it's still very much an active area of research, so maybe start a little bit late.

407
00:43:31,390 --> 00:43:40,420
So I'd like to just sort of quickly introduce some recent developments in applying deep learning to to to drug discovery, both to again,

408
00:43:40,420 --> 00:43:41,560
to this virtual screening task,

409
00:43:41,560 --> 00:43:51,180
but also some some really interesting ideas about molecule generation that's been enabled by by by deep learning techniques.

410
00:43:51,180 --> 00:43:56,880
So clearly went all the way there, even with machine learning, scoring functions, they depend on well engineered features.

411
00:43:56,880 --> 00:43:59,700
Even good features can introduce human biases.

412
00:43:59,700 --> 00:44:08,040
What we'd ideally like to be able to do is find some way of taking a raw representation of the data without human bias and getting a loan for itself.

413
00:44:08,040 --> 00:44:12,690
What a molecule looks like, what an interaction looks like, what a bad interaction looks like.

414
00:44:12,690 --> 00:44:22,810
And this. And this is. The actual application of deep learning, you know, you let the model engineer features for itself in a hierarchical manner.

415
00:44:22,810 --> 00:44:28,300
There have been all sorts of applications to deep learning, to taskin drug discovery of all solubility production,

416
00:44:28,300 --> 00:44:33,250
toxicity prediction, predicting reaction outcomes for synthesis planning.

417
00:44:33,250 --> 00:44:39,700
Again, going back to the idea of it doesn't matter how good the ligand is if you can't synthesise it to test it,

418
00:44:39,700 --> 00:44:44,050
molecular design using reinforcement like so iteratively modifying the molecule

419
00:44:44,050 --> 00:44:49,210
to make it better and also improving dorking and virtual screening results.

420
00:44:49,210 --> 00:44:50,050
And in the last few minutes,

421
00:44:50,050 --> 00:45:00,350
I'm just going to talk about some recent work from DR Protein Informatics Group that's touched on applying deep learning to drug discovery.

422
00:45:00,350 --> 00:45:10,370
The work that came out of the group was a piece of work by a former circus embroiled in 2018 where they took a used a convolutional neural network.

423
00:45:10,370 --> 00:45:17,890
Can you take that? You can take a 3D structure for like a complex, split it up into different atom types, for example, or, you know,

424
00:45:17,890 --> 00:45:23,890
where the allostatic carbons instructions are in the protein and the liquid and

425
00:45:23,890 --> 00:45:30,130
generated from the sort of voxel maps of the density of these atoms in that structure.

426
00:45:30,130 --> 00:45:36,790
And then you can you can treat these voxels maps as analogous to colour channels and an Ojibwe image.

427
00:45:36,790 --> 00:45:41,050
So in an Ojibwe image, you have three benchmarks, one for red, one for green, one for blue.

428
00:45:41,050 --> 00:45:50,140
That builds up the whole image. Here you can have these maps of where different atom types are that together represent a full protein complex.

429
00:45:50,140 --> 00:45:53,510
And with with this size 3D representation, you know,

430
00:45:53,510 --> 00:46:01,480
you now have a representation of the data that feeds quite naturally into the convolutional

431
00:46:01,480 --> 00:46:05,530
neural networks that have been applied successfully in computer vision tasks,

432
00:46:05,530 --> 00:46:18,120
image recognition, video processing, things like this. And this was shown by David COAS at 2017 to be quite effective for virtual screening.

433
00:46:18,120 --> 00:46:27,850
The piece of work the Fergus Emery did was taking these and applying architectural advancements from computer vision,

434
00:46:27,850 --> 00:46:35,740
in this case using densely connected layers in the network to see to see if this improved your ability to screen compounds,

435
00:46:35,740 --> 00:46:38,830
as it had been shown, to improve the ability to classify images.

436
00:46:38,830 --> 00:46:45,280
And indeed, they found that by introducing these densely connected blocks, exactly as you would in computer vision,

437
00:46:45,280 --> 00:46:52,030
you immediately got an improvement in your virtual screening performance, suggesting that this task does work.

438
00:46:52,030 --> 00:46:56,720
There is a computer vision task. However,

439
00:46:56,720 --> 00:47:01,190
some early analysis of these these sorts of CNN's revealed the oftentimes the CNN was

440
00:47:01,190 --> 00:47:04,790
actually just using the channels that represent the ligands to make its predictions,

441
00:47:04,790 --> 00:47:11,270
know it's implicitly learning biases about ligands, even though it's not been given engineered information about them.

442
00:47:11,270 --> 00:47:14,570
And it wasn't actually using the structure of the protein.

443
00:47:14,570 --> 00:47:21,540
Some work by a current member of OPEC, Jack Scantlebury, however, showed that by taking their training,

444
00:47:21,540 --> 00:47:29,120
augmenting by taking their known binder's, repositioning the ligand and obviously on physical poses and labelling these as non binder's,

445
00:47:29,120 --> 00:47:35,810
you force the network to classify them as incorrect and thereby force it to use the structure

446
00:47:35,810 --> 00:47:41,800
of the protein to differentiate between these two different poses of the bound ligand.

447
00:47:41,800 --> 00:47:48,240
So what they found was that if you don't do this augmentation on the right,

448
00:47:48,240 --> 00:47:55,250
you you basically get the same information, the same result, whether you use the structure of the protein or not.

449
00:47:55,250 --> 00:48:02,420
However, in this figure, in the middle, what they find is if you use if you do this data augmentation,

450
00:48:02,420 --> 00:48:06,110
you force the model to actually use the information about the structure.

451
00:48:06,110 --> 00:48:15,290
Your model generalises a lot better. And so this looks like this the figure in the image on the bottom,

452
00:48:15,290 --> 00:48:22,280
the third image on the bottom row here have a ligand sitting at an active side of a target and read

453
00:48:22,280 --> 00:48:27,830
read parts of this image correspond to parts of the structure that were masked during the test process.

454
00:48:27,830 --> 00:48:37,400
And so the model wasn't able to use the areas in green contributed favourably to prediction areas and very unfavourably.

455
00:48:37,400 --> 00:48:49,220
What we see is the network was augmented by the favourable hydrogen bonds indicated by the yellow dots and the green areas positively.

456
00:48:49,220 --> 00:48:55,130
So it's learning to make the prediction for the right reason. However, and the other images where you're not using this form of data augmentation,

457
00:48:55,130 --> 00:49:00,230
you're still getting the you're still getting correct predictions, but you're making these predictions for the wrong reasons.

458
00:49:00,230 --> 00:49:08,570
You're not actually learning that those interactions are there. So clearly, this is a key component of training these models.

459
00:49:08,570 --> 00:49:14,030
And just again, consciously start a little bit later, I'm going to gloss over this little bit at the end,

460
00:49:14,030 --> 00:49:18,170
but something that I've been working on recently with a party student,

461
00:49:18,170 --> 00:49:22,040
Oliver Turnbull, in the department is this idea of, well, OK,

462
00:49:22,040 --> 00:49:28,190
we've seen that this sort of data augmentation helps you model generalised for virtual screen and classification.

463
00:49:28,190 --> 00:49:33,770
Can we leverage that to perform better? A regression task now in regression.

464
00:49:33,770 --> 00:49:39,380
It's not clear how you would label a non-physical pose because you can't just label that as blind or no.

465
00:49:39,380 --> 00:49:44,300
You have to give it a binding affinity value and it doesn't make sense to assign that to a non-physical pose.

466
00:49:44,300 --> 00:49:52,340
So how you do that data augmentation is not clear, but what you can do is use transfer to take the model that was trained for the screen,

467
00:49:52,340 --> 00:49:59,390
the classification task that's adopted that generalisability and then use transferring to fine tune the class.

468
00:49:59,390 --> 00:50:06,920
The final layer using a regression dataset such as PTB binds to train that final layer for

469
00:50:06,920 --> 00:50:14,000
the regression task and see if you actually retain the benefit of that data augmentation.

470
00:50:14,000 --> 00:50:17,530
And so this is something the Oliver looked at recently.

471
00:50:17,530 --> 00:50:25,430
And so so what we have to look at is if you perform the same masking process using a network that was trained from trained,

472
00:50:25,430 --> 00:50:28,940
that was fine tuned from Jack Scantlebury network,

473
00:50:28,940 --> 00:50:33,500
you get something on the left where we have a ligand that we know binds with a certain binding affinity.

474
00:50:33,500 --> 00:50:39,980
And again, we can mask atoms and see atoms that appear as green.

475
00:50:39,980 --> 00:50:45,980
Here are those that contributed favourably to the final of energy production. And things that are in the red are those are contributed unfavourably.

476
00:50:45,980 --> 00:50:50,990
What we see on left is when we fine-tune a model that's had that data augmentation applied,

477
00:50:50,990 --> 00:50:54,260
the models correctly, learning where important hydrogen bonds are.

478
00:50:54,260 --> 00:51:02,030
It's clearly it's clearly rewarding those interactions being present and penalising parts of the molecule that don't contribute

479
00:51:02,030 --> 00:51:07,760
to those sorts of interactions on the one we just train the same neural network from scratch purely for energy production,

480
00:51:07,760 --> 00:51:12,860
no data augmentation. What we see, even though we got a correct prediction for this compound,

481
00:51:12,860 --> 00:51:18,740
what we see is that actually there's no real rhyme or reason to what atoms or parts

482
00:51:18,740 --> 00:51:22,400
of the protein the network thought were important for predicting binding affinity.

483
00:51:22,400 --> 00:51:29,960
For example, up here we have what should be an important bond is marked as completely unfavourable and not down here.

484
00:51:29,960 --> 00:51:36,020
What should be an important patch on the protein surface, again, is not is not contributing strongly.

485
00:51:36,020 --> 00:51:45,520
So clearly, this network. Can actually retain some of that information from the virtual screening data augmentation process,

486
00:51:45,520 --> 00:51:49,630
when you when you find tune it for this for this regression process.

487
00:51:49,630 --> 00:51:58,790
This is this seems to be a really promising lead for actually improving your ability to predict binding affinity in this virtual screen setting.

488
00:51:58,790 --> 00:52:09,190
So I was going to talk a little bit about what sort of line I don't want, but I want to make sure we stop in reasonable time.

489
00:52:09,190 --> 00:52:11,410
So I'm going to gloss over that. But if you want to ask about it,

490
00:52:11,410 --> 00:52:19,090
feel free just to just to sort of emphasise what we've learnt so far from from all the experiments that people

491
00:52:19,090 --> 00:52:25,720
have done on this topic is the machine learning methods at this stage are ubiquitous in drug discovery,

492
00:52:25,720 --> 00:52:31,750
and they often outperform traditional methods either in accuracy or accomplishing the same task faster and cheaper.

493
00:52:31,750 --> 00:52:34,990
For example, virtual screening versus high throughput screening,

494
00:52:34,990 --> 00:52:42,420
computational synthesis planning versus having a chemist plan out the synthesis of every single compound.

495
00:52:42,420 --> 00:52:50,730
And although they have a reputation as uninterpretable, blackbox is actually using using appropriately chosen algorithms or sensible

496
00:52:50,730 --> 00:52:54,630
approaches to exploring your data can actually make your model quite interpretable,

497
00:52:54,630 --> 00:52:57,540
giving you insight into how predictions are being made,

498
00:52:57,540 --> 00:53:05,970
whether the molecule is actually learning the underlying biophysics or simply spurious correlations in the data.

499
00:53:05,970 --> 00:53:11,280
And so on that note, deep learning provided you perform careful data,

500
00:53:11,280 --> 00:53:18,630
augmentation and training of the model can enable virtual screening with a model that learns directly from the

501
00:53:18,630 --> 00:53:24,930
underlying data of what favourable interactions look like without inheriting human biases from engineered features,

502
00:53:24,930 --> 00:53:30,310
regardless of how sensible it might be.

503
00:53:30,310 --> 00:53:33,900
And I'd like to mention, even though we don't have time to talk about it,

504
00:53:33,900 --> 00:53:41,310
is generative models such as arrogated neural networks, which are used to build up graphs.

505
00:53:41,310 --> 00:53:45,300
You can represent a molecule as a graph and use something like a graph gaited neural

506
00:53:45,300 --> 00:53:51,690
network to rapidly elaborate from that graph to to generate possible new molecules,

507
00:53:51,690 --> 00:53:58,560
which gives you a new way of exploring. Coming back to this idea of chemical spaces and possibly vast gives you another effective

508
00:53:58,560 --> 00:54:04,380
tool for exploring chemical space in a way that a chemist might not be able to on paper.

509
00:54:04,380 --> 00:54:11,760
All of this works also raise some important questions and challenges. And as I alluded to when I spoke about the size of chemical space,

510
00:54:11,760 --> 00:54:18,450
is this question of is our available data sufficient for our purposes and sufficiently representative,

511
00:54:18,450 --> 00:54:23,400
representative of chemical space to actually allow us to truly train generalisable

512
00:54:23,400 --> 00:54:30,490
models that can actually perform these sorts of tasks without inheriting human biases?

513
00:54:30,490 --> 00:54:39,370
The important question is sort of underpins all of this is can we actually can we currently rely on protein to generate binding poses,

514
00:54:39,370 --> 00:54:45,200
are accurate and useful enough to enable all of the machine learning models to work effectively.

515
00:54:45,200 --> 00:54:51,460
A common maxim in machine learning. Of course, it's garbage in, garbage out. This most certainly applies here.

516
00:54:51,460 --> 00:54:57,790
Another very important question that I alluded to previously was molecular dynamics is all of these methods for virtual screening

517
00:54:57,790 --> 00:55:04,400
and predicting binding affinity fundamentally operate from a static snapshot of the protein and complex a single doctor pose.

518
00:55:04,400 --> 00:55:08,530
But in reality, molecular interactions is a very dynamical biological process.

519
00:55:08,530 --> 00:55:17,980
You know, things are wobbling around. And so does it really make sense to expect to be able to solve this problem using a single static snapshot?

520
00:55:17,980 --> 00:55:23,170
Or do we need to or do we need to explore this dynamic, these dynamic processes more?

521
00:55:23,170 --> 00:55:27,790
And finally, we've seen a lot of promising work with with deep learning methods,

522
00:55:27,790 --> 00:55:33,010
being able to not only screen molecules, but indeed generate new molecules.

523
00:55:33,010 --> 00:55:41,470
So can we expect deep learning methods to fully remove the need for slow and costly human involvement in things like design and synthesis planning?

524
00:55:41,470 --> 00:55:47,530
Or do we still need the expert human on hand in order to guide this process?

525
00:55:47,530 --> 00:55:52,530
I just like to wrap up there, so.

526
00:55:52,530 --> 00:55:55,960
I'd like to thank the entire of the protein informatics group,

527
00:55:55,960 --> 00:56:04,170
which you can see all all looking very professional on the right hand side, but in particular, former and present members Fergus Emery,

528
00:56:04,170 --> 00:56:12,840
Tom Hatfield, Jack Scantlebury and all of a Turnbull whose whose workers underpin a lot of the things I've spoken about today,

529
00:56:12,840 --> 00:56:18,870
particularly on the deep learning side of things. And thanks to Garrett and Beverley for inviting me to speak today.

530
00:56:18,870 --> 00:56:25,410
And thanks for all of you to for listening to me. So I'd like to leave it there.

531
00:56:25,410 --> 00:56:29,820
And we have some time available. I'd like to answer any questions you might have.

532
00:56:29,820 --> 00:56:32,800
Thank you. OK, thank you very much, Ferguson.

533
00:56:32,800 --> 00:56:44,000
That was wonderful, and I invite everyone to use your yellow or whatever preferred colour and symbols to indicate your appreciation.

534
00:56:44,000 --> 00:56:49,058
I will stop recording now.