1
00:00:01,450 --> 00:00:09,340
OK, good, so it gives me great pleasure to welcome Graham Lee from Computer Science,

2
00:00:09,340 --> 00:00:17,400
who is going to be talking about automated testing, so please take it away.

3
00:00:17,400 --> 00:00:25,710
Great. Thank you very much. Introduction. Hopefully, what you're now looking at is the slides.

4
00:00:25,710 --> 00:00:37,050
So and yes, I will rely on Garrett to moderate the the chat and the Q&A.

5
00:00:37,050 --> 00:00:38,700
Welcome questions at any point.

6
00:00:38,700 --> 00:00:48,870
So I feel free to nudge when you've got a question, but I'm going to be focussing on my presenter notes and all my slides,

7
00:00:48,870 --> 00:00:55,830
so I probably won't notice and I'll rely on him to let me know if anything's come up.

8
00:00:55,830 --> 00:01:07,110
So yes, as as Karen said, it isn't true. I mean, the research software engineering group, which is based in the computer science department,

9
00:01:07,110 --> 00:01:12,060
although my background is in professional software engineering.

10
00:01:12,060 --> 00:01:22,590
So I did a degree in Oxford in physics back in 2004 and then moved into like commercial computing.

11
00:01:22,590 --> 00:01:32,520
I've worked at a bunch of companies, but I have always had software testing as a sort of thread running through my career,

12
00:01:32,520 --> 00:01:36,870
so a large number of years ago than I care to think about.

13
00:01:36,870 --> 00:01:40,650
I wrote the book test driven iOS development,

14
00:01:40,650 --> 00:01:52,440
which is about how developers writing for software apps for the iPhone and the iPad can test their software.

15
00:01:52,440 --> 00:02:01,800
I was the manager of a engineering group at Facebook who developed the mobile testing frameworks the Facebook used,

16
00:02:01,800 --> 00:02:07,800
and one of the things you have to ask in doing testing is what are the benefits?

17
00:02:07,800 --> 00:02:17,430
And obviously we we hope that by having test coverage, by knowing something about the behaviour of our code,

18
00:02:17,430 --> 00:02:25,620
then we're going to have increased confidence in its in its correctness and in its function.

19
00:02:25,620 --> 00:02:36,890
But there are fringe benefits as well. And at Facebook, we actually reduced the time it took to release a new updates to the Facebook mobile app.

20
00:02:36,890 --> 00:02:45,840
So the Facebook app for iOS for Android from four weeks to one and a half weeks, and it's probably even shorter now,

21
00:02:45,840 --> 00:02:55,470
just by taking some of the testing that was being done manually every time there was a new release candidate

22
00:02:55,470 --> 00:03:05,650
of the app and automating that so that we could very quickly get information on the quality of the software.

23
00:03:05,650 --> 00:03:17,650
And I also did some work, Apple last year documenting that test tools and the techniques for using software testing with that technology.

24
00:03:17,650 --> 00:03:26,200
And that's on the Apple developer website so that, you know, that's the bit where I just tell you what my CV is.

25
00:03:26,200 --> 00:03:32,410
So there's some legitimacy for me being a gay guy giving this talk.

26
00:03:32,410 --> 00:03:34,600
And then the question who we know,

27
00:03:34,600 --> 00:03:47,500
who is this Oxford Research software engineering group and where a effectively a sort of service or a facility in the

28
00:03:47,500 --> 00:03:57,310
university for helping researchers to achieve their goals using custom software and bespoke software development.

29
00:03:57,310 --> 00:04:08,170
That often means that we get involved in a project like a grant funded research project or a spin out actually writing software.

30
00:04:08,170 --> 00:04:17,140
But that's not the only thing we do. We also obviously do outreach like we use seminars, we do teaching.

31
00:04:17,140 --> 00:04:25,780
And really, one of our main goals is to sort of bring up the the standard of software development across the university.

32
00:04:25,780 --> 00:04:33,190
So rather than just being the, you know, some sort of like gatekeeper or central clearinghouse software,

33
00:04:33,190 --> 00:04:39,650
we're actually sort of building a community of expertise and and practise across the university.

34
00:04:39,650 --> 00:04:46,810
So that's really why, you know, I'm very happy to be given the opportunity to.

35
00:04:46,810 --> 00:04:52,420
Yeah, it's it's there's a waffle on it like phone on Friday afternoon about software testing,

36
00:04:52,420 --> 00:04:56,650
but less waffle because it says only me between us in the weekend now.

37
00:04:56,650 --> 00:05:00,280
So that's that's going.

38
00:05:00,280 --> 00:05:10,180
So this is really an introduction to the idea of software testing in a sort of scientific or like computational research context.

39
00:05:10,180 --> 00:05:17,380
So I'm going to try and stick mostly to sort of principles about how we think about testing,

40
00:05:17,380 --> 00:05:25,630
why we think about testing and how to sort of plan for creating tests for your software.

41
00:05:25,630 --> 00:05:30,940
I'm not going to go in-depth on any particular like tools or technologies,

42
00:05:30,940 --> 00:05:38,770
partly because I think telling you how to use something before you've got motivation for using it is,

43
00:05:38,770 --> 00:05:42,790
you know, is kind of off-putting and not relevant or useful.

44
00:05:42,790 --> 00:05:51,520
And partly because there's, you know, there's a wealth of different technologies out there and depends on what you're trying to do.

45
00:05:51,520 --> 00:05:56,470
If you're writing some sort of data manipulation in,

46
00:05:56,470 --> 00:06:03,130
are you going to have a very different experience from if you're writing a web application in JavaScript?

47
00:06:03,130 --> 00:06:08,890
And so like picking any one of those would lose a bunch of audience and not

48
00:06:08,890 --> 00:06:13,270
necessarily even be useful for the people who are using that particular technology.

49
00:06:13,270 --> 00:06:18,290
So what are we doing when we test software?

50
00:06:18,290 --> 00:06:22,210
You know, what are we trying to get out of this thing?

51
00:06:22,210 --> 00:06:33,820
And I I've come up with four sort of goals for testing, which I've call continuity, correctness, reproducibility and recovery.

52
00:06:33,820 --> 00:06:37,990
So let's let's take a look at those continuity.

53
00:06:37,990 --> 00:06:46,330
I mean that what the software does today should be, you know, somewhat related to what it's going to do tomorrow.

54
00:06:46,330 --> 00:06:52,990
We obviously do evolve software. We add new things, we fix bugs.

55
00:06:52,990 --> 00:07:04,510
The idea is any of these should be an improvement. It's very rare for us to deliberately remove capabilities from software.

56
00:07:04,510 --> 00:07:14,140
It does happen sometimes we realise that we're supporting an old platform that's no longer relevant or where we've

57
00:07:14,140 --> 00:07:23,290
got some old algorithm that the community has moved on from and that we don't need to have that algorithm anymore.

58
00:07:23,290 --> 00:07:29,580
But those are like specific events that we can plan for what we really don't want.

59
00:07:29,580 --> 00:07:37,830
These are unplanned breakages or loss of functionality, which called regressions in the software industry.

60
00:07:37,830 --> 00:07:47,400
You can imagine that if you've published research based on a code that performs a simulation or does some

61
00:07:47,400 --> 00:07:56,340
analysis of the data and someone comes along and wants to replicate that analysis or rerun that simulation,

62
00:07:56,340 --> 00:08:01,650
they may want to do it in a newer context. They may want to try new ideas.

63
00:08:01,650 --> 00:08:05,760
They may want to use newer techniques,

64
00:08:05,760 --> 00:08:14,970
but they want the the thing to basically work so they they still want to get the the results that they were able to get before.

65
00:08:14,970 --> 00:08:22,710
So one thing that having tests gives us is not only the knowledge that our software works now,

66
00:08:22,710 --> 00:08:31,260
it's knowledge about whether future versions of the software still have that early capability and that

67
00:08:31,260 --> 00:08:40,660
earlier behaviour because we can always keep these tests and run them against new versions of the software.

68
00:08:40,660 --> 00:08:47,590
Correctness is perhaps the one that makes a lot of people doing scientific

69
00:08:47,590 --> 00:08:53,130
computation kind of stop and wonder whether testing is really relevant for them.

70
00:08:53,130 --> 00:08:59,920
You know, I'm doing research, I'm trying to find out the results.

71
00:08:59,920 --> 00:09:05,890
So a question for which I don't know the answer. By definition, if I knew what the answer to the question was, it wouldn't be.

72
00:09:05,890 --> 00:09:11,530
It wouldn't be research. So how can I write a test for what?

73
00:09:11,530 --> 00:09:16,030
I don't know what the outcome is going to be? And that is a good question.

74
00:09:16,030 --> 00:09:28,480
It's an important question. Yeah, we could have some complex problem domain that we're trying to model and a new context to explore with that model.

75
00:09:28,480 --> 00:09:36,770
And while we may not know what the what the outcome is going to be in terms of the research problem.

76
00:09:36,770 --> 00:09:46,500
We we want to at least have an idea that the model that we have come up with conceptually.

77
00:09:46,500 --> 00:09:48,900
Is correctly implemented in our code.

78
00:09:48,900 --> 00:09:58,170
So, you know, if I'm simulating this sort of track looks like some many body problem, which may be in gravitation.

79
00:09:58,170 --> 00:10:08,310
Well, we have models of many property problems in gravitation and we know how a model like this behaves over time.

80
00:10:08,310 --> 00:10:17,730
We know that if we set it into some initial condition or some initial situation and then progress it by some amount.

81
00:10:17,730 --> 00:10:22,560
We know where everything should end up. And if we know where everything should end up,

82
00:10:22,560 --> 00:10:34,730
we also know where a we also know whether if it does end up there is correct and if it doesn't end up there, then something has gone wrong.

83
00:10:34,730 --> 00:10:38,960
Obviously, we're not building physical models, we're building software models,

84
00:10:38,960 --> 00:10:47,810
but software models of complex gravitational problems still have the the aspects

85
00:10:47,810 --> 00:10:57,080
that they are implementing some part of a of a simulation of a problem domain.

86
00:10:57,080 --> 00:11:10,160
And if we design the simulation, then we can know how that simulation behaves and we can validate that it is behaving in the way that we expect.

87
00:11:10,160 --> 00:11:14,990
And there are things that we can do to help that.

88
00:11:14,990 --> 00:11:28,700
So in many body problem, we know what happens when there are two bodies in a gravitational interaction or in a more complex system.

89
00:11:28,700 --> 00:11:37,670
It's easier to work in, say, the sort of low velocity relativity domains where you know where space and time are

90
00:11:37,670 --> 00:11:45,150
basically constant and don't change than it is in the in the sort of Einstein domain.

91
00:11:45,150 --> 00:11:49,730
Velocities approaching the speed of light. That's not necessarily the problem that we're trying to solve here.

92
00:11:49,730 --> 00:12:06,080
It's just an example. And this brings us on to a principle that software testers use called equivalence partitioning, you know,

93
00:12:06,080 --> 00:12:14,100
another problem we may have is the scientific problem we're trying to model could be incredibly large.

94
00:12:14,100 --> 00:12:17,690
Just to give a different example.

95
00:12:17,690 --> 00:12:28,970
Garrett and I were talking before the start of the seminar about the behaviour of particular proteins in a biological system.

96
00:12:28,970 --> 00:12:37,820
Now, one thing that we might want to do with a computer is simulate the structure of these proteins by saying where they they're built.

97
00:12:37,820 --> 00:12:44,570
Of all of these components, all of these atoms organised in this particular way.

98
00:12:44,570 --> 00:12:50,680
What structure is that going to sort of collapse into if the, you know,

99
00:12:50,680 --> 00:13:01,550
when the various electromagnetic forces on the different ions and atoms in the thing are sort of stable state?

100
00:13:01,550 --> 00:13:07,100
And that's that that is a common computational problem to solve.

101
00:13:07,100 --> 00:13:15,320
But you know, when you when you were working with proteins. Now if you take something big like a virus and say, how is this going to fold?

102
00:13:15,320 --> 00:13:18,240
A. You may not know the answer in advance.

103
00:13:18,240 --> 00:13:27,830
B, it could take a very long time, you know, even on a like a supercomputer cluster like out in order to find out what the answer is.

104
00:13:27,830 --> 00:13:37,340
But. Let's take a simpler problem. We know what the angle between the two hydrogen oxygen bonds in a water molecule are.

105
00:13:37,340 --> 00:13:40,430
Does our model get that right?

106
00:13:40,430 --> 00:13:47,690
If it, you know, if it doesn't, then we probably shouldn't be particularly confident in using it for any more complex problem.

107
00:13:47,690 --> 00:13:54,590
If it works for water, then it's try sugar. Well, let's try a really simple protein and see whether it gets the right answer.

108
00:13:54,590 --> 00:13:58,190
We're not doing anything weird in a software context here.

109
00:13:58,190 --> 00:14:02,750
What you're saying for this problem where I know what the conditions are.

110
00:14:02,750 --> 00:14:10,220
I also know what the outcome is, and I can run my software and then verify that I get the same outcome.

111
00:14:10,220 --> 00:14:23,490
If I run this multiple times with different inputs and always get the expected outcomes, then I increase my confidence that my software is correct.

112
00:14:23,490 --> 00:14:30,420
And then the two remaining goals are reproducibility and recovery.

113
00:14:30,420 --> 00:14:42,370
So. Reproducibility is obviously very important in in research as why we have the reproducible research network.

114
00:14:42,370 --> 00:14:48,040
We want someone who's running our analysis with our code to get the same results,

115
00:14:48,040 --> 00:14:55,390
and that may mean that they're running on our computer may mean that they're running it on a different computer.

116
00:14:55,390 --> 00:15:00,040
It may just mean that they're doing exactly the same thing by a different time.

117
00:15:00,040 --> 00:15:04,810
But we would we would expect to get the same results in that context.

118
00:15:04,810 --> 00:15:15,040
Someone running our analysis, but with different data, should get consistent results in many circumstances.

119
00:15:15,040 --> 00:15:21,160
If you're running a simulation in a different but related domains and the simulation

120
00:15:21,160 --> 00:15:28,550
correctly behaves and represents the outcome of the model in those particular domains,

121
00:15:28,550 --> 00:15:31,960
then the results should be comparable somehow.

122
00:15:31,960 --> 00:15:41,350
We would also like someone who takes our ideas, takes our model and free codes, and gets a consistent results,

123
00:15:41,350 --> 00:15:51,610
or at least if they don't get consistent results, then it's possible for us to investigate why where the disparity comes from.

124
00:15:51,610 --> 00:15:57,010
And that's another thing that we're going to get from automated testing that we'll look at later is a bit more fine grained

125
00:15:57,010 --> 00:16:08,720
information about how the different parts of our software system interact and and which bits of it are behaving in particular ways.

126
00:16:08,720 --> 00:16:19,010
And there's a really important thing to bear in mind where we're talking about like reusing software, reproducing the results,

127
00:16:19,010 --> 00:16:26,750
we get types of software and recovering the behaviour of software that we use a long time ago,

128
00:16:26,750 --> 00:16:34,160
and sometimes that poor person is having to deal with their software is like me or you.

129
00:16:34,160 --> 00:16:41,840
It's the same person who wrote it. And you know, we get distracted by another project or like, we get some teaching that we have to do for a term.

130
00:16:41,840 --> 00:16:49,250
And a few months later, we come back and we don't quite remember what we were doing.

131
00:16:49,250 --> 00:16:56,090
There was obviously some stroke of genius when we wrote that function there, but why did we write it that way?

132
00:16:56,090 --> 00:17:04,190
What does it do when we've got a collection of tests that say, here's what this part of the software does in these circumstances?

133
00:17:04,190 --> 00:17:11,810
That's more documentation that's more help both to us to kind of recover our mental model of what

134
00:17:11,810 --> 00:17:19,700
the software does and for other people to reconstruct that mental model and get an idea of what if,

135
00:17:19,700 --> 00:17:25,050
how this software works so that they can either reuse it or develop it.

136
00:17:25,050 --> 00:17:30,420
So, you know, how would testing help in in these scenarios?

137
00:17:30,420 --> 00:17:35,880
So some someone else wants to use our code and run it with the same data.

138
00:17:35,880 --> 00:17:46,230
If we've got a collection of tests that explain how the what the software does and how it should behave for particular inputs,

139
00:17:46,230 --> 00:17:56,400
then before someone else runs the simulation and or runs the analysis and checks what whether they get the same outcome,

140
00:17:56,400 --> 00:18:03,570
they can run those tests and see whether they all pass to see whether all of our expectations are satisfied.

141
00:18:03,570 --> 00:18:08,260
And that's going to give some information if any of those things fails.

142
00:18:08,260 --> 00:18:15,900
That's going to give some information about what the assumption is that isn't satisfied in this new context.

143
00:18:15,900 --> 00:18:20,280
Maybe the software expects some files to be present,

144
00:18:20,280 --> 00:18:29,370
like configuration files or inputs set up in a particular way that I've got in my home directory and that definitely work for me.

145
00:18:29,370 --> 00:18:39,160
But someone else needs to know that information set the same thing up in the same way if they want to get compatible results.

146
00:18:39,160 --> 00:18:44,590
Someone using different data wants to get consistent results, or again,

147
00:18:44,590 --> 00:18:55,630
if we know if we can prove that the software does correctly implement the model or the other sort of scientific concepts that we're trying to embody,

148
00:18:55,630 --> 00:19:06,340
then they can be somewhat confident that the results they get out from using it with their data are the result of our model being applied to

149
00:19:06,340 --> 00:19:16,240
that data and not the result of something weird going on with some code or with there being some mistakes in the in the behaviour somewhere.

150
00:19:16,240 --> 00:19:22,690
And then if someone else wants to take a light re-implement our model,

151
00:19:22,690 --> 00:19:33,490
be that just for a cross-check to make sure that that they understand what the model is, maybe because they using a different context,

152
00:19:33,490 --> 00:19:36,940
like maybe their super computer doesn't have this or that cluster doesn't have the same

153
00:19:36,940 --> 00:19:41,920
libraries as ours and they want to build a version is compatible with their set up.

154
00:19:41,920 --> 00:19:46,570
Or maybe they're using a Mac and we were using Linux or for whatever reason,

155
00:19:46,570 --> 00:19:52,540
they want to rebuild it if they can see the tests and they can see the expected results.

156
00:19:52,540 --> 00:19:59,170
They can compare the results they get from their implementation with the results that they get from our implantation.

157
00:19:59,170 --> 00:20:01,990
Then they know something about the compatibility of those without having to just like,

158
00:20:01,990 --> 00:20:05,890
run the whole experiment and see what the outcome is at the end.

159
00:20:05,890 --> 00:20:12,010
So really, all of these things are about increasing confidence in the software and increasing

160
00:20:12,010 --> 00:20:21,160
the rate at which we get feedback that informs that confidence in that software.

161
00:20:21,160 --> 00:20:26,120
That was my strategic pause, just to check whether there were any questions, obviously not right now.

162
00:20:26,120 --> 00:20:33,370
As all carry on. So how do we design a software test?

163
00:20:33,370 --> 00:20:44,620
You can think of the behaviour of any software as being a form of contract that you make with the user of the software,

164
00:20:44,620 --> 00:20:51,160
whether that's yourself or other people in your group or members of the public or whoever's using the software,

165
00:20:51,160 --> 00:20:57,040
you can think about a form of contract where you say.

166
00:20:57,040 --> 00:21:01,740
If you arrange for this collection of things to be true.

167
00:21:01,740 --> 00:21:14,310
And then uses software, then I will make this guarantee about the outcome if there is a calculation about the result of using their software.

168
00:21:14,310 --> 00:21:21,210
And like that sort of design principle, whatever tools you're using to write your tests,

169
00:21:21,210 --> 00:21:32,250
whatever sort of level will come in later to discuss the different sort of levels of abstraction that exist in designing tests,

170
00:21:32,250 --> 00:21:37,050
this idea of as a contract is universal.

171
00:21:37,050 --> 00:21:44,530
If you set the world up in this way and then use my software, I will do this thing as a result.

172
00:21:44,530 --> 00:21:54,870
And so, you know, we could think back to the many, but the many body problem I can say, if you have a mass appoint mass,

173
00:21:54,870 --> 00:22:05,410
a mass M here and mass of mass am to over here and the distance between them is, ah,

174
00:22:05,410 --> 00:22:14,230
then I would say they then ask my software to calculate the gravitational force on the first mass.

175
00:22:14,230 --> 00:22:19,850
It's going to say that result is jammed over squared, which is the light,

176
00:22:19,850 --> 00:22:29,050
the Newtonian gravity gravitational force equation in the direction from this point from the first point to the second point.

177
00:22:29,050 --> 00:22:34,210
If you set the mass of one of these things to be negative,

178
00:22:34,210 --> 00:22:39,370
then my software is going to generate an error because we haven't worked out how to do

179
00:22:39,370 --> 00:22:47,570
negative mass or we decided that negative mass isn't within isn't a problem we want to solve.

180
00:22:47,570 --> 00:22:58,160
If you have multiple matches in your simulation and you ask what is the Fourth Street gravity over here,

181
00:22:58,160 --> 00:23:04,880
we're going to work out each of those individual contributions and some of them.

182
00:23:04,880 --> 00:23:11,720
And so, you know, you can see that this idea of the contract is coming into play if you have done this.

183
00:23:11,720 --> 00:23:17,720
If there is a mass here in the mass there, then you ask for the gravitational forces,

184
00:23:17,720 --> 00:23:23,480
then the result, then that you know what the software is going to do is to give you this answer.

185
00:23:23,480 --> 00:23:34,640
And so a test takes that contract takes that idea of the preconditions and then the action and then the post conditions.

186
00:23:34,640 --> 00:23:39,610
And what you do is you create a single concrete example of that.

187
00:23:39,610 --> 00:23:50,200
Where you know what the answer is for a given question. So if the mass of the other object is zero, then the gravitational force is zero.

188
00:23:50,200 --> 00:23:57,640
Super simple one and that that is a valid case, and we can write that as a test.

189
00:23:57,640 --> 00:24:09,740
If the mass is one kilogram and the distance is one metre, then the the yeah, then the force is just the gravitational constant G.

190
00:24:09,740 --> 00:24:16,670
Again, another example. And we, you know, we start to think, well,

191
00:24:16,670 --> 00:24:25,340
aren't there an infinity of examples such as taking this example of this sort of scenario of many body gravitational problem?

192
00:24:25,340 --> 00:24:31,790
I could have anywhere from zero to an infinite number of different masses.

193
00:24:31,790 --> 00:24:37,640
I eat at any of infinite points in space, isn't this?

194
00:24:37,640 --> 00:24:47,510
And with infinite initial velocities, do I really have to write that many infinities of different tests?

195
00:24:47,510 --> 00:24:55,250
And and so that. State that.

196
00:24:55,250 --> 00:25:09,650
What software testers do is they look for what are they actually meaningfully distinct regions, so distinct domains in the problem space.

197
00:25:09,650 --> 00:25:17,090
And then they write tests that capture by one example over each of those regions.

198
00:25:17,090 --> 00:25:28,190
So. You know, there's the trivially degenerate case that there are no masses in your many body problem simulation,

199
00:25:28,190 --> 00:25:33,260
that there's a very simple answer there's one having one mass.

200
00:25:33,260 --> 00:25:45,600
That's again, a trivial situation of which there's one example having two masses is another similar simple example.

201
00:25:45,600 --> 00:25:49,740
And then, you know, the idea that as you add more,

202
00:25:49,740 --> 00:25:58,680
what happens is that you some of the contributions to gravity from each force tells us that as soon as you've got more than two,

203
00:25:58,680 --> 00:26:01,260
if it works for any number of more than two,

204
00:26:01,260 --> 00:26:06,840
it works for all numbers more than two because it's just got to do the same maths, but with with more inputs.

205
00:26:06,840 --> 00:26:17,000
So a tester would write a test for zero, one, two and three basses and would then be entirely happy.

206
00:26:17,000 --> 00:26:24,080
Which works really well for sort of discrete variables like that where you have continuous variables as

207
00:26:24,080 --> 00:26:34,430
a related idea where as well as there being different ranges in the or domains in the problem space.

208
00:26:34,430 --> 00:26:45,200
You also then do what's called boundary value analysis, where at the at the boundary between two of these regimes you say, is there?

209
00:26:45,200 --> 00:26:49,070
You know, are the results effectively continuous through the boundary?

210
00:26:49,070 --> 00:26:53,420
Does it do the right thing as you move from one domain to the other?

211
00:26:53,420 --> 00:27:00,920
And so if we had some simulation that had say like relativistic corrections then and small velocities,

212
00:27:00,920 --> 00:27:05,780
it just use like a normal linear space and time.

213
00:27:05,780 --> 00:27:12,110
And then when you went to higher velocities, it used the relativistic corrections.

214
00:27:12,110 --> 00:27:18,350
There would be some point in between where it started to use these corrections,

215
00:27:18,350 --> 00:27:24,440
and the texture would look at what happens just below this at this point,

216
00:27:24,440 --> 00:27:29,330
what happens on that point and what happens just after I make sure that there are sort of

217
00:27:29,330 --> 00:27:36,030
three consistent values which make sure that the transition through the regime is smooth.

218
00:27:36,030 --> 00:27:45,150
And this is this is so common in software testing that there are a couple of little sort of mantras they use,

219
00:27:45,150 --> 00:27:56,880
either in particular technologies or by particular communities to sort of encapsulate this idea if there is the contract and of the preconditions,

220
00:27:56,880 --> 00:28:01,330
the action and and impose conditions. One of them.

221
00:28:01,330 --> 00:28:14,110
Which is very common in sort of communicating the the meaning of a test between software developers and by and say problem domain

222
00:28:14,110 --> 00:28:24,190
experts say researchers who are working on the software is to use the phrase given when then which encapsulates that idea.

223
00:28:24,190 --> 00:28:28,810
Given that this set of conditions were created.

224
00:28:28,810 --> 00:28:33,580
When the software does this, then this is the outcome.

225
00:28:33,580 --> 00:28:44,220
So again, we've got stuff that happened fast. Given this set of initial conditions, court action when this happens in the software.

226
00:28:44,220 --> 00:28:48,790
And we've got an outcome then this will be the result.

227
00:28:48,790 --> 00:29:01,570
People using unit test frameworks, which are a way of testing small components as they are like little pieces of a bigger software system,

228
00:29:01,570 --> 00:29:07,180
use a a phrase called Assemble Act Assert.

229
00:29:07,180 --> 00:29:14,650
And again, the precondition is you have assembled this thing in this state.

230
00:29:14,650 --> 00:29:25,200
The act is the the action that the software takes and then asserting is saying, I am telling you that the result of the software will be this.

231
00:29:25,200 --> 00:29:32,070
So a software test is always a binary outcome,

232
00:29:32,070 --> 00:29:37,410
and it says an assertion of what the correct behaviour should be and a failure to satisfy that

233
00:29:37,410 --> 00:29:47,640
assertion means not having confidence in the software means believing that something has gone wrong.

234
00:29:47,640 --> 00:29:54,300
So with that idea of how to build a test in mind, what's the best way to get started?

235
00:29:54,300 --> 00:30:01,050
The easiest thing to do is just take your existing software and think about this given when

236
00:30:01,050 --> 00:30:07,470
then think about this idea of the contract and apply it to running the software as a whole.

237
00:30:07,470 --> 00:30:17,970
And this is called an end to end or E2EE test in the sort of jargon of professional software testing.

238
00:30:17,970 --> 00:30:30,080
If you've got like a big problem, like a machine learning training problem or a a massive like a super computer simulation, it's going to take.

239
00:30:30,080 --> 00:30:36,830
A long time and a lot of resources to run, then yeah, this is not necessarily the optimal thing to do.

240
00:30:36,830 --> 00:30:44,240
You may end up waiting a very long time or even year, costing a large amount of money just to get the results of your tests.

241
00:30:44,240 --> 00:30:55,880
And this is why we look for a sort of sample problems, toy problems, smaller datasets, something where not only do we know what the outcome is,

242
00:30:55,880 --> 00:31:00,410
but also the sort of computational effort in getting to that outcome is going

243
00:31:00,410 --> 00:31:07,670
to be small because the the less time it takes to run through your tests,

244
00:31:07,670 --> 00:31:14,180
the more frequently you're going to do it. Yeah, that's just just the way that people work.

245
00:31:14,180 --> 00:31:18,200
If if it takes more than a few minutes to do something,

246
00:31:18,200 --> 00:31:25,190
we get distracted and we're going to look at something else here we go and check social media or our emails or go make a cup of coffee or whatever.

247
00:31:25,190 --> 00:31:31,670
And so we we tend to like, save this for, Oh, it's lunchtime,

248
00:31:31,670 --> 00:31:37,990
I'm going to let go of my tasks and then go and get some lunch and then come back and see the result.

249
00:31:37,990 --> 00:31:43,540
What if I'm already running my tests every lunchtime, then if they passed on Monday lunchtime,

250
00:31:43,540 --> 00:31:47,500
they passed on Tuesday lunchtime and they fail a Wednesday lunchtime.

251
00:31:47,500 --> 00:31:54,670
The only thing I know is I did something on Tuesday afternoon or Wednesday morning that broke.

252
00:31:54,670 --> 00:31:59,210
That made this software behave in a way that I believe is incorrect.

253
00:31:59,210 --> 00:32:07,400
So I've now got to kind of go back through my entire set of changes I made over that day and try and understand what it was.

254
00:32:07,400 --> 00:32:16,340
If it takes me like a minute to run through the test, so I might just do it every time I've changed software and then.

255
00:32:16,340 --> 00:32:22,430
If I run them and I find that something's failed, I've only got to go back to the thing I was doing a minute ago,

256
00:32:22,430 --> 00:32:27,560
which is still fresh in my head and I know I won't change.

257
00:32:27,560 --> 00:32:32,560
It was and b I know like what I was changing because yeah,

258
00:32:32,560 --> 00:32:40,400
I know there's a limited amount of stuff you can do in that time and I know what I was trying to achieve.

259
00:32:40,400 --> 00:32:48,240
So I've got some idea of what I introduce that could make the thing go wrong.

260
00:32:48,240 --> 00:32:52,950
So we tend not to build massive batteries of end to end tests,

261
00:32:52,950 --> 00:33:02,180
we tend to build a small number of highly important tests that show that basic things work and that like,

262
00:33:02,180 --> 00:33:05,940
and that our system is basically glued together the right way.

263
00:33:05,940 --> 00:33:15,210
So if I think back to the work I was doing at Facebook, we would have a smoke test that was can I launch the iPhone app,

264
00:33:15,210 --> 00:33:22,970
log into Facebook and then post some text as a status to my newsfeed?

265
00:33:22,970 --> 00:33:29,990
And that would get run by every developer every time they made a change to the application.

266
00:33:29,990 --> 00:33:36,950
Most of these changes weren't going to break that behaviour, but as soon as someone did break out, babe,

267
00:33:36,950 --> 00:33:42,920
you wanted to know about it because if you had a version of the Facebook app where you couldn't post to your newsfeed,

268
00:33:42,920 --> 00:33:47,030
that wouldn't be useful to almost anybody using the application.

269
00:33:47,030 --> 00:33:57,000
So this is a very high value test, a very small focussed piece for the functionality that we were exploring.

270
00:33:57,000 --> 00:34:05,640
And these tests typically don't need any changes to your software if you manage to change your

271
00:34:05,640 --> 00:34:12,930
your dataset or like your problem specifications that you're running a very small problem.

272
00:34:12,930 --> 00:34:22,240
You just you're just using your existing software. There's no real design changes are required and you can just run through these with.

273
00:34:22,240 --> 00:34:32,110
It with a script, if they're a sort of simulation tools, you just want to go online or if you've got something that's got a user interface,

274
00:34:32,110 --> 00:34:40,680
you can find a tool for automated like pressing buttons on the user interface that will just run your software, as is.

275
00:34:40,680 --> 00:34:51,780
These tests are very useful because they tell you like whether your software is kind of all plumbed together properly is dealing with data,

276
00:34:51,780 --> 00:35:00,490
as you would expect. But they're also very low signal in that if it goes wrong, what you know is there's a problem in your software somewhere.

277
00:35:00,490 --> 00:35:05,920
You know, think about that Facebook example. Let's say that the.

278
00:35:05,920 --> 00:35:13,510
The ability to post at a feed didn't work, is that because the little submit button in the UI is broken,

279
00:35:13,510 --> 00:35:23,080
is it because the the thing that sends the data to the network is broken and how is it broken because it can't connect to the network?

280
00:35:23,080 --> 00:35:27,280
Or is it because it isn't reading the data out of the UI?

281
00:35:27,280 --> 00:35:34,600
Or is that that all got sent? And then the UI does not update to show the the new results?

282
00:35:34,600 --> 00:35:38,440
Or is it that the server ignored this data coming?

283
00:35:38,440 --> 00:35:48,190
You know, there's so many different ways in which this test could fail in any given way that all we know really is that the software is broken.

284
00:35:48,190 --> 00:35:56,230
We don't really have any information on what is broken and any way to narrow down our investigation on how to fix it.

285
00:35:56,230 --> 00:36:06,820
So the community has this idea of the test pyramid going back to my example of Kerbal Space Programme.

286
00:36:06,820 --> 00:36:13,540
We certainly could test the Space Shuttle by building a space shuttle and then seeing whether the Space Shuttle works.

287
00:36:13,540 --> 00:36:19,840
But that's a really expensive and time consuming way to test the Space Shuttle.

288
00:36:19,840 --> 00:36:28,450
It's built out of all of these different components, right? One of the earliest things that NASA did when they were building, not the Kerbal Shuttle,

289
00:36:28,450 --> 00:36:35,230
but the actual real space shuttle was to build the business, got the Delta wings.

290
00:36:35,230 --> 00:36:44,530
Don't don't bother putting any engines on it. Just strap that to the back of a jumbo jet, take off, let go of the thing and then try and land it.

291
00:36:44,530 --> 00:36:50,680
And that tells you whether the aerodynamics and the controls work without having to build these massive solid

292
00:36:50,680 --> 00:36:56,080
rocket boosters or the main engine without having to assemble all of that and then stick on a launch pad,

293
00:36:56,080 --> 00:37:06,070
fuel it up and said anything up without even having to build the little so control engines there on the back of the main body.

294
00:37:06,070 --> 00:37:14,380
So they they took a component of this complete system isolated to that component,

295
00:37:14,380 --> 00:37:22,840
set that into some reasonable starting condition and then saw how that behaved once they initiated some action,

296
00:37:22,840 --> 00:37:30,440
which was planned for the Space Shuttle. And we can do that kind of thing with software as well.

297
00:37:30,440 --> 00:37:43,070
So now we do get into the stage where we're having to think about the design of our software, which components are actually distinct.

298
00:37:43,070 --> 00:37:55,960
So functionality are distinct behaviour in this software that have they're responsible for, like some subsets of the overall system.

299
00:37:55,960 --> 00:38:02,440
How are those related to each other if we try if we were to take out all of the rest of the software,

300
00:38:02,440 --> 00:38:07,210
what would we have to supply for this thing, to have enough information to be able to work?

301
00:38:07,210 --> 00:38:12,430
What would it expects to be able to do? Does it want to read from a file or write to a file?

302
00:38:12,430 --> 00:38:24,160
Does they expect a database to be present? Does they expect some variables in the programme that's outside of its control to be set?

303
00:38:24,160 --> 00:38:27,820
So we are now making changes to our software,

304
00:38:27,820 --> 00:38:32,080
but these changes are themselves potentially useful because what we're doing

305
00:38:32,080 --> 00:38:38,170
is taking each of these components and reusing it outside the domain of our,

306
00:38:38,170 --> 00:38:50,590
you know, our immediate science problem. And in the domain of a test, this means the changes we make are changes to the reusability of this module.

307
00:38:50,590 --> 00:38:53,830
We can now take this software, this component,

308
00:38:53,830 --> 00:39:04,540
and apply it to different context because we now know what we need to do and how to set this thing up so that we can use it elsewhere.

309
00:39:04,540 --> 00:39:10,870
And what we get from doing this is we get much, much more precise feedback when a test fails.

310
00:39:10,870 --> 00:39:18,460
We know there is a failure in this component. If I if I took away landing the space shuttle test and my space shuttle didn't land,

311
00:39:18,460 --> 00:39:22,660
I wouldn't need to check the solid rocket boosters because I didn't use them.

312
00:39:22,660 --> 00:39:29,830
I only used the aerodynamic dynamic part, and you can imagine going even smaller.

313
00:39:29,830 --> 00:39:35,710
So the so-called cockpit windscreen on the front of the shuttle,

314
00:39:35,710 --> 00:39:45,520
you could test how impact of resistance that is just by exposing is a large force like the equivalent of hitting it with a hammer.

315
00:39:45,520 --> 00:39:48,160
If it breaks, then you know that the problem is with the windscreen,

316
00:39:48,160 --> 00:39:53,320
not with the rest of the shuttle, and certainly not with all of the engines and other components.

317
00:39:53,320 --> 00:40:00,770
So we're getting much, much more localised and immediately actionable feedback from our test results.

318
00:40:00,770 --> 00:40:07,560
But what we're not doing is answering the question, does this actually does this offer actually solve a problem?

319
00:40:07,560 --> 00:40:22,790
I have, yes, I need this like this unit, which is just a a way of describing like a class or a function, some very small part of software behaviour.

320
00:40:22,790 --> 00:40:28,970
Yes, I need this to to work in particular ways,

321
00:40:28,970 --> 00:40:38,030
but it's only going to be providing a valuable contribution to serving my overall problem if that working in

322
00:40:38,030 --> 00:40:48,670
particular ways is then used by the rest of the units in a way that's that's enabling my problem to be solved.

323
00:40:48,670 --> 00:40:53,500
So it's very, you know, it's very common in commercial software, for example,

324
00:40:53,500 --> 00:41:04,180
to find projects that have a large number of unit tests at a very high level of coverage by which we mean the the fraction of the

325
00:41:04,180 --> 00:41:19,190
statements or the different logic flows through the programme that are tested so that you can find very like the tests at the unit level.

326
00:41:19,190 --> 00:41:21,700
That's the very small, separate component.

327
00:41:21,700 --> 00:41:31,070
Tests are very well specified in COVID because it's easy for a programmer to think, What do I need this function to do?

328
00:41:31,070 --> 00:41:36,140
But then gaps as you get further up the pyramid into the integration and the end

329
00:41:36,140 --> 00:41:41,420
to end levels such that you actually don't know whether the programme works.

330
00:41:41,420 --> 00:41:46,220
But you know that every function and it does what the programmer thought it needed to do.

331
00:41:46,220 --> 00:41:49,700
But you don't know, does this actually solve a problem that anybody has?

332
00:41:49,700 --> 00:42:02,030
So the sort of motivation of having this pyramid graphic is to say he is like, Yeah, he's a good idea for how you should spend your testing effort.

333
00:42:02,030 --> 00:42:10,610
Lots at the small level, which gives you high fidelity, actionable feedback or sorry, high precision, actionable feedback.

334
00:42:10,610 --> 00:42:17,180
And then some at the top level that say that you actually are able to achieve your goals using the software.

335
00:42:17,180 --> 00:42:26,570
And then some bits in the middle that sort of provide the impedance match between these separate functions work and this whole software works.

336
00:42:26,570 --> 00:42:33,150
These bits, when assembled together, are also correct.

337
00:42:33,150 --> 00:42:37,770
So as I said, this was really an introduction to the concepts of testing.

338
00:42:37,770 --> 00:42:46,350
Here is a specific tool so you can look at the are relevant to using particular programming languages.

339
00:42:46,350 --> 00:42:53,850
I've tried to cover most of the things I've seen in the world of scientific computing.

340
00:42:53,850 --> 00:43:03,870
There may be others. I apologise wholeheartedly to any Fortran programmers who feel left out at the moment,

341
00:43:03,870 --> 00:43:12,420
but I don't have experience with testing or FORTRAN, so I didn't have any recommendations for tools to examine that.

342
00:43:12,420 --> 00:43:16,470
So quick summary,

343
00:43:16,470 --> 00:43:27,000
the reason we want testing in a scientific context is partly to improve the confidence that we have in our software and partly to improve the

344
00:43:27,000 --> 00:43:37,410
reproducibility of results that we get with the software because we know how the software acts and how it responds to particular inputs.

345
00:43:37,410 --> 00:43:42,150
Even if we don't know what our scientific outcomes are going to be,

346
00:43:42,150 --> 00:43:52,860
we should at least understand what conceptual model we're trying to express in our software and can say that we have correctly expressed this model,

347
00:43:52,860 --> 00:43:58,800
even if we can't say a priori what the scientific outcomes are going to be.

348
00:43:58,800 --> 00:44:10,920
And the way that we design tests is using the idea of a contract that's given when that idea that if I set things up in a particular way,

349
00:44:10,920 --> 00:44:18,840
then use my software, I will get this outcome. And that expression is an assertion.

350
00:44:18,840 --> 00:44:24,810
If it if it is satisfied, then the software is correct for that case.

351
00:44:24,810 --> 00:44:30,240
If it is not satisfied, then the software is incorrect. It fails to meet our expectations.

352
00:44:30,240 --> 00:44:36,920
The easiest way to get started is just to run your entire programme with a known input.

353
00:44:36,920 --> 00:44:47,520
That way, you know what I come to expect. And of course, the RC Group can help, and we run these things good software surgeries,

354
00:44:47,520 --> 00:44:53,960
which like a sort of half hour discussion with one or two research software engineers about your software projects.

355
00:44:53,960 --> 00:45:01,560
So if you want some help. So getting started with testing or finding out how to use software tests in a particular way?

356
00:45:01,560 --> 00:45:06,120
Drop us a line as our email address, you can find that out.

357
00:45:06,120 --> 00:45:11,370
So that has to be done. And I guess now it's time for some questions. All right.

358
00:45:11,370 --> 00:45:16,110
Thank you very much. It was wonderful. Thank you. OK.

359
00:45:16,110 --> 00:45:23,264
So I'm going to I'm going to stop the recording and then people can freely ask questions.