1 00:00:01,450 --> 00:00:09,340 OK, good, so it gives me great pleasure to welcome Graham Lee from Computer Science, 2 00:00:09,340 --> 00:00:17,400 who is going to be talking about automated testing, so please take it away. 3 00:00:17,400 --> 00:00:25,710 Great. Thank you very much. Introduction. Hopefully, what you're now looking at is the slides. 4 00:00:25,710 --> 00:00:37,050 So and yes, I will rely on Garrett to moderate the the chat and the Q&A. 5 00:00:37,050 --> 00:00:38,700 Welcome questions at any point. 6 00:00:38,700 --> 00:00:48,870 So I feel free to nudge when you've got a question, but I'm going to be focussing on my presenter notes and all my slides, 7 00:00:48,870 --> 00:00:55,830 so I probably won't notice and I'll rely on him to let me know if anything's come up. 8 00:00:55,830 --> 00:01:07,110 So yes, as as Karen said, it isn't true. I mean, the research software engineering group, which is based in the computer science department, 9 00:01:07,110 --> 00:01:12,060 although my background is in professional software engineering. 10 00:01:12,060 --> 00:01:22,590 So I did a degree in Oxford in physics back in 2004 and then moved into like commercial computing. 11 00:01:22,590 --> 00:01:32,520 I've worked at a bunch of companies, but I have always had software testing as a sort of thread running through my career, 12 00:01:32,520 --> 00:01:36,870 so a large number of years ago than I care to think about. 13 00:01:36,870 --> 00:01:40,650 I wrote the book test driven iOS development, 14 00:01:40,650 --> 00:01:52,440 which is about how developers writing for software apps for the iPhone and the iPad can test their software. 15 00:01:52,440 --> 00:02:01,800 I was the manager of a engineering group at Facebook who developed the mobile testing frameworks the Facebook used, 16 00:02:01,800 --> 00:02:07,800 and one of the things you have to ask in doing testing is what are the benefits? 17 00:02:07,800 --> 00:02:17,430 And obviously we we hope that by having test coverage, by knowing something about the behaviour of our code, 18 00:02:17,430 --> 00:02:25,620 then we're going to have increased confidence in its in its correctness and in its function. 19 00:02:25,620 --> 00:02:36,890 But there are fringe benefits as well. And at Facebook, we actually reduced the time it took to release a new updates to the Facebook mobile app. 20 00:02:36,890 --> 00:02:45,840 So the Facebook app for iOS for Android from four weeks to one and a half weeks, and it's probably even shorter now, 21 00:02:45,840 --> 00:02:55,470 just by taking some of the testing that was being done manually every time there was a new release candidate 22 00:02:55,470 --> 00:03:05,650 of the app and automating that so that we could very quickly get information on the quality of the software. 23 00:03:05,650 --> 00:03:17,650 And I also did some work, Apple last year documenting that test tools and the techniques for using software testing with that technology. 24 00:03:17,650 --> 00:03:26,200 And that's on the Apple developer website so that, you know, that's the bit where I just tell you what my CV is. 25 00:03:26,200 --> 00:03:32,410 So there's some legitimacy for me being a gay guy giving this talk. 26 00:03:32,410 --> 00:03:34,600 And then the question who we know, 27 00:03:34,600 --> 00:03:47,500 who is this Oxford Research software engineering group and where a effectively a sort of service or a facility in the 28 00:03:47,500 --> 00:03:57,310 university for helping researchers to achieve their goals using custom software and bespoke software development. 29 00:03:57,310 --> 00:04:08,170 That often means that we get involved in a project like a grant funded research project or a spin out actually writing software. 30 00:04:08,170 --> 00:04:17,140 But that's not the only thing we do. We also obviously do outreach like we use seminars, we do teaching. 31 00:04:17,140 --> 00:04:25,780 And really, one of our main goals is to sort of bring up the the standard of software development across the university. 32 00:04:25,780 --> 00:04:33,190 So rather than just being the, you know, some sort of like gatekeeper or central clearinghouse software, 33 00:04:33,190 --> 00:04:39,650 we're actually sort of building a community of expertise and and practise across the university. 34 00:04:39,650 --> 00:04:46,810 So that's really why, you know, I'm very happy to be given the opportunity to. 35 00:04:46,810 --> 00:04:52,420 Yeah, it's it's there's a waffle on it like phone on Friday afternoon about software testing, 36 00:04:52,420 --> 00:04:56,650 but less waffle because it says only me between us in the weekend now. 37 00:04:56,650 --> 00:05:00,280 So that's that's going. 38 00:05:00,280 --> 00:05:10,180 So this is really an introduction to the idea of software testing in a sort of scientific or like computational research context. 39 00:05:10,180 --> 00:05:17,380 So I'm going to try and stick mostly to sort of principles about how we think about testing, 40 00:05:17,380 --> 00:05:25,630 why we think about testing and how to sort of plan for creating tests for your software. 41 00:05:25,630 --> 00:05:30,940 I'm not going to go in-depth on any particular like tools or technologies, 42 00:05:30,940 --> 00:05:38,770 partly because I think telling you how to use something before you've got motivation for using it is, 43 00:05:38,770 --> 00:05:42,790 you know, is kind of off-putting and not relevant or useful. 44 00:05:42,790 --> 00:05:51,520 And partly because there's, you know, there's a wealth of different technologies out there and depends on what you're trying to do. 45 00:05:51,520 --> 00:05:56,470 If you're writing some sort of data manipulation in, 46 00:05:56,470 --> 00:06:03,130 are you going to have a very different experience from if you're writing a web application in JavaScript? 47 00:06:03,130 --> 00:06:08,890 And so like picking any one of those would lose a bunch of audience and not 48 00:06:08,890 --> 00:06:13,270 necessarily even be useful for the people who are using that particular technology. 49 00:06:13,270 --> 00:06:18,290 So what are we doing when we test software? 50 00:06:18,290 --> 00:06:22,210 You know, what are we trying to get out of this thing? 51 00:06:22,210 --> 00:06:33,820 And I I've come up with four sort of goals for testing, which I've call continuity, correctness, reproducibility and recovery. 52 00:06:33,820 --> 00:06:37,990 So let's let's take a look at those continuity. 53 00:06:37,990 --> 00:06:46,330 I mean that what the software does today should be, you know, somewhat related to what it's going to do tomorrow. 54 00:06:46,330 --> 00:06:52,990 We obviously do evolve software. We add new things, we fix bugs. 55 00:06:52,990 --> 00:07:04,510 The idea is any of these should be an improvement. It's very rare for us to deliberately remove capabilities from software. 56 00:07:04,510 --> 00:07:14,140 It does happen sometimes we realise that we're supporting an old platform that's no longer relevant or where we've 57 00:07:14,140 --> 00:07:23,290 got some old algorithm that the community has moved on from and that we don't need to have that algorithm anymore. 58 00:07:23,290 --> 00:07:29,580 But those are like specific events that we can plan for what we really don't want. 59 00:07:29,580 --> 00:07:37,830 These are unplanned breakages or loss of functionality, which called regressions in the software industry. 60 00:07:37,830 --> 00:07:47,400 You can imagine that if you've published research based on a code that performs a simulation or does some 61 00:07:47,400 --> 00:07:56,340 analysis of the data and someone comes along and wants to replicate that analysis or rerun that simulation, 62 00:07:56,340 --> 00:08:01,650 they may want to do it in a newer context. They may want to try new ideas. 63 00:08:01,650 --> 00:08:05,760 They may want to use newer techniques, 64 00:08:05,760 --> 00:08:14,970 but they want the the thing to basically work so they they still want to get the the results that they were able to get before. 65 00:08:14,970 --> 00:08:22,710 So one thing that having tests gives us is not only the knowledge that our software works now, 66 00:08:22,710 --> 00:08:31,260 it's knowledge about whether future versions of the software still have that early capability and that 67 00:08:31,260 --> 00:08:40,660 earlier behaviour because we can always keep these tests and run them against new versions of the software. 68 00:08:40,660 --> 00:08:47,590 Correctness is perhaps the one that makes a lot of people doing scientific 69 00:08:47,590 --> 00:08:53,130 computation kind of stop and wonder whether testing is really relevant for them. 70 00:08:53,130 --> 00:08:59,920 You know, I'm doing research, I'm trying to find out the results. 71 00:08:59,920 --> 00:09:05,890 So a question for which I don't know the answer. By definition, if I knew what the answer to the question was, it wouldn't be. 72 00:09:05,890 --> 00:09:11,530 It wouldn't be research. So how can I write a test for what? 73 00:09:11,530 --> 00:09:16,030 I don't know what the outcome is going to be? And that is a good question. 74 00:09:16,030 --> 00:09:28,480 It's an important question. Yeah, we could have some complex problem domain that we're trying to model and a new context to explore with that model. 75 00:09:28,480 --> 00:09:36,770 And while we may not know what the what the outcome is going to be in terms of the research problem. 76 00:09:36,770 --> 00:09:46,500 We we want to at least have an idea that the model that we have come up with conceptually. 77 00:09:46,500 --> 00:09:48,900 Is correctly implemented in our code. 78 00:09:48,900 --> 00:09:58,170 So, you know, if I'm simulating this sort of track looks like some many body problem, which may be in gravitation. 79 00:09:58,170 --> 00:10:08,310 Well, we have models of many property problems in gravitation and we know how a model like this behaves over time. 80 00:10:08,310 --> 00:10:17,730 We know that if we set it into some initial condition or some initial situation and then progress it by some amount. 81 00:10:17,730 --> 00:10:22,560 We know where everything should end up. And if we know where everything should end up, 82 00:10:22,560 --> 00:10:34,730 we also know where a we also know whether if it does end up there is correct and if it doesn't end up there, then something has gone wrong. 83 00:10:34,730 --> 00:10:38,960 Obviously, we're not building physical models, we're building software models, 84 00:10:38,960 --> 00:10:47,810 but software models of complex gravitational problems still have the the aspects 85 00:10:47,810 --> 00:10:57,080 that they are implementing some part of a of a simulation of a problem domain. 86 00:10:57,080 --> 00:11:10,160 And if we design the simulation, then we can know how that simulation behaves and we can validate that it is behaving in the way that we expect. 87 00:11:10,160 --> 00:11:14,990 And there are things that we can do to help that. 88 00:11:14,990 --> 00:11:28,700 So in many body problem, we know what happens when there are two bodies in a gravitational interaction or in a more complex system. 89 00:11:28,700 --> 00:11:37,670 It's easier to work in, say, the sort of low velocity relativity domains where you know where space and time are 90 00:11:37,670 --> 00:11:45,150 basically constant and don't change than it is in the in the sort of Einstein domain. 91 00:11:45,150 --> 00:11:49,730 Velocities approaching the speed of light. That's not necessarily the problem that we're trying to solve here. 92 00:11:49,730 --> 00:12:06,080 It's just an example. And this brings us on to a principle that software testers use called equivalence partitioning, you know, 93 00:12:06,080 --> 00:12:14,100 another problem we may have is the scientific problem we're trying to model could be incredibly large. 94 00:12:14,100 --> 00:12:17,690 Just to give a different example. 95 00:12:17,690 --> 00:12:28,970 Garrett and I were talking before the start of the seminar about the behaviour of particular proteins in a biological system. 96 00:12:28,970 --> 00:12:37,820 Now, one thing that we might want to do with a computer is simulate the structure of these proteins by saying where they they're built. 97 00:12:37,820 --> 00:12:44,570 Of all of these components, all of these atoms organised in this particular way. 98 00:12:44,570 --> 00:12:50,680 What structure is that going to sort of collapse into if the, you know, 99 00:12:50,680 --> 00:13:01,550 when the various electromagnetic forces on the different ions and atoms in the thing are sort of stable state? 100 00:13:01,550 --> 00:13:07,100 And that's that that is a common computational problem to solve. 101 00:13:07,100 --> 00:13:15,320 But you know, when you when you were working with proteins. Now if you take something big like a virus and say, how is this going to fold? 102 00:13:15,320 --> 00:13:18,240 A. You may not know the answer in advance. 103 00:13:18,240 --> 00:13:27,830 B, it could take a very long time, you know, even on a like a supercomputer cluster like out in order to find out what the answer is. 104 00:13:27,830 --> 00:13:37,340 But. Let's take a simpler problem. We know what the angle between the two hydrogen oxygen bonds in a water molecule are. 105 00:13:37,340 --> 00:13:40,430 Does our model get that right? 106 00:13:40,430 --> 00:13:47,690 If it, you know, if it doesn't, then we probably shouldn't be particularly confident in using it for any more complex problem. 107 00:13:47,690 --> 00:13:54,590 If it works for water, then it's try sugar. Well, let's try a really simple protein and see whether it gets the right answer. 108 00:13:54,590 --> 00:13:58,190 We're not doing anything weird in a software context here. 109 00:13:58,190 --> 00:14:02,750 What you're saying for this problem where I know what the conditions are. 110 00:14:02,750 --> 00:14:10,220 I also know what the outcome is, and I can run my software and then verify that I get the same outcome. 111 00:14:10,220 --> 00:14:23,490 If I run this multiple times with different inputs and always get the expected outcomes, then I increase my confidence that my software is correct. 112 00:14:23,490 --> 00:14:30,420 And then the two remaining goals are reproducibility and recovery. 113 00:14:30,420 --> 00:14:42,370 So. Reproducibility is obviously very important in in research as why we have the reproducible research network. 114 00:14:42,370 --> 00:14:48,040 We want someone who's running our analysis with our code to get the same results, 115 00:14:48,040 --> 00:14:55,390 and that may mean that they're running on our computer may mean that they're running it on a different computer. 116 00:14:55,390 --> 00:15:00,040 It may just mean that they're doing exactly the same thing by a different time. 117 00:15:00,040 --> 00:15:04,810 But we would we would expect to get the same results in that context. 118 00:15:04,810 --> 00:15:15,040 Someone running our analysis, but with different data, should get consistent results in many circumstances. 119 00:15:15,040 --> 00:15:21,160 If you're running a simulation in a different but related domains and the simulation 120 00:15:21,160 --> 00:15:28,550 correctly behaves and represents the outcome of the model in those particular domains, 121 00:15:28,550 --> 00:15:31,960 then the results should be comparable somehow. 122 00:15:31,960 --> 00:15:41,350 We would also like someone who takes our ideas, takes our model and free codes, and gets a consistent results, 123 00:15:41,350 --> 00:15:51,610 or at least if they don't get consistent results, then it's possible for us to investigate why where the disparity comes from. 124 00:15:51,610 --> 00:15:57,010 And that's another thing that we're going to get from automated testing that we'll look at later is a bit more fine grained 125 00:15:57,010 --> 00:16:08,720 information about how the different parts of our software system interact and and which bits of it are behaving in particular ways. 126 00:16:08,720 --> 00:16:19,010 And there's a really important thing to bear in mind where we're talking about like reusing software, reproducing the results, 127 00:16:19,010 --> 00:16:26,750 we get types of software and recovering the behaviour of software that we use a long time ago, 128 00:16:26,750 --> 00:16:34,160 and sometimes that poor person is having to deal with their software is like me or you. 129 00:16:34,160 --> 00:16:41,840 It's the same person who wrote it. And you know, we get distracted by another project or like, we get some teaching that we have to do for a term. 130 00:16:41,840 --> 00:16:49,250 And a few months later, we come back and we don't quite remember what we were doing. 131 00:16:49,250 --> 00:16:56,090 There was obviously some stroke of genius when we wrote that function there, but why did we write it that way? 132 00:16:56,090 --> 00:17:04,190 What does it do when we've got a collection of tests that say, here's what this part of the software does in these circumstances? 133 00:17:04,190 --> 00:17:11,810 That's more documentation that's more help both to us to kind of recover our mental model of what 134 00:17:11,810 --> 00:17:19,700 the software does and for other people to reconstruct that mental model and get an idea of what if, 135 00:17:19,700 --> 00:17:25,050 how this software works so that they can either reuse it or develop it. 136 00:17:25,050 --> 00:17:30,420 So, you know, how would testing help in in these scenarios? 137 00:17:30,420 --> 00:17:35,880 So some someone else wants to use our code and run it with the same data. 138 00:17:35,880 --> 00:17:46,230 If we've got a collection of tests that explain how the what the software does and how it should behave for particular inputs, 139 00:17:46,230 --> 00:17:56,400 then before someone else runs the simulation and or runs the analysis and checks what whether they get the same outcome, 140 00:17:56,400 --> 00:18:03,570 they can run those tests and see whether they all pass to see whether all of our expectations are satisfied. 141 00:18:03,570 --> 00:18:08,260 And that's going to give some information if any of those things fails. 142 00:18:08,260 --> 00:18:15,900 That's going to give some information about what the assumption is that isn't satisfied in this new context. 143 00:18:15,900 --> 00:18:20,280 Maybe the software expects some files to be present, 144 00:18:20,280 --> 00:18:29,370 like configuration files or inputs set up in a particular way that I've got in my home directory and that definitely work for me. 145 00:18:29,370 --> 00:18:39,160 But someone else needs to know that information set the same thing up in the same way if they want to get compatible results. 146 00:18:39,160 --> 00:18:44,590 Someone using different data wants to get consistent results, or again, 147 00:18:44,590 --> 00:18:55,630 if we know if we can prove that the software does correctly implement the model or the other sort of scientific concepts that we're trying to embody, 148 00:18:55,630 --> 00:19:06,340 then they can be somewhat confident that the results they get out from using it with their data are the result of our model being applied to 149 00:19:06,340 --> 00:19:16,240 that data and not the result of something weird going on with some code or with there being some mistakes in the in the behaviour somewhere. 150 00:19:16,240 --> 00:19:22,690 And then if someone else wants to take a light re-implement our model, 151 00:19:22,690 --> 00:19:33,490 be that just for a cross-check to make sure that that they understand what the model is, maybe because they using a different context, 152 00:19:33,490 --> 00:19:36,940 like maybe their super computer doesn't have this or that cluster doesn't have the same 153 00:19:36,940 --> 00:19:41,920 libraries as ours and they want to build a version is compatible with their set up. 154 00:19:41,920 --> 00:19:46,570 Or maybe they're using a Mac and we were using Linux or for whatever reason, 155 00:19:46,570 --> 00:19:52,540 they want to rebuild it if they can see the tests and they can see the expected results. 156 00:19:52,540 --> 00:19:59,170 They can compare the results they get from their implementation with the results that they get from our implantation. 157 00:19:59,170 --> 00:20:01,990 Then they know something about the compatibility of those without having to just like, 158 00:20:01,990 --> 00:20:05,890 run the whole experiment and see what the outcome is at the end. 159 00:20:05,890 --> 00:20:12,010 So really, all of these things are about increasing confidence in the software and increasing 160 00:20:12,010 --> 00:20:21,160 the rate at which we get feedback that informs that confidence in that software. 161 00:20:21,160 --> 00:20:26,120 That was my strategic pause, just to check whether there were any questions, obviously not right now. 162 00:20:26,120 --> 00:20:33,370 As all carry on. So how do we design a software test? 163 00:20:33,370 --> 00:20:44,620 You can think of the behaviour of any software as being a form of contract that you make with the user of the software, 164 00:20:44,620 --> 00:20:51,160 whether that's yourself or other people in your group or members of the public or whoever's using the software, 165 00:20:51,160 --> 00:20:57,040 you can think about a form of contract where you say. 166 00:20:57,040 --> 00:21:01,740 If you arrange for this collection of things to be true. 167 00:21:01,740 --> 00:21:14,310 And then uses software, then I will make this guarantee about the outcome if there is a calculation about the result of using their software. 168 00:21:14,310 --> 00:21:21,210 And like that sort of design principle, whatever tools you're using to write your tests, 169 00:21:21,210 --> 00:21:32,250 whatever sort of level will come in later to discuss the different sort of levels of abstraction that exist in designing tests, 170 00:21:32,250 --> 00:21:37,050 this idea of as a contract is universal. 171 00:21:37,050 --> 00:21:44,530 If you set the world up in this way and then use my software, I will do this thing as a result. 172 00:21:44,530 --> 00:21:54,870 And so, you know, we could think back to the many, but the many body problem I can say, if you have a mass appoint mass, 173 00:21:54,870 --> 00:22:05,410 a mass M here and mass of mass am to over here and the distance between them is, ah, 174 00:22:05,410 --> 00:22:14,230 then I would say they then ask my software to calculate the gravitational force on the first mass. 175 00:22:14,230 --> 00:22:19,850 It's going to say that result is jammed over squared, which is the light, 176 00:22:19,850 --> 00:22:29,050 the Newtonian gravity gravitational force equation in the direction from this point from the first point to the second point. 177 00:22:29,050 --> 00:22:34,210 If you set the mass of one of these things to be negative, 178 00:22:34,210 --> 00:22:39,370 then my software is going to generate an error because we haven't worked out how to do 179 00:22:39,370 --> 00:22:47,570 negative mass or we decided that negative mass isn't within isn't a problem we want to solve. 180 00:22:47,570 --> 00:22:58,160 If you have multiple matches in your simulation and you ask what is the Fourth Street gravity over here, 181 00:22:58,160 --> 00:23:04,880 we're going to work out each of those individual contributions and some of them. 182 00:23:04,880 --> 00:23:11,720 And so, you know, you can see that this idea of the contract is coming into play if you have done this. 183 00:23:11,720 --> 00:23:17,720 If there is a mass here in the mass there, then you ask for the gravitational forces, 184 00:23:17,720 --> 00:23:23,480 then the result, then that you know what the software is going to do is to give you this answer. 185 00:23:23,480 --> 00:23:34,640 And so a test takes that contract takes that idea of the preconditions and then the action and then the post conditions. 186 00:23:34,640 --> 00:23:39,610 And what you do is you create a single concrete example of that. 187 00:23:39,610 --> 00:23:50,200 Where you know what the answer is for a given question. So if the mass of the other object is zero, then the gravitational force is zero. 188 00:23:50,200 --> 00:23:57,640 Super simple one and that that is a valid case, and we can write that as a test. 189 00:23:57,640 --> 00:24:09,740 If the mass is one kilogram and the distance is one metre, then the the yeah, then the force is just the gravitational constant G. 190 00:24:09,740 --> 00:24:16,670 Again, another example. And we, you know, we start to think, well, 191 00:24:16,670 --> 00:24:25,340 aren't there an infinity of examples such as taking this example of this sort of scenario of many body gravitational problem? 192 00:24:25,340 --> 00:24:31,790 I could have anywhere from zero to an infinite number of different masses. 193 00:24:31,790 --> 00:24:37,640 I eat at any of infinite points in space, isn't this? 194 00:24:37,640 --> 00:24:47,510 And with infinite initial velocities, do I really have to write that many infinities of different tests? 195 00:24:47,510 --> 00:24:55,250 And and so that. State that. 196 00:24:55,250 --> 00:25:09,650 What software testers do is they look for what are they actually meaningfully distinct regions, so distinct domains in the problem space. 197 00:25:09,650 --> 00:25:17,090 And then they write tests that capture by one example over each of those regions. 198 00:25:17,090 --> 00:25:28,190 So. You know, there's the trivially degenerate case that there are no masses in your many body problem simulation, 199 00:25:28,190 --> 00:25:33,260 that there's a very simple answer there's one having one mass. 200 00:25:33,260 --> 00:25:45,600 That's again, a trivial situation of which there's one example having two masses is another similar simple example. 201 00:25:45,600 --> 00:25:49,740 And then, you know, the idea that as you add more, 202 00:25:49,740 --> 00:25:58,680 what happens is that you some of the contributions to gravity from each force tells us that as soon as you've got more than two, 203 00:25:58,680 --> 00:26:01,260 if it works for any number of more than two, 204 00:26:01,260 --> 00:26:06,840 it works for all numbers more than two because it's just got to do the same maths, but with with more inputs. 205 00:26:06,840 --> 00:26:17,000 So a tester would write a test for zero, one, two and three basses and would then be entirely happy. 206 00:26:17,000 --> 00:26:24,080 Which works really well for sort of discrete variables like that where you have continuous variables as 207 00:26:24,080 --> 00:26:34,430 a related idea where as well as there being different ranges in the or domains in the problem space. 208 00:26:34,430 --> 00:26:45,200 You also then do what's called boundary value analysis, where at the at the boundary between two of these regimes you say, is there? 209 00:26:45,200 --> 00:26:49,070 You know, are the results effectively continuous through the boundary? 210 00:26:49,070 --> 00:26:53,420 Does it do the right thing as you move from one domain to the other? 211 00:26:53,420 --> 00:27:00,920 And so if we had some simulation that had say like relativistic corrections then and small velocities, 212 00:27:00,920 --> 00:27:05,780 it just use like a normal linear space and time. 213 00:27:05,780 --> 00:27:12,110 And then when you went to higher velocities, it used the relativistic corrections. 214 00:27:12,110 --> 00:27:18,350 There would be some point in between where it started to use these corrections, 215 00:27:18,350 --> 00:27:24,440 and the texture would look at what happens just below this at this point, 216 00:27:24,440 --> 00:27:29,330 what happens on that point and what happens just after I make sure that there are sort of 217 00:27:29,330 --> 00:27:36,030 three consistent values which make sure that the transition through the regime is smooth. 218 00:27:36,030 --> 00:27:45,150 And this is this is so common in software testing that there are a couple of little sort of mantras they use, 219 00:27:45,150 --> 00:27:56,880 either in particular technologies or by particular communities to sort of encapsulate this idea if there is the contract and of the preconditions, 220 00:27:56,880 --> 00:28:01,330 the action and and impose conditions. One of them. 221 00:28:01,330 --> 00:28:14,110 Which is very common in sort of communicating the the meaning of a test between software developers and by and say problem domain 222 00:28:14,110 --> 00:28:24,190 experts say researchers who are working on the software is to use the phrase given when then which encapsulates that idea. 223 00:28:24,190 --> 00:28:28,810 Given that this set of conditions were created. 224 00:28:28,810 --> 00:28:33,580 When the software does this, then this is the outcome. 225 00:28:33,580 --> 00:28:44,220 So again, we've got stuff that happened fast. Given this set of initial conditions, court action when this happens in the software. 226 00:28:44,220 --> 00:28:48,790 And we've got an outcome then this will be the result. 227 00:28:48,790 --> 00:29:01,570 People using unit test frameworks, which are a way of testing small components as they are like little pieces of a bigger software system, 228 00:29:01,570 --> 00:29:07,180 use a a phrase called Assemble Act Assert. 229 00:29:07,180 --> 00:29:14,650 And again, the precondition is you have assembled this thing in this state. 230 00:29:14,650 --> 00:29:25,200 The act is the the action that the software takes and then asserting is saying, I am telling you that the result of the software will be this. 231 00:29:25,200 --> 00:29:32,070 So a software test is always a binary outcome, 232 00:29:32,070 --> 00:29:37,410 and it says an assertion of what the correct behaviour should be and a failure to satisfy that 233 00:29:37,410 --> 00:29:47,640 assertion means not having confidence in the software means believing that something has gone wrong. 234 00:29:47,640 --> 00:29:54,300 So with that idea of how to build a test in mind, what's the best way to get started? 235 00:29:54,300 --> 00:30:01,050 The easiest thing to do is just take your existing software and think about this given when 236 00:30:01,050 --> 00:30:07,470 then think about this idea of the contract and apply it to running the software as a whole. 237 00:30:07,470 --> 00:30:17,970 And this is called an end to end or E2EE test in the sort of jargon of professional software testing. 238 00:30:17,970 --> 00:30:30,080 If you've got like a big problem, like a machine learning training problem or a a massive like a super computer simulation, it's going to take. 239 00:30:30,080 --> 00:30:36,830 A long time and a lot of resources to run, then yeah, this is not necessarily the optimal thing to do. 240 00:30:36,830 --> 00:30:44,240 You may end up waiting a very long time or even year, costing a large amount of money just to get the results of your tests. 241 00:30:44,240 --> 00:30:55,880 And this is why we look for a sort of sample problems, toy problems, smaller datasets, something where not only do we know what the outcome is, 242 00:30:55,880 --> 00:31:00,410 but also the sort of computational effort in getting to that outcome is going 243 00:31:00,410 --> 00:31:07,670 to be small because the the less time it takes to run through your tests, 244 00:31:07,670 --> 00:31:14,180 the more frequently you're going to do it. Yeah, that's just just the way that people work. 245 00:31:14,180 --> 00:31:18,200 If if it takes more than a few minutes to do something, 246 00:31:18,200 --> 00:31:25,190 we get distracted and we're going to look at something else here we go and check social media or our emails or go make a cup of coffee or whatever. 247 00:31:25,190 --> 00:31:31,670 And so we we tend to like, save this for, Oh, it's lunchtime, 248 00:31:31,670 --> 00:31:37,990 I'm going to let go of my tasks and then go and get some lunch and then come back and see the result. 249 00:31:37,990 --> 00:31:43,540 What if I'm already running my tests every lunchtime, then if they passed on Monday lunchtime, 250 00:31:43,540 --> 00:31:47,500 they passed on Tuesday lunchtime and they fail a Wednesday lunchtime. 251 00:31:47,500 --> 00:31:54,670 The only thing I know is I did something on Tuesday afternoon or Wednesday morning that broke. 252 00:31:54,670 --> 00:31:59,210 That made this software behave in a way that I believe is incorrect. 253 00:31:59,210 --> 00:32:07,400 So I've now got to kind of go back through my entire set of changes I made over that day and try and understand what it was. 254 00:32:07,400 --> 00:32:16,340 If it takes me like a minute to run through the test, so I might just do it every time I've changed software and then. 255 00:32:16,340 --> 00:32:22,430 If I run them and I find that something's failed, I've only got to go back to the thing I was doing a minute ago, 256 00:32:22,430 --> 00:32:27,560 which is still fresh in my head and I know I won't change. 257 00:32:27,560 --> 00:32:32,560 It was and b I know like what I was changing because yeah, 258 00:32:32,560 --> 00:32:40,400 I know there's a limited amount of stuff you can do in that time and I know what I was trying to achieve. 259 00:32:40,400 --> 00:32:48,240 So I've got some idea of what I introduce that could make the thing go wrong. 260 00:32:48,240 --> 00:32:52,950 So we tend not to build massive batteries of end to end tests, 261 00:32:52,950 --> 00:33:02,180 we tend to build a small number of highly important tests that show that basic things work and that like, 262 00:33:02,180 --> 00:33:05,940 and that our system is basically glued together the right way. 263 00:33:05,940 --> 00:33:15,210 So if I think back to the work I was doing at Facebook, we would have a smoke test that was can I launch the iPhone app, 264 00:33:15,210 --> 00:33:22,970 log into Facebook and then post some text as a status to my newsfeed? 265 00:33:22,970 --> 00:33:29,990 And that would get run by every developer every time they made a change to the application. 266 00:33:29,990 --> 00:33:36,950 Most of these changes weren't going to break that behaviour, but as soon as someone did break out, babe, 267 00:33:36,950 --> 00:33:42,920 you wanted to know about it because if you had a version of the Facebook app where you couldn't post to your newsfeed, 268 00:33:42,920 --> 00:33:47,030 that wouldn't be useful to almost anybody using the application. 269 00:33:47,030 --> 00:33:57,000 So this is a very high value test, a very small focussed piece for the functionality that we were exploring. 270 00:33:57,000 --> 00:34:05,640 And these tests typically don't need any changes to your software if you manage to change your 271 00:34:05,640 --> 00:34:12,930 your dataset or like your problem specifications that you're running a very small problem. 272 00:34:12,930 --> 00:34:22,240 You just you're just using your existing software. There's no real design changes are required and you can just run through these with. 273 00:34:22,240 --> 00:34:32,110 It with a script, if they're a sort of simulation tools, you just want to go online or if you've got something that's got a user interface, 274 00:34:32,110 --> 00:34:40,680 you can find a tool for automated like pressing buttons on the user interface that will just run your software, as is. 275 00:34:40,680 --> 00:34:51,780 These tests are very useful because they tell you like whether your software is kind of all plumbed together properly is dealing with data, 276 00:34:51,780 --> 00:35:00,490 as you would expect. But they're also very low signal in that if it goes wrong, what you know is there's a problem in your software somewhere. 277 00:35:00,490 --> 00:35:05,920 You know, think about that Facebook example. Let's say that the. 278 00:35:05,920 --> 00:35:13,510 The ability to post at a feed didn't work, is that because the little submit button in the UI is broken, 279 00:35:13,510 --> 00:35:23,080 is it because the the thing that sends the data to the network is broken and how is it broken because it can't connect to the network? 280 00:35:23,080 --> 00:35:27,280 Or is it because it isn't reading the data out of the UI? 281 00:35:27,280 --> 00:35:34,600 Or is that that all got sent? And then the UI does not update to show the the new results? 282 00:35:34,600 --> 00:35:38,440 Or is it that the server ignored this data coming? 283 00:35:38,440 --> 00:35:48,190 You know, there's so many different ways in which this test could fail in any given way that all we know really is that the software is broken. 284 00:35:48,190 --> 00:35:56,230 We don't really have any information on what is broken and any way to narrow down our investigation on how to fix it. 285 00:35:56,230 --> 00:36:06,820 So the community has this idea of the test pyramid going back to my example of Kerbal Space Programme. 286 00:36:06,820 --> 00:36:13,540 We certainly could test the Space Shuttle by building a space shuttle and then seeing whether the Space Shuttle works. 287 00:36:13,540 --> 00:36:19,840 But that's a really expensive and time consuming way to test the Space Shuttle. 288 00:36:19,840 --> 00:36:28,450 It's built out of all of these different components, right? One of the earliest things that NASA did when they were building, not the Kerbal Shuttle, 289 00:36:28,450 --> 00:36:35,230 but the actual real space shuttle was to build the business, got the Delta wings. 290 00:36:35,230 --> 00:36:44,530 Don't don't bother putting any engines on it. Just strap that to the back of a jumbo jet, take off, let go of the thing and then try and land it. 291 00:36:44,530 --> 00:36:50,680 And that tells you whether the aerodynamics and the controls work without having to build these massive solid 292 00:36:50,680 --> 00:36:56,080 rocket boosters or the main engine without having to assemble all of that and then stick on a launch pad, 293 00:36:56,080 --> 00:37:06,070 fuel it up and said anything up without even having to build the little so control engines there on the back of the main body. 294 00:37:06,070 --> 00:37:14,380 So they they took a component of this complete system isolated to that component, 295 00:37:14,380 --> 00:37:22,840 set that into some reasonable starting condition and then saw how that behaved once they initiated some action, 296 00:37:22,840 --> 00:37:30,440 which was planned for the Space Shuttle. And we can do that kind of thing with software as well. 297 00:37:30,440 --> 00:37:43,070 So now we do get into the stage where we're having to think about the design of our software, which components are actually distinct. 298 00:37:43,070 --> 00:37:55,960 So functionality are distinct behaviour in this software that have they're responsible for, like some subsets of the overall system. 299 00:37:55,960 --> 00:38:02,440 How are those related to each other if we try if we were to take out all of the rest of the software, 300 00:38:02,440 --> 00:38:07,210 what would we have to supply for this thing, to have enough information to be able to work? 301 00:38:07,210 --> 00:38:12,430 What would it expects to be able to do? Does it want to read from a file or write to a file? 302 00:38:12,430 --> 00:38:24,160 Does they expect a database to be present? Does they expect some variables in the programme that's outside of its control to be set? 303 00:38:24,160 --> 00:38:27,820 So we are now making changes to our software, 304 00:38:27,820 --> 00:38:32,080 but these changes are themselves potentially useful because what we're doing 305 00:38:32,080 --> 00:38:38,170 is taking each of these components and reusing it outside the domain of our, 306 00:38:38,170 --> 00:38:50,590 you know, our immediate science problem. And in the domain of a test, this means the changes we make are changes to the reusability of this module. 307 00:38:50,590 --> 00:38:53,830 We can now take this software, this component, 308 00:38:53,830 --> 00:39:04,540 and apply it to different context because we now know what we need to do and how to set this thing up so that we can use it elsewhere. 309 00:39:04,540 --> 00:39:10,870 And what we get from doing this is we get much, much more precise feedback when a test fails. 310 00:39:10,870 --> 00:39:18,460 We know there is a failure in this component. If I if I took away landing the space shuttle test and my space shuttle didn't land, 311 00:39:18,460 --> 00:39:22,660 I wouldn't need to check the solid rocket boosters because I didn't use them. 312 00:39:22,660 --> 00:39:29,830 I only used the aerodynamic dynamic part, and you can imagine going even smaller. 313 00:39:29,830 --> 00:39:35,710 So the so-called cockpit windscreen on the front of the shuttle, 314 00:39:35,710 --> 00:39:45,520 you could test how impact of resistance that is just by exposing is a large force like the equivalent of hitting it with a hammer. 315 00:39:45,520 --> 00:39:48,160 If it breaks, then you know that the problem is with the windscreen, 316 00:39:48,160 --> 00:39:53,320 not with the rest of the shuttle, and certainly not with all of the engines and other components. 317 00:39:53,320 --> 00:40:00,770 So we're getting much, much more localised and immediately actionable feedback from our test results. 318 00:40:00,770 --> 00:40:07,560 But what we're not doing is answering the question, does this actually does this offer actually solve a problem? 319 00:40:07,560 --> 00:40:22,790 I have, yes, I need this like this unit, which is just a a way of describing like a class or a function, some very small part of software behaviour. 320 00:40:22,790 --> 00:40:28,970 Yes, I need this to to work in particular ways, 321 00:40:28,970 --> 00:40:38,030 but it's only going to be providing a valuable contribution to serving my overall problem if that working in 322 00:40:38,030 --> 00:40:48,670 particular ways is then used by the rest of the units in a way that's that's enabling my problem to be solved. 323 00:40:48,670 --> 00:40:53,500 So it's very, you know, it's very common in commercial software, for example, 324 00:40:53,500 --> 00:41:04,180 to find projects that have a large number of unit tests at a very high level of coverage by which we mean the the fraction of the 325 00:41:04,180 --> 00:41:19,190 statements or the different logic flows through the programme that are tested so that you can find very like the tests at the unit level. 326 00:41:19,190 --> 00:41:21,700 That's the very small, separate component. 327 00:41:21,700 --> 00:41:31,070 Tests are very well specified in COVID because it's easy for a programmer to think, What do I need this function to do? 328 00:41:31,070 --> 00:41:36,140 But then gaps as you get further up the pyramid into the integration and the end 329 00:41:36,140 --> 00:41:41,420 to end levels such that you actually don't know whether the programme works. 330 00:41:41,420 --> 00:41:46,220 But you know that every function and it does what the programmer thought it needed to do. 331 00:41:46,220 --> 00:41:49,700 But you don't know, does this actually solve a problem that anybody has? 332 00:41:49,700 --> 00:42:02,030 So the sort of motivation of having this pyramid graphic is to say he is like, Yeah, he's a good idea for how you should spend your testing effort. 333 00:42:02,030 --> 00:42:10,610 Lots at the small level, which gives you high fidelity, actionable feedback or sorry, high precision, actionable feedback. 334 00:42:10,610 --> 00:42:17,180 And then some at the top level that say that you actually are able to achieve your goals using the software. 335 00:42:17,180 --> 00:42:26,570 And then some bits in the middle that sort of provide the impedance match between these separate functions work and this whole software works. 336 00:42:26,570 --> 00:42:33,150 These bits, when assembled together, are also correct. 337 00:42:33,150 --> 00:42:37,770 So as I said, this was really an introduction to the concepts of testing. 338 00:42:37,770 --> 00:42:46,350 Here is a specific tool so you can look at the are relevant to using particular programming languages. 339 00:42:46,350 --> 00:42:53,850 I've tried to cover most of the things I've seen in the world of scientific computing. 340 00:42:53,850 --> 00:43:03,870 There may be others. I apologise wholeheartedly to any Fortran programmers who feel left out at the moment, 341 00:43:03,870 --> 00:43:12,420 but I don't have experience with testing or FORTRAN, so I didn't have any recommendations for tools to examine that. 342 00:43:12,420 --> 00:43:16,470 So quick summary, 343 00:43:16,470 --> 00:43:27,000 the reason we want testing in a scientific context is partly to improve the confidence that we have in our software and partly to improve the 344 00:43:27,000 --> 00:43:37,410 reproducibility of results that we get with the software because we know how the software acts and how it responds to particular inputs. 345 00:43:37,410 --> 00:43:42,150 Even if we don't know what our scientific outcomes are going to be, 346 00:43:42,150 --> 00:43:52,860 we should at least understand what conceptual model we're trying to express in our software and can say that we have correctly expressed this model, 347 00:43:52,860 --> 00:43:58,800 even if we can't say a priori what the scientific outcomes are going to be. 348 00:43:58,800 --> 00:44:10,920 And the way that we design tests is using the idea of a contract that's given when that idea that if I set things up in a particular way, 349 00:44:10,920 --> 00:44:18,840 then use my software, I will get this outcome. And that expression is an assertion. 350 00:44:18,840 --> 00:44:24,810 If it if it is satisfied, then the software is correct for that case. 351 00:44:24,810 --> 00:44:30,240 If it is not satisfied, then the software is incorrect. It fails to meet our expectations. 352 00:44:30,240 --> 00:44:36,920 The easiest way to get started is just to run your entire programme with a known input. 353 00:44:36,920 --> 00:44:47,520 That way, you know what I come to expect. And of course, the RC Group can help, and we run these things good software surgeries, 354 00:44:47,520 --> 00:44:53,960 which like a sort of half hour discussion with one or two research software engineers about your software projects. 355 00:44:53,960 --> 00:45:01,560 So if you want some help. So getting started with testing or finding out how to use software tests in a particular way? 356 00:45:01,560 --> 00:45:06,120 Drop us a line as our email address, you can find that out. 357 00:45:06,120 --> 00:45:11,370 So that has to be done. And I guess now it's time for some questions. All right. 358 00:45:11,370 --> 00:45:16,110 Thank you very much. It was wonderful. Thank you. OK. 359 00:45:16,110 --> 00:45:23,264 So I'm going to I'm going to stop the recording and then people can freely ask questions.