1 00:00:00,810 --> 00:00:04,770 Okay. Good morning, everybody. Assignment three is due now. 2 00:00:04,770 --> 00:00:08,790 I hope that was okay. It probably took a bit more work than the other ones. 3 00:00:09,990 --> 00:00:13,860 The solutions are out and we're happy to talk about them later on. 4 00:00:13,860 --> 00:00:18,930 If you have questions, I thought I'd show you a couple of things on the screen related to that. 5 00:00:19,560 --> 00:00:20,730 So let's see here. 6 00:00:24,680 --> 00:00:33,140 My little code for the ellipse fitting problem there was that we used the backslash at least squares in order to fit in ellipse to some data. 7 00:00:34,010 --> 00:00:37,909 Not a true non-linear ellipse fitting problem, but a linear ized version of it. 8 00:00:37,910 --> 00:00:41,240 And let's see. My code is called a 31 C. 9 00:00:42,140 --> 00:00:46,600 So if I. Run that, then I can click. 10 00:00:47,080 --> 00:00:49,150 I hope you got this to work on your machine. 11 00:00:49,630 --> 00:00:56,920 Just put in a bunch of points and then as soon as I click the right button, I get some kind of an ellipse. 12 00:00:57,670 --> 00:01:02,560 I hope that worked for you. Let me try one more. Let's make it highly elongated. 13 00:01:08,430 --> 00:01:14,330 So you can see it's always an ellipse centred at the origin with this particular formulation. 14 00:01:14,340 --> 00:01:21,090 Obviously one could generalise that. So in this case, it doesn't fit the data too well because the data is mostly on one side of the origin. 15 00:01:22,680 --> 00:01:31,589 Then the other problem was the SVD problem, where you're using the singular value decomposition to find some eigenvalues of a non-trivial problem. 16 00:01:31,590 --> 00:01:34,620 It was the modes of oscillation of a quarter ellipse. 17 00:01:35,310 --> 00:01:37,180 And I hope that went all right for you. 18 00:01:37,200 --> 00:01:44,880 I thought I'd show you online that one of the fun examples is a similar the same mathematics applied to a different problem. 19 00:01:45,360 --> 00:01:53,560 So if you go to the fun examples. Under partial differential equations. 20 00:01:56,550 --> 00:02:00,180 One of them is called Eigenvalues of a trapezoidal drum. 21 00:02:02,090 --> 00:02:05,630 And that's using exactly the same method you used for the ellipse. 22 00:02:06,020 --> 00:02:13,999 So there's a drum. And in order to find, again, modes of it, you can do an expansion in Bessel functions basically, 23 00:02:14,000 --> 00:02:19,490 which satisfy the eigenvalue equation but don't necessarily satisfy the boundary condition. 24 00:02:19,670 --> 00:02:23,690 So you find coefficients to make such an expansion zero on the boundary. 25 00:02:23,990 --> 00:02:31,580 In this case, all the challenge is that that particular corner, because that's the one where the function isn't smooth, it's not analytic. 26 00:02:31,590 --> 00:02:36,020 So you use the right Bessel function for that corner and outcome. 27 00:02:37,730 --> 00:02:40,760 The same mathematics you're used to pictures like what you saw. 28 00:02:41,480 --> 00:02:51,860 And then you fool around with your parameters. And then then you find a value of lambda, for which the smallest singular value is zero. 29 00:02:51,860 --> 00:02:59,269 And that's an eigenvalue of your oscillating drum. Okay. 30 00:02:59,270 --> 00:03:05,900 So that's that. There's one more assignment which will hand out on Friday, and then that's the end of this term. 31 00:03:08,120 --> 00:03:14,930 I guess that will be do in the usual place two weeks later, even though the lectures stop after this week. 32 00:03:23,910 --> 00:03:30,569 Okay. So let me remind you that we've been beginning to talk about optimisation, and this is a vast, 33 00:03:30,570 --> 00:03:37,560 complicated subject, but I like to start with its purest, cleanest version, which is Newton's method. 34 00:03:38,160 --> 00:03:42,690 Last lecture. We talked, first of all, about a single equation, which everybody knows. 35 00:03:43,110 --> 00:03:47,580 Then for a system of equations which is completely analogous, mathematically, 36 00:03:48,270 --> 00:03:53,670 you now have to solve a linear system of equations at each step of the nonlinear Newton iteration. 37 00:03:54,660 --> 00:04:02,730 Then we began to talk about minimising a function which is closely related, defined as finding a zero of the derivative, but not exactly that. 38 00:04:03,750 --> 00:04:08,370 Now you can probably guess that the next step is to minimise the function of several variables. 39 00:04:08,370 --> 00:04:21,630 So that's what I want to talk about now. Newton's method for minimising a function of several variables. 40 00:04:32,040 --> 00:04:38,370 Mathematically, we can think of two or three variables, but of course in applications it might be in the millions depending. 41 00:04:40,970 --> 00:04:46,570 Well, it's. It's like the difference from here to here, only more so. 42 00:04:47,050 --> 00:04:52,580 Mathematically, we have what looks like an equivalent system, but in fact, a lot more complexity comes up. 43 00:04:52,780 --> 00:04:57,969 Let me first of all, remind you that we use lowercase F for minimising. 44 00:04:57,970 --> 00:05:08,680 So we're now going to suppose we have a function lowercase F which maps from our N to R and we're trying to minimise that. 45 00:05:08,680 --> 00:05:19,540 So we seek an x star in our m r n such that it's a local minimum. 46 00:05:19,810 --> 00:05:24,730 So F of x star is a local minimum. 47 00:05:27,070 --> 00:05:32,950 And of course that will be related to having zero gradient, although that's not enough. 48 00:05:33,160 --> 00:05:37,570 But certainly if you're a local minimum of a smooth function, the gradient will have to be zero there. 49 00:05:39,250 --> 00:05:45,040 So let's talk about gradients, which are our first derivatives and then also second derivatives. 50 00:05:45,250 --> 00:05:56,049 So in general, if you have a function like that, the gradient is a vector of partial derivatives, as you all know, I'm sure. 51 00:05:56,050 --> 00:06:05,080 So the gradient of S at a point X in our N is the vector. 52 00:06:09,480 --> 00:06:12,210 Which we would normally write as a column and vector. 53 00:06:12,600 --> 00:06:23,610 So the gradient of F is the column vector consisting of the partial of F with respect to x1, x2, and so on down to X and. 54 00:06:29,130 --> 00:06:33,420 That's the gradient. And then to do Newton's method, we want to find zeros of the gradient, 55 00:06:33,570 --> 00:06:37,980 which means we need to differentiate the gradient in order to extrapolate at each step. 56 00:06:38,280 --> 00:06:45,480 So we now need the second derivative of little less to get the first derivative of the gradient, and that's called the hessian. 57 00:06:49,790 --> 00:06:55,750 So the question of F. Is the Jacoby end of the gradient. 58 00:06:58,720 --> 00:07:04,570 Remember, the Jacobean is a matrix of partial derivatives of a vector function. 59 00:07:04,810 --> 00:07:08,830 Here's our vector function. We now make a matrix of partial derivatives. 60 00:07:09,160 --> 00:07:19,180 So I could call it like this delta F of x, which is the gradient of F differentiated. 61 00:07:22,200 --> 00:07:26,730 And that's an NBN symmetric matrix of partial derivative. 62 00:07:26,750 --> 00:07:33,180 So in the upper left we have partial of F squared with respect to X1 squared. 63 00:07:34,820 --> 00:07:40,690 And down here we would have partial of F squared with respect to x n square. 64 00:07:41,540 --> 00:07:49,130 And of course, down here, partial of F squared with respect to x one and x, n. 65 00:07:51,660 --> 00:07:57,990 And up here partial of f squared with respect to x1xn. 66 00:08:02,420 --> 00:08:08,840 And as you know, it's a fact of mathematics that at least if it's smooth, then it doesn't matter what order you do. 67 00:08:08,840 --> 00:08:14,570 These derivatives so partial with back to end one is the same as partial with respect to one an end. 68 00:08:14,780 --> 00:08:18,920 So it's a symmetric matrix. The question is a symmetric market. 69 00:08:22,650 --> 00:08:26,010 And there already we see something that's a little bit different. 70 00:08:29,690 --> 00:08:34,370 When you solve a system of equations, you have a Jacobean which need not be symmetric. 71 00:08:34,640 --> 00:08:39,440 But when you're minimising, the Jacobean is a special one which is symmetric. 72 00:08:39,740 --> 00:08:43,360 So minimisation is not the same as zero fighting. 73 00:08:44,450 --> 00:08:48,390 So let's write down Newton's method. You can do that without my help, but I'll do it. 74 00:08:50,090 --> 00:09:03,020 So in its pure form. Newton's method for minimising this function of several variables, we would say, given an initial gas. 75 00:09:06,760 --> 00:09:09,940 It's not as always, which is now a vector. 76 00:09:12,160 --> 00:09:18,130 Then we go into our iteration for k equals 1012 and so on. 77 00:09:21,240 --> 00:09:24,390 First of all, we have to evaluate the gradient and the fashion. 78 00:09:25,920 --> 00:09:31,590 So evaluate the gradient and the fashion. 79 00:09:35,400 --> 00:09:40,980 And as you can imagine, as a general matter, those could be very serious challenges, 80 00:09:41,220 --> 00:09:45,180 both in terms of the amount of work involved and in terms of the accuracy. 81 00:09:45,180 --> 00:09:54,270 What method are you going to use for that? Then we have to solve our system of equations which will involve involved the hazard matrix. 82 00:09:54,480 --> 00:09:57,570 So here's the system. Delta of death. 83 00:09:58,650 --> 00:10:03,049 There's our matrix. Times. 84 00:10:03,050 --> 00:10:08,420 The faith step should be equal to minus the gradient of F. 85 00:10:11,360 --> 00:10:17,990 So we solve that symmetric system of equations for s sub k and then we do our update. 86 00:10:18,350 --> 00:10:23,960 We say x, k plus one equals x, k plus s k. 87 00:10:31,740 --> 00:10:36,299 So there's the idea behind a lot of optimisation, but it's only the very, 88 00:10:36,300 --> 00:10:42,960 very beginning of the subject you recognise it's just Newton's method for finding zeros of the gradient. 89 00:10:45,210 --> 00:10:53,130 So I thought, of course we play with that on the computer. We're going to have to bounce back and forth a little bit between machines. 90 00:10:53,850 --> 00:10:56,670 Let's start I think we start on the usual machine. 91 00:10:56,880 --> 00:11:04,020 So if you look at the code called M 21, Pure and Newton men, in that case, I've written down exactly this. 92 00:11:04,260 --> 00:11:13,260 So I've taken a system of two equations written explicitly the function F, the gradient G and the hash and H. 93 00:11:14,910 --> 00:11:18,270 And by the way, those letters are very commonly used. 94 00:11:19,050 --> 00:11:26,970 So we've talked about F people often say G for the gradient and H for the fashion. 95 00:11:32,470 --> 00:11:42,300 So let's see if we go to that code. It's called AM 21. 96 00:11:43,380 --> 00:11:54,830 Let's try it. I hope I'm in the right mood here. 97 00:11:56,540 --> 00:12:04,159 Yes. Okay. So there's a function, the function F and we're trying to find the black dot in two D. 98 00:12:04,160 --> 00:12:09,319 It's pretty easy to see it on the screen. And of course, we'll follow an iteration. 99 00:12:09,320 --> 00:12:13,040 And what we need to do is pick our initial guess. 100 00:12:13,060 --> 00:12:17,480 So for example, suppose I take a to two as my initial guess. 101 00:12:19,310 --> 00:12:25,790 Then you can see the red dot and at every step I'll take a step and you can see that one has not done well. 102 00:12:26,690 --> 00:12:31,820 So we're already it's not converging. We're already seeing the fragility of Newton's method. 103 00:12:32,660 --> 00:12:37,280 You're. I probably figured that would be close enough to get to the black dot. 104 00:12:37,520 --> 00:12:44,750 But Newton's method, it's all about the curve curvature of those contour lines because they're curved sort of in the wrong direction. 105 00:12:45,020 --> 00:12:50,180 Newton's method, which is using that second order information, is going the wrong way. 106 00:12:50,840 --> 00:12:56,450 Let's try again. So let's try one one as our initial guess. 107 00:12:57,740 --> 00:13:01,430 That's got to work, right? Is it working? 108 00:13:01,430 --> 00:13:06,650 The first step is not impressive, is it? But then the second step is looking great. 109 00:13:06,920 --> 00:13:12,620 Now has to look at the numbers here where we're expecting quadratic convergence, of course. 110 00:13:12,890 --> 00:13:19,250 So F is the thing we're trying to minimise and then G is the gradient which we're trying to set to zero. 111 00:13:19,490 --> 00:13:26,320 So let's see if that thing is going to zero. Take another step. 112 00:13:27,700 --> 00:13:31,510 It is smaller. Of course, we're not seeing the red dots anymore. 113 00:13:31,570 --> 00:13:37,900 The gradient is now zero to several digits and now it's ten to the minus sixth, and now it's ten to the -13. 114 00:13:38,380 --> 00:13:42,520 So sure enough, from that very good initial gas, Newton did fine. 115 00:13:44,200 --> 00:13:51,250 And just because I can't resist, let's do a couple more initial guesses. So if I try, for example, two minus two. 116 00:13:52,540 --> 00:13:56,530 Does it work? Looks promising. Yep. Very promising. 117 00:13:57,460 --> 00:14:05,380 But isn't it striking how the magic of Newton's method just doesn't appear until you get very close to the thing you're looking for? 118 00:14:05,710 --> 00:14:10,680 At the moment, it just looks like some crude approximation. Look, it's taken the wrong direction. 119 00:14:11,380 --> 00:14:17,830 Oops. It's gone in totally the wrong direction. So you see how fragile Newton's method can be. 120 00:14:19,430 --> 00:14:26,030 We're going to have to take a better initial guess. Let's take 1.5 and minus point seven. 121 00:14:28,170 --> 00:14:35,510 Will it work? Nope. Looks bad. It's have are. 122 00:14:36,230 --> 00:14:39,230 It's going to. Oh, now we've run. That probably was about to converge. 123 00:14:39,230 --> 00:14:43,100 But I ran into the limit on the number of steps. So you can see that. 124 00:14:45,740 --> 00:14:48,900 When computers came along and people started to optimise. 125 00:14:48,920 --> 00:14:52,700 Newton's method was only sort of an idea, a vision. 126 00:14:53,900 --> 00:15:00,470 Most of the work is that we're going to have to somehow get near that point in order to get our quadratic conversions. 127 00:15:03,340 --> 00:15:15,860 Okay. So I want to say a word about the more practical side of optimisation still without constraints. 128 00:15:15,870 --> 00:15:21,540 We're going to talk about constraints on Friday. We're now talking about the field of unconstrained optimisation. 129 00:15:26,640 --> 00:15:34,170 So let me say a few words about what it takes to get from this kind of Newton's method to practical things. 130 00:15:43,260 --> 00:15:46,710 From Newton's method to practical optimisation. 131 00:15:50,810 --> 00:15:56,060 And I guess I would say that there are two big issues that have to be addressed in this little example. 132 00:15:56,060 --> 00:15:59,480 We saw one of them, which was robustness for bigger examples. 133 00:15:59,480 --> 00:16:06,380 The other one is also important, namely speed. So let me turn to that in a moment. 134 00:16:07,250 --> 00:16:11,360 But first, I want to make a philosophical observation. So the vision. 135 00:16:15,190 --> 00:16:23,450 Is quadratic convergence. Now. 136 00:16:23,450 --> 00:16:29,189 What does that mean? It means that every time you take a step, you double the number of accurate digits. 137 00:16:29,190 --> 00:16:37,069 So you have the error at one step is proportional to the square of the error at the previous steps, which is wonderful, right? 138 00:16:37,070 --> 00:16:38,959 If that's happening, you're in great shape. 139 00:16:38,960 --> 00:16:51,230 If the error at step K plus one is O of the error at step K squared, I want to make a remark about the significance of that condition. 140 00:16:51,230 --> 00:16:54,379 And this is something that certainly puzzled me at one point. 141 00:16:54,380 --> 00:17:03,860 It may puzzled a lot of people. Obviously, it's good if you converge quickly, but what's so special about a square there? 142 00:17:04,100 --> 00:17:07,610 Newton's method, of course, gives you a square when things are functioning well. 143 00:17:07,610 --> 00:17:12,920 But why stop there? Why not have cubic convergence or quartette convergence? 144 00:17:13,850 --> 00:17:18,020 You know, maybe that would have been hard in the 17th century, but now we've got computers, right? 145 00:17:18,020 --> 00:17:23,090 So why not develop something like Newton's method that converges quickly? 146 00:17:24,240 --> 00:17:31,830 Of course, it'll be even less robust. That's a fact. So it will take some finicky care to make it really work. 147 00:17:32,910 --> 00:17:36,990 But it's certainly possible to develop algorithms that converge quickly. 148 00:17:37,380 --> 00:17:45,780 So the question is, why aren't these algorithms famous? And that puzzled me for a while until I realised or somebody told me, 149 00:17:45,780 --> 00:17:52,380 I guess that I'm actually a cubic algorithm is not any faster than a quadratic algorithm. 150 00:17:53,670 --> 00:17:57,660 Let me put it this way. Suppose we had a core tech algorithm. 151 00:17:57,990 --> 00:18:03,450 E k plus one equals o of e k to the fourth. 152 00:18:04,170 --> 00:18:09,840 So the number of correct digits quadruples every step that looks even better. 153 00:18:10,140 --> 00:18:14,790 And mathematically, that too would be possible. You would need fourth derivatives you suddenly be getting. 154 00:18:14,820 --> 00:18:23,400 Tensor is not just matrices. But the point is, this is just like two steps of that, right? 155 00:18:23,700 --> 00:18:32,550 So if you've got quadratic convergence, then e k plus two is o of e k to the fourth. 156 00:18:34,470 --> 00:18:41,340 So the difference between second order and fourth order or any other order bigger than one is only a constant factor. 157 00:18:41,940 --> 00:18:48,190 All you, even you, all you could dream of gaining by going chaotic is sort of having the work. 158 00:18:48,210 --> 00:18:53,760 But of course, you'd never in practice have the work because it would be such a huge effort to write down your core take algorithm. 159 00:18:54,180 --> 00:18:59,490 So there's nothing to be gained in principle by going to different orders. 160 00:19:00,000 --> 00:19:03,860 So all super linear, as we call it. 161 00:19:07,000 --> 00:19:10,210 Convergence rates, which means better than an exponent one. 162 00:19:16,110 --> 00:19:19,920 Are in principle equivalent up to constant factors. 163 00:19:32,320 --> 00:19:36,490 I'm speaking very much in principle. Of course, there are all sorts of practical issues here. 164 00:19:36,730 --> 00:19:47,350 Super linear means e k plus one equals O of e k to some alpha where alpha is bigger than one. 165 00:19:48,160 --> 00:19:51,750 Now, if alpha is equal to one, then it is genuinely slower. 166 00:19:51,760 --> 00:19:57,160 So linear convergence is slower than super linear convergence, but all super linear rates are the same. 167 00:19:58,210 --> 00:20:08,980 I remember when I first became aware of this. There were some methods discovered for computing PI, which were incredibly fast. 168 00:20:09,340 --> 00:20:13,870 They converge quickly, every iteration. You'd get three times as many digits of pi. 169 00:20:14,110 --> 00:20:20,470 And I thought that was an amazing advance over the previous algorithms, which were only quadratic convergence. 170 00:20:21,250 --> 00:20:26,950 But then somebody pointed out to me this observation it's only a constant factor at best. 171 00:20:29,780 --> 00:20:40,249 Okay. Now, in practice, in real optimisation, we exploit this to the hilt and very often we don't achieve quadratic, but we get something. 172 00:20:40,250 --> 00:20:47,240 But with Alpha bigger than one and less than two. So that's realistically what often happens in serious optimisation. 173 00:20:48,650 --> 00:20:53,210 Okay. So now having gone through the philosophy, let me say a word about the practice. 174 00:21:31,540 --> 00:21:36,020 So I mentioned that there are two practical issues we want to talk about. 175 00:21:36,040 --> 00:21:41,350 One is speed and the other is robustness. So to say a word about speed first. 176 00:21:44,050 --> 00:21:46,209 For our two by two problem. It's not an issue. 177 00:21:46,210 --> 00:21:54,340 But if we look at the algorithm that we've written down here, what we see here is the solution of a system of equations. 178 00:21:55,150 --> 00:21:58,660 So that in principle is certainly going to be a cost there. 179 00:21:58,900 --> 00:22:06,190 We have an end by end system, so we expect m cubed work would be the obvious count for how much work that takes. 180 00:22:07,030 --> 00:22:11,919 Then in addition, we have to compute these objects and in particular the hash and matrix has n 181 00:22:11,920 --> 00:22:15,850 squared elements in it and each one is a derivative or a second derivative. 182 00:22:16,270 --> 00:22:20,040 So that may be a very significant cost to them. 183 00:22:21,400 --> 00:22:28,120 So the two issues of speed really are how to compute the derivatives. 184 00:22:34,340 --> 00:22:40,730 And then how to solve the system. And I'll say a word about those. 185 00:22:54,560 --> 00:22:58,490 So in my code, I just wrote down the derivatives, which I computed on paper. 186 00:22:58,700 --> 00:23:03,670 But of course, in general, you don't do that. Now, there are many ideas around for computing derivatives. 187 00:23:03,680 --> 00:23:07,280 The obvious one is some kind of finite difference approximation. 188 00:23:07,490 --> 00:23:12,620 You step forward, you step back, you take a difference that gives you an estimate to the derivative. 189 00:23:12,920 --> 00:23:16,250 So that is certainly a long established idea. 190 00:23:19,730 --> 00:23:25,850 But that's not always the best idea and it's certainly not always the most interesting. 191 00:23:25,970 --> 00:23:32,930 A more interesting one in recent decades is what people call a D, which stands for automatic differentiation. 192 00:23:38,770 --> 00:23:45,100 So the idea here is to use your software to compute the derivatives somehow. 193 00:23:45,250 --> 00:23:49,060 How many of you have done anything with ADD? Okay. 194 00:23:49,570 --> 00:23:56,410 It's very impressive tool that sort of came on board in the 1990s, I guess it was. 195 00:23:57,280 --> 00:24:02,679 It's not a symbolic differentiation. So of course, we all know what it means to write down a formula and differentiate. 196 00:24:02,680 --> 00:24:10,959 That ad is different. It actually takes a computer code and line by line sort of differentiates each line of the code so that 197 00:24:10,960 --> 00:24:17,500 at the end you end up with something which often is sort of the exact numerically computed derivative. 198 00:24:18,340 --> 00:24:27,190 There are various ways to do this. The original idea was to take a code and then transform it to a different code which would produce the derivative. 199 00:24:28,300 --> 00:24:30,700 Then along came object oriented programming, 200 00:24:30,700 --> 00:24:38,200 and a slicker idea was to overload the operations in your code so that you know everything has a new meaning. 201 00:24:38,210 --> 00:24:45,250 So I guess the the o p version of automatic differentiation is one of the standards these days. 202 00:24:45,610 --> 00:24:49,930 Anyway, it started out with relatively small problems and then people found ways to get 203 00:24:50,380 --> 00:24:55,540 bigger and bigger efficiency with bigger and bigger differentiation problems. 204 00:24:55,900 --> 00:25:00,010 Sparse matrices turned out to be a key issue there. 205 00:25:01,510 --> 00:25:08,380 So really the thing that made automatic differentiation take off was a careful treatment of the sparsity in your question. 206 00:25:09,520 --> 00:25:14,260 So this is now one of the major technologies for computing derivatives. 207 00:25:16,470 --> 00:25:22,110 So then how do you solve the systems of equations? Well, of course there's Gaussian elimination, but that's boring. 208 00:25:22,620 --> 00:25:29,190 Now it's a symmetric matrix. So one would hope that you could exploit that. 209 00:25:29,400 --> 00:25:33,780 You would hope it's a symmetric, positive, definite matrix because that means it has a minimum. 210 00:25:34,110 --> 00:25:37,830 If it's not positive definite, you're probably not converging very nicely. 211 00:25:38,130 --> 00:25:45,900 But if it is positive definite, then the symmetric positive, definite version of Gaussian elimination is Solecki factorisation. 212 00:25:46,650 --> 00:25:49,770 But this is not what people normally do for big problems. 213 00:25:50,820 --> 00:25:55,910 So just as finite differences are rather boring. So Gauss elimination is relatively boring. 214 00:25:55,920 --> 00:25:59,700 One hopes to do nicer things like conjugate gradients. 215 00:26:06,650 --> 00:26:13,070 This is such a vast subject. There are corners of it where every possible combination is the optimal thing to do. 216 00:26:13,910 --> 00:26:18,180 A remark about conjugate gradients. When I described it, 217 00:26:18,180 --> 00:26:24,660 I described it to you as a minimisation procedure where you're searching in bigger and bigger spaces for a minimum of a quadratic form. 218 00:26:25,230 --> 00:26:28,889 Well, that was for a linear quadratic form. 219 00:26:28,890 --> 00:26:32,340 But of course, in we're now talking about nonlinear problems. 220 00:26:32,550 --> 00:26:38,730 So conjugate gradient actually has a very natural connection to the larger problem that you're trying to solve. 221 00:26:39,000 --> 00:26:42,060 So this is very important aspect of things. 222 00:26:43,440 --> 00:26:47,370 Now let's take a look. Now let's talk about robustness of it. 223 00:26:57,060 --> 00:27:03,540 Another thing I must mention is that you don't always compute derivatives accurately. 224 00:27:03,810 --> 00:27:10,620 Very often you have some kind of inexact Newton methods where you approximate derivatives. 225 00:27:12,710 --> 00:27:15,110 And the philosophical principle comes into play there. 226 00:27:15,590 --> 00:27:22,130 The exact derivative will give you a quadratic convergence, but often a reasonable approximation will still give you super linear convergence. 227 00:27:23,690 --> 00:27:26,420 So then the other issue is the matter of robustness. 228 00:27:31,010 --> 00:27:39,020 What do we do about the fact that we're probably not close enough to the minimum when we start for Newton's method to converge? 229 00:27:39,290 --> 00:27:44,150 There are lots of ideas here, and one of them is called Modified Newton. 230 00:27:48,400 --> 00:27:51,430 And there are lots of variations on those themes and that theme. 231 00:27:51,430 --> 00:28:03,640 But the simplest one is instead of using the hash an H, add a multiple of the identity to it. 232 00:28:09,420 --> 00:28:15,010 Now, if you think about that, the identity is, of course, a nice, positive, definite matrix. 233 00:28:15,060 --> 00:28:22,830 So one thing to be said is, even though H might not be positive definite, if you add enough of the identity, eventually it will be positive. 234 00:28:22,830 --> 00:28:30,000 Definite. You think of some parabolic surface that might be rising in some directions, shrinking, going down in others. 235 00:28:30,420 --> 00:28:35,249 If you add enough of the identity, you're raising everything up. And now it might be positive. 236 00:28:35,250 --> 00:28:42,989 Definite. Another way to think of it is that when theta is zero, you have a pure Newton's method, and as beta approaches infinity, 237 00:28:42,990 --> 00:28:48,120 you have a steepest descent method, which is a slower algorithm but a more robust one. 238 00:28:51,870 --> 00:28:59,130 Another aspect of robustness is the idea of line searches, which I'm going to illustrate in a moment. 239 00:29:01,370 --> 00:29:04,700 So here the idea is you don't take the full Newton step. 240 00:29:10,940 --> 00:29:18,990 Now, we saw, for example, that. In the example, we show that still up there, sometimes the steps went too far. 241 00:29:19,020 --> 00:29:24,370 Well, maybe not in this one, but sometimes they overshot. Most codes would not do that. 242 00:29:24,390 --> 00:29:28,200 They would have some kind of an exploration where you've picked a direction, 243 00:29:28,890 --> 00:29:33,570 but now you decide how far to go in that direction as you converge to the minimum. 244 00:29:34,750 --> 00:29:40,390 The length of your step converges to the one that Newton's method has recommended. 245 00:29:40,900 --> 00:29:45,340 But when you're far from the minimum, you might take a different length of a step, typically shorter. 246 00:29:47,230 --> 00:29:51,090 And indeed that's related to the other method. 247 00:29:51,100 --> 00:29:53,950 Maybe we should call it B prime of trust regions. 248 00:29:58,520 --> 00:30:03,920 The idea of a trust region is, again, that you don't take a full Newton step, but it's fancier than that. 249 00:30:05,030 --> 00:30:11,810 At every stage in your iteration, you have some sense of what part of the space you feel you have a reliable model of. 250 00:30:12,080 --> 00:30:21,650 And rather than trying to take a step in your whole search space, you can find yourself to a local approximation problem in some local set. 251 00:30:21,860 --> 00:30:27,920 So I'm here at x, k, I'm trying to get here, which is X Star, but of course I don't know where it is. 252 00:30:28,490 --> 00:30:36,950 Maybe I've got this trust region and I only allow myself to go as far as the boundary of the trust region in the next step. 253 00:30:37,400 --> 00:30:42,830 And then of course, I'll update the trust region and hopefully converge towards the thing I'm looking for. 254 00:30:44,930 --> 00:30:48,500 Let's take a look at the table of contents that I copied for you. 255 00:30:50,150 --> 00:30:56,660 So this is one of the handouts. It's a mess, this copy and I apologise about that. 256 00:30:57,200 --> 00:31:04,940 But anyway, it's an attempt to show you the table of contents for the wonderful book by No Saddle and Right called Numerical Optimisation. 257 00:31:06,320 --> 00:31:11,719 So you can sort of make it out, although a lot of the page numbers are obscured. 258 00:31:11,720 --> 00:31:20,450 And I think Chapter six, you can't see the heading of it's called quasi Newton methods in this realm of not using the true lesson. 259 00:31:21,980 --> 00:31:24,920 Just want to give you a sense of some of the ideas on the table. 260 00:31:24,920 --> 00:31:30,470 Now, remember, optimisation has unconstrained and constrain and we haven't talked yet about constrained, 261 00:31:30,740 --> 00:31:36,950 but just looking through the chapters of course is an introduction and then fundamentals, which is more or less what I've been telling you. 262 00:31:37,310 --> 00:31:43,130 Then a big chapter online search methods followed by a big chapter on trust region methods. 263 00:31:43,400 --> 00:31:50,690 So those are the two parallel technologies for robustness, then conjugate gradients, which is a way of speeding things up. 264 00:31:51,080 --> 00:31:59,780 Quasi Newton methods is Chapter six another way of speeding things up, and then large scale problems which have their own considerations. 265 00:32:00,350 --> 00:32:07,900 Chapter eight on Calculating Derivatives. And you can see that it begins by talking about finite differences for the gradient. 266 00:32:07,910 --> 00:32:15,590 Think Jacoby in the lesson How to exploit Sparsity. And then Chapter Section 8.2 is about automatic differentiation. 267 00:32:18,170 --> 00:32:22,549 Then there's this whole business of what to do if your function might not be 268 00:32:22,550 --> 00:32:25,730 very smooth or you don't want to take the trouble to estimate derivatives. 269 00:32:25,940 --> 00:32:34,040 There's a whole business of derivative free optimisation, and I'm sure some of you know the min search code in MATLAB is a derivative free code. 270 00:32:35,120 --> 00:32:39,920 Then there are at least squares problems which are a special case of optimisation of tremendous importance. 271 00:32:40,730 --> 00:32:45,620 Nonlinear equations in Chapter 11 is alluding to finding zeros rather than minima. 272 00:32:46,760 --> 00:32:50,810 And then you can see we're halfway through the book, and then we turn on constraints. 273 00:32:51,110 --> 00:32:55,520 And the back of the sheet of paper is all about how to deal with constraints. 274 00:33:11,220 --> 00:33:23,840 I want to mention a few people. Because Oxford is certainly a place with a lot of expertise in optimisation. 275 00:33:23,990 --> 00:33:31,510 So there are two regular faculty members here, which are Cora Curtis and Rafael Houser in the Numerical Analysis Group. 276 00:33:31,520 --> 00:33:36,290 And these are both optimisers, roughly speaking. 277 00:33:39,970 --> 00:33:44,020 She's non convex and he's convex, which is a big distinction. 278 00:33:44,020 --> 00:33:49,770 And optimisation probably, however, is a bit more theoretical in character as well. 279 00:33:49,780 --> 00:33:52,810 She's also partly theoretical but involved with software. 280 00:33:53,050 --> 00:33:57,550 So he's away this year, but she's a very good person to talk to for advice. 281 00:33:57,550 --> 00:34:02,310 But also there's a visiting professor who visits every week for a day. 282 00:34:02,320 --> 00:34:05,680 Nick Gould So let's see, this is a car, a car, just. 283 00:34:06,250 --> 00:34:18,879 Rafael Houser So this is quite a senior guy and he knows pretty much everything he visits one day a week and is easily found online, 284 00:34:18,880 --> 00:34:21,730 of course, from the Rutherford Appleton Lab. 285 00:34:23,740 --> 00:34:29,710 So if you are engaged in an optimisation problem, it's very easy to arrange to talk to with one of these people. 286 00:34:29,860 --> 00:34:31,419 We also have this term. 287 00:34:31,420 --> 00:34:42,040 Another expert from Belgium, Felipe Twan, is another senior figure who knows everything about optimisation and he's visiting this term. 288 00:34:47,200 --> 00:34:56,610 From the University of Nomura. And I would very much recommend that if your work involves optimisation, it's worth talking to these people. 289 00:34:56,620 --> 00:35:01,690 It's such a highly developed subject. You need to know what people are doing. 290 00:35:06,830 --> 00:35:09,170 Okay. Let's play with some more codes. 291 00:35:12,800 --> 00:35:23,030 So now what I'm going to do is try to show you something about how real codes behave a little differently from the the toy code that I showed you. 292 00:35:23,270 --> 00:35:31,220 And in order to do this, I think because I don't have f men on my laptop, I think we're going to have to switch to the other machine. 293 00:35:34,210 --> 00:35:38,250 Let's see if that works. Okay. 294 00:35:39,930 --> 00:35:44,220 So I'm now going to try the one called f m 22 f min unk. 295 00:35:46,220 --> 00:35:50,510 And I actually haven't tried this here, so it may or may not work. 296 00:35:58,710 --> 00:36:02,910 That's promising. I guess it does have. Oh, yeah. 297 00:36:02,910 --> 00:36:10,450 That's okay. Right. So. We're in the same kind of mode as before. 298 00:36:11,170 --> 00:36:16,120 We're trying to find a zero of this function. And I'm showing you what f men actually does. 299 00:36:16,330 --> 00:36:20,440 So let's start at two. Two. There it is. 300 00:36:20,470 --> 00:36:24,700 Now I'm going to. I've modified things so you can see what's happening. 301 00:36:26,410 --> 00:36:30,190 It's just taken a. I'm going to. 302 00:36:30,190 --> 00:36:34,480 Sorry, I'm going to change something here. Let me try to get it so you can see more. 303 00:36:40,180 --> 00:36:47,460 So there's the picture. And then over here I'll say format compact. 304 00:36:47,530 --> 00:36:52,009 Okay, let's start again. Okay. 305 00:36:52,010 --> 00:36:56,930 So my initial guess was two, two now. So what it does is it. 306 00:37:01,780 --> 00:37:06,610 Begins by trying to estimate some derivatives. So it's just evaluated the function. 307 00:37:07,060 --> 00:37:10,390 But now it perturbs. 308 00:37:10,660 --> 00:37:17,520 Wrong button. Now it perturbs x by such a small amount you can't even see it. 309 00:37:17,530 --> 00:37:22,390 Which means I should say, format long. Shouldn't I try one more time? 310 00:37:23,680 --> 00:37:30,220 Okay. One more time to two. At the next step it perturbs x in order to compute a derivative. 311 00:37:30,430 --> 00:37:33,670 And notice it's perturbed it by about the square root of machine precision. 312 00:37:34,540 --> 00:37:38,290 So the red dot hasn't moved because it's only been perturbed by ten to the minus eight. 313 00:37:38,890 --> 00:37:43,150 Now it perturbs y. So it's estimating the other partial derivative. 314 00:37:43,390 --> 00:37:47,410 The red dot is still the same. And then finally it takes a step. 315 00:37:48,100 --> 00:37:59,740 So you can see it took a pretty good step. Now notice it's just perturbed x again to compute another derivative and then it's perturbed 316 00:37:59,750 --> 00:38:07,520 y y so now it goes much too far and it keeps doing these estimates of derivatives. 317 00:38:08,420 --> 00:38:12,650 So I always have to press the button three times for the red dot to move. 318 00:38:13,700 --> 00:38:16,820 Now notice something that's just happened. 319 00:38:17,270 --> 00:38:24,259 We went from that point to that point and then back to here, which means it hasn't chosen a new direction. 320 00:38:24,260 --> 00:38:28,850 It's doing this kind of line search thing, looking along a line to try to get an improvement. 321 00:38:29,780 --> 00:38:33,020 Now it's probably happy wrong button. 322 00:38:33,770 --> 00:38:41,120 So now it goes off in a new direction. Estimating derivatives, now another new direction. 323 00:38:41,120 --> 00:38:44,390 And now it'll probably be close to quadratic convergence. 324 00:38:45,900 --> 00:38:54,060 Okay. You can see, I don't know if you remember the answers, but it's clearly now converged to many digits. 325 00:38:55,440 --> 00:39:00,510 So that gives a little bit of a hint of how these things work. Let me try another variation. 326 00:39:05,660 --> 00:39:14,240 M22 B now is a code that uses a harder function that we're trying to minimise a famous example called Rosenblatt's function, 327 00:39:15,410 --> 00:39:19,670 and this time I'm giving it a gradient, not making it compute its own gradient. 328 00:39:20,330 --> 00:39:27,760 So let's try that. I'll say I'm 20 to be. 329 00:39:29,550 --> 00:39:31,680 So this is the famous Rosenberg function. 330 00:39:33,450 --> 00:39:42,930 We're trying to find that black dot, but if we get stuck in the valley way over there, it can take many steps for certain algorithms to find that. 331 00:39:44,340 --> 00:39:52,080 So let's give us give ourselves an initial guest that's stuck in the valley, I'll say minus one and one. 332 00:39:53,580 --> 00:39:57,899 So there we are. What will happen now? 333 00:39:57,900 --> 00:40:09,530 We have the gradients. Exactly, I believe. It's fooling around here and then its first attempt goes off in that direction. 334 00:40:09,890 --> 00:40:13,850 So you've got to do a line search so it will realise it's done something stupid, I hope. 335 00:40:15,290 --> 00:40:20,300 Yep. See, this is the line search. It's going back because it doesn't. 336 00:40:20,480 --> 00:40:25,280 It's not going to find a new direction until it at least has got some decrease in this direction. 337 00:40:26,030 --> 00:40:30,489 Keep going. So now it does have a decrease. 338 00:40:30,490 --> 00:40:35,500 That new red dot is slightly lower than the old one, so it now might find a new direction. 339 00:40:35,530 --> 00:40:42,370 Let's see. Yes. See, now it's going off in that direction, trying to find a decrease. 340 00:40:42,850 --> 00:40:53,810 And it's pretty happy now, I guess. Every time I do three, it finds a new red dot. 341 00:41:03,760 --> 00:41:12,280 So it's kind of striking that even a two dimensional function like this and it's not that exotic, a function can be a serious challenge. 342 00:41:12,550 --> 00:41:16,629 Of course, if we didn't have all the statements, it would still find it lickety split, 343 00:41:16,630 --> 00:41:23,950 but there would be a lot of iterations along the way, so optimisation can be amazingly hard. 344 00:41:25,810 --> 00:41:37,660 Let's try M to see. So here we have a more complicated version of the same thing. 345 00:41:38,350 --> 00:41:41,820 Let's give it an initial gas of minus three and one. 346 00:41:43,160 --> 00:41:47,170 Oops. I'm sorry. I'm. You can't see what I'm doing. 347 00:41:47,180 --> 00:41:51,800 I'm typing on the wrong computer. Sorry, I'm 22. 348 00:41:51,830 --> 00:41:56,000 See, there's a more complicated version of the same thing. 349 00:41:56,000 --> 00:42:01,590 Minus three and one. So. What did it do there? 350 00:42:02,400 --> 00:42:10,890 It made a lucky guess. Very, very lucky guess. 351 00:42:10,920 --> 00:42:20,490 Okay, let's try again. Let's try -2.9 and point nine. 352 00:42:21,960 --> 00:42:29,610 Okay. So now you can see. This is what software does. 353 00:42:31,610 --> 00:42:35,120 And even in two dimensions, that must have been about 100 steps that it took. 354 00:42:36,710 --> 00:42:39,860 I found myself intrigued by that. Lucky guess I'm going to try one more. 355 00:42:40,040 --> 00:42:45,400 Does anyone want to propose an initial condition? What's that? 356 00:42:47,670 --> 00:42:52,690 And the gradient was zero. I guess that's right. 357 00:42:52,690 --> 00:42:56,190 Yes. Yeah. Let's go very close. 358 00:42:56,200 --> 00:43:00,700 So let's say -2.99. 359 00:43:01,870 --> 00:43:06,550 I've been tree 10.99. What will that do? Ah. 360 00:43:07,000 --> 00:43:10,150 See, this time the initial guest wasn't so lucky, so it gave up on it. 361 00:43:13,230 --> 00:43:16,680 Interesting. Okay. So that's what optimisation is doing in 2D. 362 00:43:16,860 --> 00:43:19,860 Now, for fun, of course, I can't resist showing you a bit of fun. 363 00:43:20,130 --> 00:43:23,700 And I think for this I can return to the other computer. So. 364 00:43:28,900 --> 00:43:34,330 Right to have fun, of course, likes to deal with functions in one or two or two and three dimensions. 365 00:43:34,540 --> 00:43:37,710 So let me just show you the kind of thing we could do like that. 366 00:43:37,720 --> 00:43:45,460 I could say f, f equals. And then I would make a function of two variables. 367 00:43:45,640 --> 00:43:55,720 And this is the one that we saw a moment ago, one minus X squared plus 100 times R plus cosine of pi x. 368 00:43:56,410 --> 00:44:00,450 If you see a bug, tell me. I hope that's the Roseanne BLOCK function. 369 00:44:00,460 --> 00:44:06,040 Nope, I've made a mistake. Or I do one minus x squared plus 100. 370 00:44:07,500 --> 00:44:10,830 What's that record for? Yes. I don't want them, do I? 371 00:44:11,130 --> 00:44:14,190 Thank you. You're quick. How did you do that? Okay. 372 00:44:14,190 --> 00:44:21,509 And I'll say ethical chip. Fun, too, of this anonymous function on the box. 373 00:44:21,510 --> 00:44:23,520 Minus three, three. Minus three, three. 374 00:44:26,370 --> 00:44:36,990 So Tetfund too now does its usual thing to construct a global representation of that function, and it then uses global methods to do zero fine. 375 00:44:37,020 --> 00:44:44,790 If we plotted, I could say levels equals zero 2500 and I could say contour f levels. 376 00:44:50,310 --> 00:44:53,400 So there's the same function and you can see that. 377 00:44:56,040 --> 00:45:05,760 To have fun has captured it somehow or other. So if I know now say Midvale men pars equals men, two of s. 378 00:45:07,340 --> 00:45:10,520 Tetfund is doing its thing and. 379 00:45:11,670 --> 00:45:17,160 This function has the actual minimum at one one, and you can see that it's got about half the digits. 380 00:45:18,600 --> 00:45:22,139 The minimal value is zero and you can see it's got most of those. 381 00:45:22,140 --> 00:45:29,040 So that's kind of typical that when you're at the bottom of a parabola, the minimal value you can hope to get to machine precision, 382 00:45:29,250 --> 00:45:33,810 the minimum position because of the parabola, you hope to get the square root of machine precision. 383 00:45:34,290 --> 00:45:38,730 So here's an example where the global nature of cheap fun has worked beautifully. 384 00:45:45,090 --> 00:45:48,140 And let's see here. Yeah. 385 00:45:48,240 --> 00:45:54,390 Okay. So we have a couple of more minutes and I want to show you something else on line. 386 00:45:59,170 --> 00:46:14,060 So. So if we go to my Web page or the course Web page, you remember there's this thing called numerical software tools, which I showed you earlier. 387 00:46:14,300 --> 00:46:19,490 And let's now that we're optimisers, go back to the optimisation section of that. 388 00:46:19,790 --> 00:46:26,840 So I'll go down to. Optimisation and there isn't much there. 389 00:46:27,020 --> 00:46:30,560 But these things link to a lot. So niasse. 390 00:46:32,150 --> 00:46:36,830 Is a repository of various things related to optimisation. 391 00:46:37,010 --> 00:46:40,490 It used to stand for network enabled optimisation software. 392 00:46:40,670 --> 00:46:45,080 I think now it doesn't stand for anything. It's just an iOS. But anyway, if I click on that. 393 00:46:47,880 --> 00:46:55,290 I find I met the server. And you can see it's a service for all kinds of things related to optimisation. 394 00:46:55,560 --> 00:46:59,550 It's hosted in Madison, Wisconsin, which is a big centre for these things. 395 00:46:59,820 --> 00:47:05,460 And you can see that it has both software and information. 396 00:47:05,470 --> 00:47:10,900 So there's this optimisation guide. And then it points you in various ways to software. 397 00:47:10,920 --> 00:47:17,770 So let's look at the guide, for example. If I click on NEA's Optimisation Guide. 398 00:47:24,580 --> 00:47:27,820 Then whose phone is ringing? 399 00:47:27,910 --> 00:47:35,560 Is that mine? Drives you crazy, doesn't it? So you can see lots of information here by the best people in the business. 400 00:47:35,740 --> 00:47:40,990 So let's just click at random on optimisation algorithms. Is that somebody's phone? 401 00:47:44,050 --> 00:47:47,760 Whose phone is that? Okay. 402 00:47:49,650 --> 00:47:53,670 So you can see under algorithms, all sorts of things and we can pick one of these at random. 403 00:47:53,670 --> 00:48:02,740 Let's go to Newton methods. Mm hmm. 404 00:48:03,070 --> 00:48:07,270 Yeah. Well, you can do this at home. 405 00:48:09,730 --> 00:48:14,410 Let's just go up a few levels, and I'll show you the other things under optimisation. 406 00:48:14,920 --> 00:48:20,160 I'm not sure this decision tree is still there. Let me see. The decision tree for optimisation software. 407 00:48:20,170 --> 00:48:23,880 I guess it is still there. No, no, wait a minute. 408 00:48:23,890 --> 00:48:29,570 There used to be a tree. How do we get to the tree? 409 00:48:31,410 --> 00:48:35,970 No, I guess it's not in the form it used to be at, so that hasn't been maintained so well. 410 00:48:36,180 --> 00:48:40,590 And then the other one is this MATLAB optimisation toolbox. 411 00:48:40,680 --> 00:48:48,120 Just to remind you that MATLAB does have some optimisation built in the reputation of this toolbox is actually not so great. 412 00:48:48,420 --> 00:48:50,760 So if you're serious about optimisation, 413 00:48:50,760 --> 00:48:57,900 I think prowling around in the office is probably a better strategy than becoming fluent in the MATLAB toolbox. 414 00:48:58,140 --> 00:49:04,440 But of course there is some stuff here for all sorts of things. You can see nonlinear is what we've been talking about. 415 00:49:04,440 --> 00:49:07,980 Linear and quadratic programming is where constraints are very important. 416 00:49:08,190 --> 00:49:12,059 We haven't yet talked about that mixed integer linear programming. 417 00:49:12,060 --> 00:49:16,920 So when you mix together continuous and discrete variables, that's called mixed integer programming. 418 00:49:17,220 --> 00:49:20,850 This wonderful book I keep telling you about, it doesn't have the discrete stuff. 419 00:49:20,850 --> 00:49:26,970 It's all continuous multi objective optimisation when you're trying to minimise several things at once, 420 00:49:27,450 --> 00:49:32,040 least squares, data fitting nonlinear equations and then parallel computing and derivatives. 421 00:49:32,610 --> 00:49:35,999 So those are some of the things that MATLAB offers. Okay. 422 00:49:36,000 --> 00:49:41,040 So I guess we have our last lecture on Friday and then I will be handing out the final assignment then.