1
00:00:00,810 --> 00:00:04,770
Okay. Good morning, everybody. Assignment three is due now.

2
00:00:04,770 --> 00:00:08,790
I hope that was okay. It probably took a bit more work than the other ones.

3
00:00:09,990 --> 00:00:13,860
The solutions are out and we're happy to talk about them later on.

4
00:00:13,860 --> 00:00:18,930
If you have questions, I thought I'd show you a couple of things on the screen related to that.

5
00:00:19,560 --> 00:00:20,730
So let's see here.

6
00:00:24,680 --> 00:00:33,140
My little code for the ellipse fitting problem there was that we used the backslash at least squares in order to fit in ellipse to some data.

7
00:00:34,010 --> 00:00:37,909
Not a true non-linear ellipse fitting problem, but a linear ized version of it.

8
00:00:37,910 --> 00:00:41,240
And let's see. My code is called a 31 C.

9
00:00:42,140 --> 00:00:46,600
So if I. Run that, then I can click.

10
00:00:47,080 --> 00:00:49,150
I hope you got this to work on your machine.

11
00:00:49,630 --> 00:00:56,920
Just put in a bunch of points and then as soon as I click the right button, I get some kind of an ellipse.

12
00:00:57,670 --> 00:01:02,560
I hope that worked for you. Let me try one more. Let's make it highly elongated.

13
00:01:08,430 --> 00:01:14,330
So you can see it's always an ellipse centred at the origin with this particular formulation.

14
00:01:14,340 --> 00:01:21,090
Obviously one could generalise that. So in this case, it doesn't fit the data too well because the data is mostly on one side of the origin.

15
00:01:22,680 --> 00:01:31,589
Then the other problem was the SVD problem, where you're using the singular value decomposition to find some eigenvalues of a non-trivial problem.

16
00:01:31,590 --> 00:01:34,620
It was the modes of oscillation of a quarter ellipse.

17
00:01:35,310 --> 00:01:37,180
And I hope that went all right for you.

18
00:01:37,200 --> 00:01:44,880
I thought I'd show you online that one of the fun examples is a similar the same mathematics applied to a different problem.

19
00:01:45,360 --> 00:01:53,560
So if you go to the fun examples. Under partial differential equations.

20
00:01:56,550 --> 00:02:00,180
One of them is called Eigenvalues of a trapezoidal drum.

21
00:02:02,090 --> 00:02:05,630
And that's using exactly the same method you used for the ellipse.

22
00:02:06,020 --> 00:02:13,999
So there's a drum. And in order to find, again, modes of it, you can do an expansion in Bessel functions basically,

23
00:02:14,000 --> 00:02:19,490
which satisfy the eigenvalue equation but don't necessarily satisfy the boundary condition.

24
00:02:19,670 --> 00:02:23,690
So you find coefficients to make such an expansion zero on the boundary.

25
00:02:23,990 --> 00:02:31,580
In this case, all the challenge is that that particular corner, because that's the one where the function isn't smooth, it's not analytic.

26
00:02:31,590 --> 00:02:36,020
So you use the right Bessel function for that corner and outcome.

27
00:02:37,730 --> 00:02:40,760
The same mathematics you're used to pictures like what you saw.

28
00:02:41,480 --> 00:02:51,860
And then you fool around with your parameters. And then then you find a value of lambda, for which the smallest singular value is zero.

29
00:02:51,860 --> 00:02:59,269
And that's an eigenvalue of your oscillating drum. Okay.

30
00:02:59,270 --> 00:03:05,900
So that's that. There's one more assignment which will hand out on Friday, and then that's the end of this term.

31
00:03:08,120 --> 00:03:14,930
I guess that will be do in the usual place two weeks later, even though the lectures stop after this week.

32
00:03:23,910 --> 00:03:30,569
Okay. So let me remind you that we've been beginning to talk about optimisation, and this is a vast,

33
00:03:30,570 --> 00:03:37,560
complicated subject, but I like to start with its purest, cleanest version, which is Newton's method.

34
00:03:38,160 --> 00:03:42,690
Last lecture. We talked, first of all, about a single equation, which everybody knows.

35
00:03:43,110 --> 00:03:47,580
Then for a system of equations which is completely analogous, mathematically,

36
00:03:48,270 --> 00:03:53,670
you now have to solve a linear system of equations at each step of the nonlinear Newton iteration.

37
00:03:54,660 --> 00:04:02,730
Then we began to talk about minimising a function which is closely related, defined as finding a zero of the derivative, but not exactly that.

38
00:04:03,750 --> 00:04:08,370
Now you can probably guess that the next step is to minimise the function of several variables.

39
00:04:08,370 --> 00:04:21,630
So that's what I want to talk about now. Newton's method for minimising a function of several variables.

40
00:04:32,040 --> 00:04:38,370
Mathematically, we can think of two or three variables, but of course in applications it might be in the millions depending.

41
00:04:40,970 --> 00:04:46,570
Well, it's. It's like the difference from here to here, only more so.

42
00:04:47,050 --> 00:04:52,580
Mathematically, we have what looks like an equivalent system, but in fact, a lot more complexity comes up.

43
00:04:52,780 --> 00:04:57,969
Let me first of all, remind you that we use lowercase F for minimising.

44
00:04:57,970 --> 00:05:08,680
So we're now going to suppose we have a function lowercase F which maps from our N to R and we're trying to minimise that.

45
00:05:08,680 --> 00:05:19,540
So we seek an x star in our m r n such that it's a local minimum.

46
00:05:19,810 --> 00:05:24,730
So F of x star is a local minimum.

47
00:05:27,070 --> 00:05:32,950
And of course that will be related to having zero gradient, although that's not enough.

48
00:05:33,160 --> 00:05:37,570
But certainly if you're a local minimum of a smooth function, the gradient will have to be zero there.

49
00:05:39,250 --> 00:05:45,040
So let's talk about gradients, which are our first derivatives and then also second derivatives.

50
00:05:45,250 --> 00:05:56,049
So in general, if you have a function like that, the gradient is a vector of partial derivatives, as you all know, I'm sure.

51
00:05:56,050 --> 00:06:05,080
So the gradient of S at a point X in our N is the vector.

52
00:06:09,480 --> 00:06:12,210
Which we would normally write as a column and vector.

53
00:06:12,600 --> 00:06:23,610
So the gradient of F is the column vector consisting of the partial of F with respect to x1, x2, and so on down to X and.

54
00:06:29,130 --> 00:06:33,420
That's the gradient. And then to do Newton's method, we want to find zeros of the gradient,

55
00:06:33,570 --> 00:06:37,980
which means we need to differentiate the gradient in order to extrapolate at each step.

56
00:06:38,280 --> 00:06:45,480
So we now need the second derivative of little less to get the first derivative of the gradient, and that's called the hessian.

57
00:06:49,790 --> 00:06:55,750
So the question of F. Is the Jacoby end of the gradient.

58
00:06:58,720 --> 00:07:04,570
Remember, the Jacobean is a matrix of partial derivatives of a vector function.

59
00:07:04,810 --> 00:07:08,830
Here's our vector function. We now make a matrix of partial derivatives.

60
00:07:09,160 --> 00:07:19,180
So I could call it like this delta F of x, which is the gradient of F differentiated.

61
00:07:22,200 --> 00:07:26,730
And that's an NBN symmetric matrix of partial derivative.

62
00:07:26,750 --> 00:07:33,180
So in the upper left we have partial of F squared with respect to X1 squared.

63
00:07:34,820 --> 00:07:40,690
And down here we would have partial of F squared with respect to x n square.

64
00:07:41,540 --> 00:07:49,130
And of course, down here, partial of F squared with respect to x one and x, n.

65
00:07:51,660 --> 00:07:57,990
And up here partial of f squared with respect to x1xn.

66
00:08:02,420 --> 00:08:08,840
And as you know, it's a fact of mathematics that at least if it's smooth, then it doesn't matter what order you do.

67
00:08:08,840 --> 00:08:14,570
These derivatives so partial with back to end one is the same as partial with respect to one an end.

68
00:08:14,780 --> 00:08:18,920
So it's a symmetric matrix. The question is a symmetric market.

69
00:08:22,650 --> 00:08:26,010
And there already we see something that's a little bit different.

70
00:08:29,690 --> 00:08:34,370
When you solve a system of equations, you have a Jacobean which need not be symmetric.

71
00:08:34,640 --> 00:08:39,440
But when you're minimising, the Jacobean is a special one which is symmetric.

72
00:08:39,740 --> 00:08:43,360
So minimisation is not the same as zero fighting.

73
00:08:44,450 --> 00:08:48,390
So let's write down Newton's method. You can do that without my help, but I'll do it.

74
00:08:50,090 --> 00:09:03,020
So in its pure form. Newton's method for minimising this function of several variables, we would say, given an initial gas.

75
00:09:06,760 --> 00:09:09,940
It's not as always, which is now a vector.

76
00:09:12,160 --> 00:09:18,130
Then we go into our iteration for k equals 1012 and so on.

77
00:09:21,240 --> 00:09:24,390
First of all, we have to evaluate the gradient and the fashion.

78
00:09:25,920 --> 00:09:31,590
So evaluate the gradient and the fashion.

79
00:09:35,400 --> 00:09:40,980
And as you can imagine, as a general matter, those could be very serious challenges,

80
00:09:41,220 --> 00:09:45,180
both in terms of the amount of work involved and in terms of the accuracy.

81
00:09:45,180 --> 00:09:54,270
What method are you going to use for that? Then we have to solve our system of equations which will involve involved the hazard matrix.

82
00:09:54,480 --> 00:09:57,570
So here's the system. Delta of death.

83
00:09:58,650 --> 00:10:03,049
There's our matrix. Times.

84
00:10:03,050 --> 00:10:08,420
The faith step should be equal to minus the gradient of F.

85
00:10:11,360 --> 00:10:17,990
So we solve that symmetric system of equations for s sub k and then we do our update.

86
00:10:18,350 --> 00:10:23,960
We say x, k plus one equals x, k plus s k.

87
00:10:31,740 --> 00:10:36,299
So there's the idea behind a lot of optimisation, but it's only the very,

88
00:10:36,300 --> 00:10:42,960
very beginning of the subject you recognise it's just Newton's method for finding zeros of the gradient.

89
00:10:45,210 --> 00:10:53,130
So I thought, of course we play with that on the computer. We're going to have to bounce back and forth a little bit between machines.

90
00:10:53,850 --> 00:10:56,670
Let's start I think we start on the usual machine.

91
00:10:56,880 --> 00:11:04,020
So if you look at the code called M 21, Pure and Newton men, in that case, I've written down exactly this.

92
00:11:04,260 --> 00:11:13,260
So I've taken a system of two equations written explicitly the function F, the gradient G and the hash and H.

93
00:11:14,910 --> 00:11:18,270
And by the way, those letters are very commonly used.

94
00:11:19,050 --> 00:11:26,970
So we've talked about F people often say G for the gradient and H for the fashion.

95
00:11:32,470 --> 00:11:42,300
So let's see if we go to that code. It's called AM 21.

96
00:11:43,380 --> 00:11:54,830
Let's try it. I hope I'm in the right mood here.

97
00:11:56,540 --> 00:12:04,159
Yes. Okay. So there's a function, the function F and we're trying to find the black dot in two D.

98
00:12:04,160 --> 00:12:09,319
It's pretty easy to see it on the screen. And of course, we'll follow an iteration.

99
00:12:09,320 --> 00:12:13,040
And what we need to do is pick our initial guess.

100
00:12:13,060 --> 00:12:17,480
So for example, suppose I take a to two as my initial guess.

101
00:12:19,310 --> 00:12:25,790
Then you can see the red dot and at every step I'll take a step and you can see that one has not done well.

102
00:12:26,690 --> 00:12:31,820
So we're already it's not converging. We're already seeing the fragility of Newton's method.

103
00:12:32,660 --> 00:12:37,280
You're. I probably figured that would be close enough to get to the black dot.

104
00:12:37,520 --> 00:12:44,750
But Newton's method, it's all about the curve curvature of those contour lines because they're curved sort of in the wrong direction.

105
00:12:45,020 --> 00:12:50,180
Newton's method, which is using that second order information, is going the wrong way.

106
00:12:50,840 --> 00:12:56,450
Let's try again. So let's try one one as our initial guess.

107
00:12:57,740 --> 00:13:01,430
That's got to work, right? Is it working?

108
00:13:01,430 --> 00:13:06,650
The first step is not impressive, is it? But then the second step is looking great.

109
00:13:06,920 --> 00:13:12,620
Now has to look at the numbers here where we're expecting quadratic convergence, of course.

110
00:13:12,890 --> 00:13:19,250
So F is the thing we're trying to minimise and then G is the gradient which we're trying to set to zero.

111
00:13:19,490 --> 00:13:26,320
So let's see if that thing is going to zero. Take another step.

112
00:13:27,700 --> 00:13:31,510
It is smaller. Of course, we're not seeing the red dots anymore.

113
00:13:31,570 --> 00:13:37,900
The gradient is now zero to several digits and now it's ten to the minus sixth, and now it's ten to the -13.

114
00:13:38,380 --> 00:13:42,520
So sure enough, from that very good initial gas, Newton did fine.

115
00:13:44,200 --> 00:13:51,250
And just because I can't resist, let's do a couple more initial guesses. So if I try, for example, two minus two.

116
00:13:52,540 --> 00:13:56,530
Does it work? Looks promising. Yep. Very promising.

117
00:13:57,460 --> 00:14:05,380
But isn't it striking how the magic of Newton's method just doesn't appear until you get very close to the thing you're looking for?

118
00:14:05,710 --> 00:14:10,680
At the moment, it just looks like some crude approximation. Look, it's taken the wrong direction.

119
00:14:11,380 --> 00:14:17,830
Oops. It's gone in totally the wrong direction. So you see how fragile Newton's method can be.

120
00:14:19,430 --> 00:14:26,030
We're going to have to take a better initial guess. Let's take 1.5 and minus point seven.

121
00:14:28,170 --> 00:14:35,510
Will it work? Nope. Looks bad. It's have are.

122
00:14:36,230 --> 00:14:39,230
It's going to. Oh, now we've run. That probably was about to converge.

123
00:14:39,230 --> 00:14:43,100
But I ran into the limit on the number of steps. So you can see that.

124
00:14:45,740 --> 00:14:48,900
When computers came along and people started to optimise.

125
00:14:48,920 --> 00:14:52,700
Newton's method was only sort of an idea, a vision.

126
00:14:53,900 --> 00:15:00,470
Most of the work is that we're going to have to somehow get near that point in order to get our quadratic conversions.

127
00:15:03,340 --> 00:15:15,860
Okay. So I want to say a word about the more practical side of optimisation still without constraints.

128
00:15:15,870 --> 00:15:21,540
We're going to talk about constraints on Friday. We're now talking about the field of unconstrained optimisation.

129
00:15:26,640 --> 00:15:34,170
So let me say a few words about what it takes to get from this kind of Newton's method to practical things.

130
00:15:43,260 --> 00:15:46,710
From Newton's method to practical optimisation.

131
00:15:50,810 --> 00:15:56,060
And I guess I would say that there are two big issues that have to be addressed in this little example.

132
00:15:56,060 --> 00:15:59,480
We saw one of them, which was robustness for bigger examples.

133
00:15:59,480 --> 00:16:06,380
The other one is also important, namely speed. So let me turn to that in a moment.

134
00:16:07,250 --> 00:16:11,360
But first, I want to make a philosophical observation. So the vision.

135
00:16:15,190 --> 00:16:23,450
Is quadratic convergence. Now.

136
00:16:23,450 --> 00:16:29,189
What does that mean? It means that every time you take a step, you double the number of accurate digits.

137
00:16:29,190 --> 00:16:37,069
So you have the error at one step is proportional to the square of the error at the previous steps, which is wonderful, right?

138
00:16:37,070 --> 00:16:38,959
If that's happening, you're in great shape.

139
00:16:38,960 --> 00:16:51,230
If the error at step K plus one is O of the error at step K squared, I want to make a remark about the significance of that condition.

140
00:16:51,230 --> 00:16:54,379
And this is something that certainly puzzled me at one point.

141
00:16:54,380 --> 00:17:03,860
It may puzzled a lot of people. Obviously, it's good if you converge quickly, but what's so special about a square there?

142
00:17:04,100 --> 00:17:07,610
Newton's method, of course, gives you a square when things are functioning well.

143
00:17:07,610 --> 00:17:12,920
But why stop there? Why not have cubic convergence or quartette convergence?

144
00:17:13,850 --> 00:17:18,020
You know, maybe that would have been hard in the 17th century, but now we've got computers, right?

145
00:17:18,020 --> 00:17:23,090
So why not develop something like Newton's method that converges quickly?

146
00:17:24,240 --> 00:17:31,830
Of course, it'll be even less robust. That's a fact. So it will take some finicky care to make it really work.

147
00:17:32,910 --> 00:17:36,990
But it's certainly possible to develop algorithms that converge quickly.

148
00:17:37,380 --> 00:17:45,780
So the question is, why aren't these algorithms famous? And that puzzled me for a while until I realised or somebody told me,

149
00:17:45,780 --> 00:17:52,380
I guess that I'm actually a cubic algorithm is not any faster than a quadratic algorithm.

150
00:17:53,670 --> 00:17:57,660
Let me put it this way. Suppose we had a core tech algorithm.

151
00:17:57,990 --> 00:18:03,450
E k plus one equals o of e k to the fourth.

152
00:18:04,170 --> 00:18:09,840
So the number of correct digits quadruples every step that looks even better.

153
00:18:10,140 --> 00:18:14,790
And mathematically, that too would be possible. You would need fourth derivatives you suddenly be getting.

154
00:18:14,820 --> 00:18:23,400
Tensor is not just matrices. But the point is, this is just like two steps of that, right?

155
00:18:23,700 --> 00:18:32,550
So if you've got quadratic convergence, then e k plus two is o of e k to the fourth.

156
00:18:34,470 --> 00:18:41,340
So the difference between second order and fourth order or any other order bigger than one is only a constant factor.

157
00:18:41,940 --> 00:18:48,190
All you, even you, all you could dream of gaining by going chaotic is sort of having the work.

158
00:18:48,210 --> 00:18:53,760
But of course, you'd never in practice have the work because it would be such a huge effort to write down your core take algorithm.

159
00:18:54,180 --> 00:18:59,490
So there's nothing to be gained in principle by going to different orders.

160
00:19:00,000 --> 00:19:03,860
So all super linear, as we call it.

161
00:19:07,000 --> 00:19:10,210
Convergence rates, which means better than an exponent one.

162
00:19:16,110 --> 00:19:19,920
Are in principle equivalent up to constant factors.

163
00:19:32,320 --> 00:19:36,490
I'm speaking very much in principle. Of course, there are all sorts of practical issues here.

164
00:19:36,730 --> 00:19:47,350
Super linear means e k plus one equals O of e k to some alpha where alpha is bigger than one.

165
00:19:48,160 --> 00:19:51,750
Now, if alpha is equal to one, then it is genuinely slower.

166
00:19:51,760 --> 00:19:57,160
So linear convergence is slower than super linear convergence, but all super linear rates are the same.

167
00:19:58,210 --> 00:20:08,980
I remember when I first became aware of this. There were some methods discovered for computing PI, which were incredibly fast.

168
00:20:09,340 --> 00:20:13,870
They converge quickly, every iteration. You'd get three times as many digits of pi.

169
00:20:14,110 --> 00:20:20,470
And I thought that was an amazing advance over the previous algorithms, which were only quadratic convergence.

170
00:20:21,250 --> 00:20:26,950
But then somebody pointed out to me this observation it's only a constant factor at best.

171
00:20:29,780 --> 00:20:40,249
Okay. Now, in practice, in real optimisation, we exploit this to the hilt and very often we don't achieve quadratic, but we get something.

172
00:20:40,250 --> 00:20:47,240
But with Alpha bigger than one and less than two. So that's realistically what often happens in serious optimisation.

173
00:20:48,650 --> 00:20:53,210
Okay. So now having gone through the philosophy, let me say a word about the practice.

174
00:21:31,540 --> 00:21:36,020
So I mentioned that there are two practical issues we want to talk about.

175
00:21:36,040 --> 00:21:41,350
One is speed and the other is robustness. So to say a word about speed first.

176
00:21:44,050 --> 00:21:46,209
For our two by two problem. It's not an issue.

177
00:21:46,210 --> 00:21:54,340
But if we look at the algorithm that we've written down here, what we see here is the solution of a system of equations.

178
00:21:55,150 --> 00:21:58,660
So that in principle is certainly going to be a cost there.

179
00:21:58,900 --> 00:22:06,190
We have an end by end system, so we expect m cubed work would be the obvious count for how much work that takes.

180
00:22:07,030 --> 00:22:11,919
Then in addition, we have to compute these objects and in particular the hash and matrix has n

181
00:22:11,920 --> 00:22:15,850
squared elements in it and each one is a derivative or a second derivative.

182
00:22:16,270 --> 00:22:20,040
So that may be a very significant cost to them.

183
00:22:21,400 --> 00:22:28,120
So the two issues of speed really are how to compute the derivatives.

184
00:22:34,340 --> 00:22:40,730
And then how to solve the system. And I'll say a word about those.

185
00:22:54,560 --> 00:22:58,490
So in my code, I just wrote down the derivatives, which I computed on paper.

186
00:22:58,700 --> 00:23:03,670
But of course, in general, you don't do that. Now, there are many ideas around for computing derivatives.

187
00:23:03,680 --> 00:23:07,280
The obvious one is some kind of finite difference approximation.

188
00:23:07,490 --> 00:23:12,620
You step forward, you step back, you take a difference that gives you an estimate to the derivative.

189
00:23:12,920 --> 00:23:16,250
So that is certainly a long established idea.

190
00:23:19,730 --> 00:23:25,850
But that's not always the best idea and it's certainly not always the most interesting.

191
00:23:25,970 --> 00:23:32,930
A more interesting one in recent decades is what people call a D, which stands for automatic differentiation.

192
00:23:38,770 --> 00:23:45,100
So the idea here is to use your software to compute the derivatives somehow.

193
00:23:45,250 --> 00:23:49,060
How many of you have done anything with ADD? Okay.

194
00:23:49,570 --> 00:23:56,410
It's very impressive tool that sort of came on board in the 1990s, I guess it was.

195
00:23:57,280 --> 00:24:02,679
It's not a symbolic differentiation. So of course, we all know what it means to write down a formula and differentiate.

196
00:24:02,680 --> 00:24:10,959
That ad is different. It actually takes a computer code and line by line sort of differentiates each line of the code so that

197
00:24:10,960 --> 00:24:17,500
at the end you end up with something which often is sort of the exact numerically computed derivative.

198
00:24:18,340 --> 00:24:27,190
There are various ways to do this. The original idea was to take a code and then transform it to a different code which would produce the derivative.

199
00:24:28,300 --> 00:24:30,700
Then along came object oriented programming,

200
00:24:30,700 --> 00:24:38,200
and a slicker idea was to overload the operations in your code so that you know everything has a new meaning.

201
00:24:38,210 --> 00:24:45,250
So I guess the the o p version of automatic differentiation is one of the standards these days.

202
00:24:45,610 --> 00:24:49,930
Anyway, it started out with relatively small problems and then people found ways to get

203
00:24:50,380 --> 00:24:55,540
bigger and bigger efficiency with bigger and bigger differentiation problems.

204
00:24:55,900 --> 00:25:00,010
Sparse matrices turned out to be a key issue there.

205
00:25:01,510 --> 00:25:08,380
So really the thing that made automatic differentiation take off was a careful treatment of the sparsity in your question.

206
00:25:09,520 --> 00:25:14,260
So this is now one of the major technologies for computing derivatives.

207
00:25:16,470 --> 00:25:22,110
So then how do you solve the systems of equations? Well, of course there's Gaussian elimination, but that's boring.

208
00:25:22,620 --> 00:25:29,190
Now it's a symmetric matrix. So one would hope that you could exploit that.

209
00:25:29,400 --> 00:25:33,780
You would hope it's a symmetric, positive, definite matrix because that means it has a minimum.

210
00:25:34,110 --> 00:25:37,830
If it's not positive definite, you're probably not converging very nicely.

211
00:25:38,130 --> 00:25:45,900
But if it is positive definite, then the symmetric positive, definite version of Gaussian elimination is Solecki factorisation.

212
00:25:46,650 --> 00:25:49,770
But this is not what people normally do for big problems.

213
00:25:50,820 --> 00:25:55,910
So just as finite differences are rather boring. So Gauss elimination is relatively boring.

214
00:25:55,920 --> 00:25:59,700
One hopes to do nicer things like conjugate gradients.

215
00:26:06,650 --> 00:26:13,070
This is such a vast subject. There are corners of it where every possible combination is the optimal thing to do.

216
00:26:13,910 --> 00:26:18,180
A remark about conjugate gradients. When I described it,

217
00:26:18,180 --> 00:26:24,660
I described it to you as a minimisation procedure where you're searching in bigger and bigger spaces for a minimum of a quadratic form.

218
00:26:25,230 --> 00:26:28,889
Well, that was for a linear quadratic form.

219
00:26:28,890 --> 00:26:32,340
But of course, in we're now talking about nonlinear problems.

220
00:26:32,550 --> 00:26:38,730
So conjugate gradient actually has a very natural connection to the larger problem that you're trying to solve.

221
00:26:39,000 --> 00:26:42,060
So this is very important aspect of things.

222
00:26:43,440 --> 00:26:47,370
Now let's take a look. Now let's talk about robustness of it.

223
00:26:57,060 --> 00:27:03,540
Another thing I must mention is that you don't always compute derivatives accurately.

224
00:27:03,810 --> 00:27:10,620
Very often you have some kind of inexact Newton methods where you approximate derivatives.

225
00:27:12,710 --> 00:27:15,110
And the philosophical principle comes into play there.

226
00:27:15,590 --> 00:27:22,130
The exact derivative will give you a quadratic convergence, but often a reasonable approximation will still give you super linear convergence.

227
00:27:23,690 --> 00:27:26,420
So then the other issue is the matter of robustness.

228
00:27:31,010 --> 00:27:39,020
What do we do about the fact that we're probably not close enough to the minimum when we start for Newton's method to converge?

229
00:27:39,290 --> 00:27:44,150
There are lots of ideas here, and one of them is called Modified Newton.

230
00:27:48,400 --> 00:27:51,430
And there are lots of variations on those themes and that theme.

231
00:27:51,430 --> 00:28:03,640
But the simplest one is instead of using the hash an H, add a multiple of the identity to it.

232
00:28:09,420 --> 00:28:15,010
Now, if you think about that, the identity is, of course, a nice, positive, definite matrix.

233
00:28:15,060 --> 00:28:22,830
So one thing to be said is, even though H might not be positive definite, if you add enough of the identity, eventually it will be positive.

234
00:28:22,830 --> 00:28:30,000
Definite. You think of some parabolic surface that might be rising in some directions, shrinking, going down in others.

235
00:28:30,420 --> 00:28:35,249
If you add enough of the identity, you're raising everything up. And now it might be positive.

236
00:28:35,250 --> 00:28:42,989
Definite. Another way to think of it is that when theta is zero, you have a pure Newton's method, and as beta approaches infinity,

237
00:28:42,990 --> 00:28:48,120
you have a steepest descent method, which is a slower algorithm but a more robust one.

238
00:28:51,870 --> 00:28:59,130
Another aspect of robustness is the idea of line searches, which I'm going to illustrate in a moment.

239
00:29:01,370 --> 00:29:04,700
So here the idea is you don't take the full Newton step.

240
00:29:10,940 --> 00:29:18,990
Now, we saw, for example, that. In the example, we show that still up there, sometimes the steps went too far.

241
00:29:19,020 --> 00:29:24,370
Well, maybe not in this one, but sometimes they overshot. Most codes would not do that.

242
00:29:24,390 --> 00:29:28,200
They would have some kind of an exploration where you've picked a direction,

243
00:29:28,890 --> 00:29:33,570
but now you decide how far to go in that direction as you converge to the minimum.

244
00:29:34,750 --> 00:29:40,390
The length of your step converges to the one that Newton's method has recommended.

245
00:29:40,900 --> 00:29:45,340
But when you're far from the minimum, you might take a different length of a step, typically shorter.

246
00:29:47,230 --> 00:29:51,090
And indeed that's related to the other method.

247
00:29:51,100 --> 00:29:53,950
Maybe we should call it B prime of trust regions.

248
00:29:58,520 --> 00:30:03,920
The idea of a trust region is, again, that you don't take a full Newton step, but it's fancier than that.

249
00:30:05,030 --> 00:30:11,810
At every stage in your iteration, you have some sense of what part of the space you feel you have a reliable model of.

250
00:30:12,080 --> 00:30:21,650
And rather than trying to take a step in your whole search space, you can find yourself to a local approximation problem in some local set.

251
00:30:21,860 --> 00:30:27,920
So I'm here at x, k, I'm trying to get here, which is X Star, but of course I don't know where it is.

252
00:30:28,490 --> 00:30:36,950
Maybe I've got this trust region and I only allow myself to go as far as the boundary of the trust region in the next step.

253
00:30:37,400 --> 00:30:42,830
And then of course, I'll update the trust region and hopefully converge towards the thing I'm looking for.

254
00:30:44,930 --> 00:30:48,500
Let's take a look at the table of contents that I copied for you.

255
00:30:50,150 --> 00:30:56,660
So this is one of the handouts. It's a mess, this copy and I apologise about that.

256
00:30:57,200 --> 00:31:04,940
But anyway, it's an attempt to show you the table of contents for the wonderful book by No Saddle and Right called Numerical Optimisation.

257
00:31:06,320 --> 00:31:11,719
So you can sort of make it out, although a lot of the page numbers are obscured.

258
00:31:11,720 --> 00:31:20,450
And I think Chapter six, you can't see the heading of it's called quasi Newton methods in this realm of not using the true lesson.

259
00:31:21,980 --> 00:31:24,920
Just want to give you a sense of some of the ideas on the table.

260
00:31:24,920 --> 00:31:30,470
Now, remember, optimisation has unconstrained and constrain and we haven't talked yet about constrained,

261
00:31:30,740 --> 00:31:36,950
but just looking through the chapters of course is an introduction and then fundamentals, which is more or less what I've been telling you.

262
00:31:37,310 --> 00:31:43,130
Then a big chapter online search methods followed by a big chapter on trust region methods.

263
00:31:43,400 --> 00:31:50,690
So those are the two parallel technologies for robustness, then conjugate gradients, which is a way of speeding things up.

264
00:31:51,080 --> 00:31:59,780
Quasi Newton methods is Chapter six another way of speeding things up, and then large scale problems which have their own considerations.

265
00:32:00,350 --> 00:32:07,900
Chapter eight on Calculating Derivatives. And you can see that it begins by talking about finite differences for the gradient.

266
00:32:07,910 --> 00:32:15,590
Think Jacoby in the lesson How to exploit Sparsity. And then Chapter Section 8.2 is about automatic differentiation.

267
00:32:18,170 --> 00:32:22,549
Then there's this whole business of what to do if your function might not be

268
00:32:22,550 --> 00:32:25,730
very smooth or you don't want to take the trouble to estimate derivatives.

269
00:32:25,940 --> 00:32:34,040
There's a whole business of derivative free optimisation, and I'm sure some of you know the min search code in MATLAB is a derivative free code.

270
00:32:35,120 --> 00:32:39,920
Then there are at least squares problems which are a special case of optimisation of tremendous importance.

271
00:32:40,730 --> 00:32:45,620
Nonlinear equations in Chapter 11 is alluding to finding zeros rather than minima.

272
00:32:46,760 --> 00:32:50,810
And then you can see we're halfway through the book, and then we turn on constraints.

273
00:32:51,110 --> 00:32:55,520
And the back of the sheet of paper is all about how to deal with constraints.

274
00:33:11,220 --> 00:33:23,840
I want to mention a few people. Because Oxford is certainly a place with a lot of expertise in optimisation.

275
00:33:23,990 --> 00:33:31,510
So there are two regular faculty members here, which are Cora Curtis and Rafael Houser in the Numerical Analysis Group.

276
00:33:31,520 --> 00:33:36,290
And these are both optimisers, roughly speaking.

277
00:33:39,970 --> 00:33:44,020
She's non convex and he's convex, which is a big distinction.

278
00:33:44,020 --> 00:33:49,770
And optimisation probably, however, is a bit more theoretical in character as well.

279
00:33:49,780 --> 00:33:52,810
She's also partly theoretical but involved with software.

280
00:33:53,050 --> 00:33:57,550
So he's away this year, but she's a very good person to talk to for advice.

281
00:33:57,550 --> 00:34:02,310
But also there's a visiting professor who visits every week for a day.

282
00:34:02,320 --> 00:34:05,680
Nick Gould So let's see, this is a car, a car, just.

283
00:34:06,250 --> 00:34:18,879
Rafael Houser So this is quite a senior guy and he knows pretty much everything he visits one day a week and is easily found online,

284
00:34:18,880 --> 00:34:21,730
of course, from the Rutherford Appleton Lab.

285
00:34:23,740 --> 00:34:29,710
So if you are engaged in an optimisation problem, it's very easy to arrange to talk to with one of these people.

286
00:34:29,860 --> 00:34:31,419
We also have this term.

287
00:34:31,420 --> 00:34:42,040
Another expert from Belgium, Felipe Twan, is another senior figure who knows everything about optimisation and he's visiting this term.

288
00:34:47,200 --> 00:34:56,610
From the University of Nomura. And I would very much recommend that if your work involves optimisation, it's worth talking to these people.

289
00:34:56,620 --> 00:35:01,690
It's such a highly developed subject. You need to know what people are doing.

290
00:35:06,830 --> 00:35:09,170
Okay. Let's play with some more codes.

291
00:35:12,800 --> 00:35:23,030
So now what I'm going to do is try to show you something about how real codes behave a little differently from the the toy code that I showed you.

292
00:35:23,270 --> 00:35:31,220
And in order to do this, I think because I don't have f men on my laptop, I think we're going to have to switch to the other machine.

293
00:35:34,210 --> 00:35:38,250
Let's see if that works. Okay.

294
00:35:39,930 --> 00:35:44,220
So I'm now going to try the one called f m 22 f min unk.

295
00:35:46,220 --> 00:35:50,510
And I actually haven't tried this here, so it may or may not work.

296
00:35:58,710 --> 00:36:02,910
That's promising. I guess it does have. Oh, yeah.

297
00:36:02,910 --> 00:36:10,450
That's okay. Right. So. We're in the same kind of mode as before.

298
00:36:11,170 --> 00:36:16,120
We're trying to find a zero of this function. And I'm showing you what f men actually does.

299
00:36:16,330 --> 00:36:20,440
So let's start at two. Two. There it is.

300
00:36:20,470 --> 00:36:24,700
Now I'm going to. I've modified things so you can see what's happening.

301
00:36:26,410 --> 00:36:30,190
It's just taken a. I'm going to.

302
00:36:30,190 --> 00:36:34,480
Sorry, I'm going to change something here. Let me try to get it so you can see more.

303
00:36:40,180 --> 00:36:47,460
So there's the picture. And then over here I'll say format compact.

304
00:36:47,530 --> 00:36:52,009
Okay, let's start again. Okay.

305
00:36:52,010 --> 00:36:56,930
So my initial guess was two, two now. So what it does is it.

306
00:37:01,780 --> 00:37:06,610
Begins by trying to estimate some derivatives. So it's just evaluated the function.

307
00:37:07,060 --> 00:37:10,390
But now it perturbs.

308
00:37:10,660 --> 00:37:17,520
Wrong button. Now it perturbs x by such a small amount you can't even see it.

309
00:37:17,530 --> 00:37:22,390
Which means I should say, format long. Shouldn't I try one more time?

310
00:37:23,680 --> 00:37:30,220
Okay. One more time to two. At the next step it perturbs x in order to compute a derivative.

311
00:37:30,430 --> 00:37:33,670
And notice it's perturbed it by about the square root of machine precision.

312
00:37:34,540 --> 00:37:38,290
So the red dot hasn't moved because it's only been perturbed by ten to the minus eight.

313
00:37:38,890 --> 00:37:43,150
Now it perturbs y. So it's estimating the other partial derivative.

314
00:37:43,390 --> 00:37:47,410
The red dot is still the same. And then finally it takes a step.

315
00:37:48,100 --> 00:37:59,740
So you can see it took a pretty good step. Now notice it's just perturbed x again to compute another derivative and then it's perturbed

316
00:37:59,750 --> 00:38:07,520
y y so now it goes much too far and it keeps doing these estimates of derivatives.

317
00:38:08,420 --> 00:38:12,650
So I always have to press the button three times for the red dot to move.

318
00:38:13,700 --> 00:38:16,820
Now notice something that's just happened.

319
00:38:17,270 --> 00:38:24,259
We went from that point to that point and then back to here, which means it hasn't chosen a new direction.

320
00:38:24,260 --> 00:38:28,850
It's doing this kind of line search thing, looking along a line to try to get an improvement.

321
00:38:29,780 --> 00:38:33,020
Now it's probably happy wrong button.

322
00:38:33,770 --> 00:38:41,120
So now it goes off in a new direction. Estimating derivatives, now another new direction.

323
00:38:41,120 --> 00:38:44,390
And now it'll probably be close to quadratic convergence.

324
00:38:45,900 --> 00:38:54,060
Okay. You can see, I don't know if you remember the answers, but it's clearly now converged to many digits.

325
00:38:55,440 --> 00:39:00,510
So that gives a little bit of a hint of how these things work. Let me try another variation.

326
00:39:05,660 --> 00:39:14,240
M22 B now is a code that uses a harder function that we're trying to minimise a famous example called Rosenblatt's function,

327
00:39:15,410 --> 00:39:19,670
and this time I'm giving it a gradient, not making it compute its own gradient.

328
00:39:20,330 --> 00:39:27,760
So let's try that. I'll say I'm 20 to be.

329
00:39:29,550 --> 00:39:31,680
So this is the famous Rosenberg function.

330
00:39:33,450 --> 00:39:42,930
We're trying to find that black dot, but if we get stuck in the valley way over there, it can take many steps for certain algorithms to find that.

331
00:39:44,340 --> 00:39:52,080
So let's give us give ourselves an initial guest that's stuck in the valley, I'll say minus one and one.

332
00:39:53,580 --> 00:39:57,899
So there we are. What will happen now?

333
00:39:57,900 --> 00:40:09,530
We have the gradients. Exactly, I believe. It's fooling around here and then its first attempt goes off in that direction.

334
00:40:09,890 --> 00:40:13,850
So you've got to do a line search so it will realise it's done something stupid, I hope.

335
00:40:15,290 --> 00:40:20,300
Yep. See, this is the line search. It's going back because it doesn't.

336
00:40:20,480 --> 00:40:25,280
It's not going to find a new direction until it at least has got some decrease in this direction.

337
00:40:26,030 --> 00:40:30,489
Keep going. So now it does have a decrease.

338
00:40:30,490 --> 00:40:35,500
That new red dot is slightly lower than the old one, so it now might find a new direction.

339
00:40:35,530 --> 00:40:42,370
Let's see. Yes. See, now it's going off in that direction, trying to find a decrease.

340
00:40:42,850 --> 00:40:53,810
And it's pretty happy now, I guess. Every time I do three, it finds a new red dot.

341
00:41:03,760 --> 00:41:12,280
So it's kind of striking that even a two dimensional function like this and it's not that exotic, a function can be a serious challenge.

342
00:41:12,550 --> 00:41:16,629
Of course, if we didn't have all the statements, it would still find it lickety split,

343
00:41:16,630 --> 00:41:23,950
but there would be a lot of iterations along the way, so optimisation can be amazingly hard.

344
00:41:25,810 --> 00:41:37,660
Let's try M to see. So here we have a more complicated version of the same thing.

345
00:41:38,350 --> 00:41:41,820
Let's give it an initial gas of minus three and one.

346
00:41:43,160 --> 00:41:47,170
Oops. I'm sorry. I'm. You can't see what I'm doing.

347
00:41:47,180 --> 00:41:51,800
I'm typing on the wrong computer. Sorry, I'm 22.

348
00:41:51,830 --> 00:41:56,000
See, there's a more complicated version of the same thing.

349
00:41:56,000 --> 00:42:01,590
Minus three and one. So. What did it do there?

350
00:42:02,400 --> 00:42:10,890
It made a lucky guess. Very, very lucky guess.

351
00:42:10,920 --> 00:42:20,490
Okay, let's try again. Let's try -2.9 and point nine.

352
00:42:21,960 --> 00:42:29,610
Okay. So now you can see. This is what software does.

353
00:42:31,610 --> 00:42:35,120
And even in two dimensions, that must have been about 100 steps that it took.

354
00:42:36,710 --> 00:42:39,860
I found myself intrigued by that. Lucky guess I'm going to try one more.

355
00:42:40,040 --> 00:42:45,400
Does anyone want to propose an initial condition? What's that?

356
00:42:47,670 --> 00:42:52,690
And the gradient was zero. I guess that's right.

357
00:42:52,690 --> 00:42:56,190
Yes. Yeah. Let's go very close.

358
00:42:56,200 --> 00:43:00,700
So let's say -2.99.

359
00:43:01,870 --> 00:43:06,550
I've been tree 10.99. What will that do? Ah.

360
00:43:07,000 --> 00:43:10,150
See, this time the initial guest wasn't so lucky, so it gave up on it.

361
00:43:13,230 --> 00:43:16,680
Interesting. Okay. So that's what optimisation is doing in 2D.

362
00:43:16,860 --> 00:43:19,860
Now, for fun, of course, I can't resist showing you a bit of fun.

363
00:43:20,130 --> 00:43:23,700
And I think for this I can return to the other computer. So.

364
00:43:28,900 --> 00:43:34,330
Right to have fun, of course, likes to deal with functions in one or two or two and three dimensions.

365
00:43:34,540 --> 00:43:37,710
So let me just show you the kind of thing we could do like that.

366
00:43:37,720 --> 00:43:45,460
I could say f, f equals. And then I would make a function of two variables.

367
00:43:45,640 --> 00:43:55,720
And this is the one that we saw a moment ago, one minus X squared plus 100 times R plus cosine of pi x.

368
00:43:56,410 --> 00:44:00,450
If you see a bug, tell me. I hope that's the Roseanne BLOCK function.

369
00:44:00,460 --> 00:44:06,040
Nope, I've made a mistake. Or I do one minus x squared plus 100.

370
00:44:07,500 --> 00:44:10,830
What's that record for? Yes. I don't want them, do I?

371
00:44:11,130 --> 00:44:14,190
Thank you. You're quick. How did you do that? Okay.

372
00:44:14,190 --> 00:44:21,509
And I'll say ethical chip. Fun, too, of this anonymous function on the box.

373
00:44:21,510 --> 00:44:23,520
Minus three, three. Minus three, three.

374
00:44:26,370 --> 00:44:36,990
So Tetfund too now does its usual thing to construct a global representation of that function, and it then uses global methods to do zero fine.

375
00:44:37,020 --> 00:44:44,790
If we plotted, I could say levels equals zero 2500 and I could say contour f levels.

376
00:44:50,310 --> 00:44:53,400
So there's the same function and you can see that.

377
00:44:56,040 --> 00:45:05,760
To have fun has captured it somehow or other. So if I know now say Midvale men pars equals men, two of s.

378
00:45:07,340 --> 00:45:10,520
Tetfund is doing its thing and.

379
00:45:11,670 --> 00:45:17,160
This function has the actual minimum at one one, and you can see that it's got about half the digits.

380
00:45:18,600 --> 00:45:22,139
The minimal value is zero and you can see it's got most of those.

381
00:45:22,140 --> 00:45:29,040
So that's kind of typical that when you're at the bottom of a parabola, the minimal value you can hope to get to machine precision,

382
00:45:29,250 --> 00:45:33,810
the minimum position because of the parabola, you hope to get the square root of machine precision.

383
00:45:34,290 --> 00:45:38,730
So here's an example where the global nature of cheap fun has worked beautifully.

384
00:45:45,090 --> 00:45:48,140
And let's see here. Yeah.

385
00:45:48,240 --> 00:45:54,390
Okay. So we have a couple of more minutes and I want to show you something else on line.

386
00:45:59,170 --> 00:46:14,060
So. So if we go to my Web page or the course Web page, you remember there's this thing called numerical software tools, which I showed you earlier.

387
00:46:14,300 --> 00:46:19,490
And let's now that we're optimisers, go back to the optimisation section of that.

388
00:46:19,790 --> 00:46:26,840
So I'll go down to. Optimisation and there isn't much there.

389
00:46:27,020 --> 00:46:30,560
But these things link to a lot. So niasse.

390
00:46:32,150 --> 00:46:36,830
Is a repository of various things related to optimisation.

391
00:46:37,010 --> 00:46:40,490
It used to stand for network enabled optimisation software.

392
00:46:40,670 --> 00:46:45,080
I think now it doesn't stand for anything. It's just an iOS. But anyway, if I click on that.

393
00:46:47,880 --> 00:46:55,290
I find I met the server. And you can see it's a service for all kinds of things related to optimisation.

394
00:46:55,560 --> 00:46:59,550
It's hosted in Madison, Wisconsin, which is a big centre for these things.

395
00:46:59,820 --> 00:47:05,460
And you can see that it has both software and information.

396
00:47:05,470 --> 00:47:10,900
So there's this optimisation guide. And then it points you in various ways to software.

397
00:47:10,920 --> 00:47:17,770
So let's look at the guide, for example. If I click on NEA's Optimisation Guide.

398
00:47:24,580 --> 00:47:27,820
Then whose phone is ringing?

399
00:47:27,910 --> 00:47:35,560
Is that mine? Drives you crazy, doesn't it? So you can see lots of information here by the best people in the business.

400
00:47:35,740 --> 00:47:40,990
So let's just click at random on optimisation algorithms. Is that somebody's phone?

401
00:47:44,050 --> 00:47:47,760
Whose phone is that? Okay.

402
00:47:49,650 --> 00:47:53,670
So you can see under algorithms, all sorts of things and we can pick one of these at random.

403
00:47:53,670 --> 00:48:02,740
Let's go to Newton methods. Mm hmm.

404
00:48:03,070 --> 00:48:07,270
Yeah. Well, you can do this at home.

405
00:48:09,730 --> 00:48:14,410
Let's just go up a few levels, and I'll show you the other things under optimisation.

406
00:48:14,920 --> 00:48:20,160
I'm not sure this decision tree is still there. Let me see. The decision tree for optimisation software.

407
00:48:20,170 --> 00:48:23,880
I guess it is still there. No, no, wait a minute.

408
00:48:23,890 --> 00:48:29,570
There used to be a tree. How do we get to the tree?

409
00:48:31,410 --> 00:48:35,970
No, I guess it's not in the form it used to be at, so that hasn't been maintained so well.

410
00:48:36,180 --> 00:48:40,590
And then the other one is this MATLAB optimisation toolbox.

411
00:48:40,680 --> 00:48:48,120
Just to remind you that MATLAB does have some optimisation built in the reputation of this toolbox is actually not so great.

412
00:48:48,420 --> 00:48:50,760
So if you're serious about optimisation,

413
00:48:50,760 --> 00:48:57,900
I think prowling around in the office is probably a better strategy than becoming fluent in the MATLAB toolbox.

414
00:48:58,140 --> 00:49:04,440
But of course there is some stuff here for all sorts of things. You can see nonlinear is what we've been talking about.

415
00:49:04,440 --> 00:49:07,980
Linear and quadratic programming is where constraints are very important.

416
00:49:08,190 --> 00:49:12,059
We haven't yet talked about that mixed integer linear programming.

417
00:49:12,060 --> 00:49:16,920
So when you mix together continuous and discrete variables, that's called mixed integer programming.

418
00:49:17,220 --> 00:49:20,850
This wonderful book I keep telling you about, it doesn't have the discrete stuff.

419
00:49:20,850 --> 00:49:26,970
It's all continuous multi objective optimisation when you're trying to minimise several things at once,

420
00:49:27,450 --> 00:49:32,040
least squares, data fitting nonlinear equations and then parallel computing and derivatives.

421
00:49:32,610 --> 00:49:35,999
So those are some of the things that MATLAB offers. Okay.

422
00:49:36,000 --> 00:49:41,040
So I guess we have our last lecture on Friday and then I will be handing out the final assignment then.