1
00:00:03,590 --> 00:00:11,809
Okay. Good morning, everybody. I'm not sure if there's anybody new here, but this is scientific computing for Dphil students,

2
00:00:11,810 --> 00:00:15,770
the second of 12 lectures in this term and then 12 more in the next term.

3
00:00:16,970 --> 00:00:19,100
Let me remind you of a few things.

4
00:00:19,130 --> 00:00:27,710
One is that there's an assignment due next Tuesday, which certainly requires you to make sure you've got MATLAB working.

5
00:00:28,070 --> 00:00:31,160
Once you've got MATLAB working, it's not a very large assignment.

6
00:00:32,570 --> 00:00:35,840
And you turn them in in the boxes across the atrium here.

7
00:00:36,140 --> 00:00:41,830
160 or 161. Let me also remind you that we have a course Web page.

8
00:00:41,840 --> 00:00:47,540
So there is the Web page, and I hope you'll get in the habit of looking at it in particular.

9
00:00:48,230 --> 00:00:55,040
It's worthwhile just going and after a lecture, if you like, either usually not posted long in advance,

10
00:00:55,040 --> 00:01:02,990
but just go to the lecture and spend a few minutes looking over the fairly terse notes that should be there.

11
00:01:03,560 --> 00:01:07,790
I hope it works not working. Well, I think it works eventually.

12
00:01:11,140 --> 00:01:14,470
My laptop has a funny viewer. Let's make sure it's there.

13
00:01:23,770 --> 00:01:30,620
Well, it's there somewhere. So you'll find that the lecture notes are fairly terse.

14
00:01:31,780 --> 00:01:35,860
They're typically just five or six pages of outline like this,

15
00:01:36,970 --> 00:01:43,210
but it's certainly worth spending some minutes looking that over to make sure you're in touch with things.

16
00:01:44,110 --> 00:01:47,110
Now let's see. Today there are three handouts, three new handouts.

17
00:01:47,470 --> 00:01:51,549
The one with the blue is just two more copies of what we handed out last time.

18
00:01:51,550 --> 00:01:57,940
But there's three new pages. Let's take a moment and look at the one called Journals and Books.

19
00:01:58,800 --> 00:02:05,740
So this is just a list of some journals and books that are well known to numerical analysts like me.

20
00:02:07,030 --> 00:02:14,019
You remember, one of my goals in this course is to immerse you a little bit in a field that isn't your own.

21
00:02:14,020 --> 00:02:17,170
So you get a sense of how we think and what our points of reference are.

22
00:02:17,350 --> 00:02:23,350
Well, these are some of our points of reference. So I've listed a dozen of the journals that are well known in numerical analysis

23
00:02:23,650 --> 00:02:27,880
and a couple of dozen books that are relevant to this term of the course.

24
00:02:28,480 --> 00:02:35,170
Most of this stuff is online. In fact, this is a copy of a HTML file that I'll probably put up on the website.

25
00:02:36,220 --> 00:02:39,820
So most of these things are freely available in one form or another.

26
00:02:40,060 --> 00:02:47,020
For example, let me just mention one or two things on this list of books that are particularly interesting.

27
00:02:47,830 --> 00:02:54,160
There are the two books called Templates for Eigenvalue Problems and for Systems of Equations,

28
00:02:54,170 --> 00:03:00,160
and those are kind of a fun way to see those problems through the software window of a kind.

29
00:03:01,030 --> 00:03:05,920
The book by Bjork is a standard book on least squares problems with many interesting things in it.

30
00:03:06,280 --> 00:03:14,200
The book by Davis is a important book on direct methods for sparse, big linear systems of equations.

31
00:03:14,980 --> 00:03:18,910
The book by Government Van Loan is sort of the Bible of numerical linear algebra.

32
00:03:20,570 --> 00:03:28,790
The book by Know Sedol and write numerical optimisation is absolutely the standard text worldwide for numerical optimisation.

33
00:03:30,020 --> 00:03:35,209
And then you'll notice books by Saad and Vandervoort. Those are two of the leaders in Iterative Methods,

34
00:03:35,210 --> 00:03:41,270
and I think you'll find both of their books online for linear equations and also eigenvalue problems.

35
00:03:41,270 --> 00:03:45,530
So Saad and Vandervoort are two very big names in this area.

36
00:03:47,930 --> 00:03:51,020
Okay. Let's take a look at the quiz. I hope you've all done the quiz.

37
00:03:53,330 --> 00:03:57,500
Problem one What is the inverse of Spurs? What's the answer? In full.

38
00:03:57,500 --> 00:04:05,130
That's right. So let's see here. The quiz. The answer to one is C.

39
00:04:06,800 --> 00:04:10,840
What's the inverse of N? It's enough.

40
00:04:11,020 --> 00:04:18,040
That's right. So the answer to two is D, which inequality is invalid?

41
00:04:24,690 --> 00:04:30,390
What's that? The second one. That's. That's right. It's the second one that should be a backslash or a backslash.

42
00:04:30,570 --> 00:04:36,030
Then it would in principle be valid, but it's a forward slash, which is something else entirely.

43
00:04:36,390 --> 00:04:42,900
So that's the invalid one. And finally, did anyone compute the output of program in number four?

44
00:04:49,280 --> 00:04:57,380
You all did, I'm sure. But let me tell you, it's a two by two matrix containing the numbers, two in all four positions.

45
00:05:00,550 --> 00:05:01,060
Okay.

46
00:05:03,100 --> 00:05:11,409
So today we're talking about conjugate gradients, which is one of the greatest hits of, well, really computer science computing, numerical analysis.

47
00:05:11,410 --> 00:05:18,370
It's really one of the great algorithms of all time whose impact is absolutely huge conjugate gradients.

48
00:05:19,270 --> 00:05:33,580
So we call this section 1.4. It's an idea that had a very clear beginning at the beginning of the 1950s.

49
00:05:34,630 --> 00:05:43,180
There's a article getting on in age now on the history of conjugate gradients that's kind of interesting.

50
00:05:43,180 --> 00:05:56,360
And that's by Gallivan O'Leary. In 1989, I guess it is Gottlieb and O'Leary in Siam Review in 1989.

51
00:06:00,890 --> 00:06:09,330
Anyway, the, the starting point of the algorithm was two people hesston us and Stiefel, who independently had more or less the same idea.

52
00:06:09,560 --> 00:06:17,660
In fact, there were three key people. Heston, he Stiefel and Lantos were all doing very closely related things at the beginning of the 1950s,

53
00:06:17,990 --> 00:06:25,700
and in particular, Heston, even Stiefel communicated in time that they ended up writing the main paper together.

54
00:06:26,240 --> 00:06:32,000
So this is the Great Paper Yesterdays and Stiefel in 1952.

55
00:06:34,130 --> 00:06:39,920
And certainly by any standard, that's one of the ten most important numerical analysis papers ever written.

56
00:06:41,030 --> 00:06:45,350
Let's see, there's a handout which shows you a bit of the flavour of that first paper.

57
00:06:45,620 --> 00:06:49,010
So methods of conjugate gradients for solving linear systems.

58
00:06:49,280 --> 00:06:58,000
I like to give you some flavour of the past. It's a remarkable paper, partly because the two of them collaborated on it.

59
00:06:58,010 --> 00:07:00,410
It's amazing how much is in this paper.

60
00:07:00,980 --> 00:07:07,730
People keep rediscovering parts of what's in this paper because of course, not so many people read the actual paper anymore.

61
00:07:08,600 --> 00:07:13,549
It's a remarkable paper, but it also has an oddity,

62
00:07:13,550 --> 00:07:20,150
which is that it can't quite decide whether conjugate gradients is an iterative method or a direct method.

63
00:07:20,330 --> 00:07:23,540
And that's a flavour we'll be talking about a lot in the next few weeks.

64
00:07:23,540 --> 00:07:31,220
And today in particular, a direct method is an algorithm where you do a certain amount of computation and then you're done.

65
00:07:31,370 --> 00:07:37,909
You've solved the problem. An iterative method is one where as you compute, you get closer and closer to the solution.

66
00:07:37,910 --> 00:07:41,450
And once you've got six or 16 digits, then you stop.

67
00:07:41,630 --> 00:07:49,100
Very different style. So a direct method, you sort of think of it as going like this, an iterative method you think of as going like this.

68
00:07:50,480 --> 00:07:55,460
The paradox of the conjugate gradient algorithm is that in principle it's both.

69
00:07:56,210 --> 00:08:04,850
If you do enough work, you solve a matrix problem X equals B, but also as you go, you get closer and closer to the solution.

70
00:08:05,330 --> 00:08:10,460
So has Farnese and Stiefel were very proud of inventing an algorithm that could do both of these things.

71
00:08:10,760 --> 00:08:16,820
And as a result, they wrote a paper which didn't focus on the important one, which is the iterative side.

72
00:08:17,270 --> 00:08:23,320
So if you just look at this first page that I've copied, the second word in the abstract is iterative.

73
00:08:23,330 --> 00:08:27,540
So obviously they know. It has an iterative flavour.

74
00:08:28,830 --> 00:08:32,160
On the other hand, if you look at the second column and you look at item A,

75
00:08:32,820 --> 00:08:38,250
it's pointing out that like Gaussian elimination, the method gives the exact solution after N steps.

76
00:08:38,790 --> 00:08:45,629
So they're hedging their bets. It's iterative, it's direct. And it was funny what happened historically.

77
00:08:45,630 --> 00:08:53,010
They wrote this great paper. People didn't focus very much on the iterative side of conjugate gradients.

78
00:08:53,010 --> 00:08:58,020
They focussed on it as a direct algorithm. And the fact is it's not a good direct algorithm.

79
00:08:58,020 --> 00:09:01,410
It's nowhere near as reliable as Gaussian elimination.

80
00:09:01,710 --> 00:09:05,430
So it sort of lost the battle in the fifties and sort of got put aside.

81
00:09:05,760 --> 00:09:08,550
Everybody knew about it, but people didn't use it much.

82
00:09:08,790 --> 00:09:15,510
And it really took 20 years for conjugate gradients to be kind of rediscovered as an iterative algorithm.

83
00:09:15,840 --> 00:09:20,850
I'd like to draw you a plot to indicate why that happened.

84
00:09:22,110 --> 00:09:28,200
It wasn't just that people were being dumb. It has to do with how things evolve with time.

85
00:09:28,200 --> 00:09:35,490
So this is time, but it's also the dimension of a problem.

86
00:09:36,420 --> 00:09:41,070
Because of course, as computers get more powerful, you can solve bigger problems.

87
00:09:41,880 --> 00:09:45,360
So this is the 1950s, this is 2015.

88
00:09:45,360 --> 00:09:47,700
And of course, n gets bigger with time.

89
00:09:47,700 --> 00:09:56,880
As we discussed last lecture, this access is the amount of work involved in solving an end by end system of equations.

90
00:09:57,120 --> 00:10:03,090
So where solving X equals B and it's end by end.

91
00:10:04,710 --> 00:10:08,940
Now if you do this with the obvious method of Gaussian elimination.

92
00:10:12,430 --> 00:10:16,090
Which is too familiar for me to want to talk about it.

93
00:10:16,750 --> 00:10:23,290
The amount of work is NN cubed, basically. So that's a fine algorithm.

94
00:10:23,290 --> 00:10:32,290
And back in 1950, let's say, and might have been ten in 2015 and might be ten to the eighth or something.

95
00:10:33,670 --> 00:10:41,080
Now conjugate gradients depends on the matrix, but loosely speaking, it's got a picture that looks more like this.

96
00:10:41,980 --> 00:10:47,200
Let me put it in quotes. I'll say C, G, and I'll put in quotes O of n squared.

97
00:10:47,740 --> 00:10:53,379
That's not a general figure. There's there's no sense in which it is generally of complexity and squared,

98
00:10:53,380 --> 00:10:59,410
but maybe that's a typical amount of work involved when things are going pretty well with conjugate gradients.

99
00:10:59,800 --> 00:11:07,480
On the other hand, the constant is works. If you have a small matrix of five, ten or 20 or 30, you would never use conjugate gradients.

100
00:11:07,750 --> 00:11:13,900
So it really didn't make sense in those early days for people to use conjugate gradients.

101
00:11:14,200 --> 00:11:19,240
Somewhere in the early 1970s, and let's to be precise, say 1971,

102
00:11:20,290 --> 00:11:29,770
these two curves crossed over and people started to notice that for big problems, at least conjugate gradient was a very powerful idea.

103
00:11:30,070 --> 00:11:33,710
So it was a remarkable change of perspective there.

104
00:11:33,730 --> 00:11:39,850
The person who made that observation was John Reid, who is still around.

105
00:11:39,850 --> 00:11:44,170
He's at the Rutherford Appleton Laboratory in ten miles south of here.

106
00:11:44,440 --> 00:11:49,000
He wrote a paper which pointed out how powerful conjugate gradient was.

107
00:11:49,150 --> 00:11:56,230
And the funny thing is all the foundations are already there and has been is in freefall in 1952 but didn't get focussed on.

108
00:11:56,410 --> 00:12:06,670
It's a very curious bit of history. In fact, one reason I'm so interested in this is that you'll often hear people say that numerical computation is,

109
00:12:06,790 --> 00:12:10,060
let's say, ten to the 12th times faster than it was way back when.

110
00:12:10,330 --> 00:12:16,240
And then they'll say six of those orders of magnitude come from computers and the other six come from algorithms.

111
00:12:16,510 --> 00:12:21,550
And in some sense that's true. The algorithms are much faster and the computers are much faster.

112
00:12:22,270 --> 00:12:26,470
But still, there's an asymmetry because the computers drive the algorithms, you know.

113
00:12:27,490 --> 00:12:33,430
Yes, it's true that we didn't focus on conjugate gradients back here, but that's not really because we were stupid.

114
00:12:33,430 --> 00:12:42,910
It's because it didn't make sense to focus on it back here. So in some sense, as the machines get faster, that affects what the right algorithms are.

115
00:12:43,120 --> 00:12:51,730
And in the end, you get your tremendous improvements, you get a curving of a plot like that because of the interaction of machines and algorithms.

116
00:12:52,840 --> 00:13:00,970
Anyway, conjugate gradient is a huge thing. I checked on the web on if you Google search it, you get 573,000 hits today.

117
00:13:02,020 --> 00:13:05,799
The third hit is a very popular one because it's a paper with a nice title.

118
00:13:05,800 --> 00:13:13,420
It's a paper by Jonathan Schuchat called an introduction to the Conjugate Gradient Method Without the Agonising Pain.

119
00:13:17,270 --> 00:13:23,870
Chuck. Chuck, by the way, is a well known guy at Berkeley who won the Wilkinson Software Prize for his package called Triangle.

120
00:13:25,280 --> 00:13:32,599
Okay. So what I want to do today is show you the conjugate gradient method and derive a

121
00:13:32,600 --> 00:13:37,580
certain part of the theory that explains why it's so spectacularly fast sometimes.

122
00:13:39,080 --> 00:13:42,970
So it applies to symmetric, positive, definite matrices.

123
00:13:42,980 --> 00:13:48,320
So let a which is n. By n. B.

124
00:13:48,890 --> 00:13:56,030
Well, we always say speedy, which means symmetric, positive, definite.

125
00:14:01,050 --> 00:14:05,010
Symmetric means if you transpose it, you get the same matrix.

126
00:14:06,060 --> 00:14:11,910
So symmetric means a equals a transpose and positive definite.

127
00:14:11,920 --> 00:14:18,750
Well, you can define that in various ways. One way to define it is to say that all the eigenvalues are positive.

128
00:14:22,600 --> 00:14:27,280
A real symmetric matrix always has real eigenvalues, but they might be zero or negative.

129
00:14:27,550 --> 00:14:32,320
A real symmetric, positive, definite one has positive eigenvalues.

130
00:14:33,610 --> 00:14:41,440
Another way to say it is that x transpose a x is always greater than zero.

131
00:14:41,950 --> 00:14:50,590
If x is a non-zero vector. So X transpose x is a very standard way that we would write a thing like this.

132
00:14:50,860 --> 00:14:54,700
You should think of that as a picture of a row vector.

133
00:14:55,030 --> 00:14:59,050
Times a. Matrix times a. Column vector.

134
00:15:00,430 --> 00:15:06,909
A lot of the notations we use are equivalent to though different from what physicists in quantum mechanics use.

135
00:15:06,910 --> 00:15:12,190
So physicists like to talk about bras and cats, which have to do with sort of left and right vectors,

136
00:15:12,190 --> 00:15:16,000
and they would sort of split this operator into a left half and the right half.

137
00:15:16,600 --> 00:15:20,200
This is the way numerical analyst like to talk about such operations.

138
00:15:22,500 --> 00:15:27,600
Anyway. In applications, symmetric positive, definite problems come up all the time,

139
00:15:27,810 --> 00:15:32,450
essentially because her mission differential equations come up all the time.

140
00:15:32,460 --> 00:15:40,440
Self Adjoint Differential Equations. When you have a self a joint differential equation, then almost certainly you can find a way to democratise it,

141
00:15:40,650 --> 00:15:43,770
which turns it into a symmetric, positive, definite matrix.

142
00:15:45,020 --> 00:15:54,440
So we want to solve x equals B, so let B be an end vector.

143
00:15:58,080 --> 00:16:13,720
And of course, we're interested in the case where end is pretty big. So we want X Star is what I'll call it, such that a x star equals B.

144
00:16:14,290 --> 00:16:18,999
That solution exists. That is unique because of the assumption that the matrix is positive.

145
00:16:19,000 --> 00:16:25,450
Definite. That implies that it's non singular. All the eigenvalues are positive, so it's a non singular matrix.

146
00:16:25,750 --> 00:16:29,320
There's a unique solution. So X Star is our unique solution.

147
00:16:29,470 --> 00:16:38,140
And the way conjugate gradients works is it starts from an initial vector, an initial gets x nought and it iteratively gets closer to x star.

148
00:16:38,410 --> 00:16:45,490
So c g finds x star iteratively.

149
00:16:48,990 --> 00:16:53,010
Starting from some vector which I'll call x not.

150
00:16:56,680 --> 00:17:00,130
Now. It could be anything you like, but we're going to take it to be zero.

151
00:17:00,760 --> 00:17:03,820
It doesn't have to be zero, except it simplifies the formulas.

152
00:17:04,360 --> 00:17:13,060
So without. Well, with only slight loss of generality, let's say that x nought is the zero vector, the N by one zero vector.

153
00:17:13,720 --> 00:17:22,870
And the way conjugate gradient does this it finds x one in the next two and x three and these iterates x

154
00:17:22,870 --> 00:17:34,090
little n belong to very interesting sub spaces so they so x and belongs to the so-called cry of space.

155
00:17:37,240 --> 00:17:41,230
Which is a subspace of capital and dimensional vectors.

156
00:17:42,600 --> 00:17:49,559
And the way we write it is we'll write K or K seven is the space spanned by the

157
00:17:49,560 --> 00:17:55,290
right hand side vector B and A times the right hand side vector dot dot dot up to.

158
00:17:56,450 --> 00:18:01,220
A race to the end, minus one power times the right hand side vector.

159
00:18:02,660 --> 00:18:09,590
So that idea of a krylov subspace is absolutely at the heart, not just of conjugate gradients,

160
00:18:09,590 --> 00:18:19,310
but of all sorts of iterative algorithms for solving systems of equations and for finding eigenvalues and singular values, all sorts of things.

161
00:18:21,570 --> 00:18:31,010
So let me say a little more about that. This notation means the set of vectors spanned by these vectors, linear combinations.

162
00:18:31,250 --> 00:18:35,660
So if I just have one of them, the first kind of subspace, that's a one dimensional space.

163
00:18:35,870 --> 00:18:39,320
If I have two of them, that's probably a two dimensional space and so on.

164
00:18:39,650 --> 00:18:48,260
So this isn't generically an end dimensional space which is contained in the set of all end vectors.

165
00:18:50,450 --> 00:18:56,750
And as little then increases, you get a bigger and bigger subspace of your giant space.

166
00:18:57,890 --> 00:19:05,060
So conjugate gradients is searching in bigger and bigger spaces for a good approximation to the solution.

167
00:19:06,260 --> 00:19:08,150
Now let me show you some of its properties.

168
00:19:14,240 --> 00:19:23,340
So first of all, we define the so-called a norm of a vector to be the size of the vector measured in a special way.

169
00:19:23,360 --> 00:19:33,500
We say that the a norm of a vector x is equal to the square root of x transpose x.

170
00:19:34,970 --> 00:19:38,840
So that's a number and it's a positive number if X is non-zero.

171
00:19:42,160 --> 00:19:52,390
Because by definition, A is symmetric, positive, definite, and therefore this thing is a positive number so we can square root it.

172
00:19:52,690 --> 00:20:03,190
That's going to be the measure of our error. And let's define the error at each step like this.

173
00:20:03,190 --> 00:20:09,760
I'll say that the error the error at step n I'll define to be sub n.

174
00:20:11,220 --> 00:20:25,980
And that will be the exact solution. X star minus the gas explosion and the residual, as we call it, is what you get if you apply A to that.

175
00:20:26,250 --> 00:20:32,310
So it's our seven hour residual equals A times seven.

176
00:20:34,410 --> 00:20:40,020
Now eight times X star is B minus eight times x n.

177
00:20:43,740 --> 00:20:52,110
So throughout numerical analysis, you always have these two ideas of an error and a residual and residuals, a kind of a dual of the error.

178
00:20:52,110 --> 00:20:57,809
It's what you get when you apply the error applied to the error, the problem you're trying to solve,

179
00:20:57,810 --> 00:21:05,130
the operator you're working with as a rule in principle, you would like a small error, but the thing you can measure is very often the residual.

180
00:21:05,140 --> 00:21:11,400
So you minimise the residual as a means of minimising the error all over numerical analysis.

181
00:21:14,730 --> 00:21:23,700
Okay, so let me move over here and write down an amazing property that I'm not going to prove of conjugate gradients.

182
00:21:24,780 --> 00:21:34,620
So I'll call this theorem one. And the amazing property is that CG conjugate gradients magically find the very

183
00:21:34,620 --> 00:21:40,440
best vector in this space in the sense of minimising the norm of the error.

184
00:21:41,340 --> 00:21:45,720
All right. So at the end of the span, yes.

185
00:21:46,020 --> 00:21:49,080
So the set of all linear combinations of these vectors.

186
00:21:50,660 --> 00:21:55,530
Other questions. I didn't come to this extent. How do I come up with it?

187
00:21:55,560 --> 00:22:01,440
I haven't told you yet. Yeah, I'm going to tell you. It's an amazingly slick and obscure.

188
00:22:01,440 --> 00:22:06,660
It's not at all obvious when you see the formulas, how it works. But I'm about to tell you, of course.

189
00:22:07,050 --> 00:22:19,830
But first, let me tell you the theorem, which is that among all vectors in this kind of space, I guess I should have given it a subscript.

190
00:22:19,830 --> 00:22:23,130
So k seven is the end dimensional cry of space.

191
00:22:26,220 --> 00:22:36,900
The one that conjugate gradients is going to give us when I tell you the formulas is the unique one that minimises the norm of the error.

192
00:22:44,510 --> 00:22:48,140
87. Measured in the a norm. That's the theorem.

193
00:22:48,740 --> 00:22:53,240
So this optimality property which has two needs and Stiefel of course, knew about.

194
00:22:57,090 --> 00:23:04,440
So at every step conjugate gradients find you the best approximation in a certain measure in a space.

195
00:23:04,440 --> 00:23:07,110
And those spaces are growing. They're getting bigger and bigger,

196
00:23:07,830 --> 00:23:17,379
which means that at some rate we must be approaching the correct solution because after n steps will have an end capital n dimensional subset of this.

197
00:23:17,380 --> 00:23:22,830
So we will have it all. So once we if we really take capital and steps, we've got to solve the problem.

198
00:23:23,190 --> 00:23:29,129
Nobody would ever take capital n steps in practice anymore. So indeed,

199
00:23:29,130 --> 00:23:35,520
we have immediately from this theorem the beautiful corollary that the initial error

200
00:23:36,720 --> 00:23:41,790
measured in the a norm is greater than or equal to the first error in the norm,

201
00:23:41,790 --> 00:23:53,760
and so on. So the errors keep going down monotonically, at least in the norm.

202
00:23:55,300 --> 00:24:03,970
Very beautiful stuff. I think they also go down in the two norm, but that's less obvious, if I remember.

203
00:24:03,970 --> 00:24:12,510
But this is the fundamental property. So I'm going to now give you the formulas.

204
00:24:13,830 --> 00:24:19,230
And as I do that, let me let me tell you the context in which I like to present them.

205
00:24:20,760 --> 00:24:26,100
The way most people think of conjugate gradients is as a kind of an optimisation search.

206
00:24:26,370 --> 00:24:29,489
You're in this giant 10,000 dimensional space.

207
00:24:29,490 --> 00:24:33,750
Let's say you're searching for a particular vector in this space,

208
00:24:34,080 --> 00:24:40,950
and you do that by starting with one vector and then moving along in some direction to improve the gas and then moving in another direction.

209
00:24:41,250 --> 00:24:51,000
So you're searching around in this huge dimensional space and conjugate gradients can be described very compactly in terms of how it does that search.

210
00:24:51,810 --> 00:25:01,440
So let me be specific. So I'll call it the CG iteration and let's in particular relate it to the idea

211
00:25:01,440 --> 00:25:10,440
of minimising something which we always call optimisation we want to minimise.

212
00:25:12,120 --> 00:25:18,419
A quantity, and that's equivalent to finding a place where a certain gradient is zero.

213
00:25:18,420 --> 00:25:24,780
Right. You imagine a surface, you're trying to minimise it. That's that's amounts to making some gradient equal to zero.

214
00:25:24,780 --> 00:25:38,820
Let's be specific. Let's define F of any vector X to be one half times x, transpose a x minus x, transpose b.

215
00:25:41,740 --> 00:25:47,950
That's a scalar. We already saw this scalar that has the structure of rho vector times, matrix times,

216
00:25:47,950 --> 00:25:53,230
column vector x, transpose b has the structure of rho vector times column vector.

217
00:25:54,700 --> 00:26:11,020
So this is a scalar and you can compute that the gradient of that scalar function of x is equal to A, x minus B.

218
00:26:12,340 --> 00:26:19,270
How do you do that? I won't go through that. But because X appears twice, you get two terms, you add them up, they're the same.

219
00:26:19,270 --> 00:26:28,540
That's so the one half gets doubled and turns into a one. And then the gradient of this term is B, so you have this 10,000 dimensional space,

220
00:26:28,540 --> 00:26:38,140
you have the scalar function of this vector in that space, the gradient of that function is a vector, and this is that vector.

221
00:26:39,790 --> 00:26:44,559
Now we want to be equal to B, which means we want the gradient to be zero, right?

222
00:26:44,560 --> 00:26:47,770
So our goal is to make that zero.

223
00:26:47,770 --> 00:26:52,660
In other words, to make the gradient zero, in other words, to minimise F of f's f of x.

224
00:26:53,530 --> 00:26:59,650
So F takes a minimum uniquely.

225
00:27:05,470 --> 00:27:08,590
At the point ex star.

226
00:27:09,790 --> 00:27:14,710
I'm not very good at drawing pictures in a thousand or 10,000 dimensions, but in two dimensions anyway.

227
00:27:15,130 --> 00:27:18,430
Of course, you imagine you've got some kind of a parabola here.

228
00:27:19,210 --> 00:27:22,540
Well, that's one dimension, I guess you'd say. And we're trying to find.

229
00:27:23,600 --> 00:27:28,850
That Poindexter. At which the parabola is minimised.

230
00:27:29,360 --> 00:27:38,750
More generally, it's a parabolic surface in capital and dimensions, so the zero gradient gives us the minimum.

231
00:27:45,500 --> 00:27:51,680
So what conjugate gradient does is to cleverly search to minimise that function.

232
00:27:51,980 --> 00:27:57,890
And now I write down the formulas. So here they are.

233
00:27:59,330 --> 00:28:00,970
And we're going to treat them as magic.

234
00:28:00,980 --> 00:28:06,740
I'm not going to tell you where they come from, but I will tell you how to interpret each line of the algorithm.

235
00:28:06,950 --> 00:28:13,100
So so here's the algorithm. We start with an initial guess, which, as I say, we're going to take to be the zero vector.

236
00:28:14,690 --> 00:28:23,980
Corresponding to that is a residual and the residual will be it's B minus A times the initial vector, but the initial vector zero.

237
00:28:23,990 --> 00:28:28,850
So it's B, so the initial residual is B.

238
00:28:29,540 --> 00:28:37,250
And then we're also going to work with a search direction vector, which we'll take to be the initial residual.

239
00:28:37,430 --> 00:28:41,630
So in our thousand dimensional space, we're going to be searching in all sorts of directions.

240
00:28:42,170 --> 00:28:45,290
And this is a vector that will define those search directions.

241
00:28:46,520 --> 00:28:52,730
Now, here's how the algorithm goes for little n equals 1 to 3, and so on until we're happy.

242
00:28:55,480 --> 00:29:00,130
Will define Alpha seven to be the quotient.

243
00:29:01,360 --> 00:29:15,250
Of R and minus one transpose R and minus one divided by P and minus one transpose a pound minus one.

244
00:29:16,930 --> 00:29:20,260
So that's a scalar divided by a scalar.

245
00:29:21,390 --> 00:29:26,850
So Alpha is a scalar. I'm going to interpret each of these expressions in a moment.

246
00:29:27,810 --> 00:29:37,800
Xand Well, of course you know what the interpretation of that is. It's the next guess, and that's X and minus one, plus alpha, minus one.

247
00:29:41,130 --> 00:29:55,230
And then our end, the next residual is equal to R and minus one, minus alpha n, a p and minus one.

248
00:29:57,180 --> 00:30:01,380
And then there's another scalar called beta or beta if you're English,

249
00:30:01,950 --> 00:30:14,670
which is the quotient of our n transpose r n divided by R and minus one transpose R and minus one.

250
00:30:15,210 --> 00:30:22,000
And then the final line is we update the vector p by the formula p.

251
00:30:22,020 --> 00:30:27,210
N equals r. N plus beta MP and minus one.

252
00:30:33,590 --> 00:30:38,240
And so that's just magic formula.

253
00:30:38,690 --> 00:30:42,960
It's not at all obvious why they work.

254
00:30:43,340 --> 00:30:50,150
But let me at least tell you what they're doing. So you can get the optimisation interpretation here.

255
00:30:51,140 --> 00:30:58,460
You already know that we're in a big dimensional space and we're trying to get better and better approximations to the point X star.

256
00:30:59,870 --> 00:31:11,330
Well, alpha being a scalar is the step length, which means it's how far we're going along a search direction to improve our current guess.

257
00:31:12,200 --> 00:31:16,160
Xin is of course the approximate solution at step n.

258
00:31:19,150 --> 00:31:22,540
Our seven is of course, the residual corresponding to that.

259
00:31:27,770 --> 00:31:30,960
Well, it's obvious from this formula how we can interpret better.

260
00:31:31,010 --> 00:31:39,250
Subban. It's the improvement. This is the square of the residual at the last step, and this is the square of the residual at the current step.

261
00:31:39,260 --> 00:31:44,120
So this measure is, in some sense the improvement in the step we're now taking.

262
00:31:45,750 --> 00:31:52,950
A scalar. And finally, this is a search direction.

263
00:31:53,160 --> 00:32:01,800
So we're in this high dimensional space and at every step we take a step from the current position to our new position along this direction.

264
00:32:02,190 --> 00:32:05,010
So this is a search direction.

265
00:32:10,760 --> 00:32:22,810
And maybe the most interesting formula to look at in this magic algorithm is this one where you can see that at every step to get the new estimate X,

266
00:32:22,820 --> 00:32:29,480
then we take the old one and then we take a step in the direction point minus one of length alpha.

267
00:32:29,530 --> 00:32:40,190
And so it's crazy and magical, but this has the property that Theorem One says it has of giving you an optimal solution in a life space.

268
00:32:42,320 --> 00:32:49,580
So I want to show you that a little bit numerically, and then I want to say a bit about what that implies about convergence rates.

269
00:33:03,600 --> 00:33:10,800
So we're now at the MATLAB handout. There's a sheet of paper there with CG and M three on it.

270
00:33:19,570 --> 00:33:20,229
Now of course,

271
00:33:20,230 --> 00:33:27,250
real software for conjugative gradients would be much fancier than this with all sorts of bells and whistles and convergence criteria and so on.

272
00:33:27,550 --> 00:33:34,720
But this little code, SIG, is designed to show you that iteration in its simplest form, and you can see it has exactly the structure.

273
00:33:34,720 --> 00:33:41,890
I just wrote down an iteration involving five or six update formulas at each step.

274
00:33:42,310 --> 00:33:46,030
We should say a word about where the work is involved in this iteration.

275
00:33:48,160 --> 00:33:50,530
Most of this is just manipulating vectors.

276
00:33:51,250 --> 00:34:00,910
The only part of the CG operation that is more than just a vector is the multiplication of the matrix A by the step P and minus one.

277
00:34:01,180 --> 00:34:05,310
So that's a matrix vector product. And it's the same here and here.

278
00:34:05,320 --> 00:34:13,150
So you only do it once at each step. So every step of conjugate gradients involves a matrix vector product.

279
00:34:13,360 --> 00:34:19,300
If that were a dense matrix, that would be and squared work. If it's a sparse matrix, it's less than that.

280
00:34:21,360 --> 00:34:27,209
Now this particular example I'm going to run or do the first half of M three convergence.

281
00:34:27,210 --> 00:34:37,290
So let's take A to B, a diagonal matrix of dimension 100 with entries going from 100 to 199 along the diagonal.

282
00:34:37,530 --> 00:34:43,620
So it's a trivial matrix. Mathematically, it's just diagonal 101 up to 199.

283
00:34:44,940 --> 00:34:53,410
So the size of a is 100 by 100. And let's take a trivial right hand side to be the vector.

284
00:34:55,620 --> 00:35:03,110
What's happening here. The vector of length, 100 of ones.

285
00:35:03,830 --> 00:35:07,230
And now let's say X equals C G of A, comma B.

286
00:35:07,400 --> 00:35:17,810
So I'm just running this code. Now what you see is these are the residuals at each step and you see they're beautifully converging.

287
00:35:17,820 --> 00:35:24,900
Now, what does that mean? We start with an error of one, and after 20 steps or so, we've got about ten digits improvement.

288
00:35:25,560 --> 00:35:32,190
It's 100 by 100 matrix. And yet it only took us ten or 20 steps to get this very nice improvement.

289
00:35:32,370 --> 00:35:35,520
So the amount of work we've done is small.

290
00:35:35,820 --> 00:35:39,960
Of course it's a trivial problem. We could have done it on paper, but you see the idea.

291
00:35:42,550 --> 00:35:46,480
And if I look at the solution, of course, it's accurate to ten digits or so.

292
00:35:46,690 --> 00:35:52,090
Let's look at the first four entries. The diagonal matrix was 100, 100 and 102.

293
00:35:52,390 --> 00:36:01,420
So the first four entries are one over 101 over 101, one over 102, and so on to ten digits.

294
00:36:03,330 --> 00:36:09,420
So there we just used conjugate gradient to solve a rather trivial 100 by 100 problem.

295
00:36:11,240 --> 00:36:22,700
Let's do another one here. So let's say A equals the diagonal matrix consisting of the vectors from let's see, I'm going to put in six zeros there.

296
00:36:22,700 --> 00:36:27,800
One, two, three, four, five, 61999999.

297
00:36:29,360 --> 00:36:37,040
Hope that's right. So I've now got a million by million matrix represented sparsely.

298
00:36:37,400 --> 00:36:41,630
Of course, again, this is a problem we can do on paper because it's just a diagonal matrix.

299
00:36:41,900 --> 00:36:48,120
But it illustrates the idea. Let's take a right hand side to be the trivial vector of length.

300
00:36:48,120 --> 00:36:56,239
2 million. And now let's again do X equals C, g of a B.

301
00:36:56,240 --> 00:37:04,400
So this is a million by millions system of equations. And you see it chugs along with essentially the same performance.

302
00:37:05,270 --> 00:37:09,920
So this gives you a hint of how powerful conjugate gradients can be when things are working.

303
00:37:12,860 --> 00:37:17,240
So we get solved a million by million system of equations. Admittedly, it was diagonal.

304
00:37:20,960 --> 00:37:24,650
Okay, so I want to say a word about convergence.

305
00:37:29,830 --> 00:37:38,140
The convergence of conjugate gradients is an example where some really beautiful mathematics has some very concrete consequences.

306
00:37:38,770 --> 00:37:45,550
And the part of mathematics we're talking about is approximation, theory, approximation of polynomials, of certain things.

307
00:37:46,540 --> 00:37:55,720
So let's call this section 1.5. This is not well in this course.

308
00:37:55,730 --> 00:38:01,010
Everything we do is mathematics in some sense, but mostly it's not really a mathematics course very much.

309
00:38:01,340 --> 00:38:05,060
Maybe the next 10 minutes are a little more mathematical than some.

310
00:38:09,470 --> 00:38:15,200
So it turns out that just from the optimality property of Theorem One,

311
00:38:15,200 --> 00:38:20,060
you can get everything that I haven't proved that theorem, but from that everything follows.

312
00:38:20,270 --> 00:38:29,569
And let me remind you that the initial error is x da minus x, not which we've taken to be zero.

313
00:38:29,570 --> 00:38:44,810
So that's x star. Similarly, the error would be x star minus x, and so that would be E, nought minus x n.

314
00:38:47,780 --> 00:38:51,220
And I want to say something about how small the end is.

315
00:38:52,550 --> 00:38:58,210
So. Using our theorem one optimality.

316
00:38:58,570 --> 00:39:04,209
We well we we note that X seven, as I've told you, is a member of this cry loft space.

317
00:39:04,210 --> 00:39:20,440
So it's in the space B A, B up to a, n minus one B but B is eight times X star, eight times E not so.

318
00:39:20,440 --> 00:39:38,110
Another way to write that is that it's a times E, not a squared times e not dot dot dot up to A to the n e not because eight times e not is B.

319
00:39:38,620 --> 00:39:43,390
So these are just two different ways to write the same little n dimensional space.

320
00:39:45,110 --> 00:39:50,150
So examine our guests at the step. At the end step is in this space.

321
00:39:51,480 --> 00:39:58,380
In other words, it's a linear combination of these vectors. And now there's really only one central idea to this analysis.

322
00:39:59,220 --> 00:40:07,410
A linear combination of these vectors can be interpreted as a polynomial of a applied to inward.

323
00:40:08,510 --> 00:40:12,319
In fact, the set of all linear combinations of these vectors is precisely well,

324
00:40:12,320 --> 00:40:21,500
it's the set of all polynomials of degree, little n applied to E, not with a with no constant term in that polynomial.

325
00:40:22,310 --> 00:40:28,320
So in conjugate gradients we're implicitly dealing with polynomial of a times initial error.

326
00:40:29,000 --> 00:40:43,100
And that's why the analysis turns out so beautifully. So this tells us that e sub ne7 is e not minus x ten and x and is in this space.

327
00:40:43,970 --> 00:40:47,299
So e seven is equal to a polynomial of degree.

328
00:40:47,300 --> 00:40:51,800
N in a times e sub not where.

329
00:40:54,220 --> 00:41:11,070
P seven is a polynomial. Of degree and or at most end because of this linear combination of these terms.

330
00:41:11,520 --> 00:41:19,050
But we know one other thing about that polynomial, which is that the degree zero term, a constant term just comes from there.

331
00:41:19,350 --> 00:41:23,249
So it's one. So it's a polynomial of degree.

332
00:41:23,250 --> 00:41:28,590
N With the property that P of zero is one.

333
00:41:31,890 --> 00:41:38,640
So I've now told you really the crucial things about conjugate gradients, the rest is just working out the consequences.

334
00:41:43,310 --> 00:41:52,250
The crucial things are. Two. One is this theorem that at each step we're optimising the norm in a certain space.

335
00:41:53,240 --> 00:42:01,730
And the other is this bit of algebra that being in that space amounts to being equal to a polynomial and a times the initial error.

336
00:42:02,270 --> 00:42:05,510
All the rest is fiddling with polynomials.

337
00:42:07,580 --> 00:42:10,850
So that's really a beautiful, elegant property, I think.

338
00:42:13,040 --> 00:42:16,370
So let me restate theorem one.

339
00:42:17,540 --> 00:42:26,510
I'll call it Theorem one prime. It's the same theorem just restated E and is equal to some polynomial and a times e not.

340
00:42:30,150 --> 00:42:38,400
And it's the one that's optimal. So that is to say where that point of a in art is minimal.

341
00:42:41,370 --> 00:42:55,530
And we have to say, of course, it's minimal among polynomials p of degree and with P of not equal to one.

342
00:42:57,810 --> 00:43:01,380
So that's what conjugate gradient is magically doing at every step.

343
00:43:41,490 --> 00:43:48,810
Now there's one other piece I have to bring in in order to really show you how it's an approximation problem that determines the convergence rate.

344
00:43:49,170 --> 00:43:54,719
And that other piece is the eigenvalues of The Matrix. So I hope you all know about eigenvalues.

345
00:43:54,720 --> 00:43:57,300
I'm just going to use them without really talking about them much.

346
00:43:57,720 --> 00:44:05,130
But in order to get quantitative about what this means, we need to look at the eigenvalues of A and the eigenvectors of A.

347
00:44:05,280 --> 00:44:06,930
Now as a symmetric matrix.

348
00:44:07,410 --> 00:44:17,459
A symmetric matrix always has real eigenvalues and orthogonal eigenvectors, so the eigenvectors provide an orthogonal basis of the space our capital.

349
00:44:17,460 --> 00:44:21,570
N So that's what we do now.

350
00:44:21,570 --> 00:44:29,700
We say Let a have eigenvalues. I always write that e w it's German.

351
00:44:29,970 --> 00:44:36,960
I convert for eigenvalue and eigenvectors e v So let a have eigenvalues,

352
00:44:37,080 --> 00:44:50,640
which by the way we know are going to be positive Lambda one through Lambda Capital, F and eigenvectors, which are orthogonal.

353
00:44:51,030 --> 00:44:59,530
In fact, let's make them all the normal. They won through Van.

354
00:45:05,500 --> 00:45:12,160
Now the thing to do is expand the vectors we care about in this basis of all the normal eigenvectors.

355
00:45:12,850 --> 00:45:16,900
So we then expand and we say that e not.

356
00:45:19,440 --> 00:45:30,510
Well, any vector can be expanded in this basis. So we're going to write it not as the sum from one to capital n of some number a sub k times v.

357
00:45:35,710 --> 00:45:39,880
It's a basis you can always expand in the eigenvector basis.

358
00:45:41,320 --> 00:45:52,630
And now we look at the polynomial of a applied to a knot and we find that n is equal to the polynomial of a times e not.

359
00:45:56,520 --> 00:46:03,150
Now the whole definition of an eigenvector is that a times vk equals Lambda K VK.

360
00:46:04,400 --> 00:46:12,770
That Generalises a polynomial and a applied to VK is that polynomial of lambda k multiplied by v k.

361
00:46:13,760 --> 00:46:24,140
So when I do this I get the sum from one to capital and of a k times.

362
00:46:24,320 --> 00:46:30,380
PM of Lambda K applied to VK.

363
00:46:34,260 --> 00:46:42,750
So now. We've made the connection between a polynomial and a particular basis.

364
00:46:43,260 --> 00:46:49,980
We want this to be small, right? Well, how could this thing be small if all of these are small?

365
00:46:50,760 --> 00:46:53,820
So that's basically where the approximation theory comes in.

366
00:46:54,150 --> 00:47:03,750
Conjugate gradients will converge quickly if there exist polynomials that go to zero quickly at the eigenvalues of the matrix.

367
00:47:04,320 --> 00:47:08,280
And thanks to the miracle of Theorem one, it's always finding the perfect polynomial.

368
00:47:08,280 --> 00:47:14,580
It's always optimising. So if there exist good polynomials, the algorithm will find them.

369
00:47:16,380 --> 00:47:19,560
So that's our conclusion.

370
00:47:20,700 --> 00:47:30,630
If there exist polynomials that are very small at the eigenvalues.

371
00:47:37,990 --> 00:47:42,550
Then C g will converge quickly.

372
00:47:49,100 --> 00:47:55,310
And this is where you get the kind of picture that is often drawn in write ups of conjugate gradient.

373
00:47:55,700 --> 00:48:00,200
People like to draw a picture of a polynomial, let's say P of lambda.

374
00:48:02,080 --> 00:48:05,590
It always has to go through one at the origin.

375
00:48:07,650 --> 00:48:16,500
But we want it to be small at the eigenvalues. So A is a big matrix with a thousand or a million eigenvalues and they're all positive.

376
00:48:16,500 --> 00:48:24,320
So they're on the real axis somewhere, and there's just a lot of them. We?

377
00:48:24,650 --> 00:48:30,170
The question we ask in considering the convergence of conjugate gradients is does

378
00:48:30,170 --> 00:48:35,450
there exist a polynomial that passes through this point and yet is very small?

379
00:48:37,890 --> 00:48:48,900
And all of these eigenvalues. And of course, I've been writing these things down quickly, but I've sort of told you the whole logic of the argument,

380
00:48:49,410 --> 00:48:52,830
if that polynomial exists and thanks to Theorem one prime or one,

381
00:48:53,700 --> 00:48:58,620
we know that there will be a small error because conjugate gradients will find that polynomial.

382
00:49:00,360 --> 00:49:08,130
Okay, so that's what it all comes down to. And I just want to show you a corollary and one more computation and then we'd better stop.

383
00:49:09,870 --> 00:49:17,510
So what this all leads to, I'll call it theorem two.

384
00:49:19,350 --> 00:49:25,800
Is that the ratio of the error at the 10th step to the initial error.

385
00:49:26,430 --> 00:49:34,259
When you do conjugate gradients is bounded by the minimum overall polynomials with the usual

386
00:49:34,260 --> 00:49:44,010
restriction times the maximum overall eigenvalues of the matrix times point of lambda.

387
00:49:47,010 --> 00:49:55,320
Okay. So as always, pain of zero has to be one.

388
00:49:55,980 --> 00:50:00,480
Now there's more to say, but I don't really have time to say it. Let me show you a final demo.

389
00:50:02,700 --> 00:50:06,630
So let's turn to the second half of the MATLAB page.

390
00:50:14,940 --> 00:50:20,580
Just to give you a sense of how when the eigenvalues of a matrix behave, you can get beautiful behaviour.

391
00:50:20,760 --> 00:50:24,180
So suppose I say a equals ran down of 500.

392
00:50:24,540 --> 00:50:27,900
That's now a 500 by 500 random matrix.

393
00:50:28,080 --> 00:50:36,210
It's not symmetric, positive, definite. But if I say A equals A transpose times A, then it is symmetric, positive, definite.

394
00:50:37,200 --> 00:50:41,370
So I've got a 500 by 500 symmetric, positive, definite matrix.

395
00:50:42,360 --> 00:50:53,030
If I make the usual right hand side. What we'll find here is that the eigenvalues of A are not so well behaved.

396
00:50:53,050 --> 00:50:57,730
So if I say eigenvalues of A and I look at the smallest eigenvalue,

397
00:50:59,140 --> 00:51:07,060
it's very small and the largest eigenvalue is au of one or au of a thousand, actually.

398
00:51:07,420 --> 00:51:12,130
So that's bad. That's the eigenvalues are badly distributed for this matrix.

399
00:51:12,370 --> 00:51:15,460
If I try to run conjugate gradients.

400
00:51:17,560 --> 00:51:20,680
It eventually gets there, but it doesn't do well.

401
00:51:20,680 --> 00:51:29,140
It's very slow convergence. On the other hand, suppose I shift the matrix to make it a better behaved matrix.

402
00:51:30,010 --> 00:51:34,780
Suppose it happens to have eigenvalues that are bigger by 100.

403
00:51:36,160 --> 00:51:42,310
So I've now taken my badly behaved matrix and moved it over by 100 times the identity.

404
00:51:45,360 --> 00:51:54,450
If I define l mean to be the minimal eigenvalue it's now about a hundred and l max is 100 more than what it was.

405
00:51:54,630 --> 00:51:59,610
We now have a good set of eigenvalues which are well separated from zero.

406
00:52:00,270 --> 00:52:05,850
So there exist polynomials that are very small on the eigenvalues, and now the convergence is completely different.

407
00:52:06,150 --> 00:52:12,930
If I say C, g. So let's put more on you know about that in MATLAB.

408
00:52:12,930 --> 00:52:16,380
If you say more on it, just fill the screen. So let's do that again.

409
00:52:17,610 --> 00:52:23,159
So you see, I've got this 500 by 500, but it could be 5000 by 5000 matrix.

410
00:52:23,160 --> 00:52:27,239
And we're converging at a reasonable rate after 20 or 30 steps.

411
00:52:27,240 --> 00:52:32,040
We have a few digits and after 20 or 30 more steps, we have six digits.

412
00:52:32,040 --> 00:52:35,459
So this is working just fine. Okay, I guess it's time.

413
00:52:35,460 --> 00:52:36,180
We'd better stop.