1 00:00:03,590 --> 00:00:11,809 Okay. Good morning, everybody. I'm not sure if there's anybody new here, but this is scientific computing for Dphil students, 2 00:00:11,810 --> 00:00:15,770 the second of 12 lectures in this term and then 12 more in the next term. 3 00:00:16,970 --> 00:00:19,100 Let me remind you of a few things. 4 00:00:19,130 --> 00:00:27,710 One is that there's an assignment due next Tuesday, which certainly requires you to make sure you've got MATLAB working. 5 00:00:28,070 --> 00:00:31,160 Once you've got MATLAB working, it's not a very large assignment. 6 00:00:32,570 --> 00:00:35,840 And you turn them in in the boxes across the atrium here. 7 00:00:36,140 --> 00:00:41,830 160 or 161. Let me also remind you that we have a course Web page. 8 00:00:41,840 --> 00:00:47,540 So there is the Web page, and I hope you'll get in the habit of looking at it in particular. 9 00:00:48,230 --> 00:00:55,040 It's worthwhile just going and after a lecture, if you like, either usually not posted long in advance, 10 00:00:55,040 --> 00:01:02,990 but just go to the lecture and spend a few minutes looking over the fairly terse notes that should be there. 11 00:01:03,560 --> 00:01:07,790 I hope it works not working. Well, I think it works eventually. 12 00:01:11,140 --> 00:01:14,470 My laptop has a funny viewer. Let's make sure it's there. 13 00:01:23,770 --> 00:01:30,620 Well, it's there somewhere. So you'll find that the lecture notes are fairly terse. 14 00:01:31,780 --> 00:01:35,860 They're typically just five or six pages of outline like this, 15 00:01:36,970 --> 00:01:43,210 but it's certainly worth spending some minutes looking that over to make sure you're in touch with things. 16 00:01:44,110 --> 00:01:47,110 Now let's see. Today there are three handouts, three new handouts. 17 00:01:47,470 --> 00:01:51,549 The one with the blue is just two more copies of what we handed out last time. 18 00:01:51,550 --> 00:01:57,940 But there's three new pages. Let's take a moment and look at the one called Journals and Books. 19 00:01:58,800 --> 00:02:05,740 So this is just a list of some journals and books that are well known to numerical analysts like me. 20 00:02:07,030 --> 00:02:14,019 You remember, one of my goals in this course is to immerse you a little bit in a field that isn't your own. 21 00:02:14,020 --> 00:02:17,170 So you get a sense of how we think and what our points of reference are. 22 00:02:17,350 --> 00:02:23,350 Well, these are some of our points of reference. So I've listed a dozen of the journals that are well known in numerical analysis 23 00:02:23,650 --> 00:02:27,880 and a couple of dozen books that are relevant to this term of the course. 24 00:02:28,480 --> 00:02:35,170 Most of this stuff is online. In fact, this is a copy of a HTML file that I'll probably put up on the website. 25 00:02:36,220 --> 00:02:39,820 So most of these things are freely available in one form or another. 26 00:02:40,060 --> 00:02:47,020 For example, let me just mention one or two things on this list of books that are particularly interesting. 27 00:02:47,830 --> 00:02:54,160 There are the two books called Templates for Eigenvalue Problems and for Systems of Equations, 28 00:02:54,170 --> 00:03:00,160 and those are kind of a fun way to see those problems through the software window of a kind. 29 00:03:01,030 --> 00:03:05,920 The book by Bjork is a standard book on least squares problems with many interesting things in it. 30 00:03:06,280 --> 00:03:14,200 The book by Davis is a important book on direct methods for sparse, big linear systems of equations. 31 00:03:14,980 --> 00:03:18,910 The book by Government Van Loan is sort of the Bible of numerical linear algebra. 32 00:03:20,570 --> 00:03:28,790 The book by Know Sedol and write numerical optimisation is absolutely the standard text worldwide for numerical optimisation. 33 00:03:30,020 --> 00:03:35,209 And then you'll notice books by Saad and Vandervoort. Those are two of the leaders in Iterative Methods, 34 00:03:35,210 --> 00:03:41,270 and I think you'll find both of their books online for linear equations and also eigenvalue problems. 35 00:03:41,270 --> 00:03:45,530 So Saad and Vandervoort are two very big names in this area. 36 00:03:47,930 --> 00:03:51,020 Okay. Let's take a look at the quiz. I hope you've all done the quiz. 37 00:03:53,330 --> 00:03:57,500 Problem one What is the inverse of Spurs? What's the answer? In full. 38 00:03:57,500 --> 00:04:05,130 That's right. So let's see here. The quiz. The answer to one is C. 39 00:04:06,800 --> 00:04:10,840 What's the inverse of N? It's enough. 40 00:04:11,020 --> 00:04:18,040 That's right. So the answer to two is D, which inequality is invalid? 41 00:04:24,690 --> 00:04:30,390 What's that? The second one. That's. That's right. It's the second one that should be a backslash or a backslash. 42 00:04:30,570 --> 00:04:36,030 Then it would in principle be valid, but it's a forward slash, which is something else entirely. 43 00:04:36,390 --> 00:04:42,900 So that's the invalid one. And finally, did anyone compute the output of program in number four? 44 00:04:49,280 --> 00:04:57,380 You all did, I'm sure. But let me tell you, it's a two by two matrix containing the numbers, two in all four positions. 45 00:05:00,550 --> 00:05:01,060 Okay. 46 00:05:03,100 --> 00:05:11,409 So today we're talking about conjugate gradients, which is one of the greatest hits of, well, really computer science computing, numerical analysis. 47 00:05:11,410 --> 00:05:18,370 It's really one of the great algorithms of all time whose impact is absolutely huge conjugate gradients. 48 00:05:19,270 --> 00:05:33,580 So we call this section 1.4. It's an idea that had a very clear beginning at the beginning of the 1950s. 49 00:05:34,630 --> 00:05:43,180 There's a article getting on in age now on the history of conjugate gradients that's kind of interesting. 50 00:05:43,180 --> 00:05:56,360 And that's by Gallivan O'Leary. In 1989, I guess it is Gottlieb and O'Leary in Siam Review in 1989. 51 00:06:00,890 --> 00:06:09,330 Anyway, the, the starting point of the algorithm was two people hesston us and Stiefel, who independently had more or less the same idea. 52 00:06:09,560 --> 00:06:17,660 In fact, there were three key people. Heston, he Stiefel and Lantos were all doing very closely related things at the beginning of the 1950s, 53 00:06:17,990 --> 00:06:25,700 and in particular, Heston, even Stiefel communicated in time that they ended up writing the main paper together. 54 00:06:26,240 --> 00:06:32,000 So this is the Great Paper Yesterdays and Stiefel in 1952. 55 00:06:34,130 --> 00:06:39,920 And certainly by any standard, that's one of the ten most important numerical analysis papers ever written. 56 00:06:41,030 --> 00:06:45,350 Let's see, there's a handout which shows you a bit of the flavour of that first paper. 57 00:06:45,620 --> 00:06:49,010 So methods of conjugate gradients for solving linear systems. 58 00:06:49,280 --> 00:06:58,000 I like to give you some flavour of the past. It's a remarkable paper, partly because the two of them collaborated on it. 59 00:06:58,010 --> 00:07:00,410 It's amazing how much is in this paper. 60 00:07:00,980 --> 00:07:07,730 People keep rediscovering parts of what's in this paper because of course, not so many people read the actual paper anymore. 61 00:07:08,600 --> 00:07:13,549 It's a remarkable paper, but it also has an oddity, 62 00:07:13,550 --> 00:07:20,150 which is that it can't quite decide whether conjugate gradients is an iterative method or a direct method. 63 00:07:20,330 --> 00:07:23,540 And that's a flavour we'll be talking about a lot in the next few weeks. 64 00:07:23,540 --> 00:07:31,220 And today in particular, a direct method is an algorithm where you do a certain amount of computation and then you're done. 65 00:07:31,370 --> 00:07:37,909 You've solved the problem. An iterative method is one where as you compute, you get closer and closer to the solution. 66 00:07:37,910 --> 00:07:41,450 And once you've got six or 16 digits, then you stop. 67 00:07:41,630 --> 00:07:49,100 Very different style. So a direct method, you sort of think of it as going like this, an iterative method you think of as going like this. 68 00:07:50,480 --> 00:07:55,460 The paradox of the conjugate gradient algorithm is that in principle it's both. 69 00:07:56,210 --> 00:08:04,850 If you do enough work, you solve a matrix problem X equals B, but also as you go, you get closer and closer to the solution. 70 00:08:05,330 --> 00:08:10,460 So has Farnese and Stiefel were very proud of inventing an algorithm that could do both of these things. 71 00:08:10,760 --> 00:08:16,820 And as a result, they wrote a paper which didn't focus on the important one, which is the iterative side. 72 00:08:17,270 --> 00:08:23,320 So if you just look at this first page that I've copied, the second word in the abstract is iterative. 73 00:08:23,330 --> 00:08:27,540 So obviously they know. It has an iterative flavour. 74 00:08:28,830 --> 00:08:32,160 On the other hand, if you look at the second column and you look at item A, 75 00:08:32,820 --> 00:08:38,250 it's pointing out that like Gaussian elimination, the method gives the exact solution after N steps. 76 00:08:38,790 --> 00:08:45,629 So they're hedging their bets. It's iterative, it's direct. And it was funny what happened historically. 77 00:08:45,630 --> 00:08:53,010 They wrote this great paper. People didn't focus very much on the iterative side of conjugate gradients. 78 00:08:53,010 --> 00:08:58,020 They focussed on it as a direct algorithm. And the fact is it's not a good direct algorithm. 79 00:08:58,020 --> 00:09:01,410 It's nowhere near as reliable as Gaussian elimination. 80 00:09:01,710 --> 00:09:05,430 So it sort of lost the battle in the fifties and sort of got put aside. 81 00:09:05,760 --> 00:09:08,550 Everybody knew about it, but people didn't use it much. 82 00:09:08,790 --> 00:09:15,510 And it really took 20 years for conjugate gradients to be kind of rediscovered as an iterative algorithm. 83 00:09:15,840 --> 00:09:20,850 I'd like to draw you a plot to indicate why that happened. 84 00:09:22,110 --> 00:09:28,200 It wasn't just that people were being dumb. It has to do with how things evolve with time. 85 00:09:28,200 --> 00:09:35,490 So this is time, but it's also the dimension of a problem. 86 00:09:36,420 --> 00:09:41,070 Because of course, as computers get more powerful, you can solve bigger problems. 87 00:09:41,880 --> 00:09:45,360 So this is the 1950s, this is 2015. 88 00:09:45,360 --> 00:09:47,700 And of course, n gets bigger with time. 89 00:09:47,700 --> 00:09:56,880 As we discussed last lecture, this access is the amount of work involved in solving an end by end system of equations. 90 00:09:57,120 --> 00:10:03,090 So where solving X equals B and it's end by end. 91 00:10:04,710 --> 00:10:08,940 Now if you do this with the obvious method of Gaussian elimination. 92 00:10:12,430 --> 00:10:16,090 Which is too familiar for me to want to talk about it. 93 00:10:16,750 --> 00:10:23,290 The amount of work is NN cubed, basically. So that's a fine algorithm. 94 00:10:23,290 --> 00:10:32,290 And back in 1950, let's say, and might have been ten in 2015 and might be ten to the eighth or something. 95 00:10:33,670 --> 00:10:41,080 Now conjugate gradients depends on the matrix, but loosely speaking, it's got a picture that looks more like this. 96 00:10:41,980 --> 00:10:47,200 Let me put it in quotes. I'll say C, G, and I'll put in quotes O of n squared. 97 00:10:47,740 --> 00:10:53,379 That's not a general figure. There's there's no sense in which it is generally of complexity and squared, 98 00:10:53,380 --> 00:10:59,410 but maybe that's a typical amount of work involved when things are going pretty well with conjugate gradients. 99 00:10:59,800 --> 00:11:07,480 On the other hand, the constant is works. If you have a small matrix of five, ten or 20 or 30, you would never use conjugate gradients. 100 00:11:07,750 --> 00:11:13,900 So it really didn't make sense in those early days for people to use conjugate gradients. 101 00:11:14,200 --> 00:11:19,240 Somewhere in the early 1970s, and let's to be precise, say 1971, 102 00:11:20,290 --> 00:11:29,770 these two curves crossed over and people started to notice that for big problems, at least conjugate gradient was a very powerful idea. 103 00:11:30,070 --> 00:11:33,710 So it was a remarkable change of perspective there. 104 00:11:33,730 --> 00:11:39,850 The person who made that observation was John Reid, who is still around. 105 00:11:39,850 --> 00:11:44,170 He's at the Rutherford Appleton Laboratory in ten miles south of here. 106 00:11:44,440 --> 00:11:49,000 He wrote a paper which pointed out how powerful conjugate gradient was. 107 00:11:49,150 --> 00:11:56,230 And the funny thing is all the foundations are already there and has been is in freefall in 1952 but didn't get focussed on. 108 00:11:56,410 --> 00:12:06,670 It's a very curious bit of history. In fact, one reason I'm so interested in this is that you'll often hear people say that numerical computation is, 109 00:12:06,790 --> 00:12:10,060 let's say, ten to the 12th times faster than it was way back when. 110 00:12:10,330 --> 00:12:16,240 And then they'll say six of those orders of magnitude come from computers and the other six come from algorithms. 111 00:12:16,510 --> 00:12:21,550 And in some sense that's true. The algorithms are much faster and the computers are much faster. 112 00:12:22,270 --> 00:12:26,470 But still, there's an asymmetry because the computers drive the algorithms, you know. 113 00:12:27,490 --> 00:12:33,430 Yes, it's true that we didn't focus on conjugate gradients back here, but that's not really because we were stupid. 114 00:12:33,430 --> 00:12:42,910 It's because it didn't make sense to focus on it back here. So in some sense, as the machines get faster, that affects what the right algorithms are. 115 00:12:43,120 --> 00:12:51,730 And in the end, you get your tremendous improvements, you get a curving of a plot like that because of the interaction of machines and algorithms. 116 00:12:52,840 --> 00:13:00,970 Anyway, conjugate gradient is a huge thing. I checked on the web on if you Google search it, you get 573,000 hits today. 117 00:13:02,020 --> 00:13:05,799 The third hit is a very popular one because it's a paper with a nice title. 118 00:13:05,800 --> 00:13:13,420 It's a paper by Jonathan Schuchat called an introduction to the Conjugate Gradient Method Without the Agonising Pain. 119 00:13:17,270 --> 00:13:23,870 Chuck. Chuck, by the way, is a well known guy at Berkeley who won the Wilkinson Software Prize for his package called Triangle. 120 00:13:25,280 --> 00:13:32,599 Okay. So what I want to do today is show you the conjugate gradient method and derive a 121 00:13:32,600 --> 00:13:37,580 certain part of the theory that explains why it's so spectacularly fast sometimes. 122 00:13:39,080 --> 00:13:42,970 So it applies to symmetric, positive, definite matrices. 123 00:13:42,980 --> 00:13:48,320 So let a which is n. By n. B. 124 00:13:48,890 --> 00:13:56,030 Well, we always say speedy, which means symmetric, positive, definite. 125 00:14:01,050 --> 00:14:05,010 Symmetric means if you transpose it, you get the same matrix. 126 00:14:06,060 --> 00:14:11,910 So symmetric means a equals a transpose and positive definite. 127 00:14:11,920 --> 00:14:18,750 Well, you can define that in various ways. One way to define it is to say that all the eigenvalues are positive. 128 00:14:22,600 --> 00:14:27,280 A real symmetric matrix always has real eigenvalues, but they might be zero or negative. 129 00:14:27,550 --> 00:14:32,320 A real symmetric, positive, definite one has positive eigenvalues. 130 00:14:33,610 --> 00:14:41,440 Another way to say it is that x transpose a x is always greater than zero. 131 00:14:41,950 --> 00:14:50,590 If x is a non-zero vector. So X transpose x is a very standard way that we would write a thing like this. 132 00:14:50,860 --> 00:14:54,700 You should think of that as a picture of a row vector. 133 00:14:55,030 --> 00:14:59,050 Times a. Matrix times a. Column vector. 134 00:15:00,430 --> 00:15:06,909 A lot of the notations we use are equivalent to though different from what physicists in quantum mechanics use. 135 00:15:06,910 --> 00:15:12,190 So physicists like to talk about bras and cats, which have to do with sort of left and right vectors, 136 00:15:12,190 --> 00:15:16,000 and they would sort of split this operator into a left half and the right half. 137 00:15:16,600 --> 00:15:20,200 This is the way numerical analyst like to talk about such operations. 138 00:15:22,500 --> 00:15:27,600 Anyway. In applications, symmetric positive, definite problems come up all the time, 139 00:15:27,810 --> 00:15:32,450 essentially because her mission differential equations come up all the time. 140 00:15:32,460 --> 00:15:40,440 Self Adjoint Differential Equations. When you have a self a joint differential equation, then almost certainly you can find a way to democratise it, 141 00:15:40,650 --> 00:15:43,770 which turns it into a symmetric, positive, definite matrix. 142 00:15:45,020 --> 00:15:54,440 So we want to solve x equals B, so let B be an end vector. 143 00:15:58,080 --> 00:16:13,720 And of course, we're interested in the case where end is pretty big. So we want X Star is what I'll call it, such that a x star equals B. 144 00:16:14,290 --> 00:16:18,999 That solution exists. That is unique because of the assumption that the matrix is positive. 145 00:16:19,000 --> 00:16:25,450 Definite. That implies that it's non singular. All the eigenvalues are positive, so it's a non singular matrix. 146 00:16:25,750 --> 00:16:29,320 There's a unique solution. So X Star is our unique solution. 147 00:16:29,470 --> 00:16:38,140 And the way conjugate gradients works is it starts from an initial vector, an initial gets x nought and it iteratively gets closer to x star. 148 00:16:38,410 --> 00:16:45,490 So c g finds x star iteratively. 149 00:16:48,990 --> 00:16:53,010 Starting from some vector which I'll call x not. 150 00:16:56,680 --> 00:17:00,130 Now. It could be anything you like, but we're going to take it to be zero. 151 00:17:00,760 --> 00:17:03,820 It doesn't have to be zero, except it simplifies the formulas. 152 00:17:04,360 --> 00:17:13,060 So without. Well, with only slight loss of generality, let's say that x nought is the zero vector, the N by one zero vector. 153 00:17:13,720 --> 00:17:22,870 And the way conjugate gradient does this it finds x one in the next two and x three and these iterates x 154 00:17:22,870 --> 00:17:34,090 little n belong to very interesting sub spaces so they so x and belongs to the so-called cry of space. 155 00:17:37,240 --> 00:17:41,230 Which is a subspace of capital and dimensional vectors. 156 00:17:42,600 --> 00:17:49,559 And the way we write it is we'll write K or K seven is the space spanned by the 157 00:17:49,560 --> 00:17:55,290 right hand side vector B and A times the right hand side vector dot dot dot up to. 158 00:17:56,450 --> 00:18:01,220 A race to the end, minus one power times the right hand side vector. 159 00:18:02,660 --> 00:18:09,590 So that idea of a krylov subspace is absolutely at the heart, not just of conjugate gradients, 160 00:18:09,590 --> 00:18:19,310 but of all sorts of iterative algorithms for solving systems of equations and for finding eigenvalues and singular values, all sorts of things. 161 00:18:21,570 --> 00:18:31,010 So let me say a little more about that. This notation means the set of vectors spanned by these vectors, linear combinations. 162 00:18:31,250 --> 00:18:35,660 So if I just have one of them, the first kind of subspace, that's a one dimensional space. 163 00:18:35,870 --> 00:18:39,320 If I have two of them, that's probably a two dimensional space and so on. 164 00:18:39,650 --> 00:18:48,260 So this isn't generically an end dimensional space which is contained in the set of all end vectors. 165 00:18:50,450 --> 00:18:56,750 And as little then increases, you get a bigger and bigger subspace of your giant space. 166 00:18:57,890 --> 00:19:05,060 So conjugate gradients is searching in bigger and bigger spaces for a good approximation to the solution. 167 00:19:06,260 --> 00:19:08,150 Now let me show you some of its properties. 168 00:19:14,240 --> 00:19:23,340 So first of all, we define the so-called a norm of a vector to be the size of the vector measured in a special way. 169 00:19:23,360 --> 00:19:33,500 We say that the a norm of a vector x is equal to the square root of x transpose x. 170 00:19:34,970 --> 00:19:38,840 So that's a number and it's a positive number if X is non-zero. 171 00:19:42,160 --> 00:19:52,390 Because by definition, A is symmetric, positive, definite, and therefore this thing is a positive number so we can square root it. 172 00:19:52,690 --> 00:20:03,190 That's going to be the measure of our error. And let's define the error at each step like this. 173 00:20:03,190 --> 00:20:09,760 I'll say that the error the error at step n I'll define to be sub n. 174 00:20:11,220 --> 00:20:25,980 And that will be the exact solution. X star minus the gas explosion and the residual, as we call it, is what you get if you apply A to that. 175 00:20:26,250 --> 00:20:32,310 So it's our seven hour residual equals A times seven. 176 00:20:34,410 --> 00:20:40,020 Now eight times X star is B minus eight times x n. 177 00:20:43,740 --> 00:20:52,110 So throughout numerical analysis, you always have these two ideas of an error and a residual and residuals, a kind of a dual of the error. 178 00:20:52,110 --> 00:20:57,809 It's what you get when you apply the error applied to the error, the problem you're trying to solve, 179 00:20:57,810 --> 00:21:05,130 the operator you're working with as a rule in principle, you would like a small error, but the thing you can measure is very often the residual. 180 00:21:05,140 --> 00:21:11,400 So you minimise the residual as a means of minimising the error all over numerical analysis. 181 00:21:14,730 --> 00:21:23,700 Okay, so let me move over here and write down an amazing property that I'm not going to prove of conjugate gradients. 182 00:21:24,780 --> 00:21:34,620 So I'll call this theorem one. And the amazing property is that CG conjugate gradients magically find the very 183 00:21:34,620 --> 00:21:40,440 best vector in this space in the sense of minimising the norm of the error. 184 00:21:41,340 --> 00:21:45,720 All right. So at the end of the span, yes. 185 00:21:46,020 --> 00:21:49,080 So the set of all linear combinations of these vectors. 186 00:21:50,660 --> 00:21:55,530 Other questions. I didn't come to this extent. How do I come up with it? 187 00:21:55,560 --> 00:22:01,440 I haven't told you yet. Yeah, I'm going to tell you. It's an amazingly slick and obscure. 188 00:22:01,440 --> 00:22:06,660 It's not at all obvious when you see the formulas, how it works. But I'm about to tell you, of course. 189 00:22:07,050 --> 00:22:19,830 But first, let me tell you the theorem, which is that among all vectors in this kind of space, I guess I should have given it a subscript. 190 00:22:19,830 --> 00:22:23,130 So k seven is the end dimensional cry of space. 191 00:22:26,220 --> 00:22:36,900 The one that conjugate gradients is going to give us when I tell you the formulas is the unique one that minimises the norm of the error. 192 00:22:44,510 --> 00:22:48,140 87. Measured in the a norm. That's the theorem. 193 00:22:48,740 --> 00:22:53,240 So this optimality property which has two needs and Stiefel of course, knew about. 194 00:22:57,090 --> 00:23:04,440 So at every step conjugate gradients find you the best approximation in a certain measure in a space. 195 00:23:04,440 --> 00:23:07,110 And those spaces are growing. They're getting bigger and bigger, 196 00:23:07,830 --> 00:23:17,379 which means that at some rate we must be approaching the correct solution because after n steps will have an end capital n dimensional subset of this. 197 00:23:17,380 --> 00:23:22,830 So we will have it all. So once we if we really take capital and steps, we've got to solve the problem. 198 00:23:23,190 --> 00:23:29,129 Nobody would ever take capital n steps in practice anymore. So indeed, 199 00:23:29,130 --> 00:23:35,520 we have immediately from this theorem the beautiful corollary that the initial error 200 00:23:36,720 --> 00:23:41,790 measured in the a norm is greater than or equal to the first error in the norm, 201 00:23:41,790 --> 00:23:53,760 and so on. So the errors keep going down monotonically, at least in the norm. 202 00:23:55,300 --> 00:24:03,970 Very beautiful stuff. I think they also go down in the two norm, but that's less obvious, if I remember. 203 00:24:03,970 --> 00:24:12,510 But this is the fundamental property. So I'm going to now give you the formulas. 204 00:24:13,830 --> 00:24:19,230 And as I do that, let me let me tell you the context in which I like to present them. 205 00:24:20,760 --> 00:24:26,100 The way most people think of conjugate gradients is as a kind of an optimisation search. 206 00:24:26,370 --> 00:24:29,489 You're in this giant 10,000 dimensional space. 207 00:24:29,490 --> 00:24:33,750 Let's say you're searching for a particular vector in this space, 208 00:24:34,080 --> 00:24:40,950 and you do that by starting with one vector and then moving along in some direction to improve the gas and then moving in another direction. 209 00:24:41,250 --> 00:24:51,000 So you're searching around in this huge dimensional space and conjugate gradients can be described very compactly in terms of how it does that search. 210 00:24:51,810 --> 00:25:01,440 So let me be specific. So I'll call it the CG iteration and let's in particular relate it to the idea 211 00:25:01,440 --> 00:25:10,440 of minimising something which we always call optimisation we want to minimise. 212 00:25:12,120 --> 00:25:18,419 A quantity, and that's equivalent to finding a place where a certain gradient is zero. 213 00:25:18,420 --> 00:25:24,780 Right. You imagine a surface, you're trying to minimise it. That's that's amounts to making some gradient equal to zero. 214 00:25:24,780 --> 00:25:38,820 Let's be specific. Let's define F of any vector X to be one half times x, transpose a x minus x, transpose b. 215 00:25:41,740 --> 00:25:47,950 That's a scalar. We already saw this scalar that has the structure of rho vector times, matrix times, 216 00:25:47,950 --> 00:25:53,230 column vector x, transpose b has the structure of rho vector times column vector. 217 00:25:54,700 --> 00:26:11,020 So this is a scalar and you can compute that the gradient of that scalar function of x is equal to A, x minus B. 218 00:26:12,340 --> 00:26:19,270 How do you do that? I won't go through that. But because X appears twice, you get two terms, you add them up, they're the same. 219 00:26:19,270 --> 00:26:28,540 That's so the one half gets doubled and turns into a one. And then the gradient of this term is B, so you have this 10,000 dimensional space, 220 00:26:28,540 --> 00:26:38,140 you have the scalar function of this vector in that space, the gradient of that function is a vector, and this is that vector. 221 00:26:39,790 --> 00:26:44,559 Now we want to be equal to B, which means we want the gradient to be zero, right? 222 00:26:44,560 --> 00:26:47,770 So our goal is to make that zero. 223 00:26:47,770 --> 00:26:52,660 In other words, to make the gradient zero, in other words, to minimise F of f's f of x. 224 00:26:53,530 --> 00:26:59,650 So F takes a minimum uniquely. 225 00:27:05,470 --> 00:27:08,590 At the point ex star. 226 00:27:09,790 --> 00:27:14,710 I'm not very good at drawing pictures in a thousand or 10,000 dimensions, but in two dimensions anyway. 227 00:27:15,130 --> 00:27:18,430 Of course, you imagine you've got some kind of a parabola here. 228 00:27:19,210 --> 00:27:22,540 Well, that's one dimension, I guess you'd say. And we're trying to find. 229 00:27:23,600 --> 00:27:28,850 That Poindexter. At which the parabola is minimised. 230 00:27:29,360 --> 00:27:38,750 More generally, it's a parabolic surface in capital and dimensions, so the zero gradient gives us the minimum. 231 00:27:45,500 --> 00:27:51,680 So what conjugate gradient does is to cleverly search to minimise that function. 232 00:27:51,980 --> 00:27:57,890 And now I write down the formulas. So here they are. 233 00:27:59,330 --> 00:28:00,970 And we're going to treat them as magic. 234 00:28:00,980 --> 00:28:06,740 I'm not going to tell you where they come from, but I will tell you how to interpret each line of the algorithm. 235 00:28:06,950 --> 00:28:13,100 So so here's the algorithm. We start with an initial guess, which, as I say, we're going to take to be the zero vector. 236 00:28:14,690 --> 00:28:23,980 Corresponding to that is a residual and the residual will be it's B minus A times the initial vector, but the initial vector zero. 237 00:28:23,990 --> 00:28:28,850 So it's B, so the initial residual is B. 238 00:28:29,540 --> 00:28:37,250 And then we're also going to work with a search direction vector, which we'll take to be the initial residual. 239 00:28:37,430 --> 00:28:41,630 So in our thousand dimensional space, we're going to be searching in all sorts of directions. 240 00:28:42,170 --> 00:28:45,290 And this is a vector that will define those search directions. 241 00:28:46,520 --> 00:28:52,730 Now, here's how the algorithm goes for little n equals 1 to 3, and so on until we're happy. 242 00:28:55,480 --> 00:29:00,130 Will define Alpha seven to be the quotient. 243 00:29:01,360 --> 00:29:15,250 Of R and minus one transpose R and minus one divided by P and minus one transpose a pound minus one. 244 00:29:16,930 --> 00:29:20,260 So that's a scalar divided by a scalar. 245 00:29:21,390 --> 00:29:26,850 So Alpha is a scalar. I'm going to interpret each of these expressions in a moment. 246 00:29:27,810 --> 00:29:37,800 Xand Well, of course you know what the interpretation of that is. It's the next guess, and that's X and minus one, plus alpha, minus one. 247 00:29:41,130 --> 00:29:55,230 And then our end, the next residual is equal to R and minus one, minus alpha n, a p and minus one. 248 00:29:57,180 --> 00:30:01,380 And then there's another scalar called beta or beta if you're English, 249 00:30:01,950 --> 00:30:14,670 which is the quotient of our n transpose r n divided by R and minus one transpose R and minus one. 250 00:30:15,210 --> 00:30:22,000 And then the final line is we update the vector p by the formula p. 251 00:30:22,020 --> 00:30:27,210 N equals r. N plus beta MP and minus one. 252 00:30:33,590 --> 00:30:38,240 And so that's just magic formula. 253 00:30:38,690 --> 00:30:42,960 It's not at all obvious why they work. 254 00:30:43,340 --> 00:30:50,150 But let me at least tell you what they're doing. So you can get the optimisation interpretation here. 255 00:30:51,140 --> 00:30:58,460 You already know that we're in a big dimensional space and we're trying to get better and better approximations to the point X star. 256 00:30:59,870 --> 00:31:11,330 Well, alpha being a scalar is the step length, which means it's how far we're going along a search direction to improve our current guess. 257 00:31:12,200 --> 00:31:16,160 Xin is of course the approximate solution at step n. 258 00:31:19,150 --> 00:31:22,540 Our seven is of course, the residual corresponding to that. 259 00:31:27,770 --> 00:31:30,960 Well, it's obvious from this formula how we can interpret better. 260 00:31:31,010 --> 00:31:39,250 Subban. It's the improvement. This is the square of the residual at the last step, and this is the square of the residual at the current step. 261 00:31:39,260 --> 00:31:44,120 So this measure is, in some sense the improvement in the step we're now taking. 262 00:31:45,750 --> 00:31:52,950 A scalar. And finally, this is a search direction. 263 00:31:53,160 --> 00:32:01,800 So we're in this high dimensional space and at every step we take a step from the current position to our new position along this direction. 264 00:32:02,190 --> 00:32:05,010 So this is a search direction. 265 00:32:10,760 --> 00:32:22,810 And maybe the most interesting formula to look at in this magic algorithm is this one where you can see that at every step to get the new estimate X, 266 00:32:22,820 --> 00:32:29,480 then we take the old one and then we take a step in the direction point minus one of length alpha. 267 00:32:29,530 --> 00:32:40,190 And so it's crazy and magical, but this has the property that Theorem One says it has of giving you an optimal solution in a life space. 268 00:32:42,320 --> 00:32:49,580 So I want to show you that a little bit numerically, and then I want to say a bit about what that implies about convergence rates. 269 00:33:03,600 --> 00:33:10,800 So we're now at the MATLAB handout. There's a sheet of paper there with CG and M three on it. 270 00:33:19,570 --> 00:33:20,229 Now of course, 271 00:33:20,230 --> 00:33:27,250 real software for conjugative gradients would be much fancier than this with all sorts of bells and whistles and convergence criteria and so on. 272 00:33:27,550 --> 00:33:34,720 But this little code, SIG, is designed to show you that iteration in its simplest form, and you can see it has exactly the structure. 273 00:33:34,720 --> 00:33:41,890 I just wrote down an iteration involving five or six update formulas at each step. 274 00:33:42,310 --> 00:33:46,030 We should say a word about where the work is involved in this iteration. 275 00:33:48,160 --> 00:33:50,530 Most of this is just manipulating vectors. 276 00:33:51,250 --> 00:34:00,910 The only part of the CG operation that is more than just a vector is the multiplication of the matrix A by the step P and minus one. 277 00:34:01,180 --> 00:34:05,310 So that's a matrix vector product. And it's the same here and here. 278 00:34:05,320 --> 00:34:13,150 So you only do it once at each step. So every step of conjugate gradients involves a matrix vector product. 279 00:34:13,360 --> 00:34:19,300 If that were a dense matrix, that would be and squared work. If it's a sparse matrix, it's less than that. 280 00:34:21,360 --> 00:34:27,209 Now this particular example I'm going to run or do the first half of M three convergence. 281 00:34:27,210 --> 00:34:37,290 So let's take A to B, a diagonal matrix of dimension 100 with entries going from 100 to 199 along the diagonal. 282 00:34:37,530 --> 00:34:43,620 So it's a trivial matrix. Mathematically, it's just diagonal 101 up to 199. 283 00:34:44,940 --> 00:34:53,410 So the size of a is 100 by 100. And let's take a trivial right hand side to be the vector. 284 00:34:55,620 --> 00:35:03,110 What's happening here. The vector of length, 100 of ones. 285 00:35:03,830 --> 00:35:07,230 And now let's say X equals C G of A, comma B. 286 00:35:07,400 --> 00:35:17,810 So I'm just running this code. Now what you see is these are the residuals at each step and you see they're beautifully converging. 287 00:35:17,820 --> 00:35:24,900 Now, what does that mean? We start with an error of one, and after 20 steps or so, we've got about ten digits improvement. 288 00:35:25,560 --> 00:35:32,190 It's 100 by 100 matrix. And yet it only took us ten or 20 steps to get this very nice improvement. 289 00:35:32,370 --> 00:35:35,520 So the amount of work we've done is small. 290 00:35:35,820 --> 00:35:39,960 Of course it's a trivial problem. We could have done it on paper, but you see the idea. 291 00:35:42,550 --> 00:35:46,480 And if I look at the solution, of course, it's accurate to ten digits or so. 292 00:35:46,690 --> 00:35:52,090 Let's look at the first four entries. The diagonal matrix was 100, 100 and 102. 293 00:35:52,390 --> 00:36:01,420 So the first four entries are one over 101 over 101, one over 102, and so on to ten digits. 294 00:36:03,330 --> 00:36:09,420 So there we just used conjugate gradient to solve a rather trivial 100 by 100 problem. 295 00:36:11,240 --> 00:36:22,700 Let's do another one here. So let's say A equals the diagonal matrix consisting of the vectors from let's see, I'm going to put in six zeros there. 296 00:36:22,700 --> 00:36:27,800 One, two, three, four, five, 61999999. 297 00:36:29,360 --> 00:36:37,040 Hope that's right. So I've now got a million by million matrix represented sparsely. 298 00:36:37,400 --> 00:36:41,630 Of course, again, this is a problem we can do on paper because it's just a diagonal matrix. 299 00:36:41,900 --> 00:36:48,120 But it illustrates the idea. Let's take a right hand side to be the trivial vector of length. 300 00:36:48,120 --> 00:36:56,239 2 million. And now let's again do X equals C, g of a B. 301 00:36:56,240 --> 00:37:04,400 So this is a million by millions system of equations. And you see it chugs along with essentially the same performance. 302 00:37:05,270 --> 00:37:09,920 So this gives you a hint of how powerful conjugate gradients can be when things are working. 303 00:37:12,860 --> 00:37:17,240 So we get solved a million by million system of equations. Admittedly, it was diagonal. 304 00:37:20,960 --> 00:37:24,650 Okay, so I want to say a word about convergence. 305 00:37:29,830 --> 00:37:38,140 The convergence of conjugate gradients is an example where some really beautiful mathematics has some very concrete consequences. 306 00:37:38,770 --> 00:37:45,550 And the part of mathematics we're talking about is approximation, theory, approximation of polynomials, of certain things. 307 00:37:46,540 --> 00:37:55,720 So let's call this section 1.5. This is not well in this course. 308 00:37:55,730 --> 00:38:01,010 Everything we do is mathematics in some sense, but mostly it's not really a mathematics course very much. 309 00:38:01,340 --> 00:38:05,060 Maybe the next 10 minutes are a little more mathematical than some. 310 00:38:09,470 --> 00:38:15,200 So it turns out that just from the optimality property of Theorem One, 311 00:38:15,200 --> 00:38:20,060 you can get everything that I haven't proved that theorem, but from that everything follows. 312 00:38:20,270 --> 00:38:29,569 And let me remind you that the initial error is x da minus x, not which we've taken to be zero. 313 00:38:29,570 --> 00:38:44,810 So that's x star. Similarly, the error would be x star minus x, and so that would be E, nought minus x n. 314 00:38:47,780 --> 00:38:51,220 And I want to say something about how small the end is. 315 00:38:52,550 --> 00:38:58,210 So. Using our theorem one optimality. 316 00:38:58,570 --> 00:39:04,209 We well we we note that X seven, as I've told you, is a member of this cry loft space. 317 00:39:04,210 --> 00:39:20,440 So it's in the space B A, B up to a, n minus one B but B is eight times X star, eight times E not so. 318 00:39:20,440 --> 00:39:38,110 Another way to write that is that it's a times E, not a squared times e not dot dot dot up to A to the n e not because eight times e not is B. 319 00:39:38,620 --> 00:39:43,390 So these are just two different ways to write the same little n dimensional space. 320 00:39:45,110 --> 00:39:50,150 So examine our guests at the step. At the end step is in this space. 321 00:39:51,480 --> 00:39:58,380 In other words, it's a linear combination of these vectors. And now there's really only one central idea to this analysis. 322 00:39:59,220 --> 00:40:07,410 A linear combination of these vectors can be interpreted as a polynomial of a applied to inward. 323 00:40:08,510 --> 00:40:12,319 In fact, the set of all linear combinations of these vectors is precisely well, 324 00:40:12,320 --> 00:40:21,500 it's the set of all polynomials of degree, little n applied to E, not with a with no constant term in that polynomial. 325 00:40:22,310 --> 00:40:28,320 So in conjugate gradients we're implicitly dealing with polynomial of a times initial error. 326 00:40:29,000 --> 00:40:43,100 And that's why the analysis turns out so beautifully. So this tells us that e sub ne7 is e not minus x ten and x and is in this space. 327 00:40:43,970 --> 00:40:47,299 So e seven is equal to a polynomial of degree. 328 00:40:47,300 --> 00:40:51,800 N in a times e sub not where. 329 00:40:54,220 --> 00:41:11,070 P seven is a polynomial. Of degree and or at most end because of this linear combination of these terms. 330 00:41:11,520 --> 00:41:19,050 But we know one other thing about that polynomial, which is that the degree zero term, a constant term just comes from there. 331 00:41:19,350 --> 00:41:23,249 So it's one. So it's a polynomial of degree. 332 00:41:23,250 --> 00:41:28,590 N With the property that P of zero is one. 333 00:41:31,890 --> 00:41:38,640 So I've now told you really the crucial things about conjugate gradients, the rest is just working out the consequences. 334 00:41:43,310 --> 00:41:52,250 The crucial things are. Two. One is this theorem that at each step we're optimising the norm in a certain space. 335 00:41:53,240 --> 00:42:01,730 And the other is this bit of algebra that being in that space amounts to being equal to a polynomial and a times the initial error. 336 00:42:02,270 --> 00:42:05,510 All the rest is fiddling with polynomials. 337 00:42:07,580 --> 00:42:10,850 So that's really a beautiful, elegant property, I think. 338 00:42:13,040 --> 00:42:16,370 So let me restate theorem one. 339 00:42:17,540 --> 00:42:26,510 I'll call it Theorem one prime. It's the same theorem just restated E and is equal to some polynomial and a times e not. 340 00:42:30,150 --> 00:42:38,400 And it's the one that's optimal. So that is to say where that point of a in art is minimal. 341 00:42:41,370 --> 00:42:55,530 And we have to say, of course, it's minimal among polynomials p of degree and with P of not equal to one. 342 00:42:57,810 --> 00:43:01,380 So that's what conjugate gradient is magically doing at every step. 343 00:43:41,490 --> 00:43:48,810 Now there's one other piece I have to bring in in order to really show you how it's an approximation problem that determines the convergence rate. 344 00:43:49,170 --> 00:43:54,719 And that other piece is the eigenvalues of The Matrix. So I hope you all know about eigenvalues. 345 00:43:54,720 --> 00:43:57,300 I'm just going to use them without really talking about them much. 346 00:43:57,720 --> 00:44:05,130 But in order to get quantitative about what this means, we need to look at the eigenvalues of A and the eigenvectors of A. 347 00:44:05,280 --> 00:44:06,930 Now as a symmetric matrix. 348 00:44:07,410 --> 00:44:17,459 A symmetric matrix always has real eigenvalues and orthogonal eigenvectors, so the eigenvectors provide an orthogonal basis of the space our capital. 349 00:44:17,460 --> 00:44:21,570 N So that's what we do now. 350 00:44:21,570 --> 00:44:29,700 We say Let a have eigenvalues. I always write that e w it's German. 351 00:44:29,970 --> 00:44:36,960 I convert for eigenvalue and eigenvectors e v So let a have eigenvalues, 352 00:44:37,080 --> 00:44:50,640 which by the way we know are going to be positive Lambda one through Lambda Capital, F and eigenvectors, which are orthogonal. 353 00:44:51,030 --> 00:44:59,530 In fact, let's make them all the normal. They won through Van. 354 00:45:05,500 --> 00:45:12,160 Now the thing to do is expand the vectors we care about in this basis of all the normal eigenvectors. 355 00:45:12,850 --> 00:45:16,900 So we then expand and we say that e not. 356 00:45:19,440 --> 00:45:30,510 Well, any vector can be expanded in this basis. So we're going to write it not as the sum from one to capital n of some number a sub k times v. 357 00:45:35,710 --> 00:45:39,880 It's a basis you can always expand in the eigenvector basis. 358 00:45:41,320 --> 00:45:52,630 And now we look at the polynomial of a applied to a knot and we find that n is equal to the polynomial of a times e not. 359 00:45:56,520 --> 00:46:03,150 Now the whole definition of an eigenvector is that a times vk equals Lambda K VK. 360 00:46:04,400 --> 00:46:12,770 That Generalises a polynomial and a applied to VK is that polynomial of lambda k multiplied by v k. 361 00:46:13,760 --> 00:46:24,140 So when I do this I get the sum from one to capital and of a k times. 362 00:46:24,320 --> 00:46:30,380 PM of Lambda K applied to VK. 363 00:46:34,260 --> 00:46:42,750 So now. We've made the connection between a polynomial and a particular basis. 364 00:46:43,260 --> 00:46:49,980 We want this to be small, right? Well, how could this thing be small if all of these are small? 365 00:46:50,760 --> 00:46:53,820 So that's basically where the approximation theory comes in. 366 00:46:54,150 --> 00:47:03,750 Conjugate gradients will converge quickly if there exist polynomials that go to zero quickly at the eigenvalues of the matrix. 367 00:47:04,320 --> 00:47:08,280 And thanks to the miracle of Theorem one, it's always finding the perfect polynomial. 368 00:47:08,280 --> 00:47:14,580 It's always optimising. So if there exist good polynomials, the algorithm will find them. 369 00:47:16,380 --> 00:47:19,560 So that's our conclusion. 370 00:47:20,700 --> 00:47:30,630 If there exist polynomials that are very small at the eigenvalues. 371 00:47:37,990 --> 00:47:42,550 Then C g will converge quickly. 372 00:47:49,100 --> 00:47:55,310 And this is where you get the kind of picture that is often drawn in write ups of conjugate gradient. 373 00:47:55,700 --> 00:48:00,200 People like to draw a picture of a polynomial, let's say P of lambda. 374 00:48:02,080 --> 00:48:05,590 It always has to go through one at the origin. 375 00:48:07,650 --> 00:48:16,500 But we want it to be small at the eigenvalues. So A is a big matrix with a thousand or a million eigenvalues and they're all positive. 376 00:48:16,500 --> 00:48:24,320 So they're on the real axis somewhere, and there's just a lot of them. We? 377 00:48:24,650 --> 00:48:30,170 The question we ask in considering the convergence of conjugate gradients is does 378 00:48:30,170 --> 00:48:35,450 there exist a polynomial that passes through this point and yet is very small? 379 00:48:37,890 --> 00:48:48,900 And all of these eigenvalues. And of course, I've been writing these things down quickly, but I've sort of told you the whole logic of the argument, 380 00:48:49,410 --> 00:48:52,830 if that polynomial exists and thanks to Theorem one prime or one, 381 00:48:53,700 --> 00:48:58,620 we know that there will be a small error because conjugate gradients will find that polynomial. 382 00:49:00,360 --> 00:49:08,130 Okay, so that's what it all comes down to. And I just want to show you a corollary and one more computation and then we'd better stop. 383 00:49:09,870 --> 00:49:17,510 So what this all leads to, I'll call it theorem two. 384 00:49:19,350 --> 00:49:25,800 Is that the ratio of the error at the 10th step to the initial error. 385 00:49:26,430 --> 00:49:34,259 When you do conjugate gradients is bounded by the minimum overall polynomials with the usual 386 00:49:34,260 --> 00:49:44,010 restriction times the maximum overall eigenvalues of the matrix times point of lambda. 387 00:49:47,010 --> 00:49:55,320 Okay. So as always, pain of zero has to be one. 388 00:49:55,980 --> 00:50:00,480 Now there's more to say, but I don't really have time to say it. Let me show you a final demo. 389 00:50:02,700 --> 00:50:06,630 So let's turn to the second half of the MATLAB page. 390 00:50:14,940 --> 00:50:20,580 Just to give you a sense of how when the eigenvalues of a matrix behave, you can get beautiful behaviour. 391 00:50:20,760 --> 00:50:24,180 So suppose I say a equals ran down of 500. 392 00:50:24,540 --> 00:50:27,900 That's now a 500 by 500 random matrix. 393 00:50:28,080 --> 00:50:36,210 It's not symmetric, positive, definite. But if I say A equals A transpose times A, then it is symmetric, positive, definite. 394 00:50:37,200 --> 00:50:41,370 So I've got a 500 by 500 symmetric, positive, definite matrix. 395 00:50:42,360 --> 00:50:53,030 If I make the usual right hand side. What we'll find here is that the eigenvalues of A are not so well behaved. 396 00:50:53,050 --> 00:50:57,730 So if I say eigenvalues of A and I look at the smallest eigenvalue, 397 00:50:59,140 --> 00:51:07,060 it's very small and the largest eigenvalue is au of one or au of a thousand, actually. 398 00:51:07,420 --> 00:51:12,130 So that's bad. That's the eigenvalues are badly distributed for this matrix. 399 00:51:12,370 --> 00:51:15,460 If I try to run conjugate gradients. 400 00:51:17,560 --> 00:51:20,680 It eventually gets there, but it doesn't do well. 401 00:51:20,680 --> 00:51:29,140 It's very slow convergence. On the other hand, suppose I shift the matrix to make it a better behaved matrix. 402 00:51:30,010 --> 00:51:34,780 Suppose it happens to have eigenvalues that are bigger by 100. 403 00:51:36,160 --> 00:51:42,310 So I've now taken my badly behaved matrix and moved it over by 100 times the identity. 404 00:51:45,360 --> 00:51:54,450 If I define l mean to be the minimal eigenvalue it's now about a hundred and l max is 100 more than what it was. 405 00:51:54,630 --> 00:51:59,610 We now have a good set of eigenvalues which are well separated from zero. 406 00:52:00,270 --> 00:52:05,850 So there exist polynomials that are very small on the eigenvalues, and now the convergence is completely different. 407 00:52:06,150 --> 00:52:12,930 If I say C, g. So let's put more on you know about that in MATLAB. 408 00:52:12,930 --> 00:52:16,380 If you say more on it, just fill the screen. So let's do that again. 409 00:52:17,610 --> 00:52:23,159 So you see, I've got this 500 by 500, but it could be 5000 by 5000 matrix. 410 00:52:23,160 --> 00:52:27,239 And we're converging at a reasonable rate after 20 or 30 steps. 411 00:52:27,240 --> 00:52:32,040 We have a few digits and after 20 or 30 more steps, we have six digits. 412 00:52:32,040 --> 00:52:35,459 So this is working just fine. Okay, I guess it's time. 413 00:52:35,460 --> 00:52:36,180 We'd better stop.