1 00:00:07,900 --> 00:00:14,920 Okay. Good afternoon, everybody. So welcome to welcome to this afternoon's colloquium. 2 00:00:15,850 --> 00:00:20,770 It's a great pleasure to welcome Mark Newman to give the colloquium today. 3 00:00:21,070 --> 00:00:26,200 So Mark was in fact an undergraduate student here, 1988, 4 00:00:26,200 --> 00:00:30,850 and then he stayed on Stories Dphil and David Sherrington, who I think I saw somewhat coming in because I did, 5 00:00:32,140 --> 00:00:38,020 and he worked with David on sort of neural network type type problems, 6 00:00:38,020 --> 00:00:46,360 which is how he got started in these complex network problems after and after leaving after leaving Oxford. 7 00:00:46,360 --> 00:00:52,540 He was a post-doc at Cornell and then a and then a post-doc at the Santa Fe Institute. 8 00:00:53,470 --> 00:00:59,250 Um, then he was an assistant professor at the University of Michigan, but basically he's been ever since. 9 00:00:59,980 --> 00:01:06,400 And he recently made the Anatol Rappaport, distinguished University professor of physics at the University of Michigan. 10 00:01:06,670 --> 00:01:11,440 So he's had a very distinguished career in the theory of networks and things like that. 11 00:01:11,440 --> 00:01:15,759 And I think it's an interesting illustration of how physics has spread out over the last 30 12 00:01:15,760 --> 00:01:20,290 years from being part of the hard core stuff that people of my generation did in their youth. 13 00:01:20,650 --> 00:01:28,830 So to actually finding physicists, tackling all sorts of problems which are of interest to other and applicability to other parts of of science. 14 00:01:28,840 --> 00:01:34,870 And really, that's the definition of physics these days. If it's physics, if a physicist can make a useful contribution. 15 00:01:35,780 --> 00:01:41,590 And so so it's a great pleasure to welcome Mark today, who is going to talk for us on the Physics of Networks model. 16 00:01:43,030 --> 00:01:49,270 Thank you very much, John. Nice introduction. Hello, everybody. It's a pleasure to be here to be back here again after 30 years. 17 00:01:50,800 --> 00:01:57,220 So I'm going to talk about networks today, not, as John says, a traditional topic for a physicist to be working on, 18 00:01:57,220 --> 00:02:06,700 but physics has actually made very substantial contributions in this area in the last 15 or 20 years, some of which I'd like to tell you about today. 19 00:02:08,950 --> 00:02:18,550 So a network for the purposes of this talk is a bunch of dots joined together by lines in the jargon of the field is called a node or a vertex, 20 00:02:19,090 --> 00:02:22,630 and the lines called an edge. I will use that jargon throughout this talk. 21 00:02:24,430 --> 00:02:29,410 There's a long history in mathematics of studying networks as pure mathematical objects, 22 00:02:29,830 --> 00:02:36,670 but physicists have been interested more in the applications of using networks as a 23 00:02:36,670 --> 00:02:42,850 mathematical representation of the structure of many real world systems of interest to us, 24 00:02:43,450 --> 00:02:49,089 including technological networks like the Internet, which is a network of computers. 25 00:02:49,090 --> 00:02:53,140 The nodes are computers and the edges are data connections between computers. 26 00:02:54,010 --> 00:03:02,230 Other technological networks include transport networks like the rail network or this network of gas pipelines in Europe. 27 00:03:03,580 --> 00:03:10,090 Another big class of networks, more abstract, and that is information networks, 28 00:03:10,300 --> 00:03:14,170 which are networks that represent the structure of some body of information. 29 00:03:15,010 --> 00:03:17,020 The classic example here would be the World Wide Web. 30 00:03:17,020 --> 00:03:23,860 If you take a Web page like this, it's got things on it that you can click on, and when you click on them, they get you to some other page. 31 00:03:23,860 --> 00:03:28,510 So there's a link between the first web page and the second web page, what we call a hyperlink in the web jargon. 32 00:03:28,870 --> 00:03:35,760 And you can represent that as a network. You can make a network with the nodes, a web pages and the edges are the links between them. 33 00:03:35,770 --> 00:03:40,050 This is a picture of just a small portion of the web. 34 00:03:40,930 --> 00:03:47,709 The whole thing has tens of millions of nodes in far too large to make a picture of it on just one slide like this, 35 00:03:47,710 --> 00:03:50,740 probably the largest network of any kind that we study. 36 00:03:51,190 --> 00:03:57,280 And its structure presumably represents somehow the structure of the body of knowledge that's stored in the World Wide Web. 37 00:03:57,550 --> 00:04:04,930 So its structure is an interesting thing to look at. Another example of an information network is a citation network where the nodes represent 38 00:04:05,200 --> 00:04:11,290 science papers and the edges represent one scientific paper citing another one, 39 00:04:11,470 --> 00:04:19,600 something that we all do when we write papers again. Presumably this the structure of this network somehow reflects the structure of 40 00:04:19,600 --> 00:04:22,870 the scientific knowledge that's stored in the body of scientific literature. 41 00:04:22,870 --> 00:04:28,060 And by looking at that structure, we could hope to learn something about the structure of our knowledge. 42 00:04:29,980 --> 00:04:33,309 Biological networks, another large and important class of networks. 43 00:04:33,310 --> 00:04:37,210 There's a lot of excitement about, and biologists are very excited about work in this area at the moment. 44 00:04:37,780 --> 00:04:40,690 Here's one example. This is a metabolic network. 45 00:04:40,960 --> 00:04:49,620 So the nodes in this network are chemicals in the cells, metabolites and the edges, metabolic reactions that turn one metabolite into another one. 46 00:04:49,630 --> 00:04:53,470 So it's a network representation of the biochemical machinery of a cell. 47 00:04:54,100 --> 00:04:57,909 Here's another biological example. This is an ecological network. 48 00:04:57,910 --> 00:05:05,110 So note here represents species in an ecosystem. The edges represent the ecological interactions between species. 49 00:05:05,110 --> 00:05:09,310 In this case, predator prey interactions. One species, it's another species. 50 00:05:10,690 --> 00:05:14,590 There's an interesting little twist here for this network. 51 00:05:14,710 --> 00:05:18,130 See that airing over there? That's what they call a self edge. 52 00:05:18,130 --> 00:05:21,190 That's an edge that connects a node back to itself. That's allowed. You can have that. 53 00:05:22,360 --> 00:05:27,430 So so that in this case, that's the graph theoretical representation of cannibalism. 54 00:05:27,830 --> 00:05:36,280 That's the species that eats itself. But a lot of my work is focussed on social networks these days. 55 00:05:36,280 --> 00:05:42,270 If you say the word social networks to a lot of people, they think of Facebook or LinkedIn, online social networks. 56 00:05:42,550 --> 00:05:44,620 But in the scientific literature, 57 00:05:44,620 --> 00:05:53,440 the word social network just refers to any network where the nodes are people and the edges are any kind of connections between people. 58 00:05:53,980 --> 00:06:03,309 The study of social networks in this broader sense goes back way farther than their online incarnations, back to the 1930s. 59 00:06:03,310 --> 00:06:10,840 In fact, sociologists probably have the longest history of the quantitative study of empirical networks. 60 00:06:11,140 --> 00:06:20,440 Here's one example. This is a friendship network of of pupils in high school and American high school. 61 00:06:20,440 --> 00:06:25,900 So the nodes represent the students in the school and the edges represent whose friends with whom. 62 00:06:26,590 --> 00:06:30,909 Here's an example from my own work. This is a collaboration network of scientists. 63 00:06:30,910 --> 00:06:35,140 The nodes represent scientists and the edges represent who's written a paper with whom. 64 00:06:36,340 --> 00:06:43,990 Here's an example I got from my friend Valdas Krebs. This is a network that comes from a study he did that was on the spread of tuberculosis, 65 00:06:44,200 --> 00:06:49,029 and it represents a network of close physical proximity between people. 66 00:06:49,030 --> 00:06:57,129 One of the reasons why social networks have attracted a lot of research funding recently is because they are 67 00:06:57,130 --> 00:07:02,500 implicated in the spread of disease diseases spread between people when they have contact with one another. 68 00:07:02,590 --> 00:07:06,729 So the network of who has contact with whom governs the way diseases spread. 69 00:07:06,730 --> 00:07:14,110 If you want to understand how diseases spread and potentially how to stop them spreading, then you need to understand the structure of these networks. 70 00:07:14,260 --> 00:07:19,330 And I'll touch a little bit later on on the epidemiological aspects of this. 71 00:07:20,350 --> 00:07:25,090 So. So there's some examples of the kinds of networks that we're interested in looking at. 72 00:07:25,810 --> 00:07:32,860 One very basic question that you can ask in this field is just what is the structure of these things? 73 00:07:32,860 --> 00:07:38,170 How do you characterise the structure things? They're very big, complicated objects with potentially billions of nodes. 74 00:07:38,590 --> 00:07:45,100 How can I say something meaningful about what their shape is? So it has been a lot of work done on that basic question. 75 00:07:45,250 --> 00:07:54,340 A lot of it focuses on what you might call small scale structure in networks, things like degrees and path lengths and some graphs and so forth. 76 00:07:55,060 --> 00:08:01,780 So these are things which you can measure just by looking at a single node in the network at a time or maybe a small group of nodes, 77 00:08:01,780 --> 00:08:05,560 the very easy to measure. There's really no complicated math here. 78 00:08:05,590 --> 00:08:11,550 You just have to be able to count. But nonetheless, these are useful measures of network structure. 79 00:08:11,560 --> 00:08:13,990 They can tell you a lot. Let me give you just two quick examples. 80 00:08:14,770 --> 00:08:18,760 Perhaps the two best studied, best known examples of this kind of structure networks. 81 00:08:18,910 --> 00:08:25,090 One is degree distributions. So the degree of a node in a network is just the number of other nodes it's connected to. 82 00:08:25,360 --> 00:08:29,320 So if I'm talking about Friendship Network, it's how many friends I have in the network. 83 00:08:30,040 --> 00:08:36,460 And if you measure the degrees of all of the nodes in the network and then look at their distribution, make a histogram of them, 84 00:08:37,060 --> 00:08:42,430 almost invariably you find something which is very strongly right skewed meaning there are a lot of 85 00:08:42,430 --> 00:08:47,409 nodes that have low degree for you connections and there are a few nos that have high doing this. 86 00:08:47,410 --> 00:08:53,140 This plot here is for a network of collaborations amongst mathematicians who's written a paper with who. 87 00:08:53,500 --> 00:08:59,469 And so what it's telling you is that there are a lot of mathematicians that only have one or two collaborators, but there are few that have very many. 88 00:08:59,470 --> 00:09:05,180 So pull at us. For example, the famous 20th century mathematician, he was legendary, 89 00:09:05,200 --> 00:09:09,280 have an extraordinary number of different collaborators, over a thousand collaborators during his lifetime. 90 00:09:09,400 --> 00:09:12,700 So he would be way out in the tail of the distribution somewhere over here. 91 00:09:12,820 --> 00:09:18,400 There are not many people like that, but it turns out that they have a profound effect on the behaviour of these systems. 92 00:09:18,580 --> 00:09:21,700 Even though they're small in number, they have an outsize effect. 93 00:09:22,870 --> 00:09:30,130 They're typically referred to as the hubs in the network for obvious reasons, and I'll talk about them a little more too later on. 94 00:09:31,180 --> 00:09:36,580 Another example of a well-known property of networks is the thing called the Small World Effect. 95 00:09:36,790 --> 00:09:40,389 That was a famous series of experiments done in the sixties by this chap here. 96 00:09:40,390 --> 00:09:43,870 His name is Stanley Milgram. He was an experimental psychologist at Harvard. 97 00:09:44,710 --> 00:09:49,510 He did these experiments where he got people to pass letters from one person to another in a 98 00:09:49,510 --> 00:09:54,010 sort of cooperative game that involved trying to get the letter to an intended target person. 99 00:09:54,310 --> 00:10:00,250 And his finding was that typically took only about half a dozen steps to get the 100 00:10:00,250 --> 00:10:04,030 letter from some randomly chosen starting person to anyone else in the country. 101 00:10:04,270 --> 00:10:11,980 And this is the origin of this semi mythical result that you are only six handshakes away from anybody else in the world, 102 00:10:12,010 --> 00:10:17,020 sometimes referred to by the phrase The Six Degrees of Separation, which is a title of a successful play, 103 00:10:17,320 --> 00:10:21,880 later made into a movie in the nineties in which they discussed Milgram's work. 104 00:10:22,900 --> 00:10:27,550 So this result in particular is now very famous. It's sort of passed into legend. 105 00:10:27,700 --> 00:10:32,259 Almost everybody seems to know about this. It's it's the topic of party games. 106 00:10:32,260 --> 00:10:35,620 It's the topic of satirical headlines on the Internet. 107 00:10:36,100 --> 00:10:39,760 And every it's a part of the mythology. 108 00:10:39,790 --> 00:10:45,849 So these are examples of sort of small scale things that one can easily measure. 109 00:10:45,850 --> 00:10:51,130 And that was still very useful, though. Just because they're easy to measure doesn't mean we should turn our noses at them. 110 00:10:51,520 --> 00:10:57,429 However, that's not what I want to talk about today. What I want to talk about today is large scale structure in networks. 111 00:10:57,430 --> 00:11:02,770 What's the big picture? If I take this whole network of millions of nodes, what is it sort of overall structure? 112 00:11:02,770 --> 00:11:06,069 How does it divide up? How are the different pieces linked to one another? 113 00:11:06,070 --> 00:11:12,520 This is a much harder question to answer, much harder to characterise large scale structure because, well, 114 00:11:12,520 --> 00:11:17,589 it has to involve taking information from all over a network and somehow combining it to tell you something useful. 115 00:11:17,590 --> 00:11:20,320 And that's going to be a more difficult thing than just looking at one at a time. 116 00:11:21,010 --> 00:11:28,090 There are a range of different techniques that we use to do this kind of analysis, but the ones I want to talk about today are spectral methods. 117 00:11:28,270 --> 00:11:33,160 So spectral methods means methods that are based on matrix representations of networks. 118 00:11:33,340 --> 00:11:37,480 So you can take any network like this small one here. 119 00:11:37,690 --> 00:11:43,690 So suppose it has N nodes and it's got a number, the nodes from one to N in any order, it doesn't matter. 120 00:11:44,140 --> 00:11:49,390 And then I construct the so called adjacency matrix, which is the matrix that has a one. 121 00:11:50,410 --> 00:11:53,319 You stick a one in the matrix to represent an edge. 122 00:11:53,320 --> 00:11:59,560 So a one means that there's an edge between the corresponding to an outer network, like there's an edge here between node one and two. 123 00:11:59,710 --> 00:12:05,110 So I put a one in the one two element of the matrix and zeros everywhere else. 124 00:12:05,470 --> 00:12:09,700 Well I also put a one in the to one element because if that's Node one No. 125 00:12:09,700 --> 00:12:13,810 Two, then obviously that's nice. Clean up two in that one as well. So it's going to be a symmetric matrix. 126 00:12:14,230 --> 00:12:19,630 Binary value, just zeros and ones. Symmetric is important because that means its eigenvalues will be real. 127 00:12:20,710 --> 00:12:27,740 This may. Rex plays a role sort of analogous to the role played by the Hamiltonian in a quantum mechanical system. 128 00:12:28,280 --> 00:12:35,269 So its eigenvalues are sort of analogous to energy levels in its eigenvectors or analogy, analogous to the wave function of the network. 129 00:12:35,270 --> 00:12:43,219 Actually, I should be a little bit careful here. The the adjacency matrix is analogous to minus the Hamiltonian of a quantum system, 130 00:12:43,220 --> 00:12:47,720 because its highest line eigenvalue is the one that plays the role of the ground 131 00:12:47,720 --> 00:12:51,320 state and the ones immediately below that play the role of the first excited state. 132 00:12:51,320 --> 00:12:55,790 So if you take a network, any network like this one here, this is a real network. 133 00:12:55,790 --> 00:13:02,120 It's a social network that I got from the social networks which and you calculate its matrix and then you calculate its spectrum, 134 00:13:02,570 --> 00:13:06,020 you typically get something that looks like this. This is very typical of what you find. 135 00:13:06,200 --> 00:13:08,030 That's the ground state at the top. 136 00:13:08,660 --> 00:13:15,200 Then there's a few excited states well separated below it, and then there's a band of closely spaced states below that. 137 00:13:16,130 --> 00:13:22,190 This turns out to be what you see in many cases. And as we'll see, there's a reason for that, why you would expect that to be this sort of structure. 138 00:13:23,060 --> 00:13:28,940 Just like in quantum systems, we tend to focus most of our attention on the ground state and on the first few excited states. 139 00:13:28,940 --> 00:13:31,460 That turns out to be where most of the useful information is. 140 00:13:33,030 --> 00:13:41,940 So the claim is that by looking at the eigenvalues and eigenvectors, we can learn things about the large scale structure of this network. 141 00:13:41,950 --> 00:13:47,430 There are deep connections between these wave function like eigenvectors and large scale structure networks. 142 00:13:48,030 --> 00:13:52,380 Let me start by giving you two just very simple examples of that kind of connection. 143 00:13:53,610 --> 00:14:02,970 So here's one central ity is any measure that gives scores to the nodes in a network in an attempt to work out which ones are most important. 144 00:14:04,500 --> 00:14:08,969 Obviously, there's many different definitions of what you might mean by important nodes in a network, 145 00:14:08,970 --> 00:14:15,570 and there are correspondingly many central key measures. But a simple one is just what we call degree central or just plain degree. 146 00:14:15,810 --> 00:14:20,130 Again, that's the number of connections that a node has. It's just how many friends you have. 147 00:14:20,520 --> 00:14:24,330 So by this measure, the person who has the most friends is the most important person. 148 00:14:25,740 --> 00:14:28,920 Okay, so, you know, that's not totally stupid. 149 00:14:29,160 --> 00:14:35,140 That's quite a useful measure. It's really easy to calculate, but it has some shortcomings, too. 150 00:14:35,160 --> 00:14:40,560 In particular, basically all it's doing is it's giving you one point for everybody that you know. 151 00:14:41,340 --> 00:14:42,780 You just count up the number of people you know. 152 00:14:44,790 --> 00:14:51,540 But in real life, of course, not everybody, you know is equally important for connecting you to the rest of the world. 153 00:14:51,870 --> 00:14:56,939 If I know the Prime Minister, then that instantly makes me a more important person, 154 00:14:56,940 --> 00:15:00,180 sort of by reflection, because I have the ear of the Prime Minister. 155 00:15:00,420 --> 00:15:04,320 So it's not just how many people you know. It's also who, you know that matters. 156 00:15:04,680 --> 00:15:10,590 So a more sophisticated central to measure is one where instead of just getting one point for everybody I know, 157 00:15:10,830 --> 00:15:15,000 I get a number of points that's proportional to their score. 158 00:15:15,030 --> 00:15:21,930 My score is the sum of the scores of the people I know, and that's the sum of the scores of people they know and so on and so forth. 159 00:15:23,190 --> 00:15:30,600 So I can create a measure where I so I'm supposed suppose I'm Node I in the network 160 00:15:31,650 --> 00:15:36,389 my score is going to be denoted by and is proportional to consider proportionality, 161 00:15:36,390 --> 00:15:43,020 which I write minus one for reasons that'll become apparent times the sum of the scores of all the people in my immediate neighbourhood. 162 00:15:43,020 --> 00:15:47,639 That means all the people around me in the network. So this is my neighbourhood in the network. 163 00:15:47,640 --> 00:15:51,620 And just summing average or another way of writing that would be the sum of all nodes. 164 00:15:51,690 --> 00:15:54,870 J But stick an element of the adjacency matrix in here. 165 00:15:55,020 --> 00:16:00,600 AJ is an element of that matrix. That's one if there's a connection to Node II, A.J and zero if there isn't. 166 00:16:00,900 --> 00:16:06,270 Right. So this automatically just picks out the scores of my neighbours and sets all of the other ones to zero. 167 00:16:06,870 --> 00:16:12,390 So I can rearrange that equation and write it in matrix form and it's just matrix a change vector x equals L.A. Times x, 168 00:16:12,600 --> 00:16:16,860 where back to x is the vector that has all these entropy scores. 169 00:16:17,040 --> 00:16:24,510 So in other words, your central to your importance in the network is just an element of an eigenvector of the adjacency matrix which. 170 00:16:25,510 --> 00:16:30,430 Eigenvector should it be? Well, if you want your scores to be positive, then there's only one choice. 171 00:16:30,430 --> 00:16:34,690 It has to be the leading eigenvector. Perhaps the most famous theorem in linear algebra. 172 00:16:34,690 --> 00:16:40,540 The intravenous theorem tells you that there is only one eigenvector with all elements positive subject to some conditions, 173 00:16:40,870 --> 00:16:44,260 and that's the one with the highest eigenvalue. 174 00:16:44,620 --> 00:16:52,030 So this is the ground state of the network and it represents the centrality of nodes in network. 175 00:16:52,040 --> 00:16:57,220 So famous example of the use of this kind of sense charting measure is the Google search engine. 176 00:16:57,790 --> 00:17:06,280 They use a slight variance on this idea to decide which web pages are important and which web pages are not important when you do web searches. 177 00:17:06,580 --> 00:17:12,820 So apparently you can make a lot of money by doing this kind of stuff. Here's another example. 178 00:17:13,570 --> 00:17:21,870 Only slightly more complicated. Percolation is a process, of course, well known to lots of physicists. 179 00:17:21,880 --> 00:17:31,690 It's the process where you fill in or occupy at random some fraction of the sites or bonds on a lattice or the nodes and edges on a network. 180 00:17:33,130 --> 00:17:37,000 So he has a network and I've coloured in some of the edges in this case. 181 00:17:37,480 --> 00:17:46,660 All the network percolation is used in physics, for instance, to model actual percolation like the population of fluids through porous materials. 182 00:17:46,660 --> 00:17:52,090 For instance, in networks, it's used, for instance, to model the spread of disease. 183 00:17:52,420 --> 00:17:59,319 I've touched on this before. You can imagine here my I'm a person in a social network. 184 00:17:59,320 --> 00:18:06,700 I have some friends. And I was I get the flu. I could give the flu to my friends, but I probably don't give it to all of them. 185 00:18:06,850 --> 00:18:10,690 I just give it to some of them, you know, depending on circumstances. 186 00:18:10,990 --> 00:18:15,670 So I can represent that by sort of colouring in the edges at random, 187 00:18:15,850 --> 00:18:20,860 with some probability we'll call it P to represent the probability that the 188 00:18:20,860 --> 00:18:24,850 disease is transmitted along those edges and then not along some other edges. 189 00:18:25,420 --> 00:18:31,060 So then if if I'm the red dot in the middle there, who gets the flu, it's only going to spread along the coloured in edges, 190 00:18:31,330 --> 00:18:38,500 meaning the disease spreads to a cluster, a percolation cluster of connected nodes within the network. 191 00:18:38,770 --> 00:18:45,040 So in other words, the percolating clusters in the network represent the sizes of potential outbreaks of the disease. 192 00:18:46,800 --> 00:18:51,900 And so if I can calculate those sizes, then I can say something about the parents disease. 193 00:18:52,020 --> 00:18:59,280 Perhaps the most famous property of percolation systems is that if you just start occupying sites or bonds at random, 194 00:19:00,600 --> 00:19:08,670 what you get is clusters of occupied sites which grow as you occupy more and more of them. 195 00:19:08,850 --> 00:19:16,440 Eventually they start to sort of gel together and eventually they all stick together to form one big cluster that spans all the way across the system. 196 00:19:17,690 --> 00:19:22,069 There is a sharp percolation threshold at which that transition happens and which you 197 00:19:22,070 --> 00:19:27,200 go from separate clusters spread out to percolating all the way across the system. 198 00:19:27,320 --> 00:19:32,090 And in the epidemiology context, that corresponds to the so-called epidemic threshold, 199 00:19:32,270 --> 00:19:36,169 where you go from a regime where the disease is never going to spread very far, 200 00:19:36,170 --> 00:19:38,030 it's only going to spread to a small local cluster, 201 00:19:38,240 --> 00:19:43,190 to a regime where it can in principle spread all the way across the entire system and cause an epidemic outbreak. 202 00:19:43,550 --> 00:19:44,870 So the epidemic threshold, 203 00:19:45,110 --> 00:19:50,900 the way the epidemiologists talk about it and the percolation threshold where physicists talk about it are essentially the same thing. 204 00:19:51,320 --> 00:19:56,740 So might ask, for instance, where does that percolation threshold happen? 205 00:19:56,750 --> 00:20:02,330 What is the critical value of this probability? P At which percolation happens on a network? 206 00:20:02,720 --> 00:20:05,030 So I can address that by doing a calculation like this. 207 00:20:06,170 --> 00:20:11,840 Suppose I'm in the regime where there is this giant cluster, this percolating cluster in my network, 208 00:20:12,170 --> 00:20:14,690 and I want to ask, what's the probability that I belong to it? 209 00:20:14,960 --> 00:20:22,850 Actually, I'm going to define use of I to be the probability that I don't belong to a giant cluster, which turns out to be easier to do it that way. 210 00:20:23,390 --> 00:20:27,060 So usually by as the problem is, I don't belong to the percolating cluster. 211 00:20:27,800 --> 00:20:36,080 So now consider his me. I'm Node I. In order to not belong strictly in cluster, I have to be not connected to it via any of my neighbours. 212 00:20:36,320 --> 00:20:39,830 If I'm connected to it even one of my neighbours, then I'm in the percolating cluster too. 213 00:20:40,550 --> 00:20:44,840 So what's the probability that I'm not connected with a leading cluster via my neighbour J. 214 00:20:45,590 --> 00:20:53,180 Well, there's two ways that can happen. One is that that edge between us could be not occupied, meaning the disease never gets transmitted to me. 215 00:20:53,990 --> 00:20:58,700 That happens with probability. One minus p p is the occupation. Probably one minus P is the occupation poverty. 216 00:20:59,090 --> 00:21:04,820 The alternative is that edge could be occupied, but J itself is not a member of the giant cluster. 217 00:21:05,200 --> 00:21:08,810 Right. That happens would probably p fetch me occupied and use of j. 218 00:21:08,960 --> 00:21:11,570 By definition is the probability that J is not in the joint custody. 219 00:21:11,870 --> 00:21:17,899 So one minus p plus p times u is the total probability that my neighbour J doesn't 220 00:21:17,900 --> 00:21:20,960 give me the disease that I'm not connected to the joint cluster via them. 221 00:21:21,350 --> 00:21:26,930 And then the total probability that I'm not going to join cluster via any of my neighbours is just take the product of that over all my neighbours. 222 00:21:28,120 --> 00:21:33,790 So this gives me a self conditioned self-consistent condition that I can solve to find these probabilities. 223 00:21:34,780 --> 00:21:41,410 So now imagine lowering the percolation, the occupation probability, the probability of transmission of the disease. 224 00:21:41,560 --> 00:21:46,810 At some point you go through the train, you go through the population transition, and your giant cluster disappears. 225 00:21:47,020 --> 00:21:52,030 When that happens, all of these probabilities you must equal one you as the probability of not being 226 00:21:52,030 --> 00:21:56,050 in the gene cluster if there is no giant cluster and everybody is not in it. So all of the probabilities are one. 227 00:21:56,800 --> 00:22:01,959 So that's the the thing that you how, you know, you've got to the transmission all of these problems these go to one. 228 00:22:01,960 --> 00:22:07,540 So we're going to do is we're just going to expand around that point. I'm going to write you as equal to one minus some small epsilon. 229 00:22:08,020 --> 00:22:11,230 It has to be minus because you as a probability has to be between zero and one. 230 00:22:12,110 --> 00:22:16,479 And I'm going to take that substituting this equation here, expand to leading order in Epsilon. 231 00:22:16,480 --> 00:22:23,110 And you get this equation here, which can be rewritten in matrix form as a vector of epsilon equals PC. 232 00:22:23,380 --> 00:22:26,380 The critical population. Probability times adjacency matrix. 233 00:22:26,380 --> 00:22:35,630 Times epsilon. So this immediately tells you that one over the critical population probability is an eigenvalue of adjacency matrix, 234 00:22:35,630 --> 00:22:42,170 which eigenvalue it's again the leading one because epsilon is a vector of probabilities, they all have to be positive. 235 00:22:42,380 --> 00:22:44,570 There's only one vector that does that for you, the leading one. 236 00:22:45,590 --> 00:22:53,060 So by just looking at the leading eigenvalue, the ground state energy of this adjacency matrix, 237 00:22:53,270 --> 00:22:58,880 I can tell you the percolation threshold or alternatively, the epidemic threshold for this network. 238 00:23:00,760 --> 00:23:04,630 Okay. So there are some examples of why these are useful things to look at. 239 00:23:05,920 --> 00:23:12,670 So what I want to do is understand the relationship between the values of these things, 240 00:23:12,670 --> 00:23:15,970 these eigenvalues and eigenvectors, and the structure of the network. 241 00:23:16,300 --> 00:23:20,140 And the way I'm going to do that is very sort of physicist type approach. 242 00:23:20,290 --> 00:23:25,689 I'm going to make simple models of networks, and then I'm going to calculate the eigenvalues and eigenvectors of those simple models, 243 00:23:25,690 --> 00:23:29,410 and I'm going to see what they do as a function of the structure of the model network. 244 00:23:29,980 --> 00:23:35,860 Before I do that, however, I want to make it an important point. How would I go about actually calculating these things? 245 00:23:35,870 --> 00:23:43,000 Suppose, for instance, I wanted to calculate the leading eigenvector. So a simple way of doing that is just repeated multiplication. 246 00:23:43,180 --> 00:23:47,799 I can calculate the leading eigenvector matrix just by taking a random starting vector. 247 00:23:47,800 --> 00:23:51,670 It doesn't matter really what it is and then just multiply repeatedly by the matrix. 248 00:23:52,150 --> 00:23:58,870 If I do that, then the components in the direction of the leading eigenvector will get multiplied by the leading eigenvalue, the biggest one. 249 00:23:59,080 --> 00:24:01,240 Everything else will get multiplied by something smaller. 250 00:24:01,390 --> 00:24:06,030 And if I just keep doing this for long enough, I end up with a vector that's pointing in the direction of the leading argument. 251 00:24:06,160 --> 00:24:10,090 It's called the power method. Very simple, but actually quite a good practical way of doing this. 252 00:24:10,570 --> 00:24:13,930 So now here's the question how long is this going to take? 253 00:24:14,440 --> 00:24:21,130 How long it's going to take depends on the separation between that leading eigenvector and the next one, right? 254 00:24:21,190 --> 00:24:22,960 If they're far apart, it won't take long. 255 00:24:23,050 --> 00:24:27,850 But as they get close together, it's going to take more and more multiplications to separate those two apart. 256 00:24:28,120 --> 00:24:33,280 And if they actually cross over, then the time to do it is going to diverge, as you see in this picture here. 257 00:24:33,670 --> 00:24:36,999 So if you're someone who studies phase transitions, this looks very familiar. 258 00:24:37,000 --> 00:24:40,510 It looks like the thing that we call critical slowing down in a phase transition. 259 00:24:40,750 --> 00:24:43,540 If you study phase transitions numerically on a computer, for instance, 260 00:24:43,870 --> 00:24:50,320 you'll be familiar with a phenomenon where the time the algorithm takes to run diverges as you pass through a phase transition. 261 00:24:51,010 --> 00:25:01,239 So a wild conjecture might be that when the levels cross in the adjacency matrix that you have phase transitions going on in networks. 262 00:25:01,240 --> 00:25:07,300 This is again kind of like in quantum mechanics, right? When when energy levels cross in quantum systems, you get quantum phase transitions. 263 00:25:07,480 --> 00:25:10,540 So it's a similar kind of idea and indeed it turns out to be true. 264 00:25:10,720 --> 00:25:19,970 And it's sort of interesting because it means that we can identify places where phase transitions should take place by looking for these crossings. 265 00:25:20,020 --> 00:25:25,209 I'll give you a few examples where that happens and we know what the phase transition is for. 266 00:25:25,210 --> 00:25:29,200 Interesting thing is there are lots of other examples where we get these crossings happening, 267 00:25:29,200 --> 00:25:32,080 but we don't know what the corresponding corresponding phase transition is. 268 00:25:32,230 --> 00:25:35,620 We believe there is some kind of phase transition now, but we don't actually know what it is. 269 00:25:36,640 --> 00:25:41,020 So let me give you some examples. I'm going to start with a very simple example. 270 00:25:42,460 --> 00:25:46,000 This is almost the simplest model network that you can imagine making. 271 00:25:46,000 --> 00:25:50,829 It's called the random graph. So it's it's really simple. 272 00:25:50,830 --> 00:25:53,590 You just take in nodes and you throw down edges completely at random. 273 00:25:53,800 --> 00:25:58,060 So every possible edge is present with probability P or not with probability one minus. 274 00:25:58,390 --> 00:26:02,080 That's it. That's the whole model. It just has two parameters and nodes and P. 275 00:26:02,080 --> 00:26:06,790 The probability of edges has been studied for a long time, first proposed in the 1950s. 276 00:26:07,270 --> 00:26:11,889 A lot is known about this model. An important quantity here will be the average degree. 277 00:26:11,890 --> 00:26:18,549 So degree again remember is the number of connections you have. The average degree of a node in this random network is just the number of nodes, 278 00:26:18,550 --> 00:26:23,380 times the probability of connection there are and nodes you can be connected to. You've got probability of being connected to each one. 279 00:26:23,560 --> 00:26:25,930 So enzymes P is your expected number of connections. 280 00:26:26,830 --> 00:26:34,180 So there's a well-known phase transition that happens in this model, very similar to the percolation transitions that we were just talking about. 281 00:26:34,450 --> 00:26:40,600 If P is zero, then you have no edges at all and everything's sort of separate, like this appears one. 282 00:26:40,750 --> 00:26:44,920 It all gels together into one big connected clump, what we call a component in the network. 283 00:26:45,070 --> 00:26:49,480 And in between these two extremes, there is known to be a percolation type phase transition. 284 00:26:50,080 --> 00:26:56,260 The clusters sort of get bigger and bigger as you increase P and at some point they gel together and you get a spanning cluster that spans the system. 285 00:26:58,090 --> 00:27:06,430 So above that transition that the structure of the network is as follows A component in a network just means a clump of connected node. 286 00:27:06,440 --> 00:27:10,479 Each of these is a component, so you expect to have one percolating component, 287 00:27:10,480 --> 00:27:14,830 the giant component, so called, and a bunch of little ones is scattered around the edge. 288 00:27:15,160 --> 00:27:18,280 That's the general structure that we expect to see. And there's a phase transition. 289 00:27:18,460 --> 00:27:24,010 That phase transition happens at the point where the average degree C is equal to one. 290 00:27:24,460 --> 00:27:31,210 That will be an important point in a moment. So so now what I'm going to do is calculate the spectrum of this thing. 291 00:27:31,810 --> 00:27:35,379 And the claim is that there will be a crossing of the eigenvalues. Exactly. 292 00:27:35,380 --> 00:27:39,640 At the point where this phase transition happens. We already know about this phase transition. 293 00:27:39,640 --> 00:27:45,640 This is just sort of a demonstrative example to show you that crossing of eigenvalues corresponds to phase transitions. 294 00:27:46,950 --> 00:27:50,490 So I don't want to spend a lot of this talk going through mass in detail. 295 00:27:50,640 --> 00:27:55,710 But I did want to run just briefly through this one calculation to show you the kinds of techniques involved. 296 00:27:56,070 --> 00:28:00,210 This is actually a very old calculation. It goes back to the 1950s as well. 297 00:28:00,900 --> 00:28:08,790 It's essentially equivalent to calculations you'll find in any graduate level textbook on random matrix theory. 298 00:28:09,180 --> 00:28:12,600 It's a this this kind of stuff has been done for a long time. 299 00:28:13,260 --> 00:28:17,400 So I'm going to calculate the spectrum of this random graph, in other words, this adjacency matrix. 300 00:28:18,600 --> 00:28:19,640 We do it in two steps. 301 00:28:19,650 --> 00:28:27,630 First of all, we're actually going to calculate the spectrum of the centred adjacency matrix, which is just the adjacency matrix minus its mean value. 302 00:28:27,900 --> 00:28:31,920 The mean value of any element in this matrix is just p the probability of an X. 303 00:28:32,640 --> 00:28:38,219 So that's the same as saying adjacency matrix minus p subtract p from every element which 304 00:28:38,220 --> 00:28:43,950 I right this p times one one transpose where both one is the uniform vector of all ones. 305 00:28:45,090 --> 00:28:48,660 So that's actually the first step is to be to calculate the spectral of that. 306 00:28:48,900 --> 00:28:52,170 And by the spectrum I mean the density of states, which is just the usual thing. 307 00:28:52,350 --> 00:28:59,220 It's just the function that has a delta function spike at every eigenvalue of the matrix. 308 00:28:59,880 --> 00:29:03,720 So it's a thing that looks like this. It's a bunch of spikes. 309 00:29:03,810 --> 00:29:09,960 It's the function where if you integrate it over any region, it'll just count the number of eigenvalues in that region. 310 00:29:11,100 --> 00:29:14,970 That's a difficult function to handle. It's got these spiky things in it, 311 00:29:15,210 --> 00:29:20,820 so we regularise it by putting some broadband function on each spike in this particular 312 00:29:20,820 --> 00:29:25,770 case will use a laurentian also called a cushy distribution by mathematicians. 313 00:29:25,950 --> 00:29:31,770 And then we're going to take the limit as those dimensions width goes to zero to recover our density of states. 314 00:29:32,770 --> 00:29:33,880 So here's my in six states, 315 00:29:34,030 --> 00:29:41,260 here's my Laurentian and as with etre I can write it as this thing normalising coefficient divided by x squared plus u two squared and I think Mm. 316 00:29:41,520 --> 00:29:42,460 It goes to zero. 317 00:29:44,100 --> 00:29:53,370 I can factor is that denominator as x plus eight times x minus eight and write it also in this case imaginary imaginary part of one over x plus. 318 00:29:53,370 --> 00:29:57,450 Either is another way of writing it. So this is one of the standard forms of the doubt function. 319 00:29:57,660 --> 00:30:04,130 So what I do is I take that, I substitute into here, I get this expression for the density of states. 320 00:30:04,530 --> 00:30:07,620 Notice that it has an x plus i eta in the denominator. 321 00:30:07,740 --> 00:30:11,520 A common thing to do at this point is just to generalise this to the complex plane. 322 00:30:11,520 --> 00:30:15,270 Instead of thinking of this as function on the real line, let's just go into the complex plane, 323 00:30:15,450 --> 00:30:18,630 make it a function of X plus eight, which is called Z here. 324 00:30:18,750 --> 00:30:22,110 So here's a row of complex Z. It's the same thing. 325 00:30:22,110 --> 00:30:26,700 It's just got a Z here. And just remember that at the end what we want to do is take the limit. 326 00:30:27,120 --> 00:30:35,129 E2 goes to zero. I the limit as you go back to the line. Then after that it's just arithmetic and boring. 327 00:30:35,130 --> 00:30:41,520 We just do a series expansion of this function and we end up with a sum over powers of eigenvalues. 328 00:30:41,520 --> 00:30:48,330 The powers of eigenvalues here and sums of powers of eigenvalues can be written in terms of traces of powers of the original matrix. 329 00:30:48,660 --> 00:30:53,940 So you end up with this formula for density of states down here. So well-known things could still just run formula. 330 00:30:55,230 --> 00:31:01,500 And the only difficult thing here is calculating the trace of this matrix. But it's made easier by the fact that this is a random model. 331 00:31:01,830 --> 00:31:03,690 So what I'm going to do is just average over the random. 332 00:31:03,810 --> 00:31:11,700 Calculate the average of this density of states, in other words, the average of this trace here, and that turns out to be an easy thing to do. 333 00:31:13,790 --> 00:31:19,019 It turns out if you sort of sloppy about it, then it just vanishes. You can persuade yourself that all terms in the straits are actually zero. 334 00:31:19,020 --> 00:31:24,959 But this is wrong. You have to be a little more careful and realise that there are some traces where the same matrix element occurs 335 00:31:24,960 --> 00:31:30,780 twice and then you get the variance of that matrix element and that can give you terms that are non-zero. 336 00:31:31,470 --> 00:31:35,670 So you get a term that depends on a power of the variance times. 337 00:31:36,090 --> 00:31:40,860 However many terms there are that don't vanish. So this becomes just a counting problem. 338 00:31:40,890 --> 00:31:45,930 It just becomes a question of working out how many times he had done vanish. I won't go through the geometry accounting problem, 339 00:31:46,110 --> 00:31:52,080 but it turns out that the number of terms here is just equal to something called the Catalan number, which a well-known thing in number theory. 340 00:31:52,350 --> 00:31:56,220 And so I can rewrite my expression in terms of the sum of these Catalan numbers. 341 00:31:56,310 --> 00:32:02,280 And that can be done in closed form. It has this expression here, take that substitute back into the formula I started off with, 342 00:32:02,460 --> 00:32:05,460 and you get this nice expression for the density of states. 343 00:32:05,670 --> 00:32:14,210 This is a famous, well-known expression, so-called semi-circle spectrum, originally derived by Eugene Vigna in the in the 1970s. 344 00:32:14,220 --> 00:32:21,840 It looks like that it works really well. That histogram there is an actual random graph that I generated on my computer the day before yesterday. 345 00:32:22,080 --> 00:32:26,700 Diagonals. Did you get a nice semicircle? The blue line is the prediction of the theory. 346 00:32:26,970 --> 00:32:33,360 So it works well, but we're not quite done yet because that was that centre adjacency matrix where we subtracted the mean. 347 00:32:33,780 --> 00:32:39,300 So you have to add the mean back on. Meaning we're really solving an igen problem if it looks like this one here. 348 00:32:40,480 --> 00:32:49,630 And you do a series of sort of boring negotiations and you end up with a neat little equation which this average over here can be done in closed form. 349 00:32:49,810 --> 00:32:55,840 You get this thing here, this can be solved and you get a value for the eigenvalue here. 350 00:32:56,380 --> 00:33:02,170 Bottom line is basically when you add that shift back on, you add the mean back onto the matrix. 351 00:33:02,290 --> 00:33:11,110 Nothing changes. You still have your semi-circle distribution except for one thing one of the eigenvalues breaks off and moves up to here C plus one. 352 00:33:11,350 --> 00:33:15,520 Where C again is the adversary of the network. So this is your final answer for your spectrum. 353 00:33:15,520 --> 00:33:24,670 It looks like this and you can calculate the whole thing. The average they expect is spectrum and analytically in this case. 354 00:33:25,720 --> 00:33:32,640 So now look at this spectrum. Here's here's my eigenvector central t that's my ground state, state, my leading eigenvalue. 355 00:33:32,880 --> 00:33:39,030 And then here's the band that I was talking about. There's a continuous band of eigenvalues, and there's a band gap here. 356 00:33:40,150 --> 00:33:46,030 And the thing is the size of that bank out depends on the average degree I one parameter the average degree the network because 357 00:33:46,030 --> 00:33:53,050 this the bandage here goes like square to see the the ground state goes like see so I can change the size of that gap. 358 00:33:53,230 --> 00:33:56,620 If I change it enough, I can make those two eigenvalues crossover. 359 00:33:56,920 --> 00:34:01,329 Where do they crossover? Well, solve two root secrecy plus one you find. 360 00:34:01,330 --> 00:34:06,490 The answer is they cross over when C equals one. Exactly the point at which that phase transition. 361 00:34:07,300 --> 00:34:10,360 So this is not proving anything, but it's again suggestive. 362 00:34:10,630 --> 00:34:13,650 When you get crossings of these eigenvalues, you get phase transitions happens. 363 00:34:13,780 --> 00:34:18,720 There's not a new phase transition. We we knew about this phase transition already we haven't discovered anything here was 364 00:34:18,800 --> 00:34:24,640 illustrative that crossings of the levels in these things correspond to phase transition. 365 00:34:25,330 --> 00:34:30,310 So now let me give you two other examples. 366 00:34:31,650 --> 00:34:36,660 A little bit less trivial than this one of phase transitions that happened in these networks. 367 00:34:37,470 --> 00:34:42,630 So the first one is just a slight extension of the model I already described. 368 00:34:42,810 --> 00:34:50,550 So I mentioned that a lot of these networks have hubs in they have a few nodes like the pull ad or node that has an awful lot of connections. 369 00:34:52,020 --> 00:34:55,889 So what's the effect of that going to be on my network? 370 00:34:55,890 --> 00:35:02,610 So I can model that by taking the same model as I had before, the random growth model, and I'm just going to add one extra node to it, 371 00:35:02,610 --> 00:35:06,150 that red node there, which is going to be my hub and it's going to have a very high degree. 372 00:35:06,660 --> 00:35:11,790 So it's going to have some probability. Q of being connected to every other node in the network. 373 00:35:12,970 --> 00:35:18,790 And I quote, I'm going to write this Neapolitan meaning hit the average degree of that no is going to be D and I'm going to make D big. 374 00:35:19,270 --> 00:35:22,690 So it's a hub. It's the one node in the network that has a lot of friends. 375 00:35:23,110 --> 00:35:26,260 I'm going to ask, what does that what effect does that have on the spectrum? 376 00:35:26,530 --> 00:35:31,240 So I won't do the calculation again, but you can it's not terrifically hard to do. 377 00:35:31,420 --> 00:35:35,860 You can do this calculation. What happens is that most of the spectrum stays the same, 378 00:35:36,010 --> 00:35:42,130 but you get one new eigenvalue with this particular value here d upon the squares d c 379 00:35:42,340 --> 00:35:46,330 where d is the average to be the hub and C is the average screen the rest of the network. 380 00:35:46,930 --> 00:35:48,400 So in other words, the spectrum looks like this. 381 00:35:48,670 --> 00:35:57,790 Semi-Circle and edge gap two states at the ground state and first excited state out outside the band that. 382 00:35:58,730 --> 00:36:05,600 So now I have two parameters D and C, and I can move those two eigenvalues around, and if I want, I can make them crossover. 383 00:36:06,050 --> 00:36:09,500 So claim is something interesting happens when they cross over. 384 00:36:10,790 --> 00:36:13,100 Where do they cross over? Well, that's easy to set. 385 00:36:13,490 --> 00:36:20,460 Do you have a square or do you mind C equals C plus one solve and I get D is equal to basically the square of the degree in the rest of the network. 386 00:36:20,470 --> 00:36:26,360 So clean is when the average degree of the hub is sort of order of magnitude, the square of the degree, the rest of the network. 387 00:36:26,540 --> 00:36:31,429 Something interesting will happen, and that's a situation that could easily happen in real networks. 388 00:36:31,430 --> 00:36:37,579 The hubs that we see in real networks easily, often the square of the average to the rest of the network. 389 00:36:37,580 --> 00:36:44,739 So it's a not an uncommon situation at all. So indeed there is a transition that happens there and it's an interesting one. 390 00:36:44,740 --> 00:36:46,690 It's a localisation transition. 391 00:36:47,200 --> 00:36:56,980 So we're looking at the ground state, we're looking at the leading eigenvector, the thing that gives us the eigenvector sensuality of the nodes. 392 00:36:57,430 --> 00:37:01,270 And normally if you just make a plot of it here, here's a figure. 393 00:37:01,780 --> 00:37:07,389 My student, Travis, made these the heights of these little tents represent the Eigenvectors Central. 394 00:37:07,390 --> 00:37:08,950 These there's the hub in the middle. 395 00:37:09,070 --> 00:37:14,650 It's got the biggest one, which is perhaps not surprising, but there are plenty of others that have got quite high central cities as well. 396 00:37:15,220 --> 00:37:24,520 Now you go through that transition where the levels cross and you get this suddenly what happens is all the weight condenses onto that hub. 397 00:37:24,760 --> 00:37:27,910 It behaves very like a defect in a condensed matter system. 398 00:37:28,000 --> 00:37:32,530 You get localisation around that defect and all the weight dies away from everything else. 399 00:37:33,190 --> 00:37:40,599 In effect, this means that that eigenvectors central city used by Google and other people to analyse networks has failed because now 400 00:37:40,600 --> 00:37:47,589 all it does is identifies the one defect and it is no good at distinguishing between any of the other nodes in the network. 401 00:37:47,590 --> 00:37:50,740 So there's a failure mode of the eigenvector centralising. 402 00:37:52,410 --> 00:37:59,150 If you want to see more evidence that this really is a phase transition, here is the order parameter for this phase transition, 403 00:37:59,170 --> 00:38:05,820 something called the inverse participation ratio basically measures what fraction of the weight is deposited on the defect node, 404 00:38:06,540 --> 00:38:11,429 and it's basically zero. It's a vanishing fraction until you get to the phase transition and then suddenly it 405 00:38:11,430 --> 00:38:15,870 becomes some significant fraction of all of the weight is deposited on the defect node. 406 00:38:16,080 --> 00:38:24,360 And in these numerical studies, the positions of these transitions coincide nicely with the theoretical predictions marked by those dotted lines that. 407 00:38:26,180 --> 00:38:33,950 Okay. So there's two fairly simple examples of cases where the crossing of these levels in the 408 00:38:33,950 --> 00:38:38,870 network corresponded to interesting structural phase transition going on in the network. 409 00:38:39,350 --> 00:38:46,370 So in the remaining time, I wanted to talk in a little more length about a third example, 410 00:38:46,370 --> 00:38:50,060 which is something that we've been working on in my group for some time now, 411 00:38:50,990 --> 00:38:56,000 and it concerns what we call community structure in networks, which is structure that looks like this. 412 00:38:56,000 --> 00:38:59,750 A lot of the networks we look at have this kind of structure where there's a clump of 413 00:38:59,750 --> 00:39:03,920 nodes and there's another clump of nodes and there's lots of connections within clumps, 414 00:39:04,400 --> 00:39:06,709 but there's relatively few connections between hubs. 415 00:39:06,710 --> 00:39:10,490 So in a social network, this might be a group of friends and this might be another group of friends, 416 00:39:10,640 --> 00:39:14,090 and they all know each other and they all know each other. But there are not many connections between two groups. 417 00:39:15,260 --> 00:39:21,500 So one very basic question that you could ask about structure like this in a network is, can I find it? 418 00:39:21,710 --> 00:39:26,780 If you give me a big network, can I find the groups, the communities in the network? 419 00:39:27,440 --> 00:39:31,760 If a network looks like this friendship network that I showed you earlier, then of course you can. 420 00:39:31,790 --> 00:39:37,670 It's easy. It's very easy to figure out what the groups are in this network. But if it looks like, say, this network, then it's a bit harder. 421 00:39:38,030 --> 00:39:39,410 All that comes from this network. 422 00:39:39,710 --> 00:39:46,250 You could maybe convince yourself that, well, maybe there's a clump over here, maybe is another one over here, but it's not entirely clear. 423 00:39:46,580 --> 00:39:51,950 So what we like is some automated way of teasing apart the structure of the network. 424 00:39:52,580 --> 00:39:55,880 So there are many ways, in fact, of doing this that have been proposed. 425 00:39:56,090 --> 00:39:59,390 But the most widely used by far is something called modularity maximisation, 426 00:39:59,630 --> 00:40:04,730 which is based on it's an optimisation method based on something called modularity. 427 00:40:04,730 --> 00:40:07,760 Modularity is a quality function. 428 00:40:07,760 --> 00:40:15,020 What it does is you give me a network and you give me a candidate division of that network into parts, into communities, 429 00:40:15,200 --> 00:40:23,030 and it gives a score to that division, which is good if it's a good division in a way that I'll define and bad if it's a bad division. 430 00:40:23,300 --> 00:40:27,590 And then modularity, maximisation is just let's look through every possible division of the network 431 00:40:27,590 --> 00:40:33,050 and find the one that gets the highest score and that's the best division. So it's turning it into an optimisation problem. 432 00:40:33,500 --> 00:40:37,230 So the whole trick here is to define the score, right? 433 00:40:38,000 --> 00:40:42,650 So what makes a good score is if you look at a picture like this, 434 00:40:43,370 --> 00:40:48,380 a good division of network is one that puts most of the edges inside the groups in the network. 435 00:40:48,500 --> 00:40:53,540 So you're giving me the candidate division of the groups I go to count up how many the edges are inside the groups. 436 00:40:55,040 --> 00:41:02,240 So so a very simple example of a score that I could use is just what what number of edges or what fraction of edges for inside groups. 437 00:41:03,080 --> 00:41:04,670 However, if you think about it for a moment, 438 00:41:04,670 --> 00:41:12,260 this is not a very good score because it's clear that the global maximum of that is let's just put everybody all in one big group together, right? 439 00:41:12,260 --> 00:41:15,770 Then 100% of your edges are inside groups and you'll get absolutely nothing. 440 00:41:15,890 --> 00:41:20,000 But so that's not quite the right thing, but it's nearly the right thing. It's basically the right answer. 441 00:41:20,270 --> 00:41:25,310 So the score that does work is so-called modularity thing, which is number of edges within groups, 442 00:41:25,520 --> 00:41:30,920 minus the expected number of edges that would fall within groups if you just placed the edges at random. 443 00:41:32,030 --> 00:41:37,460 In other words, you're asking, are there more edges in groups than I would expect on the basis of chance? 444 00:41:37,850 --> 00:41:47,970 If there are, then I found something here. So in order to actually define that, I have to say how I'm going to randomise the positions of my edges. 445 00:41:48,150 --> 00:41:52,980 And there's more than one way you could do that. But in the simplest case, for instance, you could just randomise them uniformly. 446 00:41:53,190 --> 00:41:57,000 Just take all your ideas and just move them around completely uniformly at random. 447 00:41:57,440 --> 00:42:03,420 Okay. So you're comparing actual number of regimen groups with the number that you would get after you randomise things. 448 00:42:04,840 --> 00:42:06,910 So there's a nice matrix formulation of this. 449 00:42:07,300 --> 00:42:12,730 So we've already defined the adjacency matrix, which is 1 to 0, depending if there's an edge between I and J. 450 00:42:14,070 --> 00:42:20,940 So AJ and ultimately JCC Matrix tells you the actual number of ages observed to fall between no danger. 451 00:42:21,420 --> 00:42:23,069 It's either one or zero. There's either. It's there. 452 00:42:23,070 --> 00:42:31,800 Whether you want to subtract from that the expected number of edges that would fall there if the edge positions were randomised. 453 00:42:32,520 --> 00:42:36,690 So that's the thing I'm quoting here, P.J., in the simplest case, 454 00:42:36,690 --> 00:42:42,110 that would just be equal to the overall probability of an edge if you just uniformly randomise them. 455 00:42:42,990 --> 00:42:48,390 But for whatever randomisation procedure you define, there's some PJ which is the probability that an edge. 456 00:42:49,320 --> 00:42:54,360 So actual minus expected number of edges B nodes I and J is AJ minus PJ. 457 00:42:54,600 --> 00:43:00,510 And then the modularity itself is just the sum of that over all pairs of those that fall in the same group. 458 00:43:02,200 --> 00:43:07,090 So consider, for instance, the simple case where I'm just dividing the network into two groups. 459 00:43:07,960 --> 00:43:15,940 I'm going to do that. I'm going to turn that into something that I know about by putting an ising spin on every node of the network. 460 00:43:16,090 --> 00:43:19,920 So spin up or spin down to represent. Well, I'm in group one or in group two. 461 00:43:19,930 --> 00:43:25,959 I'm just dividing into two. So plus or minus one, I've got some spin variable SRB, which is plus one athletics. 462 00:43:25,960 --> 00:43:33,910 I've lost one and minus one if it was 30. So now it turns out I can write this modularity very simply in terms of that. 463 00:43:34,830 --> 00:43:40,930 I just do some arithmetic and I end up with this expression here, which says A leading constant that I don't care about. 464 00:43:40,930 --> 00:43:47,700 Ignore this one over four and then some over spins on J by sister. 465 00:43:47,830 --> 00:43:51,069 So these are the spin variables by J is just this thing we had before. 466 00:43:51,070 --> 00:43:54,190 It's just AJ minus PJ. So those are both matrices that I know. 467 00:43:55,140 --> 00:44:01,070 Right. So this thing looks very like some sort of disordered spin model that we're used to. 468 00:44:01,460 --> 00:44:09,170 And my goal is to maximise this. So you give me a network that fixes this matrix by some of what's called the modularity matrix. 469 00:44:09,680 --> 00:44:13,579 And my goal is to maximise this by finding the spins that give me the largest 470 00:44:13,580 --> 00:44:17,270 value of the modularity that gives me my division of the network into two groups. 471 00:44:18,790 --> 00:44:21,220 So I can't do that. Really. That's a hard problem. 472 00:44:21,670 --> 00:44:27,850 If there's any spins, there are two to the end possible configurations of spins, too many of them to look through exhaustively. 473 00:44:28,420 --> 00:44:38,250 So I'm going to use an approximate strategy. I'm going to write my modularity in matrix terms here as vector SX times matrix B times vector. 474 00:44:38,320 --> 00:44:40,890 So matrix B is again this modularity matrix that we have. 475 00:44:40,900 --> 00:44:47,470 If we previous transparency and vector s is just the element vectors vector that has the ising spins in it, the plus and minus ones. 476 00:44:47,800 --> 00:44:51,910 So the goal is you give me a B, I want to find the S2, maximise its number, right. 477 00:44:53,170 --> 00:44:57,520 So the way I do that is it's a hard maximisation. It's discrete maximisation I can't do. 478 00:44:57,970 --> 00:45:09,070 So I'm, I relaxes, I use a relaxation method in which instead of confining the spins to be only plus one or minus one, I let them be any real number. 479 00:45:09,760 --> 00:45:13,719 Okay, so this is fairly standard thing in statistical physics language. 480 00:45:13,720 --> 00:45:20,980 I'm using a circle model instead of using an easy model that then becomes now an easy problem to solve 481 00:45:20,980 --> 00:45:24,430 because there are continuous variables and I can just differentiate to find where the maximum is. 482 00:45:24,640 --> 00:45:30,460 And it turns out the maximum is given by the leading eigenvector of this matrix, the modularity matrix. 483 00:45:31,520 --> 00:45:35,750 That's only an approximation to the true answer, but it turns out it's an approximation that works well. 484 00:45:36,500 --> 00:45:43,730 Here's one fun example that I got from all this again. This is a network of books. 485 00:45:43,970 --> 00:45:52,220 So nodes in this network represent books. They're all books about US politics, and they're all written around 2004, 486 00:45:52,220 --> 00:45:58,700 the time of the last US presidential election won the election between George W Bush and John Kerry, 487 00:45:59,150 --> 00:46:02,420 and they're conveniently colour coded blue and red here. 488 00:46:02,420 --> 00:46:05,569 For some reason in America, blue means left wing and red means right wing. 489 00:46:05,570 --> 00:46:10,650 I have no idea why. It's just the opposite to everyone else in the world. But so blue and red. 490 00:46:10,670 --> 00:46:16,790 So like over here on the red side, you have the face of George W Bush, and over here you have the lies of George W Bush. 491 00:46:18,470 --> 00:46:23,690 So Augustus went through these all and decided whether they were left wing books or right wing books by reading the blurb. 492 00:46:23,990 --> 00:46:27,110 And and then what are the edges in this network? 493 00:46:27,440 --> 00:46:32,600 The edges are frequent co purchasing on the online bookseller Amazon.com. 494 00:46:33,590 --> 00:46:37,640 So you go on Amazon, you know, the thing that says people bought this book, bought these other books. 495 00:46:38,180 --> 00:46:41,690 So it's that. Right? He just looked up all of these books on Amazon, made this network. 496 00:46:41,960 --> 00:46:46,010 So the point here is that everybody is just preaching to the choir. 497 00:46:46,010 --> 00:46:49,489 People who buy blue books by other books. People who buy read books by other read books. 498 00:46:49,490 --> 00:46:53,720 In other words, we believe that there is strong community structure in this network to communities, 499 00:46:53,930 --> 00:46:58,880 Democrats and Republicans, and indeed feedback through this spectral algorithm. 500 00:46:59,030 --> 00:47:06,500 And you find two very nice communities that correspond very closely to the the classification of the books, build a did based on things. 501 00:47:06,590 --> 00:47:12,590 So I'm only telling you purchasing information, I'm telling you nothing about the titles, the contents, the authors, anything about the books. 502 00:47:12,830 --> 00:47:20,240 And yet you can very clearly see through the polarisation of politics into left and right just by looking at the structure of this network. 503 00:47:21,660 --> 00:47:24,989 Okay. So fine. This is a nice algorithm. 504 00:47:24,990 --> 00:47:28,500 Works well in practice for actually finding structure. 505 00:47:28,830 --> 00:47:31,920 Widely used method for finding community structure networks. 506 00:47:33,180 --> 00:47:37,880 What I want to ask though is. How do we expect it to behave? 507 00:47:37,910 --> 00:47:40,340 I want to understand how it works. And again, 508 00:47:40,340 --> 00:47:51,140 I'm going to do that by looking at a simple model of a network that has community structure in it and asking what the spectrum is going to look like. 509 00:47:51,710 --> 00:47:56,840 So the model that I use in this case is only a little bit more complicated than the other models I've been talking about. 510 00:47:56,990 --> 00:48:01,640 It's something called a stochastic block model. It's a it's a well-known model that goes back to the 1980s. 511 00:48:02,210 --> 00:48:06,080 The definition is I take some number of nodes. I divide them up into groups. 512 00:48:06,560 --> 00:48:10,340 And then I throw down edges at random, just like I did in the random graph. 513 00:48:10,700 --> 00:48:19,370 But now with two different probabilities for edges pins the pin is edges inside groups and P out is edges between groups. 514 00:48:19,550 --> 00:48:24,050 I'm going to make P and bigger than P out, so I'm going to get more edges inside the groups, a few edges between the groups. 515 00:48:24,200 --> 00:48:28,340 In other words, I'm going to get a network that looks very like the ones I sketched before. 516 00:48:28,370 --> 00:48:32,990 It's got traditional community structure, dense clumps of nodes in the network. 517 00:48:35,690 --> 00:48:39,649 This model is still sufficiently simple, however, 518 00:48:39,650 --> 00:48:46,220 that we can actually calculate the entire spectrum of the matrix for it so we can understand how this algorithm is going to work. 519 00:48:46,580 --> 00:48:50,240 So again, I won't go through the calculation in detail, but more around the matrix theory. 520 00:48:50,420 --> 00:49:00,620 End result is semicircle like before and an outlying eigenvalue here, and we can calculate where the band edge is and where the fat lies. 521 00:49:00,860 --> 00:49:04,849 The parameters here see in and see out and just re scaled versions of those 522 00:49:04,850 --> 00:49:09,500 probabilities p and and p out rescale by the number n of nodes in the network. 523 00:49:09,920 --> 00:49:11,900 So in calculate the entire spectrum. 524 00:49:12,080 --> 00:49:21,680 And again, my conjecture is that if two of these eigenvalues cross over, like, for instance, if this one meets the band edge, then get close to zero. 525 00:49:21,920 --> 00:49:25,940 I'm going to get some interesting structural phase transition in my system. 526 00:49:25,940 --> 00:49:29,840 And indeed that turns out to be true. So. 527 00:49:31,960 --> 00:49:39,370 So the argument goes something like this if if there were no community structure in my network at all. 528 00:49:40,000 --> 00:49:46,030 In other words, if C in was just equal to C out, then this eigenvalue would disappear altogether. 529 00:49:46,030 --> 00:49:47,530 And I would just have the semicircle. 530 00:49:49,070 --> 00:49:59,870 The way to see that is that this matrix B that I'm using here is precisely the shifted adjacency matrix that I talked about before. 531 00:50:00,050 --> 00:50:05,540 It's the adjacent matrix minus its average value is the expected value. 532 00:50:05,720 --> 00:50:08,900 So I get just the semicircle and nothing else. 533 00:50:09,170 --> 00:50:13,790 When I do that calculation. So if there is no community structure, I expect this value to disappear. 534 00:50:14,250 --> 00:50:19,160 Right? So the presence of that value reveals to me that I have structure in my network. 535 00:50:20,360 --> 00:50:29,150 That's. That's how I can tell that it's there. Right. So so now what I do is I'm going to vary the values B, C and C out, 536 00:50:29,300 --> 00:50:33,080 make them closer together, which makes the structure in my network weaker and weaker. 537 00:50:33,350 --> 00:50:42,140 And when they actually meet, it's going to vanish altogether. But the interesting thing is that that eigenvalue disappears before that point. 538 00:50:42,290 --> 00:50:45,080 It meets the edge of the band and so disappears into the band. 539 00:50:45,230 --> 00:50:50,000 You'd think it might happen when there was no community structure in the network, when it was just equal to C out. 540 00:50:50,210 --> 00:50:56,440 But it actually happens before that. There is an earlier point where there's still community structure in the network. 541 00:50:56,450 --> 00:51:00,920 We know it's there because we put it there. Right. C and see how still different from each other. 542 00:51:01,130 --> 00:51:04,520 So the structure is there, but the spectrum doesn't reveal it. 543 00:51:04,760 --> 00:51:09,440 It just disappears and it happens. And this phase transition point here, you can calculate where the transition is. 544 00:51:09,920 --> 00:51:17,380 All right. So I won't go through this in detail maybe, but you can calculate. 545 00:51:17,390 --> 00:51:22,220 So the order parameter in this case is basically how many nodes do we get, right? 546 00:51:22,730 --> 00:51:27,860 Right. So the problem is I give you this network which has, say, two planted communities in it. 547 00:51:28,070 --> 00:51:31,100 I know who belongs to which communities because I made up everything. 548 00:51:31,490 --> 00:51:36,680 Right? I created the network myself. And then I feed it into this algorithm and I ask, does it get it right? 549 00:51:36,950 --> 00:51:41,030 How many nodes is it classify right? So that's my order parameter in this particular case. 550 00:51:41,180 --> 00:51:47,450 And it turns out I can calculate that. So the idea basically is you look at the eigenvectors of this. 551 00:51:47,580 --> 00:51:49,640 You look at the elements of this leading eigenvector, 552 00:51:49,790 --> 00:51:56,030 and you classify the nodes into one group or another, depending on whether they're plus one or minus one. 553 00:51:56,120 --> 00:52:01,940 Right. Spin up or spin down. Well, they're not exactly plus one or minus one, because remember, we relaxed the problem. 554 00:52:02,120 --> 00:52:07,010 We made them real numbers. So they're actually not a bad approximation. A bunch of them are plus and a bunch of them minus. 555 00:52:07,130 --> 00:52:12,050 So in this particular case, the first 500 nodes were group one and the second 500 Group two. 556 00:52:12,860 --> 00:52:19,460 However, some of the nodes, it gets wrong and you can see that those are the ones where the line crosses zero. 557 00:52:19,640 --> 00:52:25,190 Right. And classifying them into one group or the other, depending on whether they're positive spins or negative spins. 558 00:52:25,520 --> 00:52:28,520 So if they cross the zero line and a positive one becomes negative. 559 00:52:28,700 --> 00:52:33,920 You got it wrong. So it's the number of places where it crosses the zero line. 560 00:52:34,130 --> 00:52:38,770 That's your estimate of how many spins you classify wrongly. 561 00:52:39,260 --> 00:52:44,060 Okay. We can actually calculate that exactly. Again, by doing some random matrix theory. 562 00:52:45,050 --> 00:52:49,340 We can get an exact expression for the number of spins we expect to classify correctly. 563 00:52:49,550 --> 00:52:53,370 So here's what the order parameter looks like. It actually doesn't go to zero. 564 00:52:53,390 --> 00:52:57,580 It goes 2.5 because point five is complete failure. 565 00:52:57,590 --> 00:53:01,399 In this case, I've got two groups of nodes I could get. 566 00:53:01,400 --> 00:53:05,870 I could classify things correctly at the 50% level just by tossing a coin. 567 00:53:06,320 --> 00:53:10,190 Right. So the worst you can do is 50%, not zero here. 568 00:53:10,400 --> 00:53:17,060 So this is fractions correctly classify nodes. When the structure in the network is very strong, you actually get 100% of the right. 569 00:53:17,270 --> 00:53:21,020 But there is a sharp phase transition at which you fail completely. 570 00:53:21,200 --> 00:53:28,400 Solid lines. Here are the theoretical calculation and the dots, numerical calculations on real networks. 571 00:53:28,640 --> 00:53:32,630 And they agree pretty well, except for some sort of fuzziness around the phase transition. 572 00:53:34,350 --> 00:53:40,950 So what this is saying is in this region, over here to the left, there is structure in the network, but you cannot detect it. 573 00:53:41,520 --> 00:53:45,330 In fact, it turns out that the result is even stronger than this. 574 00:53:45,780 --> 00:53:52,260 There is some beautiful recent work by a group in France for really In the Cell and collaborators in Paris, 575 00:53:52,680 --> 00:54:00,630 which shows that not only does this spectral algorithm fail to detect the structure in the network below this phase transition, 576 00:54:00,810 --> 00:54:04,560 but every algorithm must fail below this transition. 577 00:54:05,940 --> 00:54:09,100 This is, I think, a really deep and interesting finding. 578 00:54:09,120 --> 00:54:15,480 It says that there exist networks that have a structure in it, but it's fundamentally impossible to detect it. 579 00:54:15,690 --> 00:54:17,490 No algorithm will detect it. 580 00:54:18,000 --> 00:54:25,230 It's kind of like the incompleteness theorem, which shows that there are results that are true, but you'll never be able to prove that they're true. 581 00:54:25,380 --> 00:54:29,459 It's a similar kind of thing. These networks, they're artificial. 582 00:54:29,460 --> 00:54:34,860 You know, we're not really interested in the stochastic block model. But it's interesting as an existence proof. 583 00:54:35,250 --> 00:54:40,260 It tells you that there are certain kinds of structures in networks which you will never be able to detect. 584 00:54:40,680 --> 00:54:43,740 And this is a new understanding that we didn't have before. 585 00:54:43,830 --> 00:54:50,160 You know, we might have hoped, you know, just how like how Bertrand Russell hoped that you could prove everything in mathematics. 586 00:54:50,310 --> 00:54:56,310 We might have hoped that if there is structure in a network, we will always be able to come up with some way of detecting it. 587 00:54:56,700 --> 00:55:04,050 This result tells us that that is not true, that there are some kinds of structure in networks which will always be undetectable, that they're there. 588 00:55:04,170 --> 00:55:08,760 We know that in this case because we put them there and yet no method can detect them. 589 00:55:09,600 --> 00:55:16,890 I'm just about out of time, so I should stop talking now. Before I do, I just want to say thank you to my terrific collaborators on this work. 590 00:55:17,610 --> 00:55:22,470 First, three students and former students in my group lost to a faculty collaborators. 591 00:55:22,710 --> 00:55:25,920 These folks gave us money. Thank you for your attention. That's it.