1
00:00:07,900 --> 00:00:14,920
Okay. Good afternoon, everybody. So welcome to welcome to this afternoon's colloquium.
2
00:00:15,850 --> 00:00:20,770
It's a great pleasure to welcome Mark Newman to give the colloquium today.
3
00:00:21,070 --> 00:00:26,200
So Mark was in fact an undergraduate student here, 1988,
4
00:00:26,200 --> 00:00:30,850
and then he stayed on Stories Dphil and David Sherrington, who I think I saw somewhat coming in because I did,
5
00:00:32,140 --> 00:00:38,020
and he worked with David on sort of neural network type type problems,
6
00:00:38,020 --> 00:00:46,360
which is how he got started in these complex network problems after and after leaving after leaving Oxford.
7
00:00:46,360 --> 00:00:52,540
He was a post-doc at Cornell and then a and then a post-doc at the Santa Fe Institute.
8
00:00:53,470 --> 00:00:59,250
Um, then he was an assistant professor at the University of Michigan, but basically he's been ever since.
9
00:00:59,980 --> 00:01:06,400
And he recently made the Anatol Rappaport, distinguished University professor of physics at the University of Michigan.
10
00:01:06,670 --> 00:01:11,440
So he's had a very distinguished career in the theory of networks and things like that.
11
00:01:11,440 --> 00:01:15,759
And I think it's an interesting illustration of how physics has spread out over the last 30
12
00:01:15,760 --> 00:01:20,290
years from being part of the hard core stuff that people of my generation did in their youth.
13
00:01:20,650 --> 00:01:28,830
So to actually finding physicists, tackling all sorts of problems which are of interest to other and applicability to other parts of of science.
14
00:01:28,840 --> 00:01:34,870
And really, that's the definition of physics these days. If it's physics, if a physicist can make a useful contribution.
15
00:01:35,780 --> 00:01:41,590
And so so it's a great pleasure to welcome Mark today, who is going to talk for us on the Physics of Networks model.
16
00:01:43,030 --> 00:01:49,270
Thank you very much, John. Nice introduction. Hello, everybody. It's a pleasure to be here to be back here again after 30 years.
17
00:01:50,800 --> 00:01:57,220
So I'm going to talk about networks today, not, as John says, a traditional topic for a physicist to be working on,
18
00:01:57,220 --> 00:02:06,700
but physics has actually made very substantial contributions in this area in the last 15 or 20 years, some of which I'd like to tell you about today.
19
00:02:08,950 --> 00:02:18,550
So a network for the purposes of this talk is a bunch of dots joined together by lines in the jargon of the field is called a node or a vertex,
20
00:02:19,090 --> 00:02:22,630
and the lines called an edge. I will use that jargon throughout this talk.
21
00:02:24,430 --> 00:02:29,410
There's a long history in mathematics of studying networks as pure mathematical objects,
22
00:02:29,830 --> 00:02:36,670
but physicists have been interested more in the applications of using networks as a
23
00:02:36,670 --> 00:02:42,850
mathematical representation of the structure of many real world systems of interest to us,
24
00:02:43,450 --> 00:02:49,089
including technological networks like the Internet, which is a network of computers.
25
00:02:49,090 --> 00:02:53,140
The nodes are computers and the edges are data connections between computers.
26
00:02:54,010 --> 00:03:02,230
Other technological networks include transport networks like the rail network or this network of gas pipelines in Europe.
27
00:03:03,580 --> 00:03:10,090
Another big class of networks, more abstract, and that is information networks,
28
00:03:10,300 --> 00:03:14,170
which are networks that represent the structure of some body of information.
29
00:03:15,010 --> 00:03:17,020
The classic example here would be the World Wide Web.
30
00:03:17,020 --> 00:03:23,860
If you take a Web page like this, it's got things on it that you can click on, and when you click on them, they get you to some other page.
31
00:03:23,860 --> 00:03:28,510
So there's a link between the first web page and the second web page, what we call a hyperlink in the web jargon.
32
00:03:28,870 --> 00:03:35,760
And you can represent that as a network. You can make a network with the nodes, a web pages and the edges are the links between them.
33
00:03:35,770 --> 00:03:40,050
This is a picture of just a small portion of the web.
34
00:03:40,930 --> 00:03:47,709
The whole thing has tens of millions of nodes in far too large to make a picture of it on just one slide like this,
35
00:03:47,710 --> 00:03:50,740
probably the largest network of any kind that we study.
36
00:03:51,190 --> 00:03:57,280
And its structure presumably represents somehow the structure of the body of knowledge that's stored in the World Wide Web.
37
00:03:57,550 --> 00:04:04,930
So its structure is an interesting thing to look at. Another example of an information network is a citation network where the nodes represent
38
00:04:05,200 --> 00:04:11,290
science papers and the edges represent one scientific paper citing another one,
39
00:04:11,470 --> 00:04:19,600
something that we all do when we write papers again. Presumably this the structure of this network somehow reflects the structure of
40
00:04:19,600 --> 00:04:22,870
the scientific knowledge that's stored in the body of scientific literature.
41
00:04:22,870 --> 00:04:28,060
And by looking at that structure, we could hope to learn something about the structure of our knowledge.
42
00:04:29,980 --> 00:04:33,309
Biological networks, another large and important class of networks.
43
00:04:33,310 --> 00:04:37,210
There's a lot of excitement about, and biologists are very excited about work in this area at the moment.
44
00:04:37,780 --> 00:04:40,690
Here's one example. This is a metabolic network.
45
00:04:40,960 --> 00:04:49,620
So the nodes in this network are chemicals in the cells, metabolites and the edges, metabolic reactions that turn one metabolite into another one.
46
00:04:49,630 --> 00:04:53,470
So it's a network representation of the biochemical machinery of a cell.
47
00:04:54,100 --> 00:04:57,909
Here's another biological example. This is an ecological network.
48
00:04:57,910 --> 00:05:05,110
So note here represents species in an ecosystem. The edges represent the ecological interactions between species.
49
00:05:05,110 --> 00:05:09,310
In this case, predator prey interactions. One species, it's another species.
50
00:05:10,690 --> 00:05:14,590
There's an interesting little twist here for this network.
51
00:05:14,710 --> 00:05:18,130
See that airing over there? That's what they call a self edge.
52
00:05:18,130 --> 00:05:21,190
That's an edge that connects a node back to itself. That's allowed. You can have that.
53
00:05:22,360 --> 00:05:27,430
So so that in this case, that's the graph theoretical representation of cannibalism.
54
00:05:27,830 --> 00:05:36,280
That's the species that eats itself. But a lot of my work is focussed on social networks these days.
55
00:05:36,280 --> 00:05:42,270
If you say the word social networks to a lot of people, they think of Facebook or LinkedIn, online social networks.
56
00:05:42,550 --> 00:05:44,620
But in the scientific literature,
57
00:05:44,620 --> 00:05:53,440
the word social network just refers to any network where the nodes are people and the edges are any kind of connections between people.
58
00:05:53,980 --> 00:06:03,309
The study of social networks in this broader sense goes back way farther than their online incarnations, back to the 1930s.
59
00:06:03,310 --> 00:06:10,840
In fact, sociologists probably have the longest history of the quantitative study of empirical networks.
60
00:06:11,140 --> 00:06:20,440
Here's one example. This is a friendship network of of pupils in high school and American high school.
61
00:06:20,440 --> 00:06:25,900
So the nodes represent the students in the school and the edges represent whose friends with whom.
62
00:06:26,590 --> 00:06:30,909
Here's an example from my own work. This is a collaboration network of scientists.
63
00:06:30,910 --> 00:06:35,140
The nodes represent scientists and the edges represent who's written a paper with whom.
64
00:06:36,340 --> 00:06:43,990
Here's an example I got from my friend Valdas Krebs. This is a network that comes from a study he did that was on the spread of tuberculosis,
65
00:06:44,200 --> 00:06:49,029
and it represents a network of close physical proximity between people.
66
00:06:49,030 --> 00:06:57,129
One of the reasons why social networks have attracted a lot of research funding recently is because they are
67
00:06:57,130 --> 00:07:02,500
implicated in the spread of disease diseases spread between people when they have contact with one another.
68
00:07:02,590 --> 00:07:06,729
So the network of who has contact with whom governs the way diseases spread.
69
00:07:06,730 --> 00:07:14,110
If you want to understand how diseases spread and potentially how to stop them spreading, then you need to understand the structure of these networks.
70
00:07:14,260 --> 00:07:19,330
And I'll touch a little bit later on on the epidemiological aspects of this.
71
00:07:20,350 --> 00:07:25,090
So. So there's some examples of the kinds of networks that we're interested in looking at.
72
00:07:25,810 --> 00:07:32,860
One very basic question that you can ask in this field is just what is the structure of these things?
73
00:07:32,860 --> 00:07:38,170
How do you characterise the structure things? They're very big, complicated objects with potentially billions of nodes.
74
00:07:38,590 --> 00:07:45,100
How can I say something meaningful about what their shape is? So it has been a lot of work done on that basic question.
75
00:07:45,250 --> 00:07:54,340
A lot of it focuses on what you might call small scale structure in networks, things like degrees and path lengths and some graphs and so forth.
76
00:07:55,060 --> 00:08:01,780
So these are things which you can measure just by looking at a single node in the network at a time or maybe a small group of nodes,
77
00:08:01,780 --> 00:08:05,560
the very easy to measure. There's really no complicated math here.
78
00:08:05,590 --> 00:08:11,550
You just have to be able to count. But nonetheless, these are useful measures of network structure.
79
00:08:11,560 --> 00:08:13,990
They can tell you a lot. Let me give you just two quick examples.
80
00:08:14,770 --> 00:08:18,760
Perhaps the two best studied, best known examples of this kind of structure networks.
81
00:08:18,910 --> 00:08:25,090
One is degree distributions. So the degree of a node in a network is just the number of other nodes it's connected to.
82
00:08:25,360 --> 00:08:29,320
So if I'm talking about Friendship Network, it's how many friends I have in the network.
83
00:08:30,040 --> 00:08:36,460
And if you measure the degrees of all of the nodes in the network and then look at their distribution, make a histogram of them,
84
00:08:37,060 --> 00:08:42,430
almost invariably you find something which is very strongly right skewed meaning there are a lot of
85
00:08:42,430 --> 00:08:47,409
nodes that have low degree for you connections and there are a few nos that have high doing this.
86
00:08:47,410 --> 00:08:53,140
This plot here is for a network of collaborations amongst mathematicians who's written a paper with who.
87
00:08:53,500 --> 00:08:59,469
And so what it's telling you is that there are a lot of mathematicians that only have one or two collaborators, but there are few that have very many.
88
00:08:59,470 --> 00:09:05,180
So pull at us. For example, the famous 20th century mathematician, he was legendary,
89
00:09:05,200 --> 00:09:09,280
have an extraordinary number of different collaborators, over a thousand collaborators during his lifetime.
90
00:09:09,400 --> 00:09:12,700
So he would be way out in the tail of the distribution somewhere over here.
91
00:09:12,820 --> 00:09:18,400
There are not many people like that, but it turns out that they have a profound effect on the behaviour of these systems.
92
00:09:18,580 --> 00:09:21,700
Even though they're small in number, they have an outsize effect.
93
00:09:22,870 --> 00:09:30,130
They're typically referred to as the hubs in the network for obvious reasons, and I'll talk about them a little more too later on.
94
00:09:31,180 --> 00:09:36,580
Another example of a well-known property of networks is the thing called the Small World Effect.
95
00:09:36,790 --> 00:09:40,389
That was a famous series of experiments done in the sixties by this chap here.
96
00:09:40,390 --> 00:09:43,870
His name is Stanley Milgram. He was an experimental psychologist at Harvard.
97
00:09:44,710 --> 00:09:49,510
He did these experiments where he got people to pass letters from one person to another in a
98
00:09:49,510 --> 00:09:54,010
sort of cooperative game that involved trying to get the letter to an intended target person.
99
00:09:54,310 --> 00:10:00,250
And his finding was that typically took only about half a dozen steps to get the
100
00:10:00,250 --> 00:10:04,030
letter from some randomly chosen starting person to anyone else in the country.
101
00:10:04,270 --> 00:10:11,980
And this is the origin of this semi mythical result that you are only six handshakes away from anybody else in the world,
102
00:10:12,010 --> 00:10:17,020
sometimes referred to by the phrase The Six Degrees of Separation, which is a title of a successful play,
103
00:10:17,320 --> 00:10:21,880
later made into a movie in the nineties in which they discussed Milgram's work.
104
00:10:22,900 --> 00:10:27,550
So this result in particular is now very famous. It's sort of passed into legend.
105
00:10:27,700 --> 00:10:32,259
Almost everybody seems to know about this. It's it's the topic of party games.
106
00:10:32,260 --> 00:10:35,620
It's the topic of satirical headlines on the Internet.
107
00:10:36,100 --> 00:10:39,760
And every it's a part of the mythology.
108
00:10:39,790 --> 00:10:45,849
So these are examples of sort of small scale things that one can easily measure.
109
00:10:45,850 --> 00:10:51,130
And that was still very useful, though. Just because they're easy to measure doesn't mean we should turn our noses at them.
110
00:10:51,520 --> 00:10:57,429
However, that's not what I want to talk about today. What I want to talk about today is large scale structure in networks.
111
00:10:57,430 --> 00:11:02,770
What's the big picture? If I take this whole network of millions of nodes, what is it sort of overall structure?
112
00:11:02,770 --> 00:11:06,069
How does it divide up? How are the different pieces linked to one another?
113
00:11:06,070 --> 00:11:12,520
This is a much harder question to answer, much harder to characterise large scale structure because, well,
114
00:11:12,520 --> 00:11:17,589
it has to involve taking information from all over a network and somehow combining it to tell you something useful.
115
00:11:17,590 --> 00:11:20,320
And that's going to be a more difficult thing than just looking at one at a time.
116
00:11:21,010 --> 00:11:28,090
There are a range of different techniques that we use to do this kind of analysis, but the ones I want to talk about today are spectral methods.
117
00:11:28,270 --> 00:11:33,160
So spectral methods means methods that are based on matrix representations of networks.
118
00:11:33,340 --> 00:11:37,480
So you can take any network like this small one here.
119
00:11:37,690 --> 00:11:43,690
So suppose it has N nodes and it's got a number, the nodes from one to N in any order, it doesn't matter.
120
00:11:44,140 --> 00:11:49,390
And then I construct the so called adjacency matrix, which is the matrix that has a one.
121
00:11:50,410 --> 00:11:53,319
You stick a one in the matrix to represent an edge.
122
00:11:53,320 --> 00:11:59,560
So a one means that there's an edge between the corresponding to an outer network, like there's an edge here between node one and two.
123
00:11:59,710 --> 00:12:05,110
So I put a one in the one two element of the matrix and zeros everywhere else.
124
00:12:05,470 --> 00:12:09,700
Well I also put a one in the to one element because if that's Node one No.
125
00:12:09,700 --> 00:12:13,810
Two, then obviously that's nice. Clean up two in that one as well. So it's going to be a symmetric matrix.
126
00:12:14,230 --> 00:12:19,630
Binary value, just zeros and ones. Symmetric is important because that means its eigenvalues will be real.
127
00:12:20,710 --> 00:12:27,740
This may. Rex plays a role sort of analogous to the role played by the Hamiltonian in a quantum mechanical system.
128
00:12:28,280 --> 00:12:35,269
So its eigenvalues are sort of analogous to energy levels in its eigenvectors or analogy, analogous to the wave function of the network.
129
00:12:35,270 --> 00:12:43,219
Actually, I should be a little bit careful here. The the adjacency matrix is analogous to minus the Hamiltonian of a quantum system,
130
00:12:43,220 --> 00:12:47,720
because its highest line eigenvalue is the one that plays the role of the ground
131
00:12:47,720 --> 00:12:51,320
state and the ones immediately below that play the role of the first excited state.
132
00:12:51,320 --> 00:12:55,790
So if you take a network, any network like this one here, this is a real network.
133
00:12:55,790 --> 00:13:02,120
It's a social network that I got from the social networks which and you calculate its matrix and then you calculate its spectrum,
134
00:13:02,570 --> 00:13:06,020
you typically get something that looks like this. This is very typical of what you find.
135
00:13:06,200 --> 00:13:08,030
That's the ground state at the top.
136
00:13:08,660 --> 00:13:15,200
Then there's a few excited states well separated below it, and then there's a band of closely spaced states below that.
137
00:13:16,130 --> 00:13:22,190
This turns out to be what you see in many cases. And as we'll see, there's a reason for that, why you would expect that to be this sort of structure.
138
00:13:23,060 --> 00:13:28,940
Just like in quantum systems, we tend to focus most of our attention on the ground state and on the first few excited states.
139
00:13:28,940 --> 00:13:31,460
That turns out to be where most of the useful information is.
140
00:13:33,030 --> 00:13:41,940
So the claim is that by looking at the eigenvalues and eigenvectors, we can learn things about the large scale structure of this network.
141
00:13:41,950 --> 00:13:47,430
There are deep connections between these wave function like eigenvectors and large scale structure networks.
142
00:13:48,030 --> 00:13:52,380
Let me start by giving you two just very simple examples of that kind of connection.
143
00:13:53,610 --> 00:14:02,970
So here's one central ity is any measure that gives scores to the nodes in a network in an attempt to work out which ones are most important.
144
00:14:04,500 --> 00:14:08,969
Obviously, there's many different definitions of what you might mean by important nodes in a network,
145
00:14:08,970 --> 00:14:15,570
and there are correspondingly many central key measures. But a simple one is just what we call degree central or just plain degree.
146
00:14:15,810 --> 00:14:20,130
Again, that's the number of connections that a node has. It's just how many friends you have.
147
00:14:20,520 --> 00:14:24,330
So by this measure, the person who has the most friends is the most important person.
148
00:14:25,740 --> 00:14:28,920
Okay, so, you know, that's not totally stupid.
149
00:14:29,160 --> 00:14:35,140
That's quite a useful measure. It's really easy to calculate, but it has some shortcomings, too.
150
00:14:35,160 --> 00:14:40,560
In particular, basically all it's doing is it's giving you one point for everybody that you know.
151
00:14:41,340 --> 00:14:42,780
You just count up the number of people you know.
152
00:14:44,790 --> 00:14:51,540
But in real life, of course, not everybody, you know is equally important for connecting you to the rest of the world.
153
00:14:51,870 --> 00:14:56,939
If I know the Prime Minister, then that instantly makes me a more important person,
154
00:14:56,940 --> 00:15:00,180
sort of by reflection, because I have the ear of the Prime Minister.
155
00:15:00,420 --> 00:15:04,320
So it's not just how many people you know. It's also who, you know that matters.
156
00:15:04,680 --> 00:15:10,590
So a more sophisticated central to measure is one where instead of just getting one point for everybody I know,
157
00:15:10,830 --> 00:15:15,000
I get a number of points that's proportional to their score.
158
00:15:15,030 --> 00:15:21,930
My score is the sum of the scores of the people I know, and that's the sum of the scores of people they know and so on and so forth.
159
00:15:23,190 --> 00:15:30,600
So I can create a measure where I so I'm supposed suppose I'm Node I in the network
160
00:15:31,650 --> 00:15:36,389
my score is going to be denoted by and is proportional to consider proportionality,
161
00:15:36,390 --> 00:15:43,020
which I write minus one for reasons that'll become apparent times the sum of the scores of all the people in my immediate neighbourhood.
162
00:15:43,020 --> 00:15:47,639
That means all the people around me in the network. So this is my neighbourhood in the network.
163
00:15:47,640 --> 00:15:51,620
And just summing average or another way of writing that would be the sum of all nodes.
164
00:15:51,690 --> 00:15:54,870
J But stick an element of the adjacency matrix in here.
165
00:15:55,020 --> 00:16:00,600
AJ is an element of that matrix. That's one if there's a connection to Node II, A.J and zero if there isn't.
166
00:16:00,900 --> 00:16:06,270
Right. So this automatically just picks out the scores of my neighbours and sets all of the other ones to zero.
167
00:16:06,870 --> 00:16:12,390
So I can rearrange that equation and write it in matrix form and it's just matrix a change vector x equals L.A. Times x,
168
00:16:12,600 --> 00:16:16,860
where back to x is the vector that has all these entropy scores.
169
00:16:17,040 --> 00:16:24,510
So in other words, your central to your importance in the network is just an element of an eigenvector of the adjacency matrix which.
170
00:16:25,510 --> 00:16:30,430
Eigenvector should it be? Well, if you want your scores to be positive, then there's only one choice.
171
00:16:30,430 --> 00:16:34,690
It has to be the leading eigenvector. Perhaps the most famous theorem in linear algebra.
172
00:16:34,690 --> 00:16:40,540
The intravenous theorem tells you that there is only one eigenvector with all elements positive subject to some conditions,
173
00:16:40,870 --> 00:16:44,260
and that's the one with the highest eigenvalue.
174
00:16:44,620 --> 00:16:52,030
So this is the ground state of the network and it represents the centrality of nodes in network.
175
00:16:52,040 --> 00:16:57,220
So famous example of the use of this kind of sense charting measure is the Google search engine.
176
00:16:57,790 --> 00:17:06,280
They use a slight variance on this idea to decide which web pages are important and which web pages are not important when you do web searches.
177
00:17:06,580 --> 00:17:12,820
So apparently you can make a lot of money by doing this kind of stuff. Here's another example.
178
00:17:13,570 --> 00:17:21,870
Only slightly more complicated. Percolation is a process, of course, well known to lots of physicists.
179
00:17:21,880 --> 00:17:31,690
It's the process where you fill in or occupy at random some fraction of the sites or bonds on a lattice or the nodes and edges on a network.
180
00:17:33,130 --> 00:17:37,000
So he has a network and I've coloured in some of the edges in this case.
181
00:17:37,480 --> 00:17:46,660
All the network percolation is used in physics, for instance, to model actual percolation like the population of fluids through porous materials.
182
00:17:46,660 --> 00:17:52,090
For instance, in networks, it's used, for instance, to model the spread of disease.
183
00:17:52,420 --> 00:17:59,319
I've touched on this before. You can imagine here my I'm a person in a social network.
184
00:17:59,320 --> 00:18:06,700
I have some friends. And I was I get the flu. I could give the flu to my friends, but I probably don't give it to all of them.
185
00:18:06,850 --> 00:18:10,690
I just give it to some of them, you know, depending on circumstances.
186
00:18:10,990 --> 00:18:15,670
So I can represent that by sort of colouring in the edges at random,
187
00:18:15,850 --> 00:18:20,860
with some probability we'll call it P to represent the probability that the
188
00:18:20,860 --> 00:18:24,850
disease is transmitted along those edges and then not along some other edges.
189
00:18:25,420 --> 00:18:31,060
So then if if I'm the red dot in the middle there, who gets the flu, it's only going to spread along the coloured in edges,
190
00:18:31,330 --> 00:18:38,500
meaning the disease spreads to a cluster, a percolation cluster of connected nodes within the network.
191
00:18:38,770 --> 00:18:45,040
So in other words, the percolating clusters in the network represent the sizes of potential outbreaks of the disease.
192
00:18:46,800 --> 00:18:51,900
And so if I can calculate those sizes, then I can say something about the parents disease.
193
00:18:52,020 --> 00:18:59,280
Perhaps the most famous property of percolation systems is that if you just start occupying sites or bonds at random,
194
00:19:00,600 --> 00:19:08,670
what you get is clusters of occupied sites which grow as you occupy more and more of them.
195
00:19:08,850 --> 00:19:16,440
Eventually they start to sort of gel together and eventually they all stick together to form one big cluster that spans all the way across the system.
196
00:19:17,690 --> 00:19:22,069
There is a sharp percolation threshold at which that transition happens and which you
197
00:19:22,070 --> 00:19:27,200
go from separate clusters spread out to percolating all the way across the system.
198
00:19:27,320 --> 00:19:32,090
And in the epidemiology context, that corresponds to the so-called epidemic threshold,
199
00:19:32,270 --> 00:19:36,169
where you go from a regime where the disease is never going to spread very far,
200
00:19:36,170 --> 00:19:38,030
it's only going to spread to a small local cluster,
201
00:19:38,240 --> 00:19:43,190
to a regime where it can in principle spread all the way across the entire system and cause an epidemic outbreak.
202
00:19:43,550 --> 00:19:44,870
So the epidemic threshold,
203
00:19:45,110 --> 00:19:50,900
the way the epidemiologists talk about it and the percolation threshold where physicists talk about it are essentially the same thing.
204
00:19:51,320 --> 00:19:56,740
So might ask, for instance, where does that percolation threshold happen?
205
00:19:56,750 --> 00:20:02,330
What is the critical value of this probability? P At which percolation happens on a network?
206
00:20:02,720 --> 00:20:05,030
So I can address that by doing a calculation like this.
207
00:20:06,170 --> 00:20:11,840
Suppose I'm in the regime where there is this giant cluster, this percolating cluster in my network,
208
00:20:12,170 --> 00:20:14,690
and I want to ask, what's the probability that I belong to it?
209
00:20:14,960 --> 00:20:22,850
Actually, I'm going to define use of I to be the probability that I don't belong to a giant cluster, which turns out to be easier to do it that way.
210
00:20:23,390 --> 00:20:27,060
So usually by as the problem is, I don't belong to the percolating cluster.
211
00:20:27,800 --> 00:20:36,080
So now consider his me. I'm Node I. In order to not belong strictly in cluster, I have to be not connected to it via any of my neighbours.
212
00:20:36,320 --> 00:20:39,830
If I'm connected to it even one of my neighbours, then I'm in the percolating cluster too.
213
00:20:40,550 --> 00:20:44,840
So what's the probability that I'm not connected with a leading cluster via my neighbour J.
214
00:20:45,590 --> 00:20:53,180
Well, there's two ways that can happen. One is that that edge between us could be not occupied, meaning the disease never gets transmitted to me.
215
00:20:53,990 --> 00:20:58,700
That happens with probability. One minus p p is the occupation. Probably one minus P is the occupation poverty.
216
00:20:59,090 --> 00:21:04,820
The alternative is that edge could be occupied, but J itself is not a member of the giant cluster.
217
00:21:05,200 --> 00:21:08,810
Right. That happens would probably p fetch me occupied and use of j.
218
00:21:08,960 --> 00:21:11,570
By definition is the probability that J is not in the joint custody.
219
00:21:11,870 --> 00:21:17,899
So one minus p plus p times u is the total probability that my neighbour J doesn't
220
00:21:17,900 --> 00:21:20,960
give me the disease that I'm not connected to the joint cluster via them.
221
00:21:21,350 --> 00:21:26,930
And then the total probability that I'm not going to join cluster via any of my neighbours is just take the product of that over all my neighbours.
222
00:21:28,120 --> 00:21:33,790
So this gives me a self conditioned self-consistent condition that I can solve to find these probabilities.
223
00:21:34,780 --> 00:21:41,410
So now imagine lowering the percolation, the occupation probability, the probability of transmission of the disease.
224
00:21:41,560 --> 00:21:46,810
At some point you go through the train, you go through the population transition, and your giant cluster disappears.
225
00:21:47,020 --> 00:21:52,030
When that happens, all of these probabilities you must equal one you as the probability of not being
226
00:21:52,030 --> 00:21:56,050
in the gene cluster if there is no giant cluster and everybody is not in it. So all of the probabilities are one.
227
00:21:56,800 --> 00:22:01,959
So that's the the thing that you how, you know, you've got to the transmission all of these problems these go to one.
228
00:22:01,960 --> 00:22:07,540
So we're going to do is we're just going to expand around that point. I'm going to write you as equal to one minus some small epsilon.
229
00:22:08,020 --> 00:22:11,230
It has to be minus because you as a probability has to be between zero and one.
230
00:22:12,110 --> 00:22:16,479
And I'm going to take that substituting this equation here, expand to leading order in Epsilon.
231
00:22:16,480 --> 00:22:23,110
And you get this equation here, which can be rewritten in matrix form as a vector of epsilon equals PC.
232
00:22:23,380 --> 00:22:26,380
The critical population. Probability times adjacency matrix.
233
00:22:26,380 --> 00:22:35,630
Times epsilon. So this immediately tells you that one over the critical population probability is an eigenvalue of adjacency matrix,
234
00:22:35,630 --> 00:22:42,170
which eigenvalue it's again the leading one because epsilon is a vector of probabilities, they all have to be positive.
235
00:22:42,380 --> 00:22:44,570
There's only one vector that does that for you, the leading one.
236
00:22:45,590 --> 00:22:53,060
So by just looking at the leading eigenvalue, the ground state energy of this adjacency matrix,
237
00:22:53,270 --> 00:22:58,880
I can tell you the percolation threshold or alternatively, the epidemic threshold for this network.
238
00:23:00,760 --> 00:23:04,630
Okay. So there are some examples of why these are useful things to look at.
239
00:23:05,920 --> 00:23:12,670
So what I want to do is understand the relationship between the values of these things,
240
00:23:12,670 --> 00:23:15,970
these eigenvalues and eigenvectors, and the structure of the network.
241
00:23:16,300 --> 00:23:20,140
And the way I'm going to do that is very sort of physicist type approach.
242
00:23:20,290 --> 00:23:25,689
I'm going to make simple models of networks, and then I'm going to calculate the eigenvalues and eigenvectors of those simple models,
243
00:23:25,690 --> 00:23:29,410
and I'm going to see what they do as a function of the structure of the model network.
244
00:23:29,980 --> 00:23:35,860
Before I do that, however, I want to make it an important point. How would I go about actually calculating these things?
245
00:23:35,870 --> 00:23:43,000
Suppose, for instance, I wanted to calculate the leading eigenvector. So a simple way of doing that is just repeated multiplication.
246
00:23:43,180 --> 00:23:47,799
I can calculate the leading eigenvector matrix just by taking a random starting vector.
247
00:23:47,800 --> 00:23:51,670
It doesn't matter really what it is and then just multiply repeatedly by the matrix.
248
00:23:52,150 --> 00:23:58,870
If I do that, then the components in the direction of the leading eigenvector will get multiplied by the leading eigenvalue, the biggest one.
249
00:23:59,080 --> 00:24:01,240
Everything else will get multiplied by something smaller.
250
00:24:01,390 --> 00:24:06,030
And if I just keep doing this for long enough, I end up with a vector that's pointing in the direction of the leading argument.
251
00:24:06,160 --> 00:24:10,090
It's called the power method. Very simple, but actually quite a good practical way of doing this.
252
00:24:10,570 --> 00:24:13,930
So now here's the question how long is this going to take?
253
00:24:14,440 --> 00:24:21,130
How long it's going to take depends on the separation between that leading eigenvector and the next one, right?
254
00:24:21,190 --> 00:24:22,960
If they're far apart, it won't take long.
255
00:24:23,050 --> 00:24:27,850
But as they get close together, it's going to take more and more multiplications to separate those two apart.
256
00:24:28,120 --> 00:24:33,280
And if they actually cross over, then the time to do it is going to diverge, as you see in this picture here.
257
00:24:33,670 --> 00:24:36,999
So if you're someone who studies phase transitions, this looks very familiar.
258
00:24:37,000 --> 00:24:40,510
It looks like the thing that we call critical slowing down in a phase transition.
259
00:24:40,750 --> 00:24:43,540
If you study phase transitions numerically on a computer, for instance,
260
00:24:43,870 --> 00:24:50,320
you'll be familiar with a phenomenon where the time the algorithm takes to run diverges as you pass through a phase transition.
261
00:24:51,010 --> 00:25:01,239
So a wild conjecture might be that when the levels cross in the adjacency matrix that you have phase transitions going on in networks.
262
00:25:01,240 --> 00:25:07,300
This is again kind of like in quantum mechanics, right? When when energy levels cross in quantum systems, you get quantum phase transitions.
263
00:25:07,480 --> 00:25:10,540
So it's a similar kind of idea and indeed it turns out to be true.
264
00:25:10,720 --> 00:25:19,970
And it's sort of interesting because it means that we can identify places where phase transitions should take place by looking for these crossings.
265
00:25:20,020 --> 00:25:25,209
I'll give you a few examples where that happens and we know what the phase transition is for.
266
00:25:25,210 --> 00:25:29,200
Interesting thing is there are lots of other examples where we get these crossings happening,
267
00:25:29,200 --> 00:25:32,080
but we don't know what the corresponding corresponding phase transition is.
268
00:25:32,230 --> 00:25:35,620
We believe there is some kind of phase transition now, but we don't actually know what it is.
269
00:25:36,640 --> 00:25:41,020
So let me give you some examples. I'm going to start with a very simple example.
270
00:25:42,460 --> 00:25:46,000
This is almost the simplest model network that you can imagine making.
271
00:25:46,000 --> 00:25:50,829
It's called the random graph. So it's it's really simple.
272
00:25:50,830 --> 00:25:53,590
You just take in nodes and you throw down edges completely at random.
273
00:25:53,800 --> 00:25:58,060
So every possible edge is present with probability P or not with probability one minus.
274
00:25:58,390 --> 00:26:02,080
That's it. That's the whole model. It just has two parameters and nodes and P.
275
00:26:02,080 --> 00:26:06,790
The probability of edges has been studied for a long time, first proposed in the 1950s.
276
00:26:07,270 --> 00:26:11,889
A lot is known about this model. An important quantity here will be the average degree.
277
00:26:11,890 --> 00:26:18,549
So degree again remember is the number of connections you have. The average degree of a node in this random network is just the number of nodes,
278
00:26:18,550 --> 00:26:23,380
times the probability of connection there are and nodes you can be connected to. You've got probability of being connected to each one.
279
00:26:23,560 --> 00:26:25,930
So enzymes P is your expected number of connections.
280
00:26:26,830 --> 00:26:34,180
So there's a well-known phase transition that happens in this model, very similar to the percolation transitions that we were just talking about.
281
00:26:34,450 --> 00:26:40,600
If P is zero, then you have no edges at all and everything's sort of separate, like this appears one.
282
00:26:40,750 --> 00:26:44,920
It all gels together into one big connected clump, what we call a component in the network.
283
00:26:45,070 --> 00:26:49,480
And in between these two extremes, there is known to be a percolation type phase transition.
284
00:26:50,080 --> 00:26:56,260
The clusters sort of get bigger and bigger as you increase P and at some point they gel together and you get a spanning cluster that spans the system.
285
00:26:58,090 --> 00:27:06,430
So above that transition that the structure of the network is as follows A component in a network just means a clump of connected node.
286
00:27:06,440 --> 00:27:10,479
Each of these is a component, so you expect to have one percolating component,
287
00:27:10,480 --> 00:27:14,830
the giant component, so called, and a bunch of little ones is scattered around the edge.
288
00:27:15,160 --> 00:27:18,280
That's the general structure that we expect to see. And there's a phase transition.
289
00:27:18,460 --> 00:27:24,010
That phase transition happens at the point where the average degree C is equal to one.
290
00:27:24,460 --> 00:27:31,210
That will be an important point in a moment. So so now what I'm going to do is calculate the spectrum of this thing.
291
00:27:31,810 --> 00:27:35,379
And the claim is that there will be a crossing of the eigenvalues. Exactly.
292
00:27:35,380 --> 00:27:39,640
At the point where this phase transition happens. We already know about this phase transition.
293
00:27:39,640 --> 00:27:45,640
This is just sort of a demonstrative example to show you that crossing of eigenvalues corresponds to phase transitions.
294
00:27:46,950 --> 00:27:50,490
So I don't want to spend a lot of this talk going through mass in detail.
295
00:27:50,640 --> 00:27:55,710
But I did want to run just briefly through this one calculation to show you the kinds of techniques involved.
296
00:27:56,070 --> 00:28:00,210
This is actually a very old calculation. It goes back to the 1950s as well.
297
00:28:00,900 --> 00:28:08,790
It's essentially equivalent to calculations you'll find in any graduate level textbook on random matrix theory.
298
00:28:09,180 --> 00:28:12,600
It's a this this kind of stuff has been done for a long time.
299
00:28:13,260 --> 00:28:17,400
So I'm going to calculate the spectrum of this random graph, in other words, this adjacency matrix.
300
00:28:18,600 --> 00:28:19,640
We do it in two steps.
301
00:28:19,650 --> 00:28:27,630
First of all, we're actually going to calculate the spectrum of the centred adjacency matrix, which is just the adjacency matrix minus its mean value.
302
00:28:27,900 --> 00:28:31,920
The mean value of any element in this matrix is just p the probability of an X.
303
00:28:32,640 --> 00:28:38,219
So that's the same as saying adjacency matrix minus p subtract p from every element which
304
00:28:38,220 --> 00:28:43,950
I right this p times one one transpose where both one is the uniform vector of all ones.
305
00:28:45,090 --> 00:28:48,660
So that's actually the first step is to be to calculate the spectral of that.
306
00:28:48,900 --> 00:28:52,170
And by the spectrum I mean the density of states, which is just the usual thing.
307
00:28:52,350 --> 00:28:59,220
It's just the function that has a delta function spike at every eigenvalue of the matrix.
308
00:28:59,880 --> 00:29:03,720
So it's a thing that looks like this. It's a bunch of spikes.
309
00:29:03,810 --> 00:29:09,960
It's the function where if you integrate it over any region, it'll just count the number of eigenvalues in that region.
310
00:29:11,100 --> 00:29:14,970
That's a difficult function to handle. It's got these spiky things in it,
311
00:29:15,210 --> 00:29:20,820
so we regularise it by putting some broadband function on each spike in this particular
312
00:29:20,820 --> 00:29:25,770
case will use a laurentian also called a cushy distribution by mathematicians.
313
00:29:25,950 --> 00:29:31,770
And then we're going to take the limit as those dimensions width goes to zero to recover our density of states.
314
00:29:32,770 --> 00:29:33,880
So here's my in six states,
315
00:29:34,030 --> 00:29:41,260
here's my Laurentian and as with etre I can write it as this thing normalising coefficient divided by x squared plus u two squared and I think Mm.
316
00:29:41,520 --> 00:29:42,460
It goes to zero.
317
00:29:44,100 --> 00:29:53,370
I can factor is that denominator as x plus eight times x minus eight and write it also in this case imaginary imaginary part of one over x plus.
318
00:29:53,370 --> 00:29:57,450
Either is another way of writing it. So this is one of the standard forms of the doubt function.
319
00:29:57,660 --> 00:30:04,130
So what I do is I take that, I substitute into here, I get this expression for the density of states.
320
00:30:04,530 --> 00:30:07,620
Notice that it has an x plus i eta in the denominator.
321
00:30:07,740 --> 00:30:11,520
A common thing to do at this point is just to generalise this to the complex plane.
322
00:30:11,520 --> 00:30:15,270
Instead of thinking of this as function on the real line, let's just go into the complex plane,
323
00:30:15,450 --> 00:30:18,630
make it a function of X plus eight, which is called Z here.
324
00:30:18,750 --> 00:30:22,110
So here's a row of complex Z. It's the same thing.
325
00:30:22,110 --> 00:30:26,700
It's just got a Z here. And just remember that at the end what we want to do is take the limit.
326
00:30:27,120 --> 00:30:35,129
E2 goes to zero. I the limit as you go back to the line. Then after that it's just arithmetic and boring.
327
00:30:35,130 --> 00:30:41,520
We just do a series expansion of this function and we end up with a sum over powers of eigenvalues.
328
00:30:41,520 --> 00:30:48,330
The powers of eigenvalues here and sums of powers of eigenvalues can be written in terms of traces of powers of the original matrix.
329
00:30:48,660 --> 00:30:53,940
So you end up with this formula for density of states down here. So well-known things could still just run formula.
330
00:30:55,230 --> 00:31:01,500
And the only difficult thing here is calculating the trace of this matrix. But it's made easier by the fact that this is a random model.
331
00:31:01,830 --> 00:31:03,690
So what I'm going to do is just average over the random.
332
00:31:03,810 --> 00:31:11,700
Calculate the average of this density of states, in other words, the average of this trace here, and that turns out to be an easy thing to do.
333
00:31:13,790 --> 00:31:19,019
It turns out if you sort of sloppy about it, then it just vanishes. You can persuade yourself that all terms in the straits are actually zero.
334
00:31:19,020 --> 00:31:24,959
But this is wrong. You have to be a little more careful and realise that there are some traces where the same matrix element occurs
335
00:31:24,960 --> 00:31:30,780
twice and then you get the variance of that matrix element and that can give you terms that are non-zero.
336
00:31:31,470 --> 00:31:35,670
So you get a term that depends on a power of the variance times.
337
00:31:36,090 --> 00:31:40,860
However many terms there are that don't vanish. So this becomes just a counting problem.
338
00:31:40,890 --> 00:31:45,930
It just becomes a question of working out how many times he had done vanish. I won't go through the geometry accounting problem,
339
00:31:46,110 --> 00:31:52,080
but it turns out that the number of terms here is just equal to something called the Catalan number, which a well-known thing in number theory.
340
00:31:52,350 --> 00:31:56,220
And so I can rewrite my expression in terms of the sum of these Catalan numbers.
341
00:31:56,310 --> 00:32:02,280
And that can be done in closed form. It has this expression here, take that substitute back into the formula I started off with,
342
00:32:02,460 --> 00:32:05,460
and you get this nice expression for the density of states.
343
00:32:05,670 --> 00:32:14,210
This is a famous, well-known expression, so-called semi-circle spectrum, originally derived by Eugene Vigna in the in the 1970s.
344
00:32:14,220 --> 00:32:21,840
It looks like that it works really well. That histogram there is an actual random graph that I generated on my computer the day before yesterday.
345
00:32:22,080 --> 00:32:26,700
Diagonals. Did you get a nice semicircle? The blue line is the prediction of the theory.
346
00:32:26,970 --> 00:32:33,360
So it works well, but we're not quite done yet because that was that centre adjacency matrix where we subtracted the mean.
347
00:32:33,780 --> 00:32:39,300
So you have to add the mean back on. Meaning we're really solving an igen problem if it looks like this one here.
348
00:32:40,480 --> 00:32:49,630
And you do a series of sort of boring negotiations and you end up with a neat little equation which this average over here can be done in closed form.
349
00:32:49,810 --> 00:32:55,840
You get this thing here, this can be solved and you get a value for the eigenvalue here.
350
00:32:56,380 --> 00:33:02,170
Bottom line is basically when you add that shift back on, you add the mean back onto the matrix.
351
00:33:02,290 --> 00:33:11,110
Nothing changes. You still have your semi-circle distribution except for one thing one of the eigenvalues breaks off and moves up to here C plus one.
352
00:33:11,350 --> 00:33:15,520
Where C again is the adversary of the network. So this is your final answer for your spectrum.
353
00:33:15,520 --> 00:33:24,670
It looks like this and you can calculate the whole thing. The average they expect is spectrum and analytically in this case.
354
00:33:25,720 --> 00:33:32,640
So now look at this spectrum. Here's here's my eigenvector central t that's my ground state, state, my leading eigenvalue.
355
00:33:32,880 --> 00:33:39,030
And then here's the band that I was talking about. There's a continuous band of eigenvalues, and there's a band gap here.
356
00:33:40,150 --> 00:33:46,030
And the thing is the size of that bank out depends on the average degree I one parameter the average degree the network because
357
00:33:46,030 --> 00:33:53,050
this the bandage here goes like square to see the the ground state goes like see so I can change the size of that gap.
358
00:33:53,230 --> 00:33:56,620
If I change it enough, I can make those two eigenvalues crossover.
359
00:33:56,920 --> 00:34:01,329
Where do they crossover? Well, solve two root secrecy plus one you find.
360
00:34:01,330 --> 00:34:06,490
The answer is they cross over when C equals one. Exactly the point at which that phase transition.
361
00:34:07,300 --> 00:34:10,360
So this is not proving anything, but it's again suggestive.
362
00:34:10,630 --> 00:34:13,650
When you get crossings of these eigenvalues, you get phase transitions happens.
363
00:34:13,780 --> 00:34:18,720
There's not a new phase transition. We we knew about this phase transition already we haven't discovered anything here was
364
00:34:18,800 --> 00:34:24,640
illustrative that crossings of the levels in these things correspond to phase transition.
365
00:34:25,330 --> 00:34:30,310
So now let me give you two other examples.
366
00:34:31,650 --> 00:34:36,660
A little bit less trivial than this one of phase transitions that happened in these networks.
367
00:34:37,470 --> 00:34:42,630
So the first one is just a slight extension of the model I already described.
368
00:34:42,810 --> 00:34:50,550
So I mentioned that a lot of these networks have hubs in they have a few nodes like the pull ad or node that has an awful lot of connections.
369
00:34:52,020 --> 00:34:55,889
So what's the effect of that going to be on my network?
370
00:34:55,890 --> 00:35:02,610
So I can model that by taking the same model as I had before, the random growth model, and I'm just going to add one extra node to it,
371
00:35:02,610 --> 00:35:06,150
that red node there, which is going to be my hub and it's going to have a very high degree.
372
00:35:06,660 --> 00:35:11,790
So it's going to have some probability. Q of being connected to every other node in the network.
373
00:35:12,970 --> 00:35:18,790
And I quote, I'm going to write this Neapolitan meaning hit the average degree of that no is going to be D and I'm going to make D big.
374
00:35:19,270 --> 00:35:22,690
So it's a hub. It's the one node in the network that has a lot of friends.
375
00:35:23,110 --> 00:35:26,260
I'm going to ask, what does that what effect does that have on the spectrum?
376
00:35:26,530 --> 00:35:31,240
So I won't do the calculation again, but you can it's not terrifically hard to do.
377
00:35:31,420 --> 00:35:35,860
You can do this calculation. What happens is that most of the spectrum stays the same,
378
00:35:36,010 --> 00:35:42,130
but you get one new eigenvalue with this particular value here d upon the squares d c
379
00:35:42,340 --> 00:35:46,330
where d is the average to be the hub and C is the average screen the rest of the network.
380
00:35:46,930 --> 00:35:48,400
So in other words, the spectrum looks like this.
381
00:35:48,670 --> 00:35:57,790
Semi-Circle and edge gap two states at the ground state and first excited state out outside the band that.
382
00:35:58,730 --> 00:36:05,600
So now I have two parameters D and C, and I can move those two eigenvalues around, and if I want, I can make them crossover.
383
00:36:06,050 --> 00:36:09,500
So claim is something interesting happens when they cross over.
384
00:36:10,790 --> 00:36:13,100
Where do they cross over? Well, that's easy to set.
385
00:36:13,490 --> 00:36:20,460
Do you have a square or do you mind C equals C plus one solve and I get D is equal to basically the square of the degree in the rest of the network.
386
00:36:20,470 --> 00:36:26,360
So clean is when the average degree of the hub is sort of order of magnitude, the square of the degree, the rest of the network.
387
00:36:26,540 --> 00:36:31,429
Something interesting will happen, and that's a situation that could easily happen in real networks.
388
00:36:31,430 --> 00:36:37,579
The hubs that we see in real networks easily, often the square of the average to the rest of the network.
389
00:36:37,580 --> 00:36:44,739
So it's a not an uncommon situation at all. So indeed there is a transition that happens there and it's an interesting one.
390
00:36:44,740 --> 00:36:46,690
It's a localisation transition.
391
00:36:47,200 --> 00:36:56,980
So we're looking at the ground state, we're looking at the leading eigenvector, the thing that gives us the eigenvector sensuality of the nodes.
392
00:36:57,430 --> 00:37:01,270
And normally if you just make a plot of it here, here's a figure.
393
00:37:01,780 --> 00:37:07,389
My student, Travis, made these the heights of these little tents represent the Eigenvectors Central.
394
00:37:07,390 --> 00:37:08,950
These there's the hub in the middle.
395
00:37:09,070 --> 00:37:14,650
It's got the biggest one, which is perhaps not surprising, but there are plenty of others that have got quite high central cities as well.
396
00:37:15,220 --> 00:37:24,520
Now you go through that transition where the levels cross and you get this suddenly what happens is all the weight condenses onto that hub.
397
00:37:24,760 --> 00:37:27,910
It behaves very like a defect in a condensed matter system.
398
00:37:28,000 --> 00:37:32,530
You get localisation around that defect and all the weight dies away from everything else.
399
00:37:33,190 --> 00:37:40,599
In effect, this means that that eigenvectors central city used by Google and other people to analyse networks has failed because now
400
00:37:40,600 --> 00:37:47,589
all it does is identifies the one defect and it is no good at distinguishing between any of the other nodes in the network.
401
00:37:47,590 --> 00:37:50,740
So there's a failure mode of the eigenvector centralising.
402
00:37:52,410 --> 00:37:59,150
If you want to see more evidence that this really is a phase transition, here is the order parameter for this phase transition,
403
00:37:59,170 --> 00:38:05,820
something called the inverse participation ratio basically measures what fraction of the weight is deposited on the defect node,
404
00:38:06,540 --> 00:38:11,429
and it's basically zero. It's a vanishing fraction until you get to the phase transition and then suddenly it
405
00:38:11,430 --> 00:38:15,870
becomes some significant fraction of all of the weight is deposited on the defect node.
406
00:38:16,080 --> 00:38:24,360
And in these numerical studies, the positions of these transitions coincide nicely with the theoretical predictions marked by those dotted lines that.
407
00:38:26,180 --> 00:38:33,950
Okay. So there's two fairly simple examples of cases where the crossing of these levels in the
408
00:38:33,950 --> 00:38:38,870
network corresponded to interesting structural phase transition going on in the network.
409
00:38:39,350 --> 00:38:46,370
So in the remaining time, I wanted to talk in a little more length about a third example,
410
00:38:46,370 --> 00:38:50,060
which is something that we've been working on in my group for some time now,
411
00:38:50,990 --> 00:38:56,000
and it concerns what we call community structure in networks, which is structure that looks like this.
412
00:38:56,000 --> 00:38:59,750
A lot of the networks we look at have this kind of structure where there's a clump of
413
00:38:59,750 --> 00:39:03,920
nodes and there's another clump of nodes and there's lots of connections within clumps,
414
00:39:04,400 --> 00:39:06,709
but there's relatively few connections between hubs.
415
00:39:06,710 --> 00:39:10,490
So in a social network, this might be a group of friends and this might be another group of friends,
416
00:39:10,640 --> 00:39:14,090
and they all know each other and they all know each other. But there are not many connections between two groups.
417
00:39:15,260 --> 00:39:21,500
So one very basic question that you could ask about structure like this in a network is, can I find it?
418
00:39:21,710 --> 00:39:26,780
If you give me a big network, can I find the groups, the communities in the network?
419
00:39:27,440 --> 00:39:31,760
If a network looks like this friendship network that I showed you earlier, then of course you can.
420
00:39:31,790 --> 00:39:37,670
It's easy. It's very easy to figure out what the groups are in this network. But if it looks like, say, this network, then it's a bit harder.
421
00:39:38,030 --> 00:39:39,410
All that comes from this network.
422
00:39:39,710 --> 00:39:46,250
You could maybe convince yourself that, well, maybe there's a clump over here, maybe is another one over here, but it's not entirely clear.
423
00:39:46,580 --> 00:39:51,950
So what we like is some automated way of teasing apart the structure of the network.
424
00:39:52,580 --> 00:39:55,880
So there are many ways, in fact, of doing this that have been proposed.
425
00:39:56,090 --> 00:39:59,390
But the most widely used by far is something called modularity maximisation,
426
00:39:59,630 --> 00:40:04,730
which is based on it's an optimisation method based on something called modularity.
427
00:40:04,730 --> 00:40:07,760
Modularity is a quality function.
428
00:40:07,760 --> 00:40:15,020
What it does is you give me a network and you give me a candidate division of that network into parts, into communities,
429
00:40:15,200 --> 00:40:23,030
and it gives a score to that division, which is good if it's a good division in a way that I'll define and bad if it's a bad division.
430
00:40:23,300 --> 00:40:27,590
And then modularity, maximisation is just let's look through every possible division of the network
431
00:40:27,590 --> 00:40:33,050
and find the one that gets the highest score and that's the best division. So it's turning it into an optimisation problem.
432
00:40:33,500 --> 00:40:37,230
So the whole trick here is to define the score, right?
433
00:40:38,000 --> 00:40:42,650
So what makes a good score is if you look at a picture like this,
434
00:40:43,370 --> 00:40:48,380
a good division of network is one that puts most of the edges inside the groups in the network.
435
00:40:48,500 --> 00:40:53,540
So you're giving me the candidate division of the groups I go to count up how many the edges are inside the groups.
436
00:40:55,040 --> 00:41:02,240
So so a very simple example of a score that I could use is just what what number of edges or what fraction of edges for inside groups.
437
00:41:03,080 --> 00:41:04,670
However, if you think about it for a moment,
438
00:41:04,670 --> 00:41:12,260
this is not a very good score because it's clear that the global maximum of that is let's just put everybody all in one big group together, right?
439
00:41:12,260 --> 00:41:15,770
Then 100% of your edges are inside groups and you'll get absolutely nothing.
440
00:41:15,890 --> 00:41:20,000
But so that's not quite the right thing, but it's nearly the right thing. It's basically the right answer.
441
00:41:20,270 --> 00:41:25,310
So the score that does work is so-called modularity thing, which is number of edges within groups,
442
00:41:25,520 --> 00:41:30,920
minus the expected number of edges that would fall within groups if you just placed the edges at random.
443
00:41:32,030 --> 00:41:37,460
In other words, you're asking, are there more edges in groups than I would expect on the basis of chance?
444
00:41:37,850 --> 00:41:47,970
If there are, then I found something here. So in order to actually define that, I have to say how I'm going to randomise the positions of my edges.
445
00:41:48,150 --> 00:41:52,980
And there's more than one way you could do that. But in the simplest case, for instance, you could just randomise them uniformly.
446
00:41:53,190 --> 00:41:57,000
Just take all your ideas and just move them around completely uniformly at random.
447
00:41:57,440 --> 00:42:03,420
Okay. So you're comparing actual number of regimen groups with the number that you would get after you randomise things.
448
00:42:04,840 --> 00:42:06,910
So there's a nice matrix formulation of this.
449
00:42:07,300 --> 00:42:12,730
So we've already defined the adjacency matrix, which is 1 to 0, depending if there's an edge between I and J.
450
00:42:14,070 --> 00:42:20,940
So AJ and ultimately JCC Matrix tells you the actual number of ages observed to fall between no danger.
451
00:42:21,420 --> 00:42:23,069
It's either one or zero. There's either. It's there.
452
00:42:23,070 --> 00:42:31,800
Whether you want to subtract from that the expected number of edges that would fall there if the edge positions were randomised.
453
00:42:32,520 --> 00:42:36,690
So that's the thing I'm quoting here, P.J., in the simplest case,
454
00:42:36,690 --> 00:42:42,110
that would just be equal to the overall probability of an edge if you just uniformly randomise them.
455
00:42:42,990 --> 00:42:48,390
But for whatever randomisation procedure you define, there's some PJ which is the probability that an edge.
456
00:42:49,320 --> 00:42:54,360
So actual minus expected number of edges B nodes I and J is AJ minus PJ.
457
00:42:54,600 --> 00:43:00,510
And then the modularity itself is just the sum of that over all pairs of those that fall in the same group.
458
00:43:02,200 --> 00:43:07,090
So consider, for instance, the simple case where I'm just dividing the network into two groups.
459
00:43:07,960 --> 00:43:15,940
I'm going to do that. I'm going to turn that into something that I know about by putting an ising spin on every node of the network.
460
00:43:16,090 --> 00:43:19,920
So spin up or spin down to represent. Well, I'm in group one or in group two.
461
00:43:19,930 --> 00:43:25,959
I'm just dividing into two. So plus or minus one, I've got some spin variable SRB, which is plus one athletics.
462
00:43:25,960 --> 00:43:33,910
I've lost one and minus one if it was 30. So now it turns out I can write this modularity very simply in terms of that.
463
00:43:34,830 --> 00:43:40,930
I just do some arithmetic and I end up with this expression here, which says A leading constant that I don't care about.
464
00:43:40,930 --> 00:43:47,700
Ignore this one over four and then some over spins on J by sister.
465
00:43:47,830 --> 00:43:51,069
So these are the spin variables by J is just this thing we had before.
466
00:43:51,070 --> 00:43:54,190
It's just AJ minus PJ. So those are both matrices that I know.
467
00:43:55,140 --> 00:44:01,070
Right. So this thing looks very like some sort of disordered spin model that we're used to.
468
00:44:01,460 --> 00:44:09,170
And my goal is to maximise this. So you give me a network that fixes this matrix by some of what's called the modularity matrix.
469
00:44:09,680 --> 00:44:13,579
And my goal is to maximise this by finding the spins that give me the largest
470
00:44:13,580 --> 00:44:17,270
value of the modularity that gives me my division of the network into two groups.
471
00:44:18,790 --> 00:44:21,220
So I can't do that. Really. That's a hard problem.
472
00:44:21,670 --> 00:44:27,850
If there's any spins, there are two to the end possible configurations of spins, too many of them to look through exhaustively.
473
00:44:28,420 --> 00:44:38,250
So I'm going to use an approximate strategy. I'm going to write my modularity in matrix terms here as vector SX times matrix B times vector.
474
00:44:38,320 --> 00:44:40,890
So matrix B is again this modularity matrix that we have.
475
00:44:40,900 --> 00:44:47,470
If we previous transparency and vector s is just the element vectors vector that has the ising spins in it, the plus and minus ones.
476
00:44:47,800 --> 00:44:51,910
So the goal is you give me a B, I want to find the S2, maximise its number, right.
477
00:44:53,170 --> 00:44:57,520
So the way I do that is it's a hard maximisation. It's discrete maximisation I can't do.
478
00:44:57,970 --> 00:45:09,070
So I'm, I relaxes, I use a relaxation method in which instead of confining the spins to be only plus one or minus one, I let them be any real number.
479
00:45:09,760 --> 00:45:13,719
Okay, so this is fairly standard thing in statistical physics language.
480
00:45:13,720 --> 00:45:20,980
I'm using a circle model instead of using an easy model that then becomes now an easy problem to solve
481
00:45:20,980 --> 00:45:24,430
because there are continuous variables and I can just differentiate to find where the maximum is.
482
00:45:24,640 --> 00:45:30,460
And it turns out the maximum is given by the leading eigenvector of this matrix, the modularity matrix.
483
00:45:31,520 --> 00:45:35,750
That's only an approximation to the true answer, but it turns out it's an approximation that works well.
484
00:45:36,500 --> 00:45:43,730
Here's one fun example that I got from all this again. This is a network of books.
485
00:45:43,970 --> 00:45:52,220
So nodes in this network represent books. They're all books about US politics, and they're all written around 2004,
486
00:45:52,220 --> 00:45:58,700
the time of the last US presidential election won the election between George W Bush and John Kerry,
487
00:45:59,150 --> 00:46:02,420
and they're conveniently colour coded blue and red here.
488
00:46:02,420 --> 00:46:05,569
For some reason in America, blue means left wing and red means right wing.
489
00:46:05,570 --> 00:46:10,650
I have no idea why. It's just the opposite to everyone else in the world. But so blue and red.
490
00:46:10,670 --> 00:46:16,790
So like over here on the red side, you have the face of George W Bush, and over here you have the lies of George W Bush.
491
00:46:18,470 --> 00:46:23,690
So Augustus went through these all and decided whether they were left wing books or right wing books by reading the blurb.
492
00:46:23,990 --> 00:46:27,110
And and then what are the edges in this network?
493
00:46:27,440 --> 00:46:32,600
The edges are frequent co purchasing on the online bookseller Amazon.com.
494
00:46:33,590 --> 00:46:37,640
So you go on Amazon, you know, the thing that says people bought this book, bought these other books.
495
00:46:38,180 --> 00:46:41,690
So it's that. Right? He just looked up all of these books on Amazon, made this network.
496
00:46:41,960 --> 00:46:46,010
So the point here is that everybody is just preaching to the choir.
497
00:46:46,010 --> 00:46:49,489
People who buy blue books by other books. People who buy read books by other read books.
498
00:46:49,490 --> 00:46:53,720
In other words, we believe that there is strong community structure in this network to communities,
499
00:46:53,930 --> 00:46:58,880
Democrats and Republicans, and indeed feedback through this spectral algorithm.
500
00:46:59,030 --> 00:47:06,500
And you find two very nice communities that correspond very closely to the the classification of the books, build a did based on things.
501
00:47:06,590 --> 00:47:12,590
So I'm only telling you purchasing information, I'm telling you nothing about the titles, the contents, the authors, anything about the books.
502
00:47:12,830 --> 00:47:20,240
And yet you can very clearly see through the polarisation of politics into left and right just by looking at the structure of this network.
503
00:47:21,660 --> 00:47:24,989
Okay. So fine. This is a nice algorithm.
504
00:47:24,990 --> 00:47:28,500
Works well in practice for actually finding structure.
505
00:47:28,830 --> 00:47:31,920
Widely used method for finding community structure networks.
506
00:47:33,180 --> 00:47:37,880
What I want to ask though is. How do we expect it to behave?
507
00:47:37,910 --> 00:47:40,340
I want to understand how it works. And again,
508
00:47:40,340 --> 00:47:51,140
I'm going to do that by looking at a simple model of a network that has community structure in it and asking what the spectrum is going to look like.
509
00:47:51,710 --> 00:47:56,840
So the model that I use in this case is only a little bit more complicated than the other models I've been talking about.
510
00:47:56,990 --> 00:48:01,640
It's something called a stochastic block model. It's a it's a well-known model that goes back to the 1980s.
511
00:48:02,210 --> 00:48:06,080
The definition is I take some number of nodes. I divide them up into groups.
512
00:48:06,560 --> 00:48:10,340
And then I throw down edges at random, just like I did in the random graph.
513
00:48:10,700 --> 00:48:19,370
But now with two different probabilities for edges pins the pin is edges inside groups and P out is edges between groups.
514
00:48:19,550 --> 00:48:24,050
I'm going to make P and bigger than P out, so I'm going to get more edges inside the groups, a few edges between the groups.
515
00:48:24,200 --> 00:48:28,340
In other words, I'm going to get a network that looks very like the ones I sketched before.
516
00:48:28,370 --> 00:48:32,990
It's got traditional community structure, dense clumps of nodes in the network.
517
00:48:35,690 --> 00:48:39,649
This model is still sufficiently simple, however,
518
00:48:39,650 --> 00:48:46,220
that we can actually calculate the entire spectrum of the matrix for it so we can understand how this algorithm is going to work.
519
00:48:46,580 --> 00:48:50,240
So again, I won't go through the calculation in detail, but more around the matrix theory.
520
00:48:50,420 --> 00:49:00,620
End result is semicircle like before and an outlying eigenvalue here, and we can calculate where the band edge is and where the fat lies.
521
00:49:00,860 --> 00:49:04,849
The parameters here see in and see out and just re scaled versions of those
522
00:49:04,850 --> 00:49:09,500
probabilities p and and p out rescale by the number n of nodes in the network.
523
00:49:09,920 --> 00:49:11,900
So in calculate the entire spectrum.
524
00:49:12,080 --> 00:49:21,680
And again, my conjecture is that if two of these eigenvalues cross over, like, for instance, if this one meets the band edge, then get close to zero.
525
00:49:21,920 --> 00:49:25,940
I'm going to get some interesting structural phase transition in my system.
526
00:49:25,940 --> 00:49:29,840
And indeed that turns out to be true. So.
527
00:49:31,960 --> 00:49:39,370
So the argument goes something like this if if there were no community structure in my network at all.
528
00:49:40,000 --> 00:49:46,030
In other words, if C in was just equal to C out, then this eigenvalue would disappear altogether.
529
00:49:46,030 --> 00:49:47,530
And I would just have the semicircle.
530
00:49:49,070 --> 00:49:59,870
The way to see that is that this matrix B that I'm using here is precisely the shifted adjacency matrix that I talked about before.
531
00:50:00,050 --> 00:50:05,540
It's the adjacent matrix minus its average value is the expected value.
532
00:50:05,720 --> 00:50:08,900
So I get just the semicircle and nothing else.
533
00:50:09,170 --> 00:50:13,790
When I do that calculation. So if there is no community structure, I expect this value to disappear.
534
00:50:14,250 --> 00:50:19,160
Right? So the presence of that value reveals to me that I have structure in my network.
535
00:50:20,360 --> 00:50:29,150
That's. That's how I can tell that it's there. Right. So so now what I do is I'm going to vary the values B, C and C out,
536
00:50:29,300 --> 00:50:33,080
make them closer together, which makes the structure in my network weaker and weaker.
537
00:50:33,350 --> 00:50:42,140
And when they actually meet, it's going to vanish altogether. But the interesting thing is that that eigenvalue disappears before that point.
538
00:50:42,290 --> 00:50:45,080
It meets the edge of the band and so disappears into the band.
539
00:50:45,230 --> 00:50:50,000
You'd think it might happen when there was no community structure in the network, when it was just equal to C out.
540
00:50:50,210 --> 00:50:56,440
But it actually happens before that. There is an earlier point where there's still community structure in the network.
541
00:50:56,450 --> 00:51:00,920
We know it's there because we put it there. Right. C and see how still different from each other.
542
00:51:01,130 --> 00:51:04,520
So the structure is there, but the spectrum doesn't reveal it.
543
00:51:04,760 --> 00:51:09,440
It just disappears and it happens. And this phase transition point here, you can calculate where the transition is.
544
00:51:09,920 --> 00:51:17,380
All right. So I won't go through this in detail maybe, but you can calculate.
545
00:51:17,390 --> 00:51:22,220
So the order parameter in this case is basically how many nodes do we get, right?
546
00:51:22,730 --> 00:51:27,860
Right. So the problem is I give you this network which has, say, two planted communities in it.
547
00:51:28,070 --> 00:51:31,100
I know who belongs to which communities because I made up everything.
548
00:51:31,490 --> 00:51:36,680
Right? I created the network myself. And then I feed it into this algorithm and I ask, does it get it right?
549
00:51:36,950 --> 00:51:41,030
How many nodes is it classify right? So that's my order parameter in this particular case.
550
00:51:41,180 --> 00:51:47,450
And it turns out I can calculate that. So the idea basically is you look at the eigenvectors of this.
551
00:51:47,580 --> 00:51:49,640
You look at the elements of this leading eigenvector,
552
00:51:49,790 --> 00:51:56,030
and you classify the nodes into one group or another, depending on whether they're plus one or minus one.
553
00:51:56,120 --> 00:52:01,940
Right. Spin up or spin down. Well, they're not exactly plus one or minus one, because remember, we relaxed the problem.
554
00:52:02,120 --> 00:52:07,010
We made them real numbers. So they're actually not a bad approximation. A bunch of them are plus and a bunch of them minus.
555
00:52:07,130 --> 00:52:12,050
So in this particular case, the first 500 nodes were group one and the second 500 Group two.
556
00:52:12,860 --> 00:52:19,460
However, some of the nodes, it gets wrong and you can see that those are the ones where the line crosses zero.
557
00:52:19,640 --> 00:52:25,190
Right. And classifying them into one group or the other, depending on whether they're positive spins or negative spins.
558
00:52:25,520 --> 00:52:28,520
So if they cross the zero line and a positive one becomes negative.
559
00:52:28,700 --> 00:52:33,920
You got it wrong. So it's the number of places where it crosses the zero line.
560
00:52:34,130 --> 00:52:38,770
That's your estimate of how many spins you classify wrongly.
561
00:52:39,260 --> 00:52:44,060
Okay. We can actually calculate that exactly. Again, by doing some random matrix theory.
562
00:52:45,050 --> 00:52:49,340
We can get an exact expression for the number of spins we expect to classify correctly.
563
00:52:49,550 --> 00:52:53,370
So here's what the order parameter looks like. It actually doesn't go to zero.
564
00:52:53,390 --> 00:52:57,580
It goes 2.5 because point five is complete failure.
565
00:52:57,590 --> 00:53:01,399
In this case, I've got two groups of nodes I could get.
566
00:53:01,400 --> 00:53:05,870
I could classify things correctly at the 50% level just by tossing a coin.
567
00:53:06,320 --> 00:53:10,190
Right. So the worst you can do is 50%, not zero here.
568
00:53:10,400 --> 00:53:17,060
So this is fractions correctly classify nodes. When the structure in the network is very strong, you actually get 100% of the right.
569
00:53:17,270 --> 00:53:21,020
But there is a sharp phase transition at which you fail completely.
570
00:53:21,200 --> 00:53:28,400
Solid lines. Here are the theoretical calculation and the dots, numerical calculations on real networks.
571
00:53:28,640 --> 00:53:32,630
And they agree pretty well, except for some sort of fuzziness around the phase transition.
572
00:53:34,350 --> 00:53:40,950
So what this is saying is in this region, over here to the left, there is structure in the network, but you cannot detect it.
573
00:53:41,520 --> 00:53:45,330
In fact, it turns out that the result is even stronger than this.
574
00:53:45,780 --> 00:53:52,260
There is some beautiful recent work by a group in France for really In the Cell and collaborators in Paris,
575
00:53:52,680 --> 00:54:00,630
which shows that not only does this spectral algorithm fail to detect the structure in the network below this phase transition,
576
00:54:00,810 --> 00:54:04,560
but every algorithm must fail below this transition.
577
00:54:05,940 --> 00:54:09,100
This is, I think, a really deep and interesting finding.
578
00:54:09,120 --> 00:54:15,480
It says that there exist networks that have a structure in it, but it's fundamentally impossible to detect it.
579
00:54:15,690 --> 00:54:17,490
No algorithm will detect it.
580
00:54:18,000 --> 00:54:25,230
It's kind of like the incompleteness theorem, which shows that there are results that are true, but you'll never be able to prove that they're true.
581
00:54:25,380 --> 00:54:29,459
It's a similar kind of thing. These networks, they're artificial.
582
00:54:29,460 --> 00:54:34,860
You know, we're not really interested in the stochastic block model. But it's interesting as an existence proof.
583
00:54:35,250 --> 00:54:40,260
It tells you that there are certain kinds of structures in networks which you will never be able to detect.
584
00:54:40,680 --> 00:54:43,740
And this is a new understanding that we didn't have before.
585
00:54:43,830 --> 00:54:50,160
You know, we might have hoped, you know, just how like how Bertrand Russell hoped that you could prove everything in mathematics.
586
00:54:50,310 --> 00:54:56,310
We might have hoped that if there is structure in a network, we will always be able to come up with some way of detecting it.
587
00:54:56,700 --> 00:55:04,050
This result tells us that that is not true, that there are some kinds of structure in networks which will always be undetectable, that they're there.
588
00:55:04,170 --> 00:55:08,760
We know that in this case because we put them there and yet no method can detect them.
589
00:55:09,600 --> 00:55:16,890
I'm just about out of time, so I should stop talking now. Before I do, I just want to say thank you to my terrific collaborators on this work.
590
00:55:17,610 --> 00:55:22,470
First, three students and former students in my group lost to a faculty collaborators.
591
00:55:22,710 --> 00:55:25,920
These folks gave us money. Thank you for your attention. That's it.