1
00:00:15,950 --> 00:00:21,200
So, yes, so machine learning has ventured into its many parts of physics,
2
00:00:21,200 --> 00:00:25,160
string theory is one, and that's happened quite recently, so I'm going to talk about that.
3
00:00:25,160 --> 00:00:34,280
And the plan is, well, I'll give you a little bit of a motivation and the history which is going to be very short.
4
00:00:34,280 --> 00:00:40,220
I'll go over some machine learning basics, which is going to do something very similar to what artists done earlier,
5
00:00:40,220 --> 00:00:42,800
but perhaps phrasing it slightly differently.
6
00:00:42,800 --> 00:00:49,520
So maybe that's helpful because hearing it twice presented in a slightly different way might be might be useful.
7
00:00:49,520 --> 00:00:54,800
I have to tell you a little bit about string theory, because otherwise I can't explain what the applications are going to be.
8
00:00:54,800 --> 00:01:03,570
And then eventually I'm going to put these two together and and tell you what you might be able to do with machine learning and string theory.
9
00:01:03,570 --> 00:01:08,820
So let me give you a short history, and it's really a very short history,
10
00:01:08,820 --> 00:01:13,830
because this subject machine learning and string theory started about three years ago,
11
00:01:13,830 --> 00:01:20,070
and I think it's fair to say that it started in Oxford Theoretical Physics with these three gentlemen.
12
00:01:20,070 --> 00:01:27,300
I mean, really, it's been a post-doc here, is happily smiling, standing in front of the CMS detector at CERN, where he is now a fellow.
13
00:01:27,300 --> 00:01:33,130
And this is when Krypton do. If it was a postdoc at the time, he's now a long term fellow in Munich and young.
14
00:01:33,130 --> 00:01:38,630
We hear who is at City University London, but has a long affiliation with Oxford.
15
00:01:38,630 --> 00:01:46,850
And four years ago, we were discussing precisely how we would get this topic started,
16
00:01:46,850 --> 00:01:50,660
how we would go about applying machine learning in string theory.
17
00:01:50,660 --> 00:02:00,650
And this led to two papers one by volume, one by Yonghui, who were kind of the first papers doing this or exploring this sort of thing.
18
00:02:00,650 --> 00:02:05,450
So this was two years ago, and it's been a burst of activity since.
19
00:02:05,450 --> 00:02:07,310
But of course, it's a very new subject,
20
00:02:07,310 --> 00:02:13,910
so people are still exploring as there's no there are no final conclusions and there won't be any final conclusions today.
21
00:02:13,910 --> 00:02:21,930
So all I can tell you about is basically what this beginning looks like and what we're hoping for.
22
00:02:21,930 --> 00:02:29,130
OK, so what are the motivations, why do you want to do this in the first place?
23
00:02:29,130 --> 00:02:37,170
No string theory leads to very large data sets, and I'll explain a bit later why that is right and that.
24
00:02:37,170 --> 00:02:43,800
But these are data sets which are very different from the usual data sets that you use, such as pictures and videos.
25
00:02:43,800 --> 00:02:51,210
And these numbers keep changing, but the current world records for the number of solutions in string theory is this somewhat ridiculous number here.
26
00:02:51,210 --> 00:02:57,220
Right? And so these are really, really large data sets, but quite different.
27
00:02:57,220 --> 00:03:01,960
And, of course, machining provides techniques to deal with large sets of data,
28
00:03:01,960 --> 00:03:08,890
so it's an obvious thought that you might be able to put those two together and be able to make some progress.
29
00:03:08,890 --> 00:03:13,810
So can we can we uncover features of string data using techniques for machine learning?
30
00:03:13,810 --> 00:03:17,160
So that's that's the obvious question.
31
00:03:17,160 --> 00:03:28,920
It's perhaps slightly less obvious question, which has to do with this example that I talked about earlier about the the deep mines go enterprise.
32
00:03:28,920 --> 00:03:30,780
So let me come back to that.
33
00:03:30,780 --> 00:03:39,360
So as you can see, the number of so sensitive go games is quite large, but still a lot less than the number of string solutions.
34
00:03:39,360 --> 00:03:47,450
But it is also a lot larger than the number of sets of chess games, which is why GO was a challenge for a very long time, computer wise.
35
00:03:47,450 --> 00:03:52,370
And so this these curves to illustrate what DeepMind was able to do in this context,
36
00:03:52,370 --> 00:04:07,480
so so the the dashed line here that corresponds to the supervised learning system that they initially devised, which beat the the world champion Lee.
37
00:04:07,480 --> 00:04:15,550
So this was this was a system that was trained using human go games, basically with many human go games.
38
00:04:15,550 --> 00:04:20,260
And the the the this blue line up here that that goes up,
39
00:04:20,260 --> 00:04:28,480
that is the reinforcement learning curve, which was for a system that did not use human input.
40
00:04:28,480 --> 00:04:35,650
It just knew the rules of go and learn to play the game by playing against itself.
41
00:04:35,650 --> 00:04:41,290
As you can see for about two days and after two days, it's strength.
42
00:04:41,290 --> 00:04:47,320
So this this vertical axis is the strength with which it plays after after about two days.
43
00:04:47,320 --> 00:04:57,130
Its strength exceeded the strength of the supervised system and of course, then also of the of the strength of the world champion.
44
00:04:57,130 --> 00:05:03,250
So that's quite impressive. And this curfew makes it even more impressive because this curve or these two curves,
45
00:05:03,250 --> 00:05:08,500
they show the number of human moves that the system made in the context humanlike
46
00:05:08,500 --> 00:05:13,360
moves that see professional player moves and the fact that this blue curve,
47
00:05:13,360 --> 00:05:22,780
which corresponds to the reinforcement, is below the the other curve says that the reinforcement learning which had hadn't used
48
00:05:22,780 --> 00:05:28,090
any human input is in fact making moves that are not humanlike like it invents new moves.
49
00:05:28,090 --> 00:05:37,780
Yet it is stronger than the best human player. And to illustrate this in a paper by presenting some of these moves that the system has come up with.
50
00:05:37,780 --> 00:05:46,420
So that is quite impressive, and that makes you think, well, perhaps if if such a system can reveal new structures of the board game,
51
00:05:46,420 --> 00:05:55,900
maybe it can reveal new structures in physics or mathematics, and we might be able to learn something about the new network in the context as well.
52
00:05:55,900 --> 00:06:00,610
So just to summarise, the two basic questions are then where exactly this one, right?
53
00:06:00,610 --> 00:06:07,780
Can we can we somehow use machine learning to reveal structures, mathematical structures within string theory?
54
00:06:07,780 --> 00:06:13,940
And this is sort of related to one of the questions we had earlier about. Can we understand better what the system is actually learning?
55
00:06:13,940 --> 00:06:18,850
Can we look at it as more more than a black box?
56
00:06:18,850 --> 00:06:27,830
And this is all very new. So. So this will illustrate this with a paper that is just from last year.
57
00:06:27,830 --> 00:06:31,970
And then the second question is, is the one that post earlier, which is,
58
00:06:31,970 --> 00:06:39,260
can we somehow use machine learning to help sort through this enormous amount of data?
59
00:06:39,260 --> 00:06:45,200
OK, so let me go through some of the basics, so this is very similar to what I did, but perhaps presented slightly differently.
60
00:06:45,200 --> 00:06:51,140
I think one of the very useful language, if you if you know the language is just the basic language of mathematics.
61
00:06:51,140 --> 00:06:57,770
So if you just remember a few bits of very basic mathematics, new netbooks are actually very easy to understand.
62
00:06:57,770 --> 00:07:03,260
And one thing to say is that you should just for the time, just think of it as a box,
63
00:07:03,260 --> 00:07:07,310
which corresponds to a function f feature which takes as an input,
64
00:07:07,310 --> 00:07:14,570
an end, dimensional vectors and real numbers, and produces an M dimensional outputs and real numbers.
65
00:07:14,570 --> 00:07:19,400
And it's actually more complicated than that. It's not just a single real function.
66
00:07:19,400 --> 00:07:27,740
It's a whole family of of functions which depend on some set of parameters feature that we call collectively theatre, right?
67
00:07:27,740 --> 00:07:37,010
And so the the art of training a neural network is to somehow pick these parameters to in some sort of suitable fashion.
68
00:07:37,010 --> 00:07:42,740
And how do we do that? Well, we have typically in Alice, in supervised learning, which is what I'm discussing here.
69
00:07:42,740 --> 00:07:51,170
We have a training set out, which corresponds to instances of input externally and intended to target outputs.
70
00:07:51,170 --> 00:07:58,280
Why? So these are the way I sort of the real, the real results that we would like to get, right?
71
00:07:58,280 --> 00:08:09,140
So if if I were pictures of cats and dogs, then I would be one zero one is one is one is for Dawkins, who is for cats, for example.
72
00:08:09,140 --> 00:08:15,410
And how do we train this while we form a function like this, which is called the lost function, right?
73
00:08:15,410 --> 00:08:21,080
So we we look at the difference between the output of the neural net.
74
00:08:21,080 --> 00:08:30,390
For a certain input and given adjustable parameters to we subtract from it the correct intended output,
75
00:08:30,390 --> 00:08:37,910
why we scrap the whole thing, and we do this over some sort of batch from this training sample, right?
76
00:08:37,910 --> 00:08:43,130
And then we we think of this whole expression as a function in those parameters theatre.
77
00:08:43,130 --> 00:08:50,000
And we try to minimise it by a certain method that is normally taken to be what is called stochastic gradient descent,
78
00:08:50,000 --> 00:08:55,220
which basically means you go down the steepest gradient in the feature direction and
79
00:08:55,220 --> 00:09:00,740
you keep repeating this as you as you pick matches here from this training set.
80
00:09:00,740 --> 00:09:06,800
And you hope that that way you get to a well-trained network.
81
00:09:06,800 --> 00:09:15,830
And then, of course, us unseen data of the same kind to to test this network, see if it if it generalise as well.
82
00:09:15,830 --> 00:09:27,100
In other words, if the loss on this data computed for the value of FT are not of the parameters that you've arrived at is actually sufficiently small.
83
00:09:27,100 --> 00:09:28,870
And if that works out,
84
00:09:28,870 --> 00:09:38,950
you declare success and you would use the Soul Train network with the parameters now set to those specific values and make predictions.
85
00:09:38,950 --> 00:09:43,610
All right. So that's that's the basic process. Now I have to say a little bit about this black box here.
86
00:09:43,610 --> 00:09:51,190
So what's the black box? So this is the the simplest version of what could be in this box, which is quite the perception.
87
00:09:51,190 --> 00:09:56,470
And the perception is it's just this performs a sequence of two steps.
88
00:09:56,470 --> 00:10:06,460
The first one is just enough time up, right? So it takes the image an input vector that forms the dot product with a Vector W, and that's a B.
89
00:10:06,460 --> 00:10:10,690
And then the second step is it applies to the output of that some function.
90
00:10:10,690 --> 00:10:15,550
So if you combine this thing in mathematical language, this is what the function looks like.
91
00:10:15,550 --> 00:10:19,930
And W the vector is called the weights.
92
00:10:19,930 --> 00:10:24,790
B is got the bias and it's the activation function.
93
00:10:24,790 --> 00:10:30,140
And these two sets of probabilities together is what I call feature previously.
94
00:10:30,140 --> 00:10:33,610
Right, so this this is these are the things that you want to trade in this particular
95
00:10:33,610 --> 00:10:37,330
context and the activation function is what what are drew on the board earlier,
96
00:10:37,330 --> 00:10:44,050
for example, this one choice, which is this function, which sort of interpolate between zero and one.
97
00:10:44,050 --> 00:10:52,650
But the other choices? OK. One thing that hopefully you remember is that this this kind of equation here that is performed
98
00:10:52,650 --> 00:10:57,390
in the first step is very closely related to the equation of a plane or a hyper plane.
99
00:10:57,390 --> 00:11:02,370
It's a I have a hyper plane, which is defined by this equation, right in two dimensions.
100
00:11:02,370 --> 00:11:08,510
It would just be a line if you if you think X is a two d vector, it would just be the equation of a line.
101
00:11:08,510 --> 00:11:19,980
All right. Now. It is with this with this sort of geometry in the back of your mind, it's very easy to understand what the system does.
102
00:11:19,980 --> 00:11:26,010
Suppose that your Vector X is such that it is above the line, right?
103
00:11:26,010 --> 00:11:30,950
Then the output of the first element here will be positive.
104
00:11:30,950 --> 00:11:37,160
All right. That's what it means to be above the line. Because there is exactly on the line greater than there is above the line, right?
105
00:11:37,160 --> 00:11:44,540
If it's if it's if this outward is positive, you are somewhere here on the branch of this activation function where it is plus one.
106
00:11:44,540 --> 00:11:51,950
So the output will be roughly plus one. If it's below the line, then this this Afghan transformation is negative.
107
00:11:51,950 --> 00:11:56,630
You'll be on the negative branch of this sigmoid and the output will be zero.
108
00:11:56,630 --> 00:12:06,020
So you can see what this is doing is a very basic system which decides whether a given point is above or below a line or a plane.
109
00:12:06,020 --> 00:12:12,230
All right. So it's it's a very simple example if you want of of a pattern recognition.
110
00:12:12,230 --> 00:12:15,810
OK, so this is actually a hands on subject.
111
00:12:15,810 --> 00:12:21,840
So I want to do something hands on. I want to actually show you how this works in real time.
112
00:12:21,840 --> 00:12:31,650
So let's this will take a little while, but let's let's see. So these systems are set up, this is, for example, within Mathematica,
113
00:12:31,650 --> 00:12:40,800
so here is a set of points random points generated in the box and there are two kinds the blue and the yellow,
114
00:12:40,800 --> 00:12:44,790
and you see they're roughly separated by a line. All right.
115
00:12:44,790 --> 00:12:48,160
But let's suppose that we don't actually know that just yet.
116
00:12:48,160 --> 00:12:56,370
We want we want to train a system to actually recognise that line so it can distinguish between the two kinds of points.
117
00:12:56,370 --> 00:13:00,780
And so. So this is the training set plotted. This is what it looks like in practise, right?
118
00:13:00,780 --> 00:13:03,930
It has the two coordinates, the X and the Y coordinate.
119
00:13:03,930 --> 00:13:13,580
And then the the target is either one if it's a blue point or two, if it's yellow or yellow point.
120
00:13:13,580 --> 00:13:18,330
OK, so then we can define ourselves a perception.
121
00:13:18,330 --> 00:13:23,880
So he has the he has the perception that's how mathematical this place is, this is this is basically the A5 transformation.
122
00:13:23,880 --> 00:13:28,990
This second bit is the logistic sigmoid activation function.
123
00:13:28,990 --> 00:13:33,730
And then we can train this in real time.
124
00:13:33,730 --> 00:13:40,100
And this is what you see here is precisely the lost function that I defined earlier.
125
00:13:40,100 --> 00:13:43,860
And so if this goes down, that's a good thing.
126
00:13:43,860 --> 00:13:53,910
And the the orange one is the loss on the training set and the blue version of the curve is lost on the validation set.
127
00:13:53,910 --> 00:14:00,780
So you want both of them, you expect the blue one to be higher than the orange one, but you definitely want both of them to go down.
128
00:14:00,780 --> 00:14:11,700
And now it's finished. And I can go and extract the values of the weights and the bias from my network,
129
00:14:11,700 --> 00:14:18,300
and I can plot from those from the state to the line that they define in two dimensions.
130
00:14:18,300 --> 00:14:22,380
And if I do that, that's what I get, right? Not surprisingly, right.
131
00:14:22,380 --> 00:14:27,870
But it's so. So I mean, someone was asking earlier, Can you understand what a new network does better?
132
00:14:27,870 --> 00:14:33,140
This is a little bit of an understanding of what it does, but of course, in a very, very simple case.
133
00:14:33,140 --> 00:14:37,730
So this clearly distinguishes the the blue and the red and the blue and yellow points.
134
00:14:37,730 --> 00:14:44,570
And so if you know, had an arbitrary point, picked an arbitrary set of X and Y coordinates and fed it into this network,
135
00:14:44,570 --> 00:14:58,830
you would get a zero output or one output. And depending on what you get, you would be able to decide where the point is above or below that line.
136
00:14:58,830 --> 00:15:06,570
OK, so. So this was the the simplest building block. But of course, it gets more complicated.
137
00:15:06,570 --> 00:15:14,130
You can look at several of these in parallel. So each each one of these is now one of those perceptions from the previous slide.
138
00:15:14,130 --> 00:15:20,190
You can look at any of them in parallel, but they have independent weights and have independent biases.
139
00:15:20,190 --> 00:15:27,530
And now the output, of course, is not a single real number. It's a it's a vector with components.
140
00:15:27,530 --> 00:15:31,040
And of course, that's a very inefficient way of riding the stone made a much better way.
141
00:15:31,040 --> 00:15:37,670
If you sort of remember basic, the vectors and matrices is to combine all these weights into a matrix,
142
00:15:37,670 --> 00:15:42,500
which I call a W and combine all these biases into Vector, which are called B.
143
00:15:42,500 --> 00:15:48,050
And of course, then these are all the parameters that are called feature previously.
144
00:15:48,050 --> 00:15:53,450
And you get to what you can sort of symbolise this whole operation like.
145
00:15:53,450 --> 00:16:02,720
So. Now this becomes just a multiplication of a of a of a vector with a matrix plus an extra bit.
146
00:16:02,720 --> 00:16:07,900
And then there's an activation function as before. So,
147
00:16:07,900 --> 00:16:12,150
so one way of of saying what this does is when the previous system learnt about the
148
00:16:12,150 --> 00:16:17,700
existence of a single hyper plane and this this such systems combine in parallel.
149
00:16:17,700 --> 00:16:22,530
This learns about the existence of M hyper planes.
150
00:16:22,530 --> 00:16:28,710
And then, of course, you can go further, you can take one of those building blocks, one of these perceptions in parallel.
151
00:16:28,710 --> 00:16:34,650
And you can construct from it several layers that you can just do this sequentially, one after the other.
152
00:16:34,650 --> 00:16:36,480
And of course, in each step in general,
153
00:16:36,480 --> 00:16:42,180
you change the dimension of your input just depending on what the size of this matrix of this weight matrix is.
154
00:16:42,180 --> 00:16:49,700
So this indicated that with those various numbers and one and two in and so forth.
155
00:16:49,700 --> 00:16:55,770
And of course, what these dimensions are depends on the details.
156
00:16:55,770 --> 00:17:01,380
And there's there's a much longer story there, but that is that is the basic structure.
157
00:17:01,380 --> 00:17:14,500
And I want to show you another example where this is done in a slightly more complicated way.
158
00:17:14,500 --> 00:17:21,730
So sort of same principle, but now with the set of two points is a lot less a lot as simple, right?
159
00:17:21,730 --> 00:17:24,200
So is a sort of a pattern. Right.
160
00:17:24,200 --> 00:17:33,460
And as before, we would like the new network to distinguish between the blue and the yellow points, but they're not linearly separated anymore.
161
00:17:33,460 --> 00:17:41,180
So if we try the same method as before, so we just try a simple, simple perceptron.
162
00:17:41,180 --> 00:17:47,440
Right. So the one that represents a line in which we try to change this?
163
00:17:47,440 --> 00:17:53,200
Well, that doesn't work, right? As you can see, the loss is not really getting a very small.
164
00:17:53,200 --> 00:17:57,190
I mean, you can see the the numbers here, it sort of zero point five, right?
165
00:17:57,190 --> 00:17:59,960
That's that's not very impressive.
166
00:17:59,960 --> 00:18:06,890
And that, of course, expected you would not expect a single line to be able to tell the difference between those two shapes.
167
00:18:06,890 --> 00:18:16,820
All right. So we can we can stop this, but we can sort of still look at what it's actually done right and what it's done something very silly.
168
00:18:16,820 --> 00:18:27,330
Right? OK, so but of course, that was a silly way of going about it anyway.
169
00:18:27,330 --> 00:18:36,720
So we do something more complicated, right? So we we look at a neural network which has in its first layer for perceptions in parallel.
170
00:18:36,720 --> 00:18:45,450
All right. So this is like to be pretty a for Buffy says that that that these are for four of these perceptions are arranged
171
00:18:45,450 --> 00:18:52,980
in parallel and then we have a sort of final layer to put them all back together and make the outward real.
172
00:18:52,980 --> 00:18:57,000
So we do we train again.
173
00:18:57,000 --> 00:19:19,640
And while initially you get a little bit worried, but then then clearly there is a difference that it starts going down quite dramatically.
174
00:19:19,640 --> 00:19:30,340
OK, we can probably stop it now. And then we look at the same same picture as before we read about the way it's read.
175
00:19:30,340 --> 00:19:34,900
And so this is what it's done. If I run this again, it would probably do something else,
176
00:19:34,900 --> 00:19:40,120
but the point is that it's somehow arranged the four lines that those four perceptions correspond to in a
177
00:19:40,120 --> 00:19:48,010
way that allows you to distinguish the blue and the yellow points based on whether they're above or below.
178
00:19:48,010 --> 00:19:57,030
Each one of those four lines. And you can see how this how you could generalise this to get to a proper pattern recognition.
179
00:19:57,030 --> 00:20:01,670
And in fact, this is this is something we can we can actually do so.
180
00:20:01,670 --> 00:20:06,870
Uh, I was talking about this and this said which is, which is this set of hand-written numbers.
181
00:20:06,870 --> 00:20:10,440
It's sort of a standard test set that has been used.
182
00:20:10,440 --> 00:20:15,150
Here is a small sample of this set. So it contains these handwritten numbers then.
183
00:20:15,150 --> 00:20:21,420
And of course, the target is what they actually represent, which numbers actually represent.
184
00:20:21,420 --> 00:20:27,780
And I can use a network which is in practise very similar to the previous one,
185
00:20:27,780 --> 00:20:32,190
just a bit more complicated and uses the twenty eight times twenty eight inputs,
186
00:20:32,190 --> 00:20:42,320
which is the size of the the pixel size of these handwritten numbers and then just goes to an output.
187
00:20:42,320 --> 00:20:43,660
And of course, I should say this,
188
00:20:43,660 --> 00:20:50,990
these are all these are all the handwritten numbers from the tonight and then I've just picked the one and the nine here as an illustration.
189
00:20:50,990 --> 00:20:57,840
So we have a lot of binary classifier. And we can train this thing.
190
00:20:57,840 --> 00:21:23,280
As before. Right, and it's it works quite well.
191
00:21:23,280 --> 00:21:29,330
OK, stop it now. And.
192
00:21:29,330 --> 00:21:31,970
So then this is just giving you an impression,
193
00:21:31,970 --> 00:21:40,340
so this is sort of used a test sets so something that has not been used at a fraction of the original set, which has not been used for training.
194
00:21:40,340 --> 00:21:49,190
I've just fed the picture into the train network and the right hand side after the error is the output that that the network actually provides.
195
00:21:49,190 --> 00:21:57,520
So as you can see, in practically all cases that are plotted, it actually correctly identifies whether it's a one or nine.
196
00:21:57,520 --> 00:22:09,120
And and in fact, the the number down here gives you tells you that it's a ninety nine point seven percent accuracy of predicting the correct outcome.
197
00:22:09,120 --> 00:22:16,610
OK. Right, so so that was. The multi-layered perception,
198
00:22:16,610 --> 00:22:25,280
there's one more thing I want to one more type of network I want to discuss which is in the within the context of unsupervised learning.
199
00:22:25,280 --> 00:22:33,080
So that's that's learning where you're not providing the target you you're kind of hoping that the network
200
00:22:33,080 --> 00:22:37,910
will discover the pattern within the data by itself without telling you anything about the answer.
201
00:22:37,910 --> 00:22:42,410
And this particular network is called an auto encoder.
202
00:22:42,410 --> 00:22:45,260
So let me explain how that works.
203
00:22:45,260 --> 00:22:51,080
So the first part of the auto encoder is pretty much the same structure of a multilayer perception that we had on the previous slide.
204
00:22:51,080 --> 00:22:54,140
And so just a sequence of such layers.
205
00:22:54,140 --> 00:23:03,470
And this is combined with another such multilayer perception where now the dimensions go in the opposite direction.
206
00:23:03,470 --> 00:23:10,220
So this goes from an end dimensional vector in sequences down to a vector with dimension and L.
207
00:23:10,220 --> 00:23:16,100
This starts with the same dimension and L and gets it back to an end dimensional one.
208
00:23:16,100 --> 00:23:23,210
And it's done in such a way that in the top one, the dimension is decreasing as I go from left to right.
209
00:23:23,210 --> 00:23:28,750
And of course, and then in the bottom one, the dimension is increasing as they go from left to right.
210
00:23:28,750 --> 00:23:33,580
OK. And. The idea is.
211
00:23:33,580 --> 00:23:35,530
But somewhere here in the middle.
212
00:23:35,530 --> 00:23:42,010
Well, first of all, I will then combine these two networks right, and I will fight what is called the encoders of this first network.
213
00:23:42,010 --> 00:23:46,180
I will fit the output of that into the inputs of the second one.
214
00:23:46,180 --> 00:23:51,190
And that way, I will ensure that my input goes through this bottleneck,
215
00:23:51,190 --> 00:23:57,080
keeping in mind that the dimension here in the middle is a lot smaller typically than the one you started with.
216
00:23:57,080 --> 00:24:01,120
Right? And what you are trained on? Well, I don't provide the targets.
217
00:24:01,120 --> 00:24:08,160
I just have a set of possible inputs ixi that I put into the auto encoding on the left.
218
00:24:08,160 --> 00:24:19,380
And what I will try to minimise answer the loss function in this case is just making sure that the input X and the output X are the same.
219
00:24:19,380 --> 00:24:23,910
All right. So I'm trying to reproduce whenever fed into here at the end.
220
00:24:23,910 --> 00:24:28,500
But what I'm doing, so I have to feed it through this bottleneck.
221
00:24:28,500 --> 00:24:38,880
Right, so. So this must mean that somehow in the middle, the this audio to encode must learn some successful compression of the data.
222
00:24:38,880 --> 00:24:56,260
OK. So again, let's do an example of that.
223
00:24:56,260 --> 00:25:03,200
So this is for the same data set, the data set. And this is it starts looking a bit more complicated,
224
00:25:03,200 --> 00:25:09,380
nobody sees it takes the twenty eight times twenty eight inputs, which corresponds to one of those pictures.
225
00:25:09,380 --> 00:25:13,390
It increases damage first, but then it starts decreasing.
226
00:25:13,390 --> 00:25:21,690
It goes all the way to two dimensions in the middle and then it goes back up to the same size again.
227
00:25:21,690 --> 00:25:31,180
So let's train this thing.
228
00:25:31,180 --> 00:25:40,920
So this is now the last, but it's remember that the loss meant the the difference between the input and the output being the same.
229
00:25:40,920 --> 00:25:47,100
So this this looks reasonably good. And it's finished now.
230
00:25:47,100 --> 00:25:55,920
And so. So now this is actually this actually isn't very good, so maybe I should I should run this again.
231
00:25:55,920 --> 00:25:57,480
So this this actually happens, right?
232
00:25:57,480 --> 00:26:05,070
So this is this is something that that can happen, that depending on your on the initialisation of your neural network,
233
00:26:05,070 --> 00:26:26,720
sometimes you might not not go into the right direction. Let's hope this time it's better.
234
00:26:26,720 --> 00:26:32,120
No, it's not OK. Anyway, that's not what it was supposed to be happening.
235
00:26:32,120 --> 00:26:36,020
Of course, what was supposed to be happening is that the the blue and yellow points there were there
236
00:26:36,020 --> 00:26:40,400
were split apart and we'll show you an example of where it actually did work later on.
237
00:26:40,400 --> 00:26:46,160
Right. But so the blue points correspond to the lines, the yellow points correspond to the ones, right?
238
00:26:46,160 --> 00:26:51,020
So that's that's the idea that the this hour to call it would be able to tell
239
00:26:51,020 --> 00:26:55,280
them apart without actually telling the machine that there are these two types.
240
00:26:55,280 --> 00:27:01,730
Yeah, I mean, maybe there's a so with a bit of good will, you could anyway.
241
00:27:01,730 --> 00:27:10,520
So again, yeah, the axis. So remember that the opening quarter would compress it to something two dimensional in the middle.
242
00:27:10,520 --> 00:27:21,500
So this these two, these two dimensions are the two axes. So it's kind of a latent space in the middle.
243
00:27:21,500 --> 00:27:33,600
OK. Right. OK, so complete switch of topics.
244
00:27:33,600 --> 00:27:40,430
Let's go to string theory, so let me remind you about some basics of string theory, because I'll have to put this in this context.
245
00:27:40,430 --> 00:27:48,200
So string theory is a theory of strings meant to be the fundamental constituents of of nature and come in two times the open and the close strings.
246
00:27:48,200 --> 00:27:51,470
So the open strings? Well, exactly what you think they are.
247
00:27:51,470 --> 00:27:59,760
And when they propagate, they sweep out this sheet and the close strings they sweep out instead of cylinders as they move along.
248
00:27:59,760 --> 00:28:09,720
And it's a theory that starts out with one free one, one free, undetermined dimension for constant, which is which is the string tension.
249
00:28:09,720 --> 00:28:16,110
And which is, it turns out, only consistently defined in 10 or 11 dimensions,
250
00:28:16,110 --> 00:28:23,320
and this is where the root cause of the trouble is, as we will see very shortly.
251
00:28:23,320 --> 00:28:29,170
Now, the spectrum of of this view is very schematically. It's pretty much like you would expect for a string.
252
00:28:29,170 --> 00:28:36,700
So the the map, the master's exit, the massive exploitation of the massive and massive massive excitations with mass m.
253
00:28:36,700 --> 00:28:41,830
They are measured in units of this string tension, which is too high for prime.
254
00:28:41,830 --> 00:28:51,420
And in those units, they're basically integers. And amongst the modes here, where N equals zero, so amongst the matchless modes,
255
00:28:51,420 --> 00:29:00,260
we find always a graviton and we always find the kind of force carriers that we need to mediate the strong and the electric forces.
256
00:29:00,260 --> 00:29:10,710
So in other words, string theory generically always has the typical types of forces, the competition and the forces that we know exist in nature.
257
00:29:10,710 --> 00:29:15,450
So. So from that point of view, it looks like a reasonable starting point.
258
00:29:15,450 --> 00:29:24,600
And it turns out that the string tension because because the gravity is always in there and we somehow have to reproduce Newton's constant,
259
00:29:24,600 --> 00:29:30,750
which is in the appropriate units, a very large energy 10 to the 19 GeV.
260
00:29:30,750 --> 00:29:38,310
Because we have to reproduce that. It turns out that in most cases, the string tension gets coupled to the Newton constant.
261
00:29:38,310 --> 00:29:43,230
Otherwords is very large, and it's basically of the order of the Planck scale.
262
00:29:43,230 --> 00:29:49,530
And for that reason, all these modes were any bigger than zero, which you might have worried about.
263
00:29:49,530 --> 00:29:51,930
They become very, very heavy. All right.
264
00:29:51,930 --> 00:30:05,400
So, so the idea then, is that the the physics that we currently observe that that will only be tied to these modes with an equal zero.
265
00:30:05,400 --> 00:30:09,810
OK, so what about these dimensions? So that's that's really very embarrassing.
266
00:30:09,810 --> 00:30:17,940
And the way to get out of that is that somehow we need to think of a six or seven, depending on whether it's starting 10 or 11 dimensions.
267
00:30:17,940 --> 00:30:25,680
We need to think of these as being curled up in a very small scale something strings of his colleagues compact ification.
268
00:30:25,680 --> 00:30:29,070
So the schematically it's you start in 10 11 dimensions,
269
00:30:29,070 --> 00:30:39,060
you call up six or seven of these dimensions on a space that I keep will be will be calling X and somehow effectively then at least at scales,
270
00:30:39,060 --> 00:30:42,300
which are much longer than this case in which you've built these things up.
271
00:30:42,300 --> 00:30:46,770
You will be ending up with a four dimensional theory as you as you would like to.
272
00:30:46,770 --> 00:30:51,230
And the spaces on which you do this coming up there can be very complicated.
273
00:30:51,230 --> 00:30:54,780
That is a picture of one of them. It's it's called clube.
274
00:30:54,780 --> 00:31:00,790
How many folds. But this is a very particular one called the bi cubic, which are come back to later.
275
00:31:00,790 --> 00:31:07,520
And. But that's that's the basic process.
276
00:31:07,520 --> 00:31:14,460
And then the question is, how does this for the effective theory that you obtains, we're actually dependent on the way that you curl this up.
277
00:31:14,460 --> 00:31:17,400
And so here is the schematic way that works.
278
00:31:17,400 --> 00:31:26,670
One feature of the curling up is, of course, the topology of the curling up, so I can only draw 2-D 2D curling up manifolds here.
279
00:31:26,670 --> 00:31:31,320
So I am drawing a choice and the sphere at these are different apologies.
280
00:31:31,320 --> 00:31:37,590
And so the before this theory would depend on which one of those you picked.
281
00:31:37,590 --> 00:31:47,640
And so more specifically, what the topology determines is the actual forces that you get in your 40s and the metal content that you get.
282
00:31:47,640 --> 00:31:54,890
And mathematically, this whole process is caught in sort of tied up with the field of mathematics that's called algebraic geometry.
283
00:31:54,890 --> 00:32:00,740
And then this, of course, also the shapes of the toys that could have a very fat or very thin toys.
284
00:32:00,740 --> 00:32:07,830
So the shape also matters, and the shape will somehow determine the coupling constants in here for this theory.
285
00:32:07,830 --> 00:32:15,060
So that's roughly how the correspondence works. And for the purpose of this talk, I will only focus on this first aspect here.
286
00:32:15,060 --> 00:32:27,260
So I will be worried about the kinds of forces and the kinds of particles that you obtain in four dimensions from this kind of construction.
287
00:32:27,260 --> 00:32:32,600
So what are what are these two apologies of for cutting up? So again, I can only draw the 2D pictures very well.
288
00:32:32,600 --> 00:32:35,090
So in two, do you have this fear you have the choice,
289
00:32:35,090 --> 00:32:39,660
but you could have something with two handlers, you could have something with three handles, etc.
290
00:32:39,660 --> 00:32:45,870
So this is called the genius of the curve. All right. So in two dimensions, it's a very simple.
291
00:32:45,870 --> 00:32:52,170
You can basically classify the topology by the number of holes by this by this single integer tree.
292
00:32:52,170 --> 00:32:53,490
And that's it, right?
293
00:32:53,490 --> 00:33:04,050
And in fact, if you were to come back to find just two dimensions from this infinite sequence, only one of them, the tools would actually be allowed.
294
00:33:04,050 --> 00:33:08,640
But we are not. We're not just creating a two dimensions. We're calling up six dimensions.
295
00:33:08,640 --> 00:33:15,570
So, so a single integer is not enough. In fact, there is usually a multitude of integers.
296
00:33:15,570 --> 00:33:25,620
So. So the bottom line here is that the topology on which you do this is is classified by a bunch of integers, by integer data.
297
00:33:25,620 --> 00:33:36,590
OK, that's that's that's the main message. And because we in six dimensions will typically be many choices for that data.
298
00:33:36,590 --> 00:33:42,020
And this is this is precisely related to this enormous number of solutions to string theory that I mentioned earlier.
299
00:33:42,020 --> 00:33:49,110
It counts these counts, the different apologies which you can use to go from 10 to four.
300
00:33:49,110 --> 00:33:55,620
And some of these choices remember the topology is tied with the particle content, some of these choices,
301
00:33:55,620 --> 00:34:07,150
they will lead to particle content, which looks realistic, like the one that we actually see in nature and many others will not.
302
00:34:07,150 --> 00:34:16,720
So how do we actually find this? More specifically? And so I have to add a little bit more information, I haven't told you quite the truth.
303
00:34:16,720 --> 00:34:20,530
So typically this Space X carries extra structures, not just the space.
304
00:34:20,530 --> 00:34:25,990
It's also things on this space. And one one thing that could be on the space is what is called line bundles.
305
00:34:25,990 --> 00:34:29,530
So what's the line bundle? Well, I can.
306
00:34:29,530 --> 00:34:30,880
I have to sort of lower dimensions.
307
00:34:30,880 --> 00:34:38,650
And now the Space X, which is the black circle, is just drawing one dimensional because I need one other dimension to draw.
308
00:34:38,650 --> 00:34:46,330
And so a line bundle would be a structure where you attach to each point on this circle just align, right?
309
00:34:46,330 --> 00:34:50,740
I've done it here as an arrow to indicate an orientation, but in principle it should.
310
00:34:50,740 --> 00:34:55,060
It should go from minus to plus infinity like a proper B line.
311
00:34:55,060 --> 00:35:00,970
So that's that's one way of introducing a line bundle on this space x.
312
00:35:00,970 --> 00:35:04,840
This number is usually called o o x by the mathematicians, right?
313
00:35:04,840 --> 00:35:11,290
But you can see that you can do this in other ways. This is sort of not a very good picture, but I hope you can see what I mean.
314
00:35:11,290 --> 00:35:18,040
I start with a line oriented this way, and then as I go round the circle, it changes its orientation.
315
00:35:18,040 --> 00:35:22,270
And when it comes back, it's sort of pointing downwards. All right.
316
00:35:22,270 --> 00:35:28,270
So you can twist. In other words, you can twist the line as you go around this, uh, the circle.
317
00:35:28,270 --> 00:35:34,390
And this is called a one. And then, of course, the next one I couldn't draw, right,
318
00:35:34,390 --> 00:35:38,680
so but you can see that why you go around the circle, you can twist it twice and three times,
319
00:35:38,680 --> 00:35:46,300
etc. And so and you could change the orientation of the twisting, which means these these numbers could also be negative in principle.
320
00:35:46,300 --> 00:35:53,650
So you see that line balance in a simple case. They're classified by integers.
321
00:35:53,650 --> 00:35:58,450
Of course, this is the the manifold in which we consider this one, the this is very simple.
322
00:35:58,450 --> 00:36:02,320
In reality, we want to do this on a six dimensional manifold and a six dimensional.
323
00:36:02,320 --> 00:36:08,800
My foot might have many different loops and around each loop, the line and my twist as you go round it.
324
00:36:08,800 --> 00:36:12,730
So in other words, you need many integers in this case, typically to describe nine models.
325
00:36:12,730 --> 00:36:16,990
So. So the message is here we are again stuck with integer data.
326
00:36:16,990 --> 00:36:23,390
So alignment is classified in six dimensions, typically by a bunch of integers.
327
00:36:23,390 --> 00:36:32,850
So. What does this have to do with anything in terms of in terms of the physics that we get in four dimensions?
328
00:36:32,850 --> 00:36:38,650
Well, line have what is called section. So what's a section? So let's draw the line, Mundell.
329
00:36:38,650 --> 00:36:45,920
Access to manifold, again, a lot of these red areas, one of those that I discussed previously and the section is just,
330
00:36:45,920 --> 00:36:54,740
uh, basically a kind of a function which picks out a value on each of those lines.
331
00:36:54,740 --> 00:36:59,840
And the physical importance of these sections is that they're kind of the internal
332
00:36:59,840 --> 00:37:06,100
wave function of the particles that we're trying to obtain in four dimensions.
333
00:37:06,100 --> 00:37:11,230
So the upshot is that the number of independent sections that such a line bundle has.
334
00:37:11,230 --> 00:37:17,480
Well, first of all, it's counted by a mathematical concept, which is called cosmology.
335
00:37:17,480 --> 00:37:26,210
But. And commodity is denoted by this age symbol, and it comes in sort of different flavours.
336
00:37:26,210 --> 00:37:29,000
Age Knowledge 100, which feeds for never mind the details.
337
00:37:29,000 --> 00:37:35,930
But the point is that these numbers, which are just dimensions, numbers of different sections, independent sections that you get.
338
00:37:35,930 --> 00:37:40,410
They in fact count the number of particles that you get in four dimensions.
339
00:37:40,410 --> 00:37:44,620
And so, so it's a very nice way in which mathematics ties into physics.
340
00:37:44,620 --> 00:37:49,400
That's what happens very frequently, that you have some sort of mathematical theory,
341
00:37:49,400 --> 00:37:56,930
and it precisely plugs into this problem of wanting to compute the spectrum of the fortieth theory.
342
00:37:56,930 --> 00:38:03,200
So mathematically, what you have to do is compute those numbers. H h one.
343
00:38:03,200 --> 00:38:11,040
And it's two. For a given night Monday. And the point is that it's an absolutely horrendous calculation, right?
344
00:38:11,040 --> 00:38:16,980
I mean, it looks simple, but it is trying to if a bunch of integer numbers in trying to get the number of a bunch of integer numbers out,
345
00:38:16,980 --> 00:38:22,110
but to actually do this in practise is totally horrendous.
346
00:38:22,110 --> 00:38:28,650
And so, so for this reason, it seems like a good problem for machine learning, and it's something that is absolutely not obvious.
347
00:38:28,650 --> 00:38:34,560
Takes a very, very long computation. And actually, how am I doing on time? I'm thinking, should I show this calculation or not?
348
00:38:34,560 --> 00:38:45,070
I think probably not. OK, good. Right? So take my word for it is horrendous and therefore be a good problem for machine learning.
349
00:38:45,070 --> 00:38:50,250
So I skipped the example. And so get to the final part, machine learning string theory.
350
00:38:50,250 --> 00:38:58,500
So. So this line, Malcolm, that I need to compute in order to know what my patio content is.
351
00:38:58,500 --> 00:39:04,320
Can I can I somehow teach the machine to learn about that? In other words, can I teach a machine to learn?
352
00:39:04,320 --> 00:39:12,270
This particular map out has explained to us that these neural networks, they can basically represent any function.
353
00:39:12,270 --> 00:39:19,770
They're very expressive, right? They represent these function spaces. And can I teach a neural network to learn this function,
354
00:39:19,770 --> 00:39:26,730
which takes an integer vector which represents one of those alignments twisting in a particular way?
355
00:39:26,730 --> 00:39:33,160
And give out the commodity or the number of particles if you want.
356
00:39:33,160 --> 00:39:39,100
And then perhaps the next question, which is more ambitious, can somehow not do this in a black box way.
357
00:39:39,100 --> 00:39:44,210
But can I learn something about the mathematical structure of this map?
358
00:39:44,210 --> 00:39:46,250
As I go about it.
359
00:39:46,250 --> 00:39:54,950
So the training data is of this form, right, so we have an integer input vector and we have one of those commodity dimensions as the output vector,
360
00:39:54,950 --> 00:39:59,030
and we get this training data from our horrendous calculation that we have to do right.
361
00:39:59,030 --> 00:40:08,490
It's supervised learning, so we need the data. OK, so there is a space which is called deep two, never, never mind what it is.
362
00:40:08,490 --> 00:40:14,860
The unimportant is that line buttons on this space are classified by three integers cannot K1 and K2.
363
00:40:14,860 --> 00:40:20,560
And we can compute the commodity for, say, about a thousand of these integer triplets.
364
00:40:20,560 --> 00:40:30,110
We get the answer and we do this, of course, for a certain range for these integers, which is in a box of size 10 or 20.
365
00:40:30,110 --> 00:40:34,490
And so this is the same picture that you saw previously, so this is how the the lost function evolve.
366
00:40:34,490 --> 00:40:42,740
So you see it trains reasonably well. And you can see that you can then check that within this box of 10 that we've trained in.
367
00:40:42,740 --> 00:40:49,240
This will predict these these dimensions correctly with a 98 percent accuracy.
368
00:40:49,240 --> 00:40:58,810
But so that looks that looks very good. But if you then go and increase the size of the box to 15 and you ask your new network,
369
00:40:58,810 --> 00:41:06,970
which was trained on the smaller box to predict values in that range, the the success rate drops very dramatically.
370
00:41:06,970 --> 00:41:14,140
So this is what I'd alluded to earlier generalisations beyond the domain that you've used for training.
371
00:41:14,140 --> 00:41:22,660
They typically don't work very well because why should the network know how it would continue so that it was only trained on this particular box?
372
00:41:22,660 --> 00:41:26,380
So the upshot here is.
373
00:41:26,380 --> 00:41:35,230
Well, this works to some degree, with reasonably high accuracy, it's of course, much faster than the horrendous method once it's trained.
374
00:41:35,230 --> 00:41:40,840
But 90 percent accuracy or something like that might, in fact, for some applications, not be good enough.
375
00:41:40,840 --> 00:41:50,620
After all, you computing a dimension of a space, but you don't really want any uncertainty in knowing that answer, as I just explained.
376
00:41:50,620 --> 00:41:54,640
If you go outside the box that you use for training, it would become very bad.
377
00:41:54,640 --> 00:41:58,390
And of course, at this point, we have absolutely no inside of what it even means.
378
00:41:58,390 --> 00:42:03,070
It doesn't tell us anything about the mathematics.
379
00:42:03,070 --> 00:42:12,660
So this is where the second question comes back, so can we actually do this in a more sophisticated way and learn something about the mathematics?
380
00:42:12,660 --> 00:42:18,120
And of course, we need some kind of intuition of what the mathematics is, and fortunately, we do have that.
381
00:42:18,120 --> 00:42:27,510
And what we know is what we suspect from experience is that the function that we're trying to learn is actually not such a complicated function,
382
00:42:27,510 --> 00:42:34,380
even though the calculation is horrendous and performed in this way, it's a function which is piece wise polynomial.
383
00:42:34,380 --> 00:42:40,470
So for in this space of key vectors, there are regions which we don't know arbitrary,
384
00:42:40,470 --> 00:42:49,790
but there are regions and in each region this function, the commodity function is described by a polynomial of a certain degree.
385
00:42:49,790 --> 00:42:53,150
But we don't know what that is, either.
386
00:42:53,150 --> 00:42:58,550
So we can devise a somewhat more sophisticated neural network, and maybe I don't want to go too much into the details,
387
00:42:58,550 --> 00:43:06,680
but the idea of this new network is that we have two branches. The Apple one is somehow supposed to recognise what these regions are.
388
00:43:06,680 --> 00:43:16,660
And you can think of it working in much the same way as the sort of pattern recogniser that are presented in these earlier runs.
389
00:43:16,660 --> 00:43:21,050
All right. So it recognises these regions and the local branch.
390
00:43:21,050 --> 00:43:28,830
This one here sort of recognises the polynomial, and it puts the two together, and I can train this network on data.
391
00:43:28,830 --> 00:43:33,740
And then read out certain bits of information from these weights here.
392
00:43:33,740 --> 00:43:40,790
Which which of these input factors come with the same polynomial formula?
393
00:43:40,790 --> 00:43:46,760
So if I use this information, I can basically identify the regions. So let me let me just show you an example.
394
00:43:46,760 --> 00:43:51,680
So this example is actually for this by cubic that I showed the picture before.
395
00:43:51,680 --> 00:43:56,240
Its linemen are characterised by two into just K1 and K2.
396
00:43:56,240 --> 00:44:00,620
And if I run the neural network over my training data, I get this kind of plot.
397
00:44:00,620 --> 00:44:13,650
So the different colours here indicate the different regions. So in each of those regions, I know that the formula must be described by a polynomial.
398
00:44:13,650 --> 00:44:18,210
And so I can now just go knowing what the regions are and just fit the right polynomial to it,
399
00:44:18,210 --> 00:44:22,710
I need very few points because I know it's most a cubic polynomial, right?
400
00:44:22,710 --> 00:44:28,890
A cubic polynomial in two variables doesn't have very many coefficients. I only need a certain number of points to do that.
401
00:44:28,890 --> 00:44:31,980
I can do that for the Blue Ridge. I find it's just zero.
402
00:44:31,980 --> 00:44:37,920
And I could do it for the yellow and green region, which in turn in fact turn out to be two parts of the same region.
403
00:44:37,920 --> 00:44:44,850
And I find this formula. I can use the formula to clean up the regions, which were a little bit fuzzy at the edges.
404
00:44:44,850 --> 00:44:53,130
And find the boundaries and end up with a final formula, which looks so simple that it's almost embarrassing.
405
00:44:53,130 --> 00:44:59,940
Right? And and in fact, this formula is a formula of this kind were not known in mathematics,
406
00:44:59,940 --> 00:45:06,280
and it's not completely clear to this date why the standard when this calculation that it's normally been performed,
407
00:45:06,280 --> 00:45:11,450
which can take hours on the computer ends up with such a simple result.
408
00:45:11,450 --> 00:45:17,240
But this particular formula, so you can you can think of this as some kind of conjecture generator.
409
00:45:17,240 --> 00:45:23,240
So this is a conjecture that has been generated. It's not been proven because it's only been conjectured from a finite amount of data.
410
00:45:23,240 --> 00:45:31,150
But you can now go and try to mathematically prove this. And that's indeed been done for this, for me and for others of of a similar kind.
411
00:45:31,150 --> 00:45:42,450
So that's an example where you might have learnt something more than sort of the typical black box thing from in your network.
412
00:45:42,450 --> 00:45:43,620
OK, so this is another example,
413
00:45:43,620 --> 00:45:50,820
which is just here because it's pretty so this is again for this space deep to which where nine bills are characterised by three integers,
414
00:45:50,820 --> 00:45:57,600
so it's three dimensional plots, you see, it's a lot more complicated. There are six regions there, but they all identified.
415
00:45:57,600 --> 00:46:04,170
They've been cleaned up. I can read of the the polynomials. So a bit more complicated and there are six regions.
416
00:46:04,170 --> 00:46:15,450
This is basically just to see that it works, and you can also go and mathematically prove this conjecture.
417
00:46:15,450 --> 00:46:20,490
OK, so the summary here is at least instead of a modest way,
418
00:46:20,490 --> 00:46:30,890
machining can be used to generate mathematical conjectures within string theory, but probably also more generally.
419
00:46:30,890 --> 00:46:38,210
OK, so the second part has to do with can we somehow use machine learning to sift through this enormous amount of string data?
420
00:46:38,210 --> 00:46:46,400
And one simple question that you might ask is, Well, I have these constructions of all these four dimensional models that come from compact.
421
00:46:46,400 --> 00:46:52,970
You find from 10 to four and the spectrum of of these 40 theories depends on exactly how curled is up.
422
00:46:52,970 --> 00:46:59,960
Some of them will be good models from a physics point of view, others not with a neural network be able to tell the difference.
423
00:46:59,960 --> 00:47:07,850
But would a new a network be able to tell me which one of these choices corresponds to a standard model of particle physics and which one does not?
424
00:47:07,850 --> 00:47:11,360
So that's that's that's one basic question you might ask.
425
00:47:11,360 --> 00:47:19,220
And so one thing I need to say before is is that in order to get a model with the right forces,
426
00:47:19,220 --> 00:47:24,290
first of all, because I need to cut up six dimensions, but also need five of these line ones.
427
00:47:24,290 --> 00:47:33,320
Right. So but that just means picking five of these integer vectors, which we could summarise into some sort of matrix that are called OK.
428
00:47:33,320 --> 00:47:44,750
All right. So so just think about we pick up space, and the models on this space will then be described, but just picking an integer matrix.
429
00:47:44,750 --> 00:47:49,910
And we can create ourselves training data, which is a set of these interpretive matrices,
430
00:47:49,910 --> 00:47:56,480
and they go to zero or one, depending on whether that leads to a standard model or not.
431
00:47:56,480 --> 00:48:01,110
OK, so that's that's the kind of training data binary choice.
432
00:48:01,110 --> 00:48:09,260
And the question again is, can we somehow teach the machine to distinguish between those two?
433
00:48:09,260 --> 00:48:16,040
So we have to pick a space. I mean, never mind what the space is, but the point here is that it is also described by some set of integers,
434
00:48:16,040 --> 00:48:19,520
so that's a particular 60 space for cooling up.
435
00:48:19,520 --> 00:48:27,030
The nice thing about this space is that there are 17000 stand up models on there, which we have found by brute force.
436
00:48:27,030 --> 00:48:37,080
And we add to the 17000 the same number of non-standard models, which we just get by randomly generating some matrices with integers.
437
00:48:37,080 --> 00:48:43,230
And so he had two examples, like so that's one matrix, which in fact corresponds to the standard model.
438
00:48:43,230 --> 00:48:48,660
That's the one matrix, which does not correspond to a standard model. So the question is, can you tell the difference?
439
00:48:48,660 --> 00:48:58,180
I can't. Right. But the question is, can the machine?
440
00:48:58,180 --> 00:49:06,670
So it's a relatively simple network, two or three layers, so this is another way of representing the data.
441
00:49:06,670 --> 00:49:12,430
So this gives you an indication of how big the entries in this matrix are.
442
00:49:12,430 --> 00:49:21,460
And this is the distribution of this typical size, and we take training and validation data from this lower end of the spectrum,
443
00:49:21,460 --> 00:49:29,050
which corresponds to the matrices with small entries. And we take a test set from this upper end.
444
00:49:29,050 --> 00:49:37,390
And we run we train the new network, and we find it's extremely successful on this low end trains very well validates very well,
445
00:49:37,390 --> 00:49:41,020
but surprisingly it's also very successful on the test set.
446
00:49:41,020 --> 00:49:46,000
So this is an example which I don't completely understand yet of where the new
447
00:49:46,000 --> 00:49:54,000
network actually generalise as well beyond the domain that it has trained in.
448
00:49:54,000 --> 00:49:58,590
So the bottom line here is we can, in fact, distinguish between those two types.
449
00:49:58,590 --> 00:50:04,940
And again, there is a very complicated calculation actually to do this in the standard way.
450
00:50:04,940 --> 00:50:12,420
And so this this will be once it is trained, it will be a lot faster than doing this computation.
451
00:50:12,420 --> 00:50:18,630
OK, so now comes the outgoing caller, which went so spectacularly wrong before, but of course not.
452
00:50:18,630 --> 00:50:24,210
I'm not doing this in real time because it would take too long.
453
00:50:24,210 --> 00:50:27,780
So I I used the same dataset for an audio encoder,
454
00:50:27,780 --> 00:50:33,930
and the latent space in the middle is again two dimensional so I can produce a nice two dimensional plot.
455
00:50:33,930 --> 00:50:38,250
And this is what the plot looks like. Right. So the red points are the standard ones.
456
00:50:38,250 --> 00:50:42,330
The new points are the non-standard models and they are neatly separated, right?
457
00:50:42,330 --> 00:50:47,430
And this is from actually the set with small entries and one which it was strange.
458
00:50:47,430 --> 00:50:54,030
And if I use a test set of unseen data from the matrices with bigger entries, the split persists, right?
459
00:50:54,030 --> 00:51:00,870
Which which again seems to be saying it's whether generalising beyond the domain of training.
460
00:51:00,870 --> 00:51:06,150
OK, so the auto encoder works very well in this case as well.
461
00:51:06,150 --> 00:51:16,860
OK, so that's the end of it. So machine learning and strength has really just begun, and we don't really know what the good problems are,
462
00:51:16,860 --> 00:51:22,640
what the good techniques are and how we combine the two. So this is all developing still.
463
00:51:22,640 --> 00:51:33,080
But at least we can see that perhaps there is a there's an avenue there where machining can be used to generate conjecture, mathematical conjectures.
464
00:51:33,080 --> 00:51:45,500
And I think there's some hope that machine learning can help us sifting through this enormous landscape of string solutions.
465
00:51:45,500 --> 00:51:50,990
I think the question as to whether machining can really lead to substantial progress in string theory, that's still open.
466
00:51:50,990 --> 00:51:59,330
But that's why we have to wait and see. And somehow you might have a hope that because the data sets that we have in string
467
00:51:59,330 --> 00:52:03,050
theory and the kind of questions that we're asking are so different from the
468
00:52:03,050 --> 00:52:07,310
usual kinds of questions which often have to do with pictures and videos and
469
00:52:07,310 --> 00:52:11,600
speech recognition because those things are so different in the science context.
470
00:52:11,600 --> 00:52:17,510
There might eventually also teach us something about machine learning that we didn't know before, but we'll have to see.
471
00:52:17,510 --> 00:52:23,445
And thanks very much.