1
00:00:02,520 --> 00:00:05,860
[Auto-generated transcript. Edits may have been applied for clarity.]
If you separate.

2
00:00:14,570 --> 00:00:18,649
Okay, let us, uh, slowly start. Thank you very much for coming in.

3
00:00:18,650 --> 00:00:21,770
It's such a nice number to this, uh, wonderful place.

4
00:00:22,100 --> 00:00:25,850
It is a great pleasure to introduce Professor Ian Stoker.

5
00:00:26,270 --> 00:00:34,260
Ian is a professor at the UC Berkeley, where he holds Um Su Bo Chancellor chair in computer science.

6
00:00:34,280 --> 00:00:43,190
Well, it is exciting. Um, you, um, has been one of the most influential computer scientist, uh, in the field of distributed systems.

7
00:00:43,910 --> 00:00:51,350
Um, he has been instrumental in developing systems that have shaped how large scale data processing,

8
00:00:51,380 --> 00:01:00,380
data analytics, um, uh, um, AI workloads are built and deployed today.

9
00:01:01,310 --> 00:01:09,230
Some of these contributions include Apache Spark, Apache missiles, um, Sky pilot Ray.

10
00:01:09,920 --> 00:01:13,370
In my time, I was a PhD student to a peer to peer network, sort of popular.

11
00:01:13,370 --> 00:01:18,080
So there he was with the distributed hash tables and the system called code.

12
00:01:18,860 --> 00:01:24,320
And most importantly, also all these amazing projects are open source projects.

13
00:01:25,190 --> 00:01:35,630
Now, Ian has also been a very successful entrepreneur. He has co-founded Databricks and any scale, both companies based on his open source projects.

14
00:01:36,050 --> 00:01:41,690
So, um, we are delighted to welcome Ian here as our Strategy Speaker.

15
00:01:42,050 --> 00:01:58,790
And please join me in welcoming. Yes. Thank you. Thank you, Ivan, for such a kind introduction and very happy to be here.

16
00:01:59,820 --> 00:02:05,820
So today I am going to talk about some research I've done over the past ten years or so.

17
00:02:06,540 --> 00:02:12,990
And I have to start with a disclaimer. I know that you guys here are famous for the theoretical computer science.

18
00:02:13,590 --> 00:02:16,920
I only have one equation at some point in my slides.

19
00:02:17,580 --> 00:02:22,170
So this is very system oriented talk. And without.

20
00:02:22,170 --> 00:02:30,240
Let me start. So over the years I've been involved in quite a few projects, some of them more successful than others.

21
00:02:30,960 --> 00:02:34,890
And in this talk and these are two neurones.

22
00:02:35,670 --> 00:02:42,750
Um, and in this talk I am going to talk about three projects which represent a slice through my work.

23
00:02:43,560 --> 00:02:46,139
Uh, radial alarm and childbirth arena.

24
00:02:46,140 --> 00:02:54,620
And they can be actually, uh, it's it's can be composed into a mini stack while I'm running on top of it and chat about that,

25
00:02:54,640 --> 00:03:02,550
you know, running on top of your alarm. So I'm going to talk about these projects and the template I am going to use for each project.

26
00:03:02,580 --> 00:03:09,810
Uh, is this one I'm going to talk about trends because in many cases the trends are shapes that he said is a work I've done,

27
00:03:10,590 --> 00:03:16,770
challenges the kids what it is, tell you a little bit about the project and about the impact.

28
00:03:17,430 --> 00:03:24,509
So let me start. So they started in 2016 and the trends magazine which are still to today,

29
00:03:24,510 --> 00:03:33,570
is that I demand so are growing faster than a single node capabilities and the AI workloads also becoming more complex.

30
00:03:34,500 --> 00:03:40,559
So this is some more recent data about the growth in computation demands of training.

31
00:03:40,560 --> 00:03:49,560
The top uh, models versus time from 2010 to 2024 is along the Y scale, is a log scale.

32
00:03:49,920 --> 00:03:53,730
And depending how you are on different, uh, different uh, um.

33
00:03:54,610 --> 00:03:58,670
Period. You see that the demands have grown.

34
00:03:58,690 --> 00:04:03,430
Computation demand. Go. Go down between 6.74.2 every year.

35
00:04:03,700 --> 00:04:09,579
So let's say for X every year, uh, at the same time the node capabilities.

36
00:04:09,580 --> 00:04:18,640
And here we are talking about GPUs. Uh, if you look for the same precision it's about 1.35 x every year.

37
00:04:19,120 --> 00:04:26,040
And if you look about in terms of memory capacity, it's about 1.2 x every year.

38
00:04:26,050 --> 00:04:36,460
And the memory, uh, bandwidth between the GPU compute and the memory, GPU memory, it's around the same thing, 1.2 1.22.

39
00:04:37,610 --> 00:04:47,840
So if you put this together, what you see is that compute demands are growing three times every year faster than compute and memory capacities of,

40
00:04:48,140 --> 00:04:51,560
say, GPUs in this case, and of course, much faster than CPUs.

41
00:04:52,460 --> 00:05:01,610
Um, so, uh, and, uh, and a, uh, a corollary here is that even if the demand stops growing today in terms of so,

42
00:05:01,610 --> 00:05:03,890
you are not going to have larger and larger models.

43
00:05:04,250 --> 00:05:16,100
It will still take you maybe decades to still be able to train a model on one chip, which means that distributed computing,

44
00:05:16,100 --> 00:05:20,990
in particular heterogeneous distributed computing, is becoming more and more of a norm.

45
00:05:21,590 --> 00:05:29,219
Right? Is that the exception? The second point I wanted I made is I, I workload are becoming more complex.

46
00:05:29,220 --> 00:05:32,730
And let me start with this kind of all the workload, all the workload.

47
00:05:32,730 --> 00:05:36,600
And it's just a recommendation system. So how do you do your student what you're doing?

48
00:05:36,600 --> 00:05:41,250
I recommend if you use them, you'll get the logs from the user to say interaction with the site.

49
00:05:41,790 --> 00:05:48,240
Uh, you'll do some, uh, you know, Featurization used to do some data pre-processing to extract the main features.

50
00:05:48,510 --> 00:05:53,640
Then you do some training to create, to build the models, to predict, you know, what,

51
00:05:54,210 --> 00:05:58,650
um, what recommendation to give you, like movie recommendation or things like that.

52
00:05:59,070 --> 00:06:04,890
You do fine tuning, so you have a new model. So you are going to as you are going to, um,

53
00:06:05,220 --> 00:06:11,730
play a little bit with different parameters of the model to see which parameters are going to give you the best results.

54
00:06:12,150 --> 00:06:21,660
Um, then you push that in production and then you get logs, new logs, and then you try to improve the model to get better and better results.

55
00:06:22,560 --> 00:06:29,250
Um, so this is what basically happened. And there are many more or gloats in 2015.

56
00:06:29,250 --> 00:06:40,739
And so for this, for the emergence of reinforcement learning, um, if you remember, uh, you know, DeepMind were just here playing Atari games and uh,

57
00:06:40,740 --> 00:06:47,850
then you have AlphaGo, then, you know, being applied to, uh, data centres, reducing,

58
00:06:47,910 --> 00:06:53,640
uh, uh, centres, uh, data centre, uh, energy usage for cooling and many more years.

59
00:06:54,570 --> 00:07:01,680
So these workloads are going again, uh, it becomes more and complex recommendation system reinforcement learning.

60
00:07:01,680 --> 00:07:08,040
Then you have batch inference. And as of one year and a half right now, it was the rise of test time computing.

61
00:07:08,820 --> 00:07:11,760
Um, and post-training. Okay.

62
00:07:12,180 --> 00:07:21,810
So what you so the main challenge here is that when you want to build an end to end application, you applications or ML application in the past,

63
00:07:21,960 --> 00:07:30,180
you have to put together all these components, right, to put them together and you need to scale them that we just discussed.

64
00:07:30,180 --> 00:07:37,680
The demands of the data processing and training these models is growing faster than the capabilities of a single chip or a single node.

65
00:07:38,100 --> 00:07:48,450
Okay. So you want to scale them. And at that time, not even today in big parts, you have distributed systems to scale each of these kind of stages,

66
00:07:48,450 --> 00:07:56,520
like data processing or pre-processing training, uh tuning and uh, batch prediction and so forth.

67
00:07:56,610 --> 00:08:00,599
So batch prediction, by the way, is basically before your use, in many cases,

68
00:08:00,600 --> 00:08:05,219
one of the use cases before you are going to deploy the model in production,

69
00:08:05,220 --> 00:08:10,020
you are going to test on the old logs to see how it performs before pushing it to production.

70
00:08:10,170 --> 00:08:18,450
Okay. Uh, so in order to build your end to end pipeline, your end to end systems, you need to stitch together these components.

71
00:08:18,900 --> 00:08:24,000
Right? And this is challenging for a bunch of, for, for for a bunch of reasons.

72
00:08:24,510 --> 00:08:29,790
Uh, first, it's hard to develop because each of these component becomes with its own API,

73
00:08:30,510 --> 00:08:35,940
it's and it's hard to deploy and manage because you need to manage these different distributed systems.

74
00:08:35,940 --> 00:08:43,890
And this system has its own semantics in terms of failure recovery or failure from failure data consistency semantics and things like that.

75
00:08:44,550 --> 00:08:54,060
Um, it's hard to use resources uh, uh, in a efficiently is because typically you are going to have a cluster for each of these components.

76
00:08:54,270 --> 00:08:59,370
So it's hard to share the resources across, across components.

77
00:08:59,760 --> 00:09:04,110
Right. Uh, so for instance, if you do if you do training,

78
00:09:04,110 --> 00:09:09,510
you have a cluster for training and then you have a cluster for inference when you are done with the training,

79
00:09:09,720 --> 00:09:12,990
that cluster is hard to reuse for other things.

80
00:09:13,500 --> 00:09:19,889
Okay. And finally, it's also slow because you need to move the data between these components and the data.

81
00:09:19,890 --> 00:09:25,320
It's a lot of data. Typically you write on as a kind of a blob store, like S3 or something like that.

82
00:09:25,320 --> 00:09:30,930
Then you have to read it. So he has a lot of overhead reading, writing, serialisation, Deserialization.

83
00:09:31,350 --> 00:09:38,610
Okay, so what Ray provided is one system which is general enough to support all these workloads.

84
00:09:39,030 --> 00:09:44,670
Okay, so this is Ray and has a component that we can see at the bottom is what we call

85
00:09:44,820 --> 00:09:49,290
right core and is a unified computing framework for distributed applications.

86
00:09:49,890 --> 00:09:55,379
And then you have on top of it a bunch of libraries running on core to support these different workloads.

87
00:09:55,380 --> 00:10:00,570
But now you have a system which is going to you can use system for your entire end to end application.

88
00:10:00,930 --> 00:10:04,200
Okay. So that's basically what DBAs. So what is key idea.

89
00:10:04,200 --> 00:10:15,270
So at the behind um right. So what it is is that it takes a procedural language like Python and generalise it to a distributed setting.

90
00:10:15,540 --> 00:10:21,810
That's basically what it is. And it focus on flexibility by exposing the parallel lines to the developers.

91
00:10:22,290 --> 00:10:31,020
Right. So the developers can, um, build the application with various pattern of parallels, like nested parallels and things like that.

92
00:10:31,680 --> 00:10:36,430
And one of the question it's always been asked is that why not the declarative language?

93
00:10:36,430 --> 00:10:40,260
You just said language what to do? And the language A is a.

94
00:10:41,440 --> 00:10:46,510
As a back end, um, the compiler or whatever is going to decide how to do it, right.

95
00:10:47,030 --> 00:10:55,750
Um, I think there are a few challenges with that. First of all, there is no proper, uh, general purpose, uh, general purpose, declarative language.

96
00:10:56,290 --> 00:10:59,859
Uh, and I've been working on Datalog and other languages.

97
00:10:59,860 --> 00:11:03,070
So I've been doing a little bit of work on that.

98
00:11:03,430 --> 00:11:06,880
And, you know, adopting a new language is hard.

99
00:11:07,240 --> 00:11:12,880
Let's face the truth. Um, so and we didn't want to invent one because of these reasons.

100
00:11:13,450 --> 00:11:19,930
And at the same time, when we started, the Andre Python was emerging the lingua franca for AI.

101
00:11:19,990 --> 00:11:25,930
So all I knew, I knew Python and all the libraries like TensorFlow, PyTorch.

102
00:11:25,990 --> 00:11:28,990
There are libraries within Python. Okay.

103
00:11:30,760 --> 00:11:35,499
So what kind of it is there are, you know, in of these kind of procedural languages that out of,

104
00:11:35,500 --> 00:11:40,060
you know, a few concepts I actually that are more I just had as I one which are relevant for me.

105
00:11:40,730 --> 00:11:46,680
Uh, first of all, you have in terms of the compute is like you have function and classes, right?

106
00:11:46,690 --> 00:11:50,799
Classes are stateful operators, right? And function in general.

107
00:11:50,800 --> 00:11:57,910
If you if they say they shouldn't have side effects so that a stateless operators and then you have concurrency obviously.

108
00:11:57,910 --> 00:12:05,400
And this is some a form provided by Python with uh I think io IP which allows to execute

109
00:12:05,410 --> 00:12:10,780
things in parallel uh or overlap for instance computation and communication and other things.

110
00:12:11,140 --> 00:12:13,780
And the other one, which is kind of many more low level.

111
00:12:13,780 --> 00:12:22,120
But it's important is that passing, especially for large amount of data, is the ability to pass a data by reference, right, instead of by value.

112
00:12:22,420 --> 00:12:25,840
Right. Far more efficient is when you have large amounts of data. Okay.

113
00:12:26,050 --> 00:12:29,140
So these are the things we kind of try to capture in Ray.

114
00:12:29,710 --> 00:12:34,870
And we are we we encapsulate this in this fast compute model is what it called.

115
00:12:35,500 --> 00:12:43,180
And so again you are going to have you are going to take classes and and functions and provide um

116
00:12:43,480 --> 00:12:51,880
ability to instantiate transparently these functions and classes remotely as tasks and of course,

117
00:12:52,060 --> 00:12:56,410
right. You know, you've seen a lot of actor languages.

118
00:12:56,860 --> 00:13:03,580
Um, then we have a shared the in-memory distributed objects, which enables passing arguments by reference.

119
00:13:03,910 --> 00:13:12,400
And then we have these futures, in particular distributed futures, which are references to objects arguments as well as the results.

120
00:13:12,820 --> 00:13:19,750
And in some cases these are the results which are created by task or actors even before being scheduling.

121
00:13:19,750 --> 00:13:25,570
So they are existing abstract level. You don't know even where the object is going to be created.

122
00:13:26,350 --> 00:13:31,630
And this obviously it enables concurrency and concurrent execution and parallels.

123
00:13:32,380 --> 00:13:37,000
The record has a very minimal API inside of the all of the functions.

124
00:13:37,370 --> 00:13:41,769
The main functions are core function functions. And I'm not going to read all of them.

125
00:13:41,770 --> 00:13:44,770
I'm going to demonstrate a few of them with a few examples.

126
00:13:45,940 --> 00:13:54,250
So here is a very trivial example. So say you have a function which is computing for one second and return the result.

127
00:13:54,700 --> 00:13:59,799
And if you execute that in traditional Python I'm not talking about multiprocessing.

128
00:13:59,800 --> 00:14:02,890
Python. You you call this function twice.

129
00:14:03,090 --> 00:14:07,660
Is going to take you two seconds to run right. Because you sequential right.

130
00:14:09,670 --> 00:14:17,799
Now what today is doing, you are going to have to have to use this kind of decorator and remote and is doing under the hood.

131
00:14:17,800 --> 00:14:21,820
When you are going to call this, you are going to call f dot remote.

132
00:14:22,420 --> 00:14:26,229
Um, then um is going to take that function.

133
00:14:26,230 --> 00:14:32,139
So here when we execute that and it's going to submit a task for that function

134
00:14:32,140 --> 00:14:36,910
which will be scheduled by somebody in the system can be on a different worker,

135
00:14:36,910 --> 00:14:41,620
one or different note this function is not blocking okay.

136
00:14:41,860 --> 00:14:49,290
So he's going to return immediately. And he's going to return a reference, a future to the result.

137
00:14:50,130 --> 00:14:54,340
Okay, because it's non-blocking. You can execute now the next.

138
00:14:54,720 --> 00:15:04,300
The next call to the function f dot remote. And is going again to get you back, um, a reference to the result.

139
00:15:04,960 --> 00:15:09,010
And finally you get to get the results. Now this is a blocking function.

140
00:15:09,640 --> 00:15:16,360
And under the hood. Do you know, um, that I submitted these tasks to a local scheduler?

141
00:15:16,360 --> 00:15:25,390
Uh, and the local scheduler will figure out where to schedule these, uh, this, uh, tasks for each of the functions.

142
00:15:26,050 --> 00:15:32,710
And then, you know, these are going to be scheduled, executed in this case in parallel because there is no data dependency.

143
00:15:33,010 --> 00:15:42,820
And you are going to get the results. And because these two tasks are executive in parallel, you are going to now the entire, um, program is going to.

144
00:15:44,400 --> 00:15:49,430
Take only one second. Okay. So this is functions.

145
00:15:49,670 --> 00:15:54,710
You also have classes are instantiated as actors.

146
00:15:55,040 --> 00:15:59,720
Again you do that you just need to say array dot remote and then dot remote.

147
00:16:00,230 --> 00:16:03,800
When you instantiate the class and then you call it a method of the class.

148
00:16:04,040 --> 00:16:08,000
By the way this dot remote. Um, it's actually not needed.

149
00:16:08,300 --> 00:16:13,370
The reason we preserve it in the language is to make that developer, the programmer,

150
00:16:13,370 --> 00:16:18,049
aware that this is going to be can be executed remotely on a different machine,

151
00:16:18,050 --> 00:16:24,350
and it can be slower, but also just to, um, give an extra hint to the developers.

152
00:16:25,040 --> 00:16:39,079
It's also has the ability array to which was uh, early on we made this decision, um, to specify the programmer can specify resource demands for, uh,

153
00:16:39,080 --> 00:16:46,790
a particular function or of a particular um, uh, class one is going to be instantiated in terms of the number of CPUs,

154
00:16:46,790 --> 00:16:50,360
GPUs, memory, uh, requirements and things like that.

155
00:16:50,840 --> 00:16:59,720
Okay. This is class. Uh, that's conductors. Then I mentioned that you have a shared in-memory objects store.

156
00:17:00,140 --> 00:17:05,450
Um, so this is, uh, this, uh, uh, shared memory.

157
00:17:05,900 --> 00:17:12,230
Um. It's, um, um, it's, um, um.

158
00:17:13,270 --> 00:17:20,700
It's it's it's it's read is right. Only read and read so you don't update.

159
00:17:20,720 --> 00:17:24,820
It doesn't provide the ability to update, um, a data.

160
00:17:24,820 --> 00:17:28,630
And this makes obviously keeping the consistency much easier.

161
00:17:29,050 --> 00:17:36,130
Uh, of course as a programming level you are going to update it, but under the hood you are just going to create a new version,

162
00:17:36,430 --> 00:17:41,020
uh, next of a new version of the data with, uh, the modified data.

163
00:17:41,410 --> 00:17:52,090
Okay. Um, so and here is another simple example where you have two functions where f and g where uh, where uh and um.

164
00:17:53,240 --> 00:17:58,130
You call F and then. Then you'll pass the result from F to G.

165
00:17:58,880 --> 00:18:04,460
Okay. So again this what happens here is I do not have to call f dot remote.

166
00:18:04,490 --> 00:18:13,490
It's again it's unblocking call. You are going to execute eventually uh Ray is going to instantiate the task associated with F.

167
00:18:14,120 --> 00:18:19,400
And the return is a reference to the result of f this distributed future.

168
00:18:19,700 --> 00:18:23,990
And that is going to be passed to uh, to G.

169
00:18:24,440 --> 00:18:32,450
Right. And the remote is going to uh, G is going to the grey is going to instantiate a task for G,

170
00:18:32,810 --> 00:18:39,860
maybe on another worker on another node and return, uh, the reference distributed future for G.

171
00:18:39,860 --> 00:18:42,620
And then you can you can get the result.

172
00:18:43,070 --> 00:18:52,129
So now and now you have a dependency between f and g is a data dependency that knows that and basically is going in before executing G.

173
00:18:52,130 --> 00:18:57,390
It waits for the reference to be resolved, which means that f should be it should be executed.

174
00:18:57,390 --> 00:19:01,550
It should, uh, finish and produce the result in this case x.

175
00:19:01,940 --> 00:19:09,710
And then now that X is available, rail will uh, transfer x to the node which is G.

176
00:19:10,070 --> 00:19:13,850
And now you are going to finally run G.

177
00:19:14,420 --> 00:19:25,850
The main point here is that in this particular case x it's transferred only once between uh node one and o two which runs fn g respectively.

178
00:19:26,240 --> 00:19:34,250
And the reason I'm giving this example, because you have to to if you want to to contrast with remote procedure call.

179
00:19:34,340 --> 00:19:38,870
Right. The remote procedure call is again a way to run function remotely.

180
00:19:38,930 --> 00:19:47,280
And you have that in many languages and so forth. But in remote in most of the remote procedure calls implementation the data.

181
00:19:47,290 --> 00:19:50,840
Uh, it's the arguments are passed by value, right.

182
00:19:51,410 --> 00:19:55,469
And in this particular case, for instance you have to execute x.

183
00:19:55,470 --> 00:20:01,400
Then you are going to get the value back. You need to execute f which will get the value back x.

184
00:20:01,910 --> 00:20:05,840
Uh, just to pass it to g okay.

185
00:20:06,470 --> 00:20:10,220
And then G will be executed on this.

186
00:20:10,220 --> 00:20:16,340
Now to and in this particular case, as you can see you have two transfers of X in the system instead of one.

187
00:20:16,790 --> 00:20:25,579
But this is again you knew that. I mean that's why the the advantage of using references, passing the argument by reference to avoid extra copies.

188
00:20:25,580 --> 00:20:28,700
And that's true for a single node.

189
00:20:28,730 --> 00:20:32,450
And, uh, it's also true, obviously for a distributed system.

190
00:20:33,620 --> 00:20:42,949
Okay. Um, now I will also, um, try to mention some ongoing work and obviously a lot of in going work.

191
00:20:42,950 --> 00:20:46,730
It's about improving the GPUs. What we are witnessing today, we are moving.

192
00:20:47,030 --> 00:20:55,520
We are have been already moved from a CPU, uh, dominated compute infrastructure to a GPU or accelerator dominated infrastructure.

193
00:20:56,210 --> 00:21:02,330
And we do a lot of work on DirectX to reduce the overhead for that of GPU to GPU communications,

194
00:21:02,810 --> 00:21:07,730
uh, and implementing different collective operations and so forth.

195
00:21:08,000 --> 00:21:17,300
This is one example. Um, uh, basically that I can give you to GPU communication is like in the original Ray, though.

196
00:21:17,810 --> 00:21:21,170
Remember, we have this kind of, um, uh, object storage.

197
00:21:21,170 --> 00:21:24,470
So object storage is implemented in the CPU memory.

198
00:21:24,980 --> 00:21:31,160
So what happens if we want to to transfer some data from one GPU to another like a tensor?

199
00:21:31,430 --> 00:21:37,249
What happens is that this tensor is moved from the GPU memory to the local Ram memory of that.

200
00:21:37,250 --> 00:21:43,370
Now I'm going to go through the object store from to the Ram of the receiver,

201
00:21:43,730 --> 00:21:49,820
and then from the memory CPU memory of the receiver to be transferred in the GPU memory.

202
00:21:50,330 --> 00:21:59,300
So it's kind of quite slow, right. And uh, obviously what we want and there are, uh, a lot of copies, a lot of RPC calls.

203
00:21:59,540 --> 00:22:08,540
So what we want, obviously it's not a GPU to GPU communication, uh, and memory statically allocated and everything, uh, for optimising it.

204
00:22:08,960 --> 00:22:15,560
So one thing I wanted, I give you this figure because initially when I did that and this is still experimental,

205
00:22:15,800 --> 00:22:22,040
we saw that everyone only use direct communication direct transfer between CPU and GPU zero and GPU one.

206
00:22:22,400 --> 00:22:25,850
But there is one instance two out of surprisingly people preferred.

207
00:22:26,150 --> 00:22:29,900
So one is coming on the out of left to go.

208
00:22:29,900 --> 00:22:35,000
These are all traditional way of transferring, uh, the data through the CPU memory.

209
00:22:35,300 --> 00:22:40,730
Do you know I do you know, what is that? What is a case in which you may still prefer?

210
00:22:41,850 --> 00:22:52,520
The left. The slow transfer over the right one. It turns out that when you do Gpio to Gpio transfer, it's a synchronous operation.

211
00:22:53,240 --> 00:22:57,440
Okay. You need also to synchronise the sender and the receivers.

212
00:22:57,440 --> 00:23:03,950
You need to allocate the memory in the Gpio receiver to be ready to receive the data from the sender.

213
00:23:04,700 --> 00:23:09,410
Okay. Now in this case it's more it's more asynchronous.

214
00:23:09,800 --> 00:23:16,490
You need only to be ready to have to allocate the Gpio to be ready to get it to have to allocate the Gpio.

215
00:23:16,950 --> 00:23:28,250
Um, memory. When you have the tensor here on the already it's instantiated the sends out the tensor on the local machine in the CPU or memory.

216
00:23:29,000 --> 00:23:32,930
So the amount of time for which you keep the GPU memory allocated.

217
00:23:33,970 --> 00:23:43,350
To get the tensor is going to be higher when you have direct GPU to GPU communication than when you have slow memory, right.

218
00:23:43,360 --> 00:23:51,970
Again because synchronous. So if your GPU memory it's extremely, extremely a premium and is a bottleneck, you may prefer uh, the left hand side.

219
00:23:54,470 --> 00:24:01,610
Okay, so, uh, there are many, like I mentioned to you, you know, it's Ray is very flexible.

220
00:24:01,970 --> 00:24:05,510
Uh, it's also provide support for a very generous infrastructure.

221
00:24:05,900 --> 00:24:11,719
And here you have, you know, batch inference where you have you read data,

222
00:24:11,720 --> 00:24:15,980
you pre-process the data, or you do something for finance and you write the data.

223
00:24:16,520 --> 00:24:24,559
And, uh, here is the data pre-processing and post-processing can run on CPU as well as the inference can run on GPUs.

224
00:24:24,560 --> 00:24:30,440
And by doing so you can save a lot of cost instead of running everything on GPUs.

225
00:24:30,710 --> 00:24:35,930
And there are many workloads you can support here, or multi-node search and so forth.

226
00:24:36,440 --> 00:24:42,979
Fine tuning, embedding computation. Um, Ray, it's also emerging as a standard, you know.

227
00:24:42,980 --> 00:24:48,320
Well, the not standard but the framework of choice for Post-training.

228
00:24:48,590 --> 00:24:52,370
There are a lot of post-training frameworks being developed over the past year,

229
00:24:52,640 --> 00:24:59,270
many of them using Ray and also using the multi-language AI system or also developed at Berkeley.

230
00:24:59,630 --> 00:25:03,980
Okay. So impact it has been a very popular framework actually.

231
00:25:03,980 --> 00:25:12,800
It was used by OpenAI in their training infrastructures to train ChatGPT for and um, uh, and,

232
00:25:13,010 --> 00:25:23,270
and others and by many other companies to build their AI infrastructure, um, very fast growth, especially in the past year.

233
00:25:23,540 --> 00:25:31,610
This is a number of downloads and this is GitHub status, uh, compared with other distributed frameworks, uh, popular distributed frameworks.

234
00:25:31,910 --> 00:25:35,540
So you can see the growth is accelerated accelerating.

235
00:25:36,530 --> 00:25:41,060
So you saw three I have two other projects I have to finish quickly.

236
00:25:41,420 --> 00:25:45,500
Um, so next is Vlad. Okay. So Vlad, um, it's about inference.

237
00:25:47,090 --> 00:25:51,799
And you know. So what? What kind of change?

238
00:25:51,800 --> 00:25:55,940
So what kind of the trend? So the trend obviously was the large language model.

239
00:25:55,940 --> 00:25:59,690
So what was different about this model from the point of view of training and inference?

240
00:26:00,020 --> 00:26:06,200
It used to be that in the traditional machine learning, for every task, for every use case,

241
00:26:06,200 --> 00:26:09,710
you are going to train a model and then you are going to serve the model.

242
00:26:09,800 --> 00:26:18,680
So training set of large language models are much more general, and they are a kind of provide some do multiple tasks.

243
00:26:18,980 --> 00:26:23,090
So you train a large language model and then you can use it for multiple tasks.

244
00:26:23,090 --> 00:26:26,900
You train wise and you you do the inference for many, many tasks.

245
00:26:26,930 --> 00:26:31,850
Okay. So that means that the inference actually becomes even more important.

246
00:26:32,330 --> 00:26:38,570
And um, you have a lot of this was happening by the in 2023.

247
00:26:38,780 --> 00:26:50,570
This is when I started this research. Um, and um, at that time it was very expensive to serve these, uh, models.

248
00:26:51,260 --> 00:26:58,310
Okay. So you can have, uh, if you remember, each GPU can only serve a handful of requests per second.

249
00:26:58,580 --> 00:27:05,540
That was A100, if you remember, for us, you know, that was a top GPU, uh, in 2023.

250
00:27:05,720 --> 00:27:13,520
Um, okay. And so you have neither, you know, the cost for request was very high.

251
00:27:14,390 --> 00:27:24,140
And why is that? Everyone knows that. So I'm going to go quickly is because, as you know, uh, these models are autoregressive models.

252
00:27:24,380 --> 00:27:30,620
And when they generate a token, they generate a new token based on all the previous tokens.

253
00:27:31,130 --> 00:27:35,570
Okay. So you have this kind of strong dependencies, this attention mechanism.

254
00:27:36,080 --> 00:27:47,570
Uh, okay. And um, so sorry about so generating one token at a time because you have the dependency, it doesn't exploit the GPU parallelism.

255
00:27:47,870 --> 00:27:54,560
Right. The GPUs are fast and have a high, uh, you know, uh, compute capabilities.

256
00:27:54,830 --> 00:27:58,740
But they do that when you can exploit their parallelism. Right.

257
00:27:58,830 --> 00:28:02,340
But now you cannot exploit them because of these dependencies for one request.

258
00:28:02,700 --> 00:28:07,200
So of course, if you cannot exploit for one request, then what is a solution?

259
00:28:08,190 --> 00:28:12,450
Well, I'm going to try to serve multiple requests at the same time.

260
00:28:13,140 --> 00:28:16,710
Right. So you request a batch. You are going to process a batch of requests.

261
00:28:17,910 --> 00:28:21,389
The problem is that now you are going to get out of memory.

262
00:28:21,390 --> 00:28:24,450
So the memory becomes a bottleneck. So why is that.

263
00:28:24,780 --> 00:28:32,669
Well because again of this dependency to generate the next token you need to store it in memory.

264
00:28:32,670 --> 00:28:36,660
And this is how memory is organised. You have the number of zeros as 40GB.

265
00:28:36,940 --> 00:28:40,850
That is at times 3830 billion a lm.

266
00:28:41,100 --> 00:28:46,860
That was a Lama model. So the precision at that time was two bytes.

267
00:28:47,100 --> 00:28:52,379
So you have 26GB with 65% of the entire GPU memory.

268
00:28:52,380 --> 00:28:57,510
And then you have this other activation a few percent and then a KB cache.

269
00:28:57,540 --> 00:29:03,390
So okay we can try to keep the data. Um temporal data to generate the next token.

270
00:29:03,630 --> 00:29:10,680
So in this case we cache you need to keep the embeddings you know for the entire prefix right.

271
00:29:10,680 --> 00:29:15,920
For each request. Right. And each token, each embedding it was like one megabyte.

272
00:29:15,930 --> 00:29:19,200
So one request can have several gigabytes. Okay.

273
00:29:19,560 --> 00:29:23,370
So that's a problem. Okay. So, um.

274
00:29:25,280 --> 00:29:36,290
And when we looked, actually it was this was compounded by the fact that, um, before what happens is a memory was contiguously allocated.

275
00:29:36,430 --> 00:29:42,050
That's a easiest way to allocate memory contiguously allocate memory you are going to need.

276
00:29:42,320 --> 00:29:45,469
Right. And you also need to because it's contiguous.

277
00:29:45,470 --> 00:29:49,480
You need to provide space for the future. Future tokens.

278
00:29:49,850 --> 00:29:56,929
Okay. Of course, you know, if you took, you know, systems 1 to 1, you know,

279
00:29:56,930 --> 00:30:02,300
that list that can lead contiguous memory allocation can lead a lot of can lead to fragmentation.

280
00:30:02,750 --> 00:30:09,580
Um, and you can lead to internal fragmentation because I am going to allocate a space for a request.

281
00:30:09,590 --> 00:30:12,830
But I don't know eventually the length of the output. Right.

282
00:30:12,920 --> 00:30:16,880
And because I don't know, I'm I'm going to have to probably allocate more than I should.

283
00:30:17,060 --> 00:30:23,270
So you have fragmentation there. Um, you also have um, this is a interesting one.

284
00:30:23,930 --> 00:30:31,910
So the interesting one is the following. So let's say that you know, exactly how long is the output okay.

285
00:30:32,090 --> 00:30:38,690
So then I can allocate that all the and then as a block of memory to fit all the output and no more.

286
00:30:39,170 --> 00:30:43,220
Well it turns out that even this is not optimal. So why is not optimal.

287
00:30:43,490 --> 00:30:46,580
Say I have to allocate 1000 tokens for output.

288
00:30:46,940 --> 00:30:50,720
Now these tokens are going to be filled sequentially one by one right.

289
00:30:50,960 --> 00:30:59,090
So initially I have 1000 tokens free I of next token I'm going to generate one token used 999 three and so forth.

290
00:30:59,480 --> 00:31:07,730
So during the process, you know dynamically I am going to waste a lot of tokens which are allocated space which are allocated in the future,

291
00:31:08,090 --> 00:31:13,549
right, which are not yours. Those are going to be used eventually, but for a long time they are not going to be used.

292
00:31:13,550 --> 00:31:18,080
So this is kind of, uh, reserve reserve instance and external fragmentation.

293
00:31:18,080 --> 00:31:22,130
I'm I'm going to not talk much about that.

294
00:31:22,540 --> 00:31:27,080
And at this, this was at that stage, this was how the memory was used.

295
00:31:27,470 --> 00:31:32,420
And the only thing you need to be look at here. It's about the green part.

296
00:31:32,480 --> 00:31:36,470
This out of, you know, the other colour, a different kind of fragmentation.

297
00:31:36,650 --> 00:31:41,209
But this is the effective memory which has been which was used for uh,

298
00:31:41,210 --> 00:31:49,790
by the inference system that at that time, uh, and uh, the most of it, the largest on a 38% by 38%,

299
00:31:49,790 --> 00:31:57,019
you have this kind of oracle is saying here, Oracle is what I told you, assuming that, you know the future, we should acknowledge that in our future.

300
00:31:57,020 --> 00:32:03,380
How long is the request? Still, you are going to effectively use only 38% of the memory if you don't know the future.

301
00:32:03,390 --> 00:32:07,610
The best at that time was 26.8%. The rest was kind of wasted.

302
00:32:09,140 --> 00:32:16,010
So as you know, what is a solution? You know you still have fragmentation and things like that challenges segments.

303
00:32:16,010 --> 00:32:21,510
If you remember, you know, operating system. What do you do that. Beijing.

304
00:32:21,510 --> 00:32:27,270
You have this kind of Beijing which are fixed sizes, you know, a few kilobytes and so forth.

305
00:32:27,870 --> 00:32:33,659
And then the there is no fragmentation, external fragmentation between these pages.

306
00:32:33,660 --> 00:32:46,530
And then, um, you still provide to the application the illusion of, uh, you know, of, uh, nice, uh, address space, um, which is contiguous.

307
00:32:46,800 --> 00:32:50,700
But this address space is going to be mapped to different pages under the hood.

308
00:32:50,790 --> 00:32:56,930
Right? This is exactly what happens here. Right. So this is what memory management you have the process a process.

309
00:32:56,940 --> 00:33:00,959
We they have the mutual memory which is map which is contiguous for each of them.

310
00:33:00,960 --> 00:33:08,670
The virtual memory and the virtual memory under the hood is, is mapped to these pages, which are all over the place in the physical memory.

311
00:33:09,150 --> 00:33:15,240
So the same we are going to do here. And this is a logical KB block.

312
00:33:15,240 --> 00:33:18,480
So it's everything is contiguous each block after another.

313
00:33:18,930 --> 00:33:27,750
And but in the physical uh, memory, uh, you are going to have different blocks are going to be spanned a different location.

314
00:33:27,990 --> 00:33:36,960
A block consists of multiple is a fixed size and consists of a few, uh, of several embedding tokens in this case for right, for tokens.

315
00:33:37,560 --> 00:33:44,890
Okay. So then and then how do you go from the logical to the virtual block space.

316
00:33:44,910 --> 00:33:50,640
Well you have a mapping. You have um, a table, uh, which index?

317
00:33:51,330 --> 00:33:54,720
Uh, each a logical block to a physical block in memory.

318
00:33:54,900 --> 00:34:05,940
And then you have another field which is, uh, field here, uh, which basically says how much of that, uh, block is, has been filled with tokens.

319
00:34:06,420 --> 00:34:09,420
Okay. So in this case, the block one is two.

320
00:34:09,840 --> 00:34:13,290
Um, and now you are going to generate an off next token in this case.

321
00:34:13,290 --> 00:34:18,350
And, and you are going to put in the corresponding block in the physical memory and so forth.

322
00:34:18,360 --> 00:34:25,140
Now and when it is full, uh, the block um, they are allocating another block in the physical memory.

323
00:34:25,440 --> 00:34:36,000
Right. So it's kind of simple. And again what it um, this allows also if you are having multiple, uh, you know,

324
00:34:36,020 --> 00:34:41,250
request and so forth, and ask certainly the same prefix to share as a prefix more efficiently.

325
00:34:41,790 --> 00:34:51,090
Um, so, um, you know, it is you can do parallel sampling multi turn when you have multiple ah, conversation or anything like that.

326
00:34:51,210 --> 00:34:58,950
Okay. Um, now let me tell you because it's interesting, it's, it's very much like paging in kind of,

327
00:34:58,980 --> 00:35:05,879
uh, um, virtual memory and operating system and, but um, there are also a few differences.

328
00:35:05,880 --> 00:35:11,010
So similarities pages are very similar is CPU blocks are very similar with pages.

329
00:35:11,340 --> 00:35:18,270
And you can share pages across processes since like if you have uh, common prefixes for the,

330
00:35:18,270 --> 00:35:24,479
for the request or if you have system prompt, the system prompt, it's used for every prompt, right.

331
00:35:24,480 --> 00:35:28,890
So it can be shared across all prompts from all the, uh, requests.

332
00:35:29,730 --> 00:35:32,400
Uh, but there are some differences. Okay.

333
00:35:34,080 --> 00:35:42,240
So when you when you talk about evictions, you see, and when you evict, when you run out of physical space or memory space, what do you do?

334
00:35:42,270 --> 00:35:46,200
You evict pages, right? To make room to bring a new page unit.

335
00:35:46,470 --> 00:35:49,770
Right. And that's typically a start on uh.

336
00:35:50,100 --> 00:35:57,150
Uh, you know, in, uh, you know, swap space on the disk and, um, then you can bring it back later.

337
00:35:57,540 --> 00:36:07,949
So on, on a slower storage, but bigger storage. Um, so in this case, there are two differences, you see, because the dependency, you know,

338
00:36:07,950 --> 00:36:14,040
when you, you need to generate the next token, you need all the tokens before that.

339
00:36:14,340 --> 00:36:21,510
Okay. So it doesn't make sense to just, uh, page out one block.

340
00:36:21,810 --> 00:36:25,260
Right. You have to position all of them. Right. Okay.

341
00:36:25,410 --> 00:36:29,430
Because you still need to do all of them. And you are going to need to get the next token right.

342
00:36:29,460 --> 00:36:33,030
So you need to do I think you can do it. Request level.

343
00:36:33,330 --> 00:36:39,750
Uh, eviction versus page level eviction, which will free more memory for as a request because the other one also is interesting.

344
00:36:40,080 --> 00:36:48,300
It's like, you know, like I said, in the operating system, you do page in, page out and page in your store is a page on a disk on the disk.

345
00:36:48,930 --> 00:36:56,520
But here because, uh, you know, because the, um, GPU is, um, have so much compute,

346
00:36:56,760 --> 00:37:03,629
sometimes it's more efficient to throw away the, uh, block and to very compute,

347
00:37:03,630 --> 00:37:05,370
to remember, to realise it when you need it,

348
00:37:05,880 --> 00:37:12,720
because the computation is faster than just storing it for in this case can be on the CPU memory and bringing it back.

349
00:37:13,170 --> 00:37:17,790
Okay. This is some of this is fallacies. So what was the result?

350
00:37:17,850 --> 00:37:25,170
Um the result we got um 96% in terms of the memory utilisation at that time.

351
00:37:25,350 --> 00:37:28,560
And this converts directly into support. Right.

352
00:37:28,800 --> 00:37:32,280
So 96% compared to this 26.8%.

353
00:37:32,580 --> 00:37:37,020
It's almost four times higher support because you can fit now in a batch.

354
00:37:37,020 --> 00:37:40,800
And you can process at the same time four times more requests.

355
00:37:41,130 --> 00:37:45,870
Okay. Um, is also is actually this is one of the fastest growing.

356
00:37:46,970 --> 00:37:52,240
Projects I have seen. I was part of it. That's kind of GitHub status of NLM.

357
00:37:52,610 --> 00:37:59,480
Think it's another? Unfortunately, I don't have time to go over it. It's another project we developed, uh, in the lab.

358
00:38:00,020 --> 00:38:03,650
Um, it was a top open source project on GitHub in 2025.

359
00:38:03,770 --> 00:38:07,190
And I also wanted to show these how hard things are in practice.

360
00:38:07,550 --> 00:38:10,940
Um, and this is about, uh, you know,

361
00:38:10,940 --> 00:38:17,149
like what what is saying this is are the number of models which are supported

362
00:38:17,150 --> 00:38:22,400
actually September last year by VLA lab and the number of accelerators it supports.

363
00:38:22,520 --> 00:38:27,950
It's 260 accelerators. And I don't know, it's like 500 model architectures.

364
00:38:28,130 --> 00:38:33,320
It's very, very complicated, very complex. Um, okay.

365
00:38:33,440 --> 00:38:38,059
So yeah, and this is what I want to say here is at this page attention.

366
00:38:38,060 --> 00:38:42,680
It's actually as a techniques via the artefact by page attention.

367
00:38:42,680 --> 00:38:49,339
It's use now by all of these inference uh, uh systems, not only BLM in the industry.

368
00:38:49,340 --> 00:38:53,059
So he's a nice to see that. So the final one and it'll take me ten minutes.

369
00:38:53,060 --> 00:38:58,570
So this as well. Um so it chatbot arena rethinking the VLA as a L alum.

370
00:38:58,610 --> 00:39:06,169
Uh, evaluation. So this is a different one. And this is hopefully, you know, it's like I thought it's a nice study and it's the end of the talk.

371
00:39:06,170 --> 00:39:09,140
So hopefully it's going to be more engaging.

372
00:39:09,410 --> 00:39:20,630
Um, so look, um, it's this is, uh, the Patterson during our webinar and he has this famous quote, for better or worse, benchmark shape of field.

373
00:39:20,900 --> 00:39:29,059
Right. And this is absolutely true about AI. You have all these kind of emission at and CFR and all of these early benchmarks.

374
00:39:29,060 --> 00:39:35,660
And you have more and more now benchmarks right to to uh, to compare and to see which models are better.

375
00:39:36,230 --> 00:39:49,040
So here our study. So, um, if when the time was met, I was opening sort of set of models, the open sort of salama in 2023.

376
00:39:49,040 --> 00:39:52,309
In February, a group of students, uh,

377
00:39:52,310 --> 00:39:59,209
then took that model and took this kind of shared Egypt date that we just shared Egypt as a date as a prompt

378
00:39:59,210 --> 00:40:08,300
sends the results that people are sharing over the web and use that to fine tune right to father train this,

379
00:40:08,330 --> 00:40:12,710
uh, llama, uh, model and call the model vicuna.

380
00:40:13,130 --> 00:40:20,110
Okay. And this was high quality data, shared GPT, because people only share the prompts and the results.

381
00:40:20,270 --> 00:40:23,390
They are, you know, uh, excited.

382
00:40:23,480 --> 00:40:24,440
They are proud about.

383
00:40:24,860 --> 00:40:33,920
Okay, so that was and we affirmed, um, this is the trend here is the rise of chat bots, because you are going to be QnA via a chat bot interface.

384
00:40:34,130 --> 00:40:38,540
That was a chat GPT early days. And it's a question.

385
00:40:38,540 --> 00:40:44,989
It's about how you are going to evaluate it. You have this kind of benchmark with a static benchmark that time.

386
00:40:44,990 --> 00:40:48,200
I mean, you have minutes of thoughts, but they are static.

387
00:40:48,650 --> 00:40:52,760
Static means they are also prone to contamination by contamination.

388
00:40:52,760 --> 00:41:00,350
What I mean is that the model is this large language model, as you know, are are trained on all the data which is available in the internet.

389
00:41:00,350 --> 00:41:04,040
And these benchmarks are available in the internet. They are going to be trained on this benchmark,

390
00:41:04,250 --> 00:41:09,530
and then they are going to be evaluated on the same benchmarks and then do not capture human preferences.

391
00:41:09,980 --> 00:41:13,490
Right. It's not only about correctness, it's about the style and so forth today.

392
00:41:13,790 --> 00:41:18,559
So that's kind of the problem okay. So how do you do it. Now this is a contamination.

393
00:41:18,560 --> 00:41:27,530
It's an example to show you that it's real. This social group is some kind of you know, this is kind of some message I think was sort of on Twitter.

394
00:41:28,190 --> 00:41:41,240
Um, so code force. Right. It turns out that it was ten out of ten solving correctly, um, you know, before 2021 and zero out of ten after that.

395
00:41:41,600 --> 00:41:48,330
And that was, incidentally, the Cut-Off date for data used to train GPT, right?

396
00:41:48,350 --> 00:41:54,080
Of course, you you are trained on, uh, on the solutions, uh, you are going to remember all these solutions.

397
00:41:54,320 --> 00:41:57,590
So that's always the real problem. So you need human evaluation.

398
00:41:57,590 --> 00:42:05,380
So I did it. What do you do as a faculty. Right. Like okay so you'll get some students and I give them some pizza and say okay ask.

399
00:42:05,390 --> 00:42:09,200
And these are the prompts for different models. See what is better okay.

400
00:42:09,920 --> 00:42:13,520
So we realise that this doesn't really scale okay.

401
00:42:14,060 --> 00:42:20,450
And why doesn't scale? Because the assets and the evaluations, uh, take time are not obvious.

402
00:42:20,450 --> 00:42:24,290
It's not like a math problem. This is good and right. And here are two examples.

403
00:42:25,220 --> 00:42:30,590
One one says this question is develop a Python program that reads all the text files

404
00:42:31,850 --> 00:42:37,130
under a directory and returns top five orders with the most number of occurrences.

405
00:42:38,030 --> 00:42:43,410
Okay. You know, so everyone can do that.

406
00:42:44,100 --> 00:42:47,830
And I'm not going to ask you to program is is no, no question.

407
00:42:47,850 --> 00:42:54,240
But I am going to give you two solutions provided by two large language model the two jackpots.

408
00:42:54,810 --> 00:43:02,790
And I'm going to ask you each one is about. How many of your sinks is?

409
00:43:04,390 --> 00:43:09,430
A assistant a. Assistant.

410
00:43:09,710 --> 00:43:21,020
Hey, what about assistant B? Okay, I know that Oxford is very strong on all the other departments.

411
00:43:21,620 --> 00:43:26,420
So here it's like it's a biology. Photosynthesis is a vital.

412
00:43:27,640 --> 00:43:33,430
Process for life on Earth. Could you outline the two main stages of photosynthesis,

413
00:43:33,760 --> 00:43:40,090
including where does it take place within the chloroplast and the primary inputs and output for each stage?

414
00:43:42,510 --> 00:43:49,710
Okay. I'm not going to ask you which out of those, I'm going to give you two answers and I'm going to ask you a Chinese method.

415
00:43:59,860 --> 00:44:03,830
A lot to me. How many? Be.

416
00:44:05,660 --> 00:44:14,840
Okay. But. So now we have a challenge to our, you know, in our hands how we are going to scale this.

417
00:44:15,060 --> 00:44:23,370
Right. It turns out like like what we did is like, well, GPT four was released two weeks before, right?

418
00:44:23,730 --> 00:44:32,790
And they said, why don't you know, students are very inventive, you know, say, why don't you use this?

419
00:44:32,820 --> 00:44:36,350
Uh sympt40mes. Like charge, right.

420
00:44:37,050 --> 00:44:43,260
Say, okay, this is the GPT. This is a question. These are the assets, you know, tell me what is the best on and the.

421
00:44:43,380 --> 00:44:48,180
So it was a question. It was graded from one to, uh, 1 to 10 or something like that.

422
00:44:49,370 --> 00:44:53,540
Okay. And that was a you know, this is a judge, right?

423
00:44:53,570 --> 00:44:57,469
This is um, and you know, the paper about on about that and things like that.

424
00:44:57,470 --> 00:45:03,500
So this is, uh, I think that, you know, this is the first year as a judge, um, in the open,

425
00:45:03,500 --> 00:45:08,450
I think that Microsoft did when they did the internal evaluation of, uh, GPT.

426
00:45:09,110 --> 00:45:12,410
You know, they have a very strong relationship at that time was OpenAI.

427
00:45:13,100 --> 00:45:16,520
But this is the first application in the open. Okay.

428
00:45:17,060 --> 00:45:23,930
And we had to apply it. And, you know, for instance, like for this first question, the Python question, it turns out,

429
00:45:24,290 --> 00:45:32,570
um, that a assistant A didn't handle case sensitivity, punctuation and, and so forth.

430
00:45:32,750 --> 00:45:43,110
So in this case B was better. And in this case, while a site looks much more comprehensive and detailed,

431
00:45:43,170 --> 00:45:48,900
I should again be is better because a confuses which are the inputs and outputs from the two stages.

432
00:45:50,280 --> 00:45:53,990
Okay, so when you look about with this explanation now, it's much easier, right?

433
00:45:54,000 --> 00:45:58,230
It's like to do it okay. But we did that.

434
00:45:58,290 --> 00:46:01,920
But people come back to us and say hey yeah, yeah again.

435
00:46:01,980 --> 00:46:08,940
So it's very early on. It sounds good on this anecdote because anecdote, uh, you know, um, data points.

436
00:46:09,300 --> 00:46:15,060
But really, how does it performance your back in square one to have human evaluation to compare?

437
00:46:15,330 --> 00:46:18,810
Right. And um, again how to scale it.

438
00:46:19,260 --> 00:46:25,350
So human evaluation, you know, like ideally you have for every questions you want to rank the alleles.

439
00:46:25,590 --> 00:46:29,700
I have one question is I have the answer from general rank them right.

440
00:46:29,700 --> 00:46:38,400
And then you do what is a better. Now we know that ranking and choice is hard and easier is to pick some best out of n, right?

441
00:46:38,760 --> 00:46:42,540
Um, but even that is hard. But you know this paradox of choice.

442
00:46:42,720 --> 00:46:46,980
So they use this easier saying you have to pick the best of two.

443
00:46:47,340 --> 00:46:52,530
You cannot do you know less than that and it's the least you can do.

444
00:46:52,740 --> 00:46:58,280
So that's kind of first one angle. Okay. So pick the best answer between two elements.

445
00:46:58,290 --> 00:47:03,180
And now there are many ways you can do it. One you organise a tournament right.

446
00:47:03,600 --> 00:47:07,260
Each for each question and so forth.

447
00:47:07,920 --> 00:47:11,910
The problem is is it doesn't scale and doesn't scale for many, many reasons.

448
00:47:12,270 --> 00:47:16,059
Um, one is because you have to play tennis square, right?

449
00:47:16,060 --> 00:47:20,430
To play, everyone needs everyone but the other one. The bigger problem is that the static.

450
00:47:21,580 --> 00:47:25,760
Right. You cannot enter new steam into competition, right?

451
00:47:25,780 --> 00:47:34,659
Unless it ends. And this leaves sort of starting to, you know, new a new source being released, you know, every month.

452
00:47:34,660 --> 00:47:39,010
Every week then. Okay. But fortunately, there is another way to do it.

453
00:47:39,010 --> 00:47:45,030
And this is our key idea here is that, um, you know, like writing, right?

454
00:47:45,040 --> 00:47:49,829
Like chess. And there are many other similar ratings in other sports.

455
00:47:49,830 --> 00:47:53,700
Almost every sport has one in which you can have a meaningful rating.

456
00:47:54,000 --> 00:48:00,440
With players not playing, everyone is everyone, and it's more related about the strength of the opponent.

457
00:48:00,780 --> 00:48:03,930
If you are going to beat a stronger opponent, you are going to get a bigger bump.

458
00:48:04,170 --> 00:48:07,800
If you are going to be the weaker opponent. You may get a very little bump.

459
00:48:08,010 --> 00:48:12,470
Okay, so that's the basic idea. So this is what they developed. This is kind of chat about arena.

460
00:48:12,480 --> 00:48:16,530
So what we did we provided this kind of interface people.

461
00:48:17,130 --> 00:48:20,940
We provided these models for free. You ask questions you get to access.

462
00:48:20,970 --> 00:48:25,200
So this is you know very early days. So that question is very simplistic.

463
00:48:25,200 --> 00:48:33,330
The answer to this article is so simplistic. And then people can say A is better, B is better in style or both are bad.

464
00:48:33,900 --> 00:48:38,270
And then you compute use. We use a.

465
00:48:38,280 --> 00:48:41,840
Sorry. We use Bradley to remodel.

466
00:48:43,190 --> 00:48:50,590
To compute rating this arena rating. And we rank these models with a proper confidence interval.

467
00:48:50,610 --> 00:48:54,920
Something like that. Okay. So that's what I've done. Okay.

468
00:48:54,930 --> 00:49:01,260
There are many categories here. Uh, for, um, you can rank these models.

469
00:49:01,500 --> 00:49:04,650
We had a lot of, um. Uh, yeah.

470
00:49:04,680 --> 00:49:13,770
And, uh, let me just. Okay. Um, now, now you have user evaluation, and you have them as a judge, right?

471
00:49:13,890 --> 00:49:20,820
So now you can do a proper study. So, you know, this kind of paper is actually this was, uh, in the same paper, like I am of the judge.

472
00:49:21,300 --> 00:49:33,120
And what did we did? Uh, the findings are that the lines are biased when they do make judgement very much like humans.

473
00:49:33,450 --> 00:49:40,200
You have, for instance, positional bias. They prefer the first answer that if you give them question, answer one, answer two.

474
00:49:40,350 --> 00:49:45,690
They prefer the first time. Uh. Verbosity bias. Sometimes they prefer longer answers.

475
00:49:46,080 --> 00:49:50,460
Um, self enhancement bias. They certainly prefer answer from yourself.

476
00:49:50,520 --> 00:49:53,700
Okay. Or the other. Uh, I mean, the same family.

477
00:49:53,820 --> 00:49:57,720
And as in time, they have limited reasoning capabilities.

478
00:49:57,810 --> 00:50:05,990
They are not good at math. Okay. Um, so and then we look about the agreement between two humans, right?

479
00:50:06,000 --> 00:50:09,420
Experts and also between humans and large language models.

480
00:50:09,900 --> 00:50:15,150
And, you know, at the end of the day, it was quite similar.

481
00:50:15,450 --> 00:50:23,430
The human to human preference is 81%. And the human to beat question was, uh, 85%.

482
00:50:24,030 --> 00:50:28,950
Okay. So that's kind of okay. Um, we have many other modalities.

483
00:50:29,400 --> 00:50:34,530
This is one question I asked last night. Please show an image of the Oxford University at 4 p.m. in February.

484
00:50:35,490 --> 00:50:40,750
Okay. It's very, very close to that. Um, I selected a being better.

485
00:50:41,580 --> 00:50:46,210
Um, and. Yeah, this is again on many modalities.

486
00:50:46,660 --> 00:50:50,210
Um, what can you do with the data? I'm going to be very quick here.

487
00:50:50,230 --> 00:50:54,400
Um. Uh, let me just. So here is one idea.

488
00:50:54,430 --> 00:51:00,280
Is that from the leaderboard? So after we do here and we develop a model that's kind of, uh, I think this is,

489
00:51:00,280 --> 00:51:04,420
uh, you can you can look at this equation on the I tell you what is here.

490
00:51:04,780 --> 00:51:09,760
So what you do here, you have a prompt. If I have a prompt, I can generate.

491
00:51:10,880 --> 00:51:16,070
You know, on the fly a rating for different models for that prompt.

492
00:51:16,340 --> 00:51:23,870
So how do you do that? Maybe we never seen that prompt. However, I've seen a lot of prompts similar to that prompt.

493
00:51:24,590 --> 00:51:29,950
So you use, uh, votes to the to the to the to the prompt of the presents.

494
00:51:29,960 --> 00:51:38,950
Other problems which are similar to the new prompt as a proxy to compute the leaderboard for that particular prompt.

495
00:51:38,960 --> 00:51:40,610
And this has many applications.

496
00:51:40,910 --> 00:51:51,470
You can um, you know, given a cost target, you can maximise the arena score or given a arena score you can maximise you can minimise the cost.

497
00:51:51,680 --> 00:52:05,000
Right. And this is different models at that time when I wrote the paper and with this one you know for the same cost you can get 30 more Elo points,

498
00:52:05,540 --> 00:52:10,820
um better models and for the same accuracy I didn't.

499
00:52:10,830 --> 00:52:16,130
I do not score. You can get to X cheaper by using this from the leaderboard and picking the right model.

500
00:52:16,430 --> 00:52:25,000
Okay. You know, it's like it has a you know, it was very rewarding because right now it's like most Foundation labs,

501
00:52:25,270 --> 00:52:35,690
before they release a model, they, they, they, um, uh, evaluate these models also on uh, uh, arena.

502
00:52:35,720 --> 00:52:40,360
Now it's called arena. Okay. And we have a lot of users votes and things like that.

503
00:52:41,380 --> 00:52:47,830
Yeah. Instead of some, uh. People tweeting when the new results come out.

504
00:52:48,370 --> 00:52:51,880
Okay. So I'm going to go very briefly.

505
00:52:52,540 --> 00:52:55,659
So I'm going to have these three, uh slides.

506
00:52:55,660 --> 00:53:00,260
And then I'm going to go to conclusions. Uh so what are the lessons we learned.

507
00:53:00,280 --> 00:53:07,450
I think these are important. I don't want to skip one of these. Lesson number one is that trends do matter, right?

508
00:53:07,900 --> 00:53:14,020
With the emergence of complex AI workloads and heterogeneous distributed systems, what's really important with Vietnam?

509
00:53:14,020 --> 00:53:21,910
The aggressive nature of LMS. So it's a problem changes and problem changes in most of the cases solution is going to change right.

510
00:53:22,360 --> 00:53:33,909
And so chat bots you know looks like out of basically natural labs and complex AI workloads is what actually determined us to work on these systems.

511
00:53:33,910 --> 00:53:41,530
And and actually are the drivers also behind problems that impact the same lesson too?

512
00:53:41,560 --> 00:53:50,680
The simple solution matters, obviously. Um, you know, it's, um, you know, the ray you see is a he has a record, a minimalist API.

513
00:53:50,880 --> 00:53:59,020
Vietnam. The pay attention is quite simple. Uh, as a judge and head to head evaluation using, you know, that, like, writing is very simple ideas.

514
00:53:59,020 --> 00:54:07,690
Maybe obvious, too obvious in retrospect. And the reason this matters is that is easier to understand also by other people, right.

515
00:54:07,690 --> 00:54:11,980
And applies them like I stole your page attention. It's used now by everyone.

516
00:54:12,670 --> 00:54:17,770
And also, you know, like writing, you see a lot of other people doing it or as a judge.

517
00:54:18,160 --> 00:54:23,560
And this is related to the impact, because if you understand something, how is working and how to do it,

518
00:54:23,690 --> 00:54:27,790
you are going is much more likely you are going to do it right versus something you just have to understand.

519
00:54:28,420 --> 00:54:34,390
And uh, the other one, I want to do it. I want to say it is like it's like, um,

520
00:54:35,560 --> 00:54:43,360
you have to plan on for flexibility and rewriting when you develop these kind of systems because you do not have the requirements,

521
00:54:43,360 --> 00:54:47,469
the requirements evolves as you are going to develop the system.

522
00:54:47,470 --> 00:54:53,440
So yesterday we we already wrote the rate for times, uh, which is um, uh,

523
00:54:53,440 --> 00:55:00,640
we lamb who already wrote it once and it's again, it's two and a half year old project and now we are it's another year.

524
00:55:00,660 --> 00:55:11,290
Right. And and and and so forth. Um, so I'm going to skip over the next one and I am going to move to questions.

525
00:55:11,290 --> 00:55:17,049
So, um, so summary is that, um, I do believe in open source everything.

526
00:55:17,050 --> 00:55:20,530
We've done it vertically. It's open source. We actually don't in the lab.

527
00:55:20,530 --> 00:55:24,850
I am we you know yet and in my lab we, we don't file for any patents.

528
00:55:25,090 --> 00:55:33,820
So everything is for public, uh, consumption. And we, you know, I talk today about these three of these, uh, projects, right.

529
00:55:34,390 --> 00:55:38,230
And chat about how, you know, uh, and I do believe that, you know,

530
00:55:38,230 --> 00:55:44,800
we are just scratching the surface and there will be many, many more, uh, challenges we are going to have had to address.

531
00:55:45,070 --> 00:55:51,430
And I hope that most of them will be addressed in, uh, open source using open source technologies.

532
00:55:51,700 --> 00:55:52,210
Thank you.