1 00:00:02,520 --> 00:00:05,860 [Auto-generated transcript. Edits may have been applied for clarity.] If you separate. 2 00:00:14,570 --> 00:00:18,649 Okay, let us, uh, slowly start. Thank you very much for coming in. 3 00:00:18,650 --> 00:00:21,770 It's such a nice number to this, uh, wonderful place. 4 00:00:22,100 --> 00:00:25,850 It is a great pleasure to introduce Professor Ian Stoker. 5 00:00:26,270 --> 00:00:34,260 Ian is a professor at the UC Berkeley, where he holds Um Su Bo Chancellor chair in computer science. 6 00:00:34,280 --> 00:00:43,190 Well, it is exciting. Um, you, um, has been one of the most influential computer scientist, uh, in the field of distributed systems. 7 00:00:43,910 --> 00:00:51,350 Um, he has been instrumental in developing systems that have shaped how large scale data processing, 8 00:00:51,380 --> 00:01:00,380 data analytics, um, uh, um, AI workloads are built and deployed today. 9 00:01:01,310 --> 00:01:09,230 Some of these contributions include Apache Spark, Apache missiles, um, Sky pilot Ray. 10 00:01:09,920 --> 00:01:13,370 In my time, I was a PhD student to a peer to peer network, sort of popular. 11 00:01:13,370 --> 00:01:18,080 So there he was with the distributed hash tables and the system called code. 12 00:01:18,860 --> 00:01:24,320 And most importantly, also all these amazing projects are open source projects. 13 00:01:25,190 --> 00:01:35,630 Now, Ian has also been a very successful entrepreneur. He has co-founded Databricks and any scale, both companies based on his open source projects. 14 00:01:36,050 --> 00:01:41,690 So, um, we are delighted to welcome Ian here as our Strategy Speaker. 15 00:01:42,050 --> 00:01:58,790 And please join me in welcoming. Yes. Thank you. Thank you, Ivan, for such a kind introduction and very happy to be here. 16 00:01:59,820 --> 00:02:05,820 So today I am going to talk about some research I've done over the past ten years or so. 17 00:02:06,540 --> 00:02:12,990 And I have to start with a disclaimer. I know that you guys here are famous for the theoretical computer science. 18 00:02:13,590 --> 00:02:16,920 I only have one equation at some point in my slides. 19 00:02:17,580 --> 00:02:22,170 So this is very system oriented talk. And without. 20 00:02:22,170 --> 00:02:30,240 Let me start. So over the years I've been involved in quite a few projects, some of them more successful than others. 21 00:02:30,960 --> 00:02:34,890 And in this talk and these are two neurones. 22 00:02:35,670 --> 00:02:42,750 Um, and in this talk I am going to talk about three projects which represent a slice through my work. 23 00:02:43,560 --> 00:02:46,139 Uh, radial alarm and childbirth arena. 24 00:02:46,140 --> 00:02:54,620 And they can be actually, uh, it's it's can be composed into a mini stack while I'm running on top of it and chat about that, 25 00:02:54,640 --> 00:03:02,550 you know, running on top of your alarm. So I'm going to talk about these projects and the template I am going to use for each project. 26 00:03:02,580 --> 00:03:09,810 Uh, is this one I'm going to talk about trends because in many cases the trends are shapes that he said is a work I've done, 27 00:03:10,590 --> 00:03:16,770 challenges the kids what it is, tell you a little bit about the project and about the impact. 28 00:03:17,430 --> 00:03:24,509 So let me start. So they started in 2016 and the trends magazine which are still to today, 29 00:03:24,510 --> 00:03:33,570 is that I demand so are growing faster than a single node capabilities and the AI workloads also becoming more complex. 30 00:03:34,500 --> 00:03:40,559 So this is some more recent data about the growth in computation demands of training. 31 00:03:40,560 --> 00:03:49,560 The top uh, models versus time from 2010 to 2024 is along the Y scale, is a log scale. 32 00:03:49,920 --> 00:03:53,730 And depending how you are on different, uh, different uh, um. 33 00:03:54,610 --> 00:03:58,670 Period. You see that the demands have grown. 34 00:03:58,690 --> 00:04:03,430 Computation demand. Go. Go down between 6.74.2 every year. 35 00:04:03,700 --> 00:04:09,579 So let's say for X every year, uh, at the same time the node capabilities. 36 00:04:09,580 --> 00:04:18,640 And here we are talking about GPUs. Uh, if you look for the same precision it's about 1.35 x every year. 37 00:04:19,120 --> 00:04:26,040 And if you look about in terms of memory capacity, it's about 1.2 x every year. 38 00:04:26,050 --> 00:04:36,460 And the memory, uh, bandwidth between the GPU compute and the memory, GPU memory, it's around the same thing, 1.2 1.22. 39 00:04:37,610 --> 00:04:47,840 So if you put this together, what you see is that compute demands are growing three times every year faster than compute and memory capacities of, 40 00:04:48,140 --> 00:04:51,560 say, GPUs in this case, and of course, much faster than CPUs. 41 00:04:52,460 --> 00:05:01,610 Um, so, uh, and, uh, and a, uh, a corollary here is that even if the demand stops growing today in terms of so, 42 00:05:01,610 --> 00:05:03,890 you are not going to have larger and larger models. 43 00:05:04,250 --> 00:05:16,100 It will still take you maybe decades to still be able to train a model on one chip, which means that distributed computing, 44 00:05:16,100 --> 00:05:20,990 in particular heterogeneous distributed computing, is becoming more and more of a norm. 45 00:05:21,590 --> 00:05:29,219 Right? Is that the exception? The second point I wanted I made is I, I workload are becoming more complex. 46 00:05:29,220 --> 00:05:32,730 And let me start with this kind of all the workload, all the workload. 47 00:05:32,730 --> 00:05:36,600 And it's just a recommendation system. So how do you do your student what you're doing? 48 00:05:36,600 --> 00:05:41,250 I recommend if you use them, you'll get the logs from the user to say interaction with the site. 49 00:05:41,790 --> 00:05:48,240 Uh, you'll do some, uh, you know, Featurization used to do some data pre-processing to extract the main features. 50 00:05:48,510 --> 00:05:53,640 Then you do some training to create, to build the models, to predict, you know, what, 51 00:05:54,210 --> 00:05:58,650 um, what recommendation to give you, like movie recommendation or things like that. 52 00:05:59,070 --> 00:06:04,890 You do fine tuning, so you have a new model. So you are going to as you are going to, um, 53 00:06:05,220 --> 00:06:11,730 play a little bit with different parameters of the model to see which parameters are going to give you the best results. 54 00:06:12,150 --> 00:06:21,660 Um, then you push that in production and then you get logs, new logs, and then you try to improve the model to get better and better results. 55 00:06:22,560 --> 00:06:29,250 Um, so this is what basically happened. And there are many more or gloats in 2015. 56 00:06:29,250 --> 00:06:40,739 And so for this, for the emergence of reinforcement learning, um, if you remember, uh, you know, DeepMind were just here playing Atari games and uh, 57 00:06:40,740 --> 00:06:47,850 then you have AlphaGo, then, you know, being applied to, uh, data centres, reducing, 58 00:06:47,910 --> 00:06:53,640 uh, uh, centres, uh, data centre, uh, energy usage for cooling and many more years. 59 00:06:54,570 --> 00:07:01,680 So these workloads are going again, uh, it becomes more and complex recommendation system reinforcement learning. 60 00:07:01,680 --> 00:07:08,040 Then you have batch inference. And as of one year and a half right now, it was the rise of test time computing. 61 00:07:08,820 --> 00:07:11,760 Um, and post-training. Okay. 62 00:07:12,180 --> 00:07:21,810 So what you so the main challenge here is that when you want to build an end to end application, you applications or ML application in the past, 63 00:07:21,960 --> 00:07:30,180 you have to put together all these components, right, to put them together and you need to scale them that we just discussed. 64 00:07:30,180 --> 00:07:37,680 The demands of the data processing and training these models is growing faster than the capabilities of a single chip or a single node. 65 00:07:38,100 --> 00:07:48,450 Okay. So you want to scale them. And at that time, not even today in big parts, you have distributed systems to scale each of these kind of stages, 66 00:07:48,450 --> 00:07:56,520 like data processing or pre-processing training, uh tuning and uh, batch prediction and so forth. 67 00:07:56,610 --> 00:08:00,599 So batch prediction, by the way, is basically before your use, in many cases, 68 00:08:00,600 --> 00:08:05,219 one of the use cases before you are going to deploy the model in production, 69 00:08:05,220 --> 00:08:10,020 you are going to test on the old logs to see how it performs before pushing it to production. 70 00:08:10,170 --> 00:08:18,450 Okay. Uh, so in order to build your end to end pipeline, your end to end systems, you need to stitch together these components. 71 00:08:18,900 --> 00:08:24,000 Right? And this is challenging for a bunch of, for, for for a bunch of reasons. 72 00:08:24,510 --> 00:08:29,790 Uh, first, it's hard to develop because each of these component becomes with its own API, 73 00:08:30,510 --> 00:08:35,940 it's and it's hard to deploy and manage because you need to manage these different distributed systems. 74 00:08:35,940 --> 00:08:43,890 And this system has its own semantics in terms of failure recovery or failure from failure data consistency semantics and things like that. 75 00:08:44,550 --> 00:08:54,060 Um, it's hard to use resources uh, uh, in a efficiently is because typically you are going to have a cluster for each of these components. 76 00:08:54,270 --> 00:08:59,370 So it's hard to share the resources across, across components. 77 00:08:59,760 --> 00:09:04,110 Right. Uh, so for instance, if you do if you do training, 78 00:09:04,110 --> 00:09:09,510 you have a cluster for training and then you have a cluster for inference when you are done with the training, 79 00:09:09,720 --> 00:09:12,990 that cluster is hard to reuse for other things. 80 00:09:13,500 --> 00:09:19,889 Okay. And finally, it's also slow because you need to move the data between these components and the data. 81 00:09:19,890 --> 00:09:25,320 It's a lot of data. Typically you write on as a kind of a blob store, like S3 or something like that. 82 00:09:25,320 --> 00:09:30,930 Then you have to read it. So he has a lot of overhead reading, writing, serialisation, Deserialization. 83 00:09:31,350 --> 00:09:38,610 Okay, so what Ray provided is one system which is general enough to support all these workloads. 84 00:09:39,030 --> 00:09:44,670 Okay, so this is Ray and has a component that we can see at the bottom is what we call 85 00:09:44,820 --> 00:09:49,290 right core and is a unified computing framework for distributed applications. 86 00:09:49,890 --> 00:09:55,379 And then you have on top of it a bunch of libraries running on core to support these different workloads. 87 00:09:55,380 --> 00:10:00,570 But now you have a system which is going to you can use system for your entire end to end application. 88 00:10:00,930 --> 00:10:04,200 Okay. So that's basically what DBAs. So what is key idea. 89 00:10:04,200 --> 00:10:15,270 So at the behind um right. So what it is is that it takes a procedural language like Python and generalise it to a distributed setting. 90 00:10:15,540 --> 00:10:21,810 That's basically what it is. And it focus on flexibility by exposing the parallel lines to the developers. 91 00:10:22,290 --> 00:10:31,020 Right. So the developers can, um, build the application with various pattern of parallels, like nested parallels and things like that. 92 00:10:31,680 --> 00:10:36,430 And one of the question it's always been asked is that why not the declarative language? 93 00:10:36,430 --> 00:10:40,260 You just said language what to do? And the language A is a. 94 00:10:41,440 --> 00:10:46,510 As a back end, um, the compiler or whatever is going to decide how to do it, right. 95 00:10:47,030 --> 00:10:55,750 Um, I think there are a few challenges with that. First of all, there is no proper, uh, general purpose, uh, general purpose, declarative language. 96 00:10:56,290 --> 00:10:59,859 Uh, and I've been working on Datalog and other languages. 97 00:10:59,860 --> 00:11:03,070 So I've been doing a little bit of work on that. 98 00:11:03,430 --> 00:11:06,880 And, you know, adopting a new language is hard. 99 00:11:07,240 --> 00:11:12,880 Let's face the truth. Um, so and we didn't want to invent one because of these reasons. 100 00:11:13,450 --> 00:11:19,930 And at the same time, when we started, the Andre Python was emerging the lingua franca for AI. 101 00:11:19,990 --> 00:11:25,930 So all I knew, I knew Python and all the libraries like TensorFlow, PyTorch. 102 00:11:25,990 --> 00:11:28,990 There are libraries within Python. Okay. 103 00:11:30,760 --> 00:11:35,499 So what kind of it is there are, you know, in of these kind of procedural languages that out of, 104 00:11:35,500 --> 00:11:40,060 you know, a few concepts I actually that are more I just had as I one which are relevant for me. 105 00:11:40,730 --> 00:11:46,680 Uh, first of all, you have in terms of the compute is like you have function and classes, right? 106 00:11:46,690 --> 00:11:50,799 Classes are stateful operators, right? And function in general. 107 00:11:50,800 --> 00:11:57,910 If you if they say they shouldn't have side effects so that a stateless operators and then you have concurrency obviously. 108 00:11:57,910 --> 00:12:05,400 And this is some a form provided by Python with uh I think io IP which allows to execute 109 00:12:05,410 --> 00:12:10,780 things in parallel uh or overlap for instance computation and communication and other things. 110 00:12:11,140 --> 00:12:13,780 And the other one, which is kind of many more low level. 111 00:12:13,780 --> 00:12:22,120 But it's important is that passing, especially for large amount of data, is the ability to pass a data by reference, right, instead of by value. 112 00:12:22,420 --> 00:12:25,840 Right. Far more efficient is when you have large amounts of data. Okay. 113 00:12:26,050 --> 00:12:29,140 So these are the things we kind of try to capture in Ray. 114 00:12:29,710 --> 00:12:34,870 And we are we we encapsulate this in this fast compute model is what it called. 115 00:12:35,500 --> 00:12:43,180 And so again you are going to have you are going to take classes and and functions and provide um 116 00:12:43,480 --> 00:12:51,880 ability to instantiate transparently these functions and classes remotely as tasks and of course, 117 00:12:52,060 --> 00:12:56,410 right. You know, you've seen a lot of actor languages. 118 00:12:56,860 --> 00:13:03,580 Um, then we have a shared the in-memory distributed objects, which enables passing arguments by reference. 119 00:13:03,910 --> 00:13:12,400 And then we have these futures, in particular distributed futures, which are references to objects arguments as well as the results. 120 00:13:12,820 --> 00:13:19,750 And in some cases these are the results which are created by task or actors even before being scheduling. 121 00:13:19,750 --> 00:13:25,570 So they are existing abstract level. You don't know even where the object is going to be created. 122 00:13:26,350 --> 00:13:31,630 And this obviously it enables concurrency and concurrent execution and parallels. 123 00:13:32,380 --> 00:13:37,000 The record has a very minimal API inside of the all of the functions. 124 00:13:37,370 --> 00:13:41,769 The main functions are core function functions. And I'm not going to read all of them. 125 00:13:41,770 --> 00:13:44,770 I'm going to demonstrate a few of them with a few examples. 126 00:13:45,940 --> 00:13:54,250 So here is a very trivial example. So say you have a function which is computing for one second and return the result. 127 00:13:54,700 --> 00:13:59,799 And if you execute that in traditional Python I'm not talking about multiprocessing. 128 00:13:59,800 --> 00:14:02,890 Python. You you call this function twice. 129 00:14:03,090 --> 00:14:07,660 Is going to take you two seconds to run right. Because you sequential right. 130 00:14:09,670 --> 00:14:17,799 Now what today is doing, you are going to have to have to use this kind of decorator and remote and is doing under the hood. 131 00:14:17,800 --> 00:14:21,820 When you are going to call this, you are going to call f dot remote. 132 00:14:22,420 --> 00:14:26,229 Um, then um is going to take that function. 133 00:14:26,230 --> 00:14:32,139 So here when we execute that and it's going to submit a task for that function 134 00:14:32,140 --> 00:14:36,910 which will be scheduled by somebody in the system can be on a different worker, 135 00:14:36,910 --> 00:14:41,620 one or different note this function is not blocking okay. 136 00:14:41,860 --> 00:14:49,290 So he's going to return immediately. And he's going to return a reference, a future to the result. 137 00:14:50,130 --> 00:14:54,340 Okay, because it's non-blocking. You can execute now the next. 138 00:14:54,720 --> 00:15:04,300 The next call to the function f dot remote. And is going again to get you back, um, a reference to the result. 139 00:15:04,960 --> 00:15:09,010 And finally you get to get the results. Now this is a blocking function. 140 00:15:09,640 --> 00:15:16,360 And under the hood. Do you know, um, that I submitted these tasks to a local scheduler? 141 00:15:16,360 --> 00:15:25,390 Uh, and the local scheduler will figure out where to schedule these, uh, this, uh, tasks for each of the functions. 142 00:15:26,050 --> 00:15:32,710 And then, you know, these are going to be scheduled, executed in this case in parallel because there is no data dependency. 143 00:15:33,010 --> 00:15:42,820 And you are going to get the results. And because these two tasks are executive in parallel, you are going to now the entire, um, program is going to. 144 00:15:44,400 --> 00:15:49,430 Take only one second. Okay. So this is functions. 145 00:15:49,670 --> 00:15:54,710 You also have classes are instantiated as actors. 146 00:15:55,040 --> 00:15:59,720 Again you do that you just need to say array dot remote and then dot remote. 147 00:16:00,230 --> 00:16:03,800 When you instantiate the class and then you call it a method of the class. 148 00:16:04,040 --> 00:16:08,000 By the way this dot remote. Um, it's actually not needed. 149 00:16:08,300 --> 00:16:13,370 The reason we preserve it in the language is to make that developer, the programmer, 150 00:16:13,370 --> 00:16:18,049 aware that this is going to be can be executed remotely on a different machine, 151 00:16:18,050 --> 00:16:24,350 and it can be slower, but also just to, um, give an extra hint to the developers. 152 00:16:25,040 --> 00:16:39,079 It's also has the ability array to which was uh, early on we made this decision, um, to specify the programmer can specify resource demands for, uh, 153 00:16:39,080 --> 00:16:46,790 a particular function or of a particular um, uh, class one is going to be instantiated in terms of the number of CPUs, 154 00:16:46,790 --> 00:16:50,360 GPUs, memory, uh, requirements and things like that. 155 00:16:50,840 --> 00:16:59,720 Okay. This is class. Uh, that's conductors. Then I mentioned that you have a shared in-memory objects store. 156 00:17:00,140 --> 00:17:05,450 Um, so this is, uh, this, uh, uh, shared memory. 157 00:17:05,900 --> 00:17:12,230 Um. It's, um, um, it's, um, um. 158 00:17:13,270 --> 00:17:20,700 It's it's it's it's read is right. Only read and read so you don't update. 159 00:17:20,720 --> 00:17:24,820 It doesn't provide the ability to update, um, a data. 160 00:17:24,820 --> 00:17:28,630 And this makes obviously keeping the consistency much easier. 161 00:17:29,050 --> 00:17:36,130 Uh, of course as a programming level you are going to update it, but under the hood you are just going to create a new version, 162 00:17:36,430 --> 00:17:41,020 uh, next of a new version of the data with, uh, the modified data. 163 00:17:41,410 --> 00:17:52,090 Okay. Um, so and here is another simple example where you have two functions where f and g where uh, where uh and um. 164 00:17:53,240 --> 00:17:58,130 You call F and then. Then you'll pass the result from F to G. 165 00:17:58,880 --> 00:18:04,460 Okay. So again this what happens here is I do not have to call f dot remote. 166 00:18:04,490 --> 00:18:13,490 It's again it's unblocking call. You are going to execute eventually uh Ray is going to instantiate the task associated with F. 167 00:18:14,120 --> 00:18:19,400 And the return is a reference to the result of f this distributed future. 168 00:18:19,700 --> 00:18:23,990 And that is going to be passed to uh, to G. 169 00:18:24,440 --> 00:18:32,450 Right. And the remote is going to uh, G is going to the grey is going to instantiate a task for G, 170 00:18:32,810 --> 00:18:39,860 maybe on another worker on another node and return, uh, the reference distributed future for G. 171 00:18:39,860 --> 00:18:42,620 And then you can you can get the result. 172 00:18:43,070 --> 00:18:52,129 So now and now you have a dependency between f and g is a data dependency that knows that and basically is going in before executing G. 173 00:18:52,130 --> 00:18:57,390 It waits for the reference to be resolved, which means that f should be it should be executed. 174 00:18:57,390 --> 00:19:01,550 It should, uh, finish and produce the result in this case x. 175 00:19:01,940 --> 00:19:09,710 And then now that X is available, rail will uh, transfer x to the node which is G. 176 00:19:10,070 --> 00:19:13,850 And now you are going to finally run G. 177 00:19:14,420 --> 00:19:25,850 The main point here is that in this particular case x it's transferred only once between uh node one and o two which runs fn g respectively. 178 00:19:26,240 --> 00:19:34,250 And the reason I'm giving this example, because you have to to if you want to to contrast with remote procedure call. 179 00:19:34,340 --> 00:19:38,870 Right. The remote procedure call is again a way to run function remotely. 180 00:19:38,930 --> 00:19:47,280 And you have that in many languages and so forth. But in remote in most of the remote procedure calls implementation the data. 181 00:19:47,290 --> 00:19:50,840 Uh, it's the arguments are passed by value, right. 182 00:19:51,410 --> 00:19:55,469 And in this particular case, for instance you have to execute x. 183 00:19:55,470 --> 00:20:01,400 Then you are going to get the value back. You need to execute f which will get the value back x. 184 00:20:01,910 --> 00:20:05,840 Uh, just to pass it to g okay. 185 00:20:06,470 --> 00:20:10,220 And then G will be executed on this. 186 00:20:10,220 --> 00:20:16,340 Now to and in this particular case, as you can see you have two transfers of X in the system instead of one. 187 00:20:16,790 --> 00:20:25,579 But this is again you knew that. I mean that's why the the advantage of using references, passing the argument by reference to avoid extra copies. 188 00:20:25,580 --> 00:20:28,700 And that's true for a single node. 189 00:20:28,730 --> 00:20:32,450 And, uh, it's also true, obviously for a distributed system. 190 00:20:33,620 --> 00:20:42,949 Okay. Um, now I will also, um, try to mention some ongoing work and obviously a lot of in going work. 191 00:20:42,950 --> 00:20:46,730 It's about improving the GPUs. What we are witnessing today, we are moving. 192 00:20:47,030 --> 00:20:55,520 We are have been already moved from a CPU, uh, dominated compute infrastructure to a GPU or accelerator dominated infrastructure. 193 00:20:56,210 --> 00:21:02,330 And we do a lot of work on DirectX to reduce the overhead for that of GPU to GPU communications, 194 00:21:02,810 --> 00:21:07,730 uh, and implementing different collective operations and so forth. 195 00:21:08,000 --> 00:21:17,300 This is one example. Um, uh, basically that I can give you to GPU communication is like in the original Ray, though. 196 00:21:17,810 --> 00:21:21,170 Remember, we have this kind of, um, uh, object storage. 197 00:21:21,170 --> 00:21:24,470 So object storage is implemented in the CPU memory. 198 00:21:24,980 --> 00:21:31,160 So what happens if we want to to transfer some data from one GPU to another like a tensor? 199 00:21:31,430 --> 00:21:37,249 What happens is that this tensor is moved from the GPU memory to the local Ram memory of that. 200 00:21:37,250 --> 00:21:43,370 Now I'm going to go through the object store from to the Ram of the receiver, 201 00:21:43,730 --> 00:21:49,820 and then from the memory CPU memory of the receiver to be transferred in the GPU memory. 202 00:21:50,330 --> 00:21:59,300 So it's kind of quite slow, right. And uh, obviously what we want and there are, uh, a lot of copies, a lot of RPC calls. 203 00:21:59,540 --> 00:22:08,540 So what we want, obviously it's not a GPU to GPU communication, uh, and memory statically allocated and everything, uh, for optimising it. 204 00:22:08,960 --> 00:22:15,560 So one thing I wanted, I give you this figure because initially when I did that and this is still experimental, 205 00:22:15,800 --> 00:22:22,040 we saw that everyone only use direct communication direct transfer between CPU and GPU zero and GPU one. 206 00:22:22,400 --> 00:22:25,850 But there is one instance two out of surprisingly people preferred. 207 00:22:26,150 --> 00:22:29,900 So one is coming on the out of left to go. 208 00:22:29,900 --> 00:22:35,000 These are all traditional way of transferring, uh, the data through the CPU memory. 209 00:22:35,300 --> 00:22:40,730 Do you know I do you know, what is that? What is a case in which you may still prefer? 210 00:22:41,850 --> 00:22:52,520 The left. The slow transfer over the right one. It turns out that when you do Gpio to Gpio transfer, it's a synchronous operation. 211 00:22:53,240 --> 00:22:57,440 Okay. You need also to synchronise the sender and the receivers. 212 00:22:57,440 --> 00:23:03,950 You need to allocate the memory in the Gpio receiver to be ready to receive the data from the sender. 213 00:23:04,700 --> 00:23:09,410 Okay. Now in this case it's more it's more asynchronous. 214 00:23:09,800 --> 00:23:16,490 You need only to be ready to have to allocate the Gpio to be ready to get it to have to allocate the Gpio. 215 00:23:16,950 --> 00:23:28,250 Um, memory. When you have the tensor here on the already it's instantiated the sends out the tensor on the local machine in the CPU or memory. 216 00:23:29,000 --> 00:23:32,930 So the amount of time for which you keep the GPU memory allocated. 217 00:23:33,970 --> 00:23:43,350 To get the tensor is going to be higher when you have direct GPU to GPU communication than when you have slow memory, right. 218 00:23:43,360 --> 00:23:51,970 Again because synchronous. So if your GPU memory it's extremely, extremely a premium and is a bottleneck, you may prefer uh, the left hand side. 219 00:23:54,470 --> 00:24:01,610 Okay, so, uh, there are many, like I mentioned to you, you know, it's Ray is very flexible. 220 00:24:01,970 --> 00:24:05,510 Uh, it's also provide support for a very generous infrastructure. 221 00:24:05,900 --> 00:24:11,719 And here you have, you know, batch inference where you have you read data, 222 00:24:11,720 --> 00:24:15,980 you pre-process the data, or you do something for finance and you write the data. 223 00:24:16,520 --> 00:24:24,559 And, uh, here is the data pre-processing and post-processing can run on CPU as well as the inference can run on GPUs. 224 00:24:24,560 --> 00:24:30,440 And by doing so you can save a lot of cost instead of running everything on GPUs. 225 00:24:30,710 --> 00:24:35,930 And there are many workloads you can support here, or multi-node search and so forth. 226 00:24:36,440 --> 00:24:42,979 Fine tuning, embedding computation. Um, Ray, it's also emerging as a standard, you know. 227 00:24:42,980 --> 00:24:48,320 Well, the not standard but the framework of choice for Post-training. 228 00:24:48,590 --> 00:24:52,370 There are a lot of post-training frameworks being developed over the past year, 229 00:24:52,640 --> 00:24:59,270 many of them using Ray and also using the multi-language AI system or also developed at Berkeley. 230 00:24:59,630 --> 00:25:03,980 Okay. So impact it has been a very popular framework actually. 231 00:25:03,980 --> 00:25:12,800 It was used by OpenAI in their training infrastructures to train ChatGPT for and um, uh, and, 232 00:25:13,010 --> 00:25:23,270 and others and by many other companies to build their AI infrastructure, um, very fast growth, especially in the past year. 233 00:25:23,540 --> 00:25:31,610 This is a number of downloads and this is GitHub status, uh, compared with other distributed frameworks, uh, popular distributed frameworks. 234 00:25:31,910 --> 00:25:35,540 So you can see the growth is accelerated accelerating. 235 00:25:36,530 --> 00:25:41,060 So you saw three I have two other projects I have to finish quickly. 236 00:25:41,420 --> 00:25:45,500 Um, so next is Vlad. Okay. So Vlad, um, it's about inference. 237 00:25:47,090 --> 00:25:51,799 And you know. So what? What kind of change? 238 00:25:51,800 --> 00:25:55,940 So what kind of the trend? So the trend obviously was the large language model. 239 00:25:55,940 --> 00:25:59,690 So what was different about this model from the point of view of training and inference? 240 00:26:00,020 --> 00:26:06,200 It used to be that in the traditional machine learning, for every task, for every use case, 241 00:26:06,200 --> 00:26:09,710 you are going to train a model and then you are going to serve the model. 242 00:26:09,800 --> 00:26:18,680 So training set of large language models are much more general, and they are a kind of provide some do multiple tasks. 243 00:26:18,980 --> 00:26:23,090 So you train a large language model and then you can use it for multiple tasks. 244 00:26:23,090 --> 00:26:26,900 You train wise and you you do the inference for many, many tasks. 245 00:26:26,930 --> 00:26:31,850 Okay. So that means that the inference actually becomes even more important. 246 00:26:32,330 --> 00:26:38,570 And um, you have a lot of this was happening by the in 2023. 247 00:26:38,780 --> 00:26:50,570 This is when I started this research. Um, and um, at that time it was very expensive to serve these, uh, models. 248 00:26:51,260 --> 00:26:58,310 Okay. So you can have, uh, if you remember, each GPU can only serve a handful of requests per second. 249 00:26:58,580 --> 00:27:05,540 That was A100, if you remember, for us, you know, that was a top GPU, uh, in 2023. 250 00:27:05,720 --> 00:27:13,520 Um, okay. And so you have neither, you know, the cost for request was very high. 251 00:27:14,390 --> 00:27:24,140 And why is that? Everyone knows that. So I'm going to go quickly is because, as you know, uh, these models are autoregressive models. 252 00:27:24,380 --> 00:27:30,620 And when they generate a token, they generate a new token based on all the previous tokens. 253 00:27:31,130 --> 00:27:35,570 Okay. So you have this kind of strong dependencies, this attention mechanism. 254 00:27:36,080 --> 00:27:47,570 Uh, okay. And um, so sorry about so generating one token at a time because you have the dependency, it doesn't exploit the GPU parallelism. 255 00:27:47,870 --> 00:27:54,560 Right. The GPUs are fast and have a high, uh, you know, uh, compute capabilities. 256 00:27:54,830 --> 00:27:58,740 But they do that when you can exploit their parallelism. Right. 257 00:27:58,830 --> 00:28:02,340 But now you cannot exploit them because of these dependencies for one request. 258 00:28:02,700 --> 00:28:07,200 So of course, if you cannot exploit for one request, then what is a solution? 259 00:28:08,190 --> 00:28:12,450 Well, I'm going to try to serve multiple requests at the same time. 260 00:28:13,140 --> 00:28:16,710 Right. So you request a batch. You are going to process a batch of requests. 261 00:28:17,910 --> 00:28:21,389 The problem is that now you are going to get out of memory. 262 00:28:21,390 --> 00:28:24,450 So the memory becomes a bottleneck. So why is that. 263 00:28:24,780 --> 00:28:32,669 Well because again of this dependency to generate the next token you need to store it in memory. 264 00:28:32,670 --> 00:28:36,660 And this is how memory is organised. You have the number of zeros as 40GB. 265 00:28:36,940 --> 00:28:40,850 That is at times 3830 billion a lm. 266 00:28:41,100 --> 00:28:46,860 That was a Lama model. So the precision at that time was two bytes. 267 00:28:47,100 --> 00:28:52,379 So you have 26GB with 65% of the entire GPU memory. 268 00:28:52,380 --> 00:28:57,510 And then you have this other activation a few percent and then a KB cache. 269 00:28:57,540 --> 00:29:03,390 So okay we can try to keep the data. Um temporal data to generate the next token. 270 00:29:03,630 --> 00:29:10,680 So in this case we cache you need to keep the embeddings you know for the entire prefix right. 271 00:29:10,680 --> 00:29:15,920 For each request. Right. And each token, each embedding it was like one megabyte. 272 00:29:15,930 --> 00:29:19,200 So one request can have several gigabytes. Okay. 273 00:29:19,560 --> 00:29:23,370 So that's a problem. Okay. So, um. 274 00:29:25,280 --> 00:29:36,290 And when we looked, actually it was this was compounded by the fact that, um, before what happens is a memory was contiguously allocated. 275 00:29:36,430 --> 00:29:42,050 That's a easiest way to allocate memory contiguously allocate memory you are going to need. 276 00:29:42,320 --> 00:29:45,469 Right. And you also need to because it's contiguous. 277 00:29:45,470 --> 00:29:49,480 You need to provide space for the future. Future tokens. 278 00:29:49,850 --> 00:29:56,929 Okay. Of course, you know, if you took, you know, systems 1 to 1, you know, 279 00:29:56,930 --> 00:30:02,300 that list that can lead contiguous memory allocation can lead a lot of can lead to fragmentation. 280 00:30:02,750 --> 00:30:09,580 Um, and you can lead to internal fragmentation because I am going to allocate a space for a request. 281 00:30:09,590 --> 00:30:12,830 But I don't know eventually the length of the output. Right. 282 00:30:12,920 --> 00:30:16,880 And because I don't know, I'm I'm going to have to probably allocate more than I should. 283 00:30:17,060 --> 00:30:23,270 So you have fragmentation there. Um, you also have um, this is a interesting one. 284 00:30:23,930 --> 00:30:31,910 So the interesting one is the following. So let's say that you know, exactly how long is the output okay. 285 00:30:32,090 --> 00:30:38,690 So then I can allocate that all the and then as a block of memory to fit all the output and no more. 286 00:30:39,170 --> 00:30:43,220 Well it turns out that even this is not optimal. So why is not optimal. 287 00:30:43,490 --> 00:30:46,580 Say I have to allocate 1000 tokens for output. 288 00:30:46,940 --> 00:30:50,720 Now these tokens are going to be filled sequentially one by one right. 289 00:30:50,960 --> 00:30:59,090 So initially I have 1000 tokens free I of next token I'm going to generate one token used 999 three and so forth. 290 00:30:59,480 --> 00:31:07,730 So during the process, you know dynamically I am going to waste a lot of tokens which are allocated space which are allocated in the future, 291 00:31:08,090 --> 00:31:13,549 right, which are not yours. Those are going to be used eventually, but for a long time they are not going to be used. 292 00:31:13,550 --> 00:31:18,080 So this is kind of, uh, reserve reserve instance and external fragmentation. 293 00:31:18,080 --> 00:31:22,130 I'm I'm going to not talk much about that. 294 00:31:22,540 --> 00:31:27,080 And at this, this was at that stage, this was how the memory was used. 295 00:31:27,470 --> 00:31:32,420 And the only thing you need to be look at here. It's about the green part. 296 00:31:32,480 --> 00:31:36,470 This out of, you know, the other colour, a different kind of fragmentation. 297 00:31:36,650 --> 00:31:41,209 But this is the effective memory which has been which was used for uh, 298 00:31:41,210 --> 00:31:49,790 by the inference system that at that time, uh, and uh, the most of it, the largest on a 38% by 38%, 299 00:31:49,790 --> 00:31:57,019 you have this kind of oracle is saying here, Oracle is what I told you, assuming that, you know the future, we should acknowledge that in our future. 300 00:31:57,020 --> 00:32:03,380 How long is the request? Still, you are going to effectively use only 38% of the memory if you don't know the future. 301 00:32:03,390 --> 00:32:07,610 The best at that time was 26.8%. The rest was kind of wasted. 302 00:32:09,140 --> 00:32:16,010 So as you know, what is a solution? You know you still have fragmentation and things like that challenges segments. 303 00:32:16,010 --> 00:32:21,510 If you remember, you know, operating system. What do you do that. Beijing. 304 00:32:21,510 --> 00:32:27,270 You have this kind of Beijing which are fixed sizes, you know, a few kilobytes and so forth. 305 00:32:27,870 --> 00:32:33,659 And then the there is no fragmentation, external fragmentation between these pages. 306 00:32:33,660 --> 00:32:46,530 And then, um, you still provide to the application the illusion of, uh, you know, of, uh, nice, uh, address space, um, which is contiguous. 307 00:32:46,800 --> 00:32:50,700 But this address space is going to be mapped to different pages under the hood. 308 00:32:50,790 --> 00:32:56,930 Right? This is exactly what happens here. Right. So this is what memory management you have the process a process. 309 00:32:56,940 --> 00:33:00,959 We they have the mutual memory which is map which is contiguous for each of them. 310 00:33:00,960 --> 00:33:08,670 The virtual memory and the virtual memory under the hood is, is mapped to these pages, which are all over the place in the physical memory. 311 00:33:09,150 --> 00:33:15,240 So the same we are going to do here. And this is a logical KB block. 312 00:33:15,240 --> 00:33:18,480 So it's everything is contiguous each block after another. 313 00:33:18,930 --> 00:33:27,750 And but in the physical uh, memory, uh, you are going to have different blocks are going to be spanned a different location. 314 00:33:27,990 --> 00:33:36,960 A block consists of multiple is a fixed size and consists of a few, uh, of several embedding tokens in this case for right, for tokens. 315 00:33:37,560 --> 00:33:44,890 Okay. So then and then how do you go from the logical to the virtual block space. 316 00:33:44,910 --> 00:33:50,640 Well you have a mapping. You have um, a table, uh, which index? 317 00:33:51,330 --> 00:33:54,720 Uh, each a logical block to a physical block in memory. 318 00:33:54,900 --> 00:34:05,940 And then you have another field which is, uh, field here, uh, which basically says how much of that, uh, block is, has been filled with tokens. 319 00:34:06,420 --> 00:34:09,420 Okay. So in this case, the block one is two. 320 00:34:09,840 --> 00:34:13,290 Um, and now you are going to generate an off next token in this case. 321 00:34:13,290 --> 00:34:18,350 And, and you are going to put in the corresponding block in the physical memory and so forth. 322 00:34:18,360 --> 00:34:25,140 Now and when it is full, uh, the block um, they are allocating another block in the physical memory. 323 00:34:25,440 --> 00:34:36,000 Right. So it's kind of simple. And again what it um, this allows also if you are having multiple, uh, you know, 324 00:34:36,020 --> 00:34:41,250 request and so forth, and ask certainly the same prefix to share as a prefix more efficiently. 325 00:34:41,790 --> 00:34:51,090 Um, so, um, you know, it is you can do parallel sampling multi turn when you have multiple ah, conversation or anything like that. 326 00:34:51,210 --> 00:34:58,950 Okay. Um, now let me tell you because it's interesting, it's, it's very much like paging in kind of, 327 00:34:58,980 --> 00:35:05,879 uh, um, virtual memory and operating system and, but um, there are also a few differences. 328 00:35:05,880 --> 00:35:11,010 So similarities pages are very similar is CPU blocks are very similar with pages. 329 00:35:11,340 --> 00:35:18,270 And you can share pages across processes since like if you have uh, common prefixes for the, 330 00:35:18,270 --> 00:35:24,479 for the request or if you have system prompt, the system prompt, it's used for every prompt, right. 331 00:35:24,480 --> 00:35:28,890 So it can be shared across all prompts from all the, uh, requests. 332 00:35:29,730 --> 00:35:32,400 Uh, but there are some differences. Okay. 333 00:35:34,080 --> 00:35:42,240 So when you when you talk about evictions, you see, and when you evict, when you run out of physical space or memory space, what do you do? 334 00:35:42,270 --> 00:35:46,200 You evict pages, right? To make room to bring a new page unit. 335 00:35:46,470 --> 00:35:49,770 Right. And that's typically a start on uh. 336 00:35:50,100 --> 00:35:57,150 Uh, you know, in, uh, you know, swap space on the disk and, um, then you can bring it back later. 337 00:35:57,540 --> 00:36:07,949 So on, on a slower storage, but bigger storage. Um, so in this case, there are two differences, you see, because the dependency, you know, 338 00:36:07,950 --> 00:36:14,040 when you, you need to generate the next token, you need all the tokens before that. 339 00:36:14,340 --> 00:36:21,510 Okay. So it doesn't make sense to just, uh, page out one block. 340 00:36:21,810 --> 00:36:25,260 Right. You have to position all of them. Right. Okay. 341 00:36:25,410 --> 00:36:29,430 Because you still need to do all of them. And you are going to need to get the next token right. 342 00:36:29,460 --> 00:36:33,030 So you need to do I think you can do it. Request level. 343 00:36:33,330 --> 00:36:39,750 Uh, eviction versus page level eviction, which will free more memory for as a request because the other one also is interesting. 344 00:36:40,080 --> 00:36:48,300 It's like, you know, like I said, in the operating system, you do page in, page out and page in your store is a page on a disk on the disk. 345 00:36:48,930 --> 00:36:56,520 But here because, uh, you know, because the, um, GPU is, um, have so much compute, 346 00:36:56,760 --> 00:37:03,629 sometimes it's more efficient to throw away the, uh, block and to very compute, 347 00:37:03,630 --> 00:37:05,370 to remember, to realise it when you need it, 348 00:37:05,880 --> 00:37:12,720 because the computation is faster than just storing it for in this case can be on the CPU memory and bringing it back. 349 00:37:13,170 --> 00:37:17,790 Okay. This is some of this is fallacies. So what was the result? 350 00:37:17,850 --> 00:37:25,170 Um the result we got um 96% in terms of the memory utilisation at that time. 351 00:37:25,350 --> 00:37:28,560 And this converts directly into support. Right. 352 00:37:28,800 --> 00:37:32,280 So 96% compared to this 26.8%. 353 00:37:32,580 --> 00:37:37,020 It's almost four times higher support because you can fit now in a batch. 354 00:37:37,020 --> 00:37:40,800 And you can process at the same time four times more requests. 355 00:37:41,130 --> 00:37:45,870 Okay. Um, is also is actually this is one of the fastest growing. 356 00:37:46,970 --> 00:37:52,240 Projects I have seen. I was part of it. That's kind of GitHub status of NLM. 357 00:37:52,610 --> 00:37:59,480 Think it's another? Unfortunately, I don't have time to go over it. It's another project we developed, uh, in the lab. 358 00:38:00,020 --> 00:38:03,650 Um, it was a top open source project on GitHub in 2025. 359 00:38:03,770 --> 00:38:07,190 And I also wanted to show these how hard things are in practice. 360 00:38:07,550 --> 00:38:10,940 Um, and this is about, uh, you know, 361 00:38:10,940 --> 00:38:17,149 like what what is saying this is are the number of models which are supported 362 00:38:17,150 --> 00:38:22,400 actually September last year by VLA lab and the number of accelerators it supports. 363 00:38:22,520 --> 00:38:27,950 It's 260 accelerators. And I don't know, it's like 500 model architectures. 364 00:38:28,130 --> 00:38:33,320 It's very, very complicated, very complex. Um, okay. 365 00:38:33,440 --> 00:38:38,059 So yeah, and this is what I want to say here is at this page attention. 366 00:38:38,060 --> 00:38:42,680 It's actually as a techniques via the artefact by page attention. 367 00:38:42,680 --> 00:38:49,339 It's use now by all of these inference uh, uh systems, not only BLM in the industry. 368 00:38:49,340 --> 00:38:53,059 So he's a nice to see that. So the final one and it'll take me ten minutes. 369 00:38:53,060 --> 00:38:58,570 So this as well. Um so it chatbot arena rethinking the VLA as a L alum. 370 00:38:58,610 --> 00:39:06,169 Uh, evaluation. So this is a different one. And this is hopefully, you know, it's like I thought it's a nice study and it's the end of the talk. 371 00:39:06,170 --> 00:39:09,140 So hopefully it's going to be more engaging. 372 00:39:09,410 --> 00:39:20,630 Um, so look, um, it's this is, uh, the Patterson during our webinar and he has this famous quote, for better or worse, benchmark shape of field. 373 00:39:20,900 --> 00:39:29,059 Right. And this is absolutely true about AI. You have all these kind of emission at and CFR and all of these early benchmarks. 374 00:39:29,060 --> 00:39:35,660 And you have more and more now benchmarks right to to uh, to compare and to see which models are better. 375 00:39:36,230 --> 00:39:49,040 So here our study. So, um, if when the time was met, I was opening sort of set of models, the open sort of salama in 2023. 376 00:39:49,040 --> 00:39:52,309 In February, a group of students, uh, 377 00:39:52,310 --> 00:39:59,209 then took that model and took this kind of shared Egypt date that we just shared Egypt as a date as a prompt 378 00:39:59,210 --> 00:40:08,300 sends the results that people are sharing over the web and use that to fine tune right to father train this, 379 00:40:08,330 --> 00:40:12,710 uh, llama, uh, model and call the model vicuna. 380 00:40:13,130 --> 00:40:20,110 Okay. And this was high quality data, shared GPT, because people only share the prompts and the results. 381 00:40:20,270 --> 00:40:23,390 They are, you know, uh, excited. 382 00:40:23,480 --> 00:40:24,440 They are proud about. 383 00:40:24,860 --> 00:40:33,920 Okay, so that was and we affirmed, um, this is the trend here is the rise of chat bots, because you are going to be QnA via a chat bot interface. 384 00:40:34,130 --> 00:40:38,540 That was a chat GPT early days. And it's a question. 385 00:40:38,540 --> 00:40:44,989 It's about how you are going to evaluate it. You have this kind of benchmark with a static benchmark that time. 386 00:40:44,990 --> 00:40:48,200 I mean, you have minutes of thoughts, but they are static. 387 00:40:48,650 --> 00:40:52,760 Static means they are also prone to contamination by contamination. 388 00:40:52,760 --> 00:41:00,350 What I mean is that the model is this large language model, as you know, are are trained on all the data which is available in the internet. 389 00:41:00,350 --> 00:41:04,040 And these benchmarks are available in the internet. They are going to be trained on this benchmark, 390 00:41:04,250 --> 00:41:09,530 and then they are going to be evaluated on the same benchmarks and then do not capture human preferences. 391 00:41:09,980 --> 00:41:13,490 Right. It's not only about correctness, it's about the style and so forth today. 392 00:41:13,790 --> 00:41:18,559 So that's kind of the problem okay. So how do you do it. Now this is a contamination. 393 00:41:18,560 --> 00:41:27,530 It's an example to show you that it's real. This social group is some kind of you know, this is kind of some message I think was sort of on Twitter. 394 00:41:28,190 --> 00:41:41,240 Um, so code force. Right. It turns out that it was ten out of ten solving correctly, um, you know, before 2021 and zero out of ten after that. 395 00:41:41,600 --> 00:41:48,330 And that was, incidentally, the Cut-Off date for data used to train GPT, right? 396 00:41:48,350 --> 00:41:54,080 Of course, you you are trained on, uh, on the solutions, uh, you are going to remember all these solutions. 397 00:41:54,320 --> 00:41:57,590 So that's always the real problem. So you need human evaluation. 398 00:41:57,590 --> 00:42:05,380 So I did it. What do you do as a faculty. Right. Like okay so you'll get some students and I give them some pizza and say okay ask. 399 00:42:05,390 --> 00:42:09,200 And these are the prompts for different models. See what is better okay. 400 00:42:09,920 --> 00:42:13,520 So we realise that this doesn't really scale okay. 401 00:42:14,060 --> 00:42:20,450 And why doesn't scale? Because the assets and the evaluations, uh, take time are not obvious. 402 00:42:20,450 --> 00:42:24,290 It's not like a math problem. This is good and right. And here are two examples. 403 00:42:25,220 --> 00:42:30,590 One one says this question is develop a Python program that reads all the text files 404 00:42:31,850 --> 00:42:37,130 under a directory and returns top five orders with the most number of occurrences. 405 00:42:38,030 --> 00:42:43,410 Okay. You know, so everyone can do that. 406 00:42:44,100 --> 00:42:47,830 And I'm not going to ask you to program is is no, no question. 407 00:42:47,850 --> 00:42:54,240 But I am going to give you two solutions provided by two large language model the two jackpots. 408 00:42:54,810 --> 00:43:02,790 And I'm going to ask you each one is about. How many of your sinks is? 409 00:43:04,390 --> 00:43:09,430 A assistant a. Assistant. 410 00:43:09,710 --> 00:43:21,020 Hey, what about assistant B? Okay, I know that Oxford is very strong on all the other departments. 411 00:43:21,620 --> 00:43:26,420 So here it's like it's a biology. Photosynthesis is a vital. 412 00:43:27,640 --> 00:43:33,430 Process for life on Earth. Could you outline the two main stages of photosynthesis, 413 00:43:33,760 --> 00:43:40,090 including where does it take place within the chloroplast and the primary inputs and output for each stage? 414 00:43:42,510 --> 00:43:49,710 Okay. I'm not going to ask you which out of those, I'm going to give you two answers and I'm going to ask you a Chinese method. 415 00:43:59,860 --> 00:44:03,830 A lot to me. How many? Be. 416 00:44:05,660 --> 00:44:14,840 Okay. But. So now we have a challenge to our, you know, in our hands how we are going to scale this. 417 00:44:15,060 --> 00:44:23,370 Right. It turns out like like what we did is like, well, GPT four was released two weeks before, right? 418 00:44:23,730 --> 00:44:32,790 And they said, why don't you know, students are very inventive, you know, say, why don't you use this? 419 00:44:32,820 --> 00:44:36,350 Uh sympt40mes. Like charge, right. 420 00:44:37,050 --> 00:44:43,260 Say, okay, this is the GPT. This is a question. These are the assets, you know, tell me what is the best on and the. 421 00:44:43,380 --> 00:44:48,180 So it was a question. It was graded from one to, uh, 1 to 10 or something like that. 422 00:44:49,370 --> 00:44:53,540 Okay. And that was a you know, this is a judge, right? 423 00:44:53,570 --> 00:44:57,469 This is um, and you know, the paper about on about that and things like that. 424 00:44:57,470 --> 00:45:03,500 So this is, uh, I think that, you know, this is the first year as a judge, um, in the open, 425 00:45:03,500 --> 00:45:08,450 I think that Microsoft did when they did the internal evaluation of, uh, GPT. 426 00:45:09,110 --> 00:45:12,410 You know, they have a very strong relationship at that time was OpenAI. 427 00:45:13,100 --> 00:45:16,520 But this is the first application in the open. Okay. 428 00:45:17,060 --> 00:45:23,930 And we had to apply it. And, you know, for instance, like for this first question, the Python question, it turns out, 429 00:45:24,290 --> 00:45:32,570 um, that a assistant A didn't handle case sensitivity, punctuation and, and so forth. 430 00:45:32,750 --> 00:45:43,110 So in this case B was better. And in this case, while a site looks much more comprehensive and detailed, 431 00:45:43,170 --> 00:45:48,900 I should again be is better because a confuses which are the inputs and outputs from the two stages. 432 00:45:50,280 --> 00:45:53,990 Okay, so when you look about with this explanation now, it's much easier, right? 433 00:45:54,000 --> 00:45:58,230 It's like to do it okay. But we did that. 434 00:45:58,290 --> 00:46:01,920 But people come back to us and say hey yeah, yeah again. 435 00:46:01,980 --> 00:46:08,940 So it's very early on. It sounds good on this anecdote because anecdote, uh, you know, um, data points. 436 00:46:09,300 --> 00:46:15,060 But really, how does it performance your back in square one to have human evaluation to compare? 437 00:46:15,330 --> 00:46:18,810 Right. And um, again how to scale it. 438 00:46:19,260 --> 00:46:25,350 So human evaluation, you know, like ideally you have for every questions you want to rank the alleles. 439 00:46:25,590 --> 00:46:29,700 I have one question is I have the answer from general rank them right. 440 00:46:29,700 --> 00:46:38,400 And then you do what is a better. Now we know that ranking and choice is hard and easier is to pick some best out of n, right? 441 00:46:38,760 --> 00:46:42,540 Um, but even that is hard. But you know this paradox of choice. 442 00:46:42,720 --> 00:46:46,980 So they use this easier saying you have to pick the best of two. 443 00:46:47,340 --> 00:46:52,530 You cannot do you know less than that and it's the least you can do. 444 00:46:52,740 --> 00:46:58,280 So that's kind of first one angle. Okay. So pick the best answer between two elements. 445 00:46:58,290 --> 00:47:03,180 And now there are many ways you can do it. One you organise a tournament right. 446 00:47:03,600 --> 00:47:07,260 Each for each question and so forth. 447 00:47:07,920 --> 00:47:11,910 The problem is is it doesn't scale and doesn't scale for many, many reasons. 448 00:47:12,270 --> 00:47:16,059 Um, one is because you have to play tennis square, right? 449 00:47:16,060 --> 00:47:20,430 To play, everyone needs everyone but the other one. The bigger problem is that the static. 450 00:47:21,580 --> 00:47:25,760 Right. You cannot enter new steam into competition, right? 451 00:47:25,780 --> 00:47:34,659 Unless it ends. And this leaves sort of starting to, you know, new a new source being released, you know, every month. 452 00:47:34,660 --> 00:47:39,010 Every week then. Okay. But fortunately, there is another way to do it. 453 00:47:39,010 --> 00:47:45,030 And this is our key idea here is that, um, you know, like writing, right? 454 00:47:45,040 --> 00:47:49,829 Like chess. And there are many other similar ratings in other sports. 455 00:47:49,830 --> 00:47:53,700 Almost every sport has one in which you can have a meaningful rating. 456 00:47:54,000 --> 00:48:00,440 With players not playing, everyone is everyone, and it's more related about the strength of the opponent. 457 00:48:00,780 --> 00:48:03,930 If you are going to beat a stronger opponent, you are going to get a bigger bump. 458 00:48:04,170 --> 00:48:07,800 If you are going to be the weaker opponent. You may get a very little bump. 459 00:48:08,010 --> 00:48:12,470 Okay, so that's the basic idea. So this is what they developed. This is kind of chat about arena. 460 00:48:12,480 --> 00:48:16,530 So what we did we provided this kind of interface people. 461 00:48:17,130 --> 00:48:20,940 We provided these models for free. You ask questions you get to access. 462 00:48:20,970 --> 00:48:25,200 So this is you know very early days. So that question is very simplistic. 463 00:48:25,200 --> 00:48:33,330 The answer to this article is so simplistic. And then people can say A is better, B is better in style or both are bad. 464 00:48:33,900 --> 00:48:38,270 And then you compute use. We use a. 465 00:48:38,280 --> 00:48:41,840 Sorry. We use Bradley to remodel. 466 00:48:43,190 --> 00:48:50,590 To compute rating this arena rating. And we rank these models with a proper confidence interval. 467 00:48:50,610 --> 00:48:54,920 Something like that. Okay. So that's what I've done. Okay. 468 00:48:54,930 --> 00:49:01,260 There are many categories here. Uh, for, um, you can rank these models. 469 00:49:01,500 --> 00:49:04,650 We had a lot of, um. Uh, yeah. 470 00:49:04,680 --> 00:49:13,770 And, uh, let me just. Okay. Um, now, now you have user evaluation, and you have them as a judge, right? 471 00:49:13,890 --> 00:49:20,820 So now you can do a proper study. So, you know, this kind of paper is actually this was, uh, in the same paper, like I am of the judge. 472 00:49:21,300 --> 00:49:33,120 And what did we did? Uh, the findings are that the lines are biased when they do make judgement very much like humans. 473 00:49:33,450 --> 00:49:40,200 You have, for instance, positional bias. They prefer the first answer that if you give them question, answer one, answer two. 474 00:49:40,350 --> 00:49:45,690 They prefer the first time. Uh. Verbosity bias. Sometimes they prefer longer answers. 475 00:49:46,080 --> 00:49:50,460 Um, self enhancement bias. They certainly prefer answer from yourself. 476 00:49:50,520 --> 00:49:53,700 Okay. Or the other. Uh, I mean, the same family. 477 00:49:53,820 --> 00:49:57,720 And as in time, they have limited reasoning capabilities. 478 00:49:57,810 --> 00:50:05,990 They are not good at math. Okay. Um, so and then we look about the agreement between two humans, right? 479 00:50:06,000 --> 00:50:09,420 Experts and also between humans and large language models. 480 00:50:09,900 --> 00:50:15,150 And, you know, at the end of the day, it was quite similar. 481 00:50:15,450 --> 00:50:23,430 The human to human preference is 81%. And the human to beat question was, uh, 85%. 482 00:50:24,030 --> 00:50:28,950 Okay. So that's kind of okay. Um, we have many other modalities. 483 00:50:29,400 --> 00:50:34,530 This is one question I asked last night. Please show an image of the Oxford University at 4 p.m. in February. 484 00:50:35,490 --> 00:50:40,750 Okay. It's very, very close to that. Um, I selected a being better. 485 00:50:41,580 --> 00:50:46,210 Um, and. Yeah, this is again on many modalities. 486 00:50:46,660 --> 00:50:50,210 Um, what can you do with the data? I'm going to be very quick here. 487 00:50:50,230 --> 00:50:54,400 Um. Uh, let me just. So here is one idea. 488 00:50:54,430 --> 00:51:00,280 Is that from the leaderboard? So after we do here and we develop a model that's kind of, uh, I think this is, 489 00:51:00,280 --> 00:51:04,420 uh, you can you can look at this equation on the I tell you what is here. 490 00:51:04,780 --> 00:51:09,760 So what you do here, you have a prompt. If I have a prompt, I can generate. 491 00:51:10,880 --> 00:51:16,070 You know, on the fly a rating for different models for that prompt. 492 00:51:16,340 --> 00:51:23,870 So how do you do that? Maybe we never seen that prompt. However, I've seen a lot of prompts similar to that prompt. 493 00:51:24,590 --> 00:51:29,950 So you use, uh, votes to the to the to the to the prompt of the presents. 494 00:51:29,960 --> 00:51:38,950 Other problems which are similar to the new prompt as a proxy to compute the leaderboard for that particular prompt. 495 00:51:38,960 --> 00:51:40,610 And this has many applications. 496 00:51:40,910 --> 00:51:51,470 You can um, you know, given a cost target, you can maximise the arena score or given a arena score you can maximise you can minimise the cost. 497 00:51:51,680 --> 00:52:05,000 Right. And this is different models at that time when I wrote the paper and with this one you know for the same cost you can get 30 more Elo points, 498 00:52:05,540 --> 00:52:10,820 um better models and for the same accuracy I didn't. 499 00:52:10,830 --> 00:52:16,130 I do not score. You can get to X cheaper by using this from the leaderboard and picking the right model. 500 00:52:16,430 --> 00:52:25,000 Okay. You know, it's like it has a you know, it was very rewarding because right now it's like most Foundation labs, 501 00:52:25,270 --> 00:52:35,690 before they release a model, they, they, they, um, uh, evaluate these models also on uh, uh, arena. 502 00:52:35,720 --> 00:52:40,360 Now it's called arena. Okay. And we have a lot of users votes and things like that. 503 00:52:41,380 --> 00:52:47,830 Yeah. Instead of some, uh. People tweeting when the new results come out. 504 00:52:48,370 --> 00:52:51,880 Okay. So I'm going to go very briefly. 505 00:52:52,540 --> 00:52:55,659 So I'm going to have these three, uh slides. 506 00:52:55,660 --> 00:53:00,260 And then I'm going to go to conclusions. Uh so what are the lessons we learned. 507 00:53:00,280 --> 00:53:07,450 I think these are important. I don't want to skip one of these. Lesson number one is that trends do matter, right? 508 00:53:07,900 --> 00:53:14,020 With the emergence of complex AI workloads and heterogeneous distributed systems, what's really important with Vietnam? 509 00:53:14,020 --> 00:53:21,910 The aggressive nature of LMS. So it's a problem changes and problem changes in most of the cases solution is going to change right. 510 00:53:22,360 --> 00:53:33,909 And so chat bots you know looks like out of basically natural labs and complex AI workloads is what actually determined us to work on these systems. 511 00:53:33,910 --> 00:53:41,530 And and actually are the drivers also behind problems that impact the same lesson too? 512 00:53:41,560 --> 00:53:50,680 The simple solution matters, obviously. Um, you know, it's, um, you know, the ray you see is a he has a record, a minimalist API. 513 00:53:50,880 --> 00:53:59,020 Vietnam. The pay attention is quite simple. Uh, as a judge and head to head evaluation using, you know, that, like, writing is very simple ideas. 514 00:53:59,020 --> 00:54:07,690 Maybe obvious, too obvious in retrospect. And the reason this matters is that is easier to understand also by other people, right. 515 00:54:07,690 --> 00:54:11,980 And applies them like I stole your page attention. It's used now by everyone. 516 00:54:12,670 --> 00:54:17,770 And also, you know, like writing, you see a lot of other people doing it or as a judge. 517 00:54:18,160 --> 00:54:23,560 And this is related to the impact, because if you understand something, how is working and how to do it, 518 00:54:23,690 --> 00:54:27,790 you are going is much more likely you are going to do it right versus something you just have to understand. 519 00:54:28,420 --> 00:54:34,390 And uh, the other one, I want to do it. I want to say it is like it's like, um, 520 00:54:35,560 --> 00:54:43,360 you have to plan on for flexibility and rewriting when you develop these kind of systems because you do not have the requirements, 521 00:54:43,360 --> 00:54:47,469 the requirements evolves as you are going to develop the system. 522 00:54:47,470 --> 00:54:53,440 So yesterday we we already wrote the rate for times, uh, which is um, uh, 523 00:54:53,440 --> 00:55:00,640 we lamb who already wrote it once and it's again, it's two and a half year old project and now we are it's another year. 524 00:55:00,660 --> 00:55:11,290 Right. And and and and so forth. Um, so I'm going to skip over the next one and I am going to move to questions. 525 00:55:11,290 --> 00:55:17,049 So, um, so summary is that, um, I do believe in open source everything. 526 00:55:17,050 --> 00:55:20,530 We've done it vertically. It's open source. We actually don't in the lab. 527 00:55:20,530 --> 00:55:24,850 I am we you know yet and in my lab we, we don't file for any patents. 528 00:55:25,090 --> 00:55:33,820 So everything is for public, uh, consumption. And we, you know, I talk today about these three of these, uh, projects, right. 529 00:55:34,390 --> 00:55:38,230 And chat about how, you know, uh, and I do believe that, you know, 530 00:55:38,230 --> 00:55:44,800 we are just scratching the surface and there will be many, many more, uh, challenges we are going to have had to address. 531 00:55:45,070 --> 00:55:51,430 And I hope that most of them will be addressed in, uh, open source using open source technologies. 532 00:55:51,700 --> 00:55:52,210 Thank you.