1 00:00:01,450 --> 00:00:16,720 Okay. Okay. 2 00:00:16,730 --> 00:00:21,590 So I'm delighted to introduce Ava TARDIS, who's going to give this year's Lovelace Lecture. 3 00:00:22,430 --> 00:00:29,930 Actually, I know that Ava is really well known to many people, and also she has so many contributions, 4 00:00:29,930 --> 00:00:33,430 in her words, that if I read them all, it would take all of her time. 5 00:00:33,440 --> 00:00:36,080 So I've made a small selection. 6 00:00:36,770 --> 00:00:48,140 So Ava as well, she received the Fulkerson Prize in 1980, 1988 for funding minimum course circulations in strongly polynomial time. 7 00:00:48,620 --> 00:00:58,729 And then in 2006, won the George Dancing Prize for for deep and wide ranging contributions to mathematical programming, 8 00:00:58,730 --> 00:01:03,470 including the first strongly polynomial time algorithm for minimum cost flows. 9 00:01:03,710 --> 00:01:10,520 Several other variants network flows, linear programming, sub modularity, circuit complexity, scheduling. 10 00:01:10,880 --> 00:01:11,990 The list goes on. 11 00:01:12,980 --> 00:01:23,000 She moved on from there to laying the foundations of algorithmic game theory, for which she was a recipient of the Journal Prize in 2012. 12 00:01:23,000 --> 00:01:26,480 For her contribution was her paper with Tim Ruff Garden. 13 00:01:26,780 --> 00:01:28,400 How about unselfish rooting? 14 00:01:28,970 --> 00:01:38,690 And then there's a string of awards, a National Academy of Science fellow of the AMS, member of the Academy American Academy of Arts and Sciences. 15 00:01:38,690 --> 00:01:44,200 And I noticed just today that it's this award upcoming in 2017. 16 00:01:44,210 --> 00:01:50,120 So it's a real honour for us to hear about her contributions to game theory. 17 00:01:50,120 --> 00:01:53,930 And I think she has a really unique take and combining that with learning. 18 00:01:54,470 --> 00:02:07,930 So we're really looking forward to the lecture. Thanks. So. So thank you very much and thank you for the nice introduction. 19 00:02:09,370 --> 00:02:16,750 And indeed, what I want to do today is thank you something about actually not just regular games, but games, 20 00:02:16,930 --> 00:02:27,520 repeated games and then thinking about combining that is learning where I'm thinking of learning as a way to model some of the agents in these games. 21 00:02:27,760 --> 00:02:36,850 So I guess the previous lecture where one where we heard a lot about how irrational human players are in games was an interesting introduction. 22 00:02:37,870 --> 00:02:44,949 I'm proposing a more mathematical model of how people behave, not quite the classical model. 23 00:02:44,950 --> 00:02:50,379 And I'll tell you a little bit more and maybe it's a good moment to try to defend that. 24 00:02:50,380 --> 00:02:54,040 My rational model is applicable in many situations. 25 00:02:54,040 --> 00:03:01,900 So for this talk my favourite application is writing, which apparently I pronounce differently than Leslie. 26 00:03:04,000 --> 00:03:07,000 You can think of it as traffic, as in cars. 27 00:03:07,330 --> 00:03:11,500 That is, people choosing a path to go to work in a car. 28 00:03:12,160 --> 00:03:16,660 Or if you want, you can think about package writing that is internet package. 29 00:03:16,900 --> 00:03:27,979 In most cases, the basic setup is that cars are driving and congestion slows down like the more cars are random, 30 00:03:27,980 --> 00:03:31,360 more packets are around, the slower it gets to get to the destination. 31 00:03:31,600 --> 00:03:33,730 From the decision makers perspective, 32 00:03:33,940 --> 00:03:41,050 these two are actually a little bit different in the sense that in the packet case the decision makers are the writers. 33 00:03:41,350 --> 00:03:47,650 So there is an algorithm behind decision making and maybe we can believe that algorithms are actually rational. 34 00:03:47,830 --> 00:03:52,000 Someone quoted them up and presumably they coded them up to do the right thing, 35 00:03:52,840 --> 00:04:01,270 whereas humans driving there might be other decision making processes and they might not be fully driving on the shortest path. 36 00:04:02,080 --> 00:04:06,790 We'll come back to this in a little bit, but for now, you can think of it either way. 37 00:04:07,330 --> 00:04:12,400 So traffic writing is a place where congestion causes delays, so therefore it's a game. 38 00:04:12,610 --> 00:04:18,130 My driving going to pass might help me get someplace, but at the same time slow as everyone has done on the same bus. 39 00:04:20,160 --> 00:04:27,410 As a stream or an idea in my research that I will potentially touch on at least one to mention it, 40 00:04:27,690 --> 00:04:35,040 even if that's not going to be a full focus of the of the talk that sometimes in this situation, 41 00:04:35,250 --> 00:04:41,760 Internet traffic, I think for sure, and even car traffic to some extent situation actually does change. 42 00:04:41,760 --> 00:04:49,190 And these players in the game will repeatedly play the same game and not really facing the same situation, but something changes. 43 00:04:49,200 --> 00:04:52,640 So in the Internet case, you know, something becomes popular. 44 00:04:52,650 --> 00:04:59,250 Everyone's due to the website to download that content. It's hard for the Internet traffic people to know this ahead of time. 45 00:04:59,250 --> 00:05:05,399 They don't. So they're going to have to adjust to the changing traffic and the same thing on traffic streams. 46 00:05:05,400 --> 00:05:10,979 If you're very smart and you know where the football game is happening around the football game tends to end. 47 00:05:10,980 --> 00:05:15,450 So therefore the traffic will be bad because people are coming out of the football game, 48 00:05:15,450 --> 00:05:22,140 say maybe you can avoid it, but other people do not and they get stuck in the bad traffic and have to adjust. 49 00:05:22,440 --> 00:05:29,340 So sometimes the situation changes and I'll come back to this in a little bit that how our Siri adjusts to this. 50 00:05:29,610 --> 00:05:33,060 So here is our model of what I think of as a repeated game. 51 00:05:33,270 --> 00:05:39,149 So there are a bunch of agents, maybe their end of them in every step and I'm going to think of discrete steps. 52 00:05:39,150 --> 00:05:42,900 So in the morning, every morning you wake up and maybe drive to work. 53 00:05:43,170 --> 00:05:51,510 I'm going to generally think of you take action i they t and you do this every day and then some something happens in everyday. 54 00:05:51,510 --> 00:05:58,230 You figure out how many minutes it took to get to work. You might, you know, go around and every day something happens. 55 00:05:58,530 --> 00:06:02,730 And but I'm going to assume that when you're rational, 56 00:06:02,910 --> 00:06:10,890 you try to minimise the average time it takes to get your package to the right destination, which again assumes that they're somewhat rational. 57 00:06:11,670 --> 00:06:15,329 And what players are doing is trying to learn from data. 58 00:06:15,330 --> 00:06:21,690 So this is maybe a more clearly a model of of what the what the writers are trying to do. 59 00:06:21,690 --> 00:06:27,450 They're trying to get the packets, the destination, and where is the information coming from that good data, 60 00:06:27,450 --> 00:06:32,309 whereas they know how much the previous packets take over the peak time right now is, 61 00:06:32,310 --> 00:06:38,220 or they know something about previous data and that's what they're trying to use as the information. 62 00:06:39,510 --> 00:06:45,300 And what I'm going to ask, I guess in some way the talk will consist of two parts. 63 00:06:46,050 --> 00:06:51,990 In the beginning I'm going to ask about the research direction I and many of us have been pursuing for 64 00:06:51,990 --> 00:06:58,670 a number of years ask what can we say about the outcome of these games when the players are learners? 65 00:06:58,680 --> 00:07:04,650 I'm going to make the assumption that the read assumption here that every player is trying to learn from past data. 66 00:07:04,890 --> 00:07:09,600 And I'm going to ask, what can I say about what will happen in the game if everyone's trying to learn? 67 00:07:11,950 --> 00:07:21,219 This led me to think about learning and led me to realise that in some way this online learning that I'm going to use is very well developed. 68 00:07:21,220 --> 00:07:27,610 Siri But there are some other ways game theory gets me motivated to think about online learning questions or think 69 00:07:27,610 --> 00:07:34,179 about understanding or like learning behaviour in a little bit differently than the online learning community will do. 70 00:07:34,180 --> 00:07:43,749 And I'm going to give you a little bit of test on some recent work where we try to ask what what does not natural learning 71 00:07:43,750 --> 00:07:52,660 behaviour tell you or what kind of algorithms can you say that that assured us that players are actually learning online? 72 00:07:53,410 --> 00:07:57,879 The speed at which you're learning is, I'm sure, important for online learning. 73 00:07:57,880 --> 00:08:00,490 This is a topic that gets started online learning, 74 00:08:00,640 --> 00:08:08,469 but in particular it's very important for game theory because if I want to have a system where there's some transient behaviour, 75 00:08:08,470 --> 00:08:12,129 where people stay in the system for a little bit and then vanish, if they're very, 76 00:08:12,130 --> 00:08:17,080 very slow at learning, then, you know, for a little while they were in a system, they'd learned enough. 77 00:08:17,080 --> 00:08:21,250 I can't say anything about their behaviour. In order to be able to say something, 78 00:08:21,490 --> 00:08:27,010 I have to know they stayed in the system long enough to learn and in particular I have to know how long does it take to learn? 79 00:08:27,400 --> 00:08:33,370 So I have to know some balance on how fast they learn to be able to say anything meaningful. 80 00:08:33,580 --> 00:08:40,240 So that's what I would like to tell you a little bit about. So two very related are very intertwined parts. 81 00:08:41,460 --> 00:08:50,100 So games probably I should start with maybe this is a computer science audience so I think or large part. 82 00:08:50,460 --> 00:08:55,800 So I'm assuming that trying to tell you that learning is a reasonable behaviour 83 00:08:55,800 --> 00:09:00,210 of what happens in an online setting maybe doesn't need so much defence, 84 00:09:00,810 --> 00:09:06,960 but maybe it's still good to say that. Of course, if I go back to what classical game theory always did, 85 00:09:07,110 --> 00:09:12,330 then the classical thing we all studied me included this Nash equilibrium and not learning outcomes. 86 00:09:13,170 --> 00:09:18,360 So maybe the natural thing would be to say that a light on the study Nash equilibrium. 87 00:09:18,930 --> 00:09:25,919 And I guess there are many good reasons not to, but one of them is that Nash equilibrium is hard to find. 88 00:09:25,920 --> 00:09:31,830 And there are many reasons, many variations on the sense of what I mean by hard to find. 89 00:09:32,040 --> 00:09:37,349 One of them is in order to find the Nash equilibrium, 90 00:09:37,350 --> 00:09:42,360 I have to be very well informed of who else is in the system and what on earth might they wanna do, 91 00:09:42,660 --> 00:09:46,940 which is an information, lack of information kind of argument. 92 00:09:46,950 --> 00:09:55,110 And of course there is also the famous does go Clarke is a very powerful attribute to resolve that is computationally hard to find, 93 00:09:55,800 --> 00:09:59,790 give others a sort of alternate pulse sort of leading the same conclusion. 94 00:09:59,970 --> 00:10:05,100 NASH Equilibrium is a problematic model. To get the Nash equilibrium, you simply have to know too much. 95 00:10:05,280 --> 00:10:09,060 And even if you knew a lot, it would be computationally hard. 96 00:10:09,970 --> 00:10:14,050 But then I can also offer you a very empirical evidence. 97 00:10:14,410 --> 00:10:19,300 Hey, I can look. That's not what's happening. And here is my empirical evidence. 98 00:10:19,720 --> 00:10:25,240 If I want to have a repeat of game and the simplest model of what does it mean to be at Nash Equilibrium? 99 00:10:25,240 --> 00:10:32,649 And indeed the very first studies understanding this kind of games was to assume that, you know, 100 00:10:32,650 --> 00:10:39,790 the players might like try to do stuff and then they going to settle into a Nash equilibrium and then going to repeat the 101 00:10:39,790 --> 00:10:47,169 very same action every single day because they figured out what's an equilibrium if they could ever figure out such a state. 102 00:10:47,170 --> 00:10:50,920 Indeed, that state should be stable because if they ever were in a state. 103 00:10:51,190 --> 00:10:56,409 But. Nash equilibrium means exactly this equation right down here in equality right here, 104 00:10:56,410 --> 00:11:02,060 it says that they're doing something which has lower cost than anything else they could try to do. 105 00:11:02,080 --> 00:11:06,700 So if they try something else, it's only going to be worse. The next day they're going to go back to the right thing. 106 00:11:07,420 --> 00:11:14,320 So now it's actually very stable because it has this quality of it's usually referred to as a no regret inequality. 107 00:11:14,470 --> 00:11:22,730 It's NASH because you're not regretting that you're not doing something else and it's stable because they find this Nash equilibrium. 108 00:11:23,260 --> 00:11:26,970 But you look at any data and sure, life isn't this stable. 109 00:11:26,980 --> 00:11:34,930 I can tell you whatever data you look at, the data that's easy to show you is actually not the main subject of this talk is action data. 110 00:11:34,930 --> 00:11:38,889 And I guess I've been working this different kind of action data. 111 00:11:38,890 --> 00:11:43,420 This one is coming from being or Microsoft advertisment data. 112 00:11:43,600 --> 00:11:50,469 They're just not this stable. That's not what's happening. So there is a complex of the problem with Nash equilibrium. 113 00:11:50,470 --> 00:11:55,420 There's an information theoretic problem, there's not enough information, and then there's an empirical problem. 114 00:11:55,540 --> 00:12:03,169 It's just not happening. Now one way to look at what the Microsoft advertiser's doing and again, 115 00:12:03,170 --> 00:12:09,680 these are people these are actually high advertisers that change their bid very, very frequently. 116 00:12:09,860 --> 00:12:16,820 So frequently that they virtually certainly know they're not humans like they were just too often they just they just change all the time. 117 00:12:17,270 --> 00:12:20,750 This has to be an algorithm that does this, not a human, I'm guessing. 118 00:12:23,240 --> 00:12:30,560 With this kind of changes. I think what best models this behaviour is some sort of online optimisation algorithms. 119 00:12:30,710 --> 00:12:35,300 This is what online optimal optimisation does. It does some form of gradient descent. 120 00:12:35,570 --> 00:12:41,330 And this totally looks to me as a just first eyeballing it as a gradient descent algorithm. 121 00:12:41,330 --> 00:12:48,200 So that's another evidence that maybe learning is a better model of what really happening in real life. 122 00:12:48,650 --> 00:12:54,470 So here is what I'm going to do. I'm going to assume as a behavioural model that people are learning. 123 00:12:55,350 --> 00:12:57,330 And what does it mean that they're learning? 124 00:12:57,540 --> 00:13:04,710 So but the ones that really show up in the system say this is the beginning of this picture is day one for some user. 125 00:13:05,430 --> 00:13:09,120 And maybe he's clueless. He's trying other stuff. He's doing wrong things. 126 00:13:09,540 --> 00:13:14,190 But as time goes on, he gets better and maybe does something reasonable. 127 00:13:15,640 --> 00:13:20,620 And then I guess I want to use an assumption that's very, very similar to the Nash equilibrium, 128 00:13:20,800 --> 00:13:26,950 except it does not you use the stability that is they like to do every day something different. 129 00:13:28,300 --> 00:13:36,460 No regret would mean that for every fixed action, whatever they did is lower cost than having done that fixed action with hindsight. 130 00:13:36,730 --> 00:13:38,560 Okay. This is the no regrets condition. 131 00:13:38,890 --> 00:13:46,390 What I really mean, of course, is not no regret because there was the initial experimentation where you could have easily made a few mistakes. 132 00:13:46,660 --> 00:13:55,830 I mean, small regret. And it's usually referred to as no regret, as long as this error doesn't go linearly with time. 133 00:13:55,850 --> 00:14:04,550 So for example, if having normalised the cost between zero and one, the biggest error you can accumulate is to at 90. 134 00:14:04,730 --> 00:14:13,190 If regret grows roughly a zero to or something not quite linearly, then it's called no regret, which is your gut vanishes as time goes on. 135 00:14:14,410 --> 00:14:17,500 This is a pretty standard sort of no regret algorithm. 136 00:14:21,460 --> 00:14:25,300 So I'm going to actually spend at least one slide in. 137 00:14:25,990 --> 00:14:29,800 I'm going to defend to you the no regrets assumption, 138 00:14:29,980 --> 00:14:37,030 but maybe I should admit that actually like using a little bit different no regret the assumption which I actually 139 00:14:37,030 --> 00:14:42,580 think at least the data shows matches human behaviour much better than the actual no regret the assumption. 140 00:14:43,830 --> 00:14:48,480 Is an approximate version, so this must, in all regret, remember the cost. 141 00:14:49,290 --> 00:14:58,890 That you experience for any version of youth or any player, II is less than the cost of anything sexual besides site plus some sort of error instead. 142 00:14:59,910 --> 00:15:09,030 In a somewhat recent paper from last year at NIPS, we try to introduce that a better notion for game theory is this thing called approximate regret. 143 00:15:09,030 --> 00:15:16,950 And what I'm doing is allowing you instead of a purely additive error and on something that algorithms, 144 00:15:16,950 --> 00:15:23,639 people often like doing, especially since I for example came from approximation algorithm is an area is approximate error. 145 00:15:23,640 --> 00:15:31,470 And what I'm doing is mixing that too. I'm giving you a even plus epsilon error that I'm allowing you to have a multiplicative error, 146 00:15:32,190 --> 00:15:42,959 one plus Epsilon saying that most humans and even the systems missing some low bit of noise something like and for looking at empirical data say five, 147 00:15:42,960 --> 00:15:46,620 10%. They're not very sensitive. They not optimised the last digit. 148 00:15:47,100 --> 00:15:52,200 So this is modelling this would be an epsilon equals 1/10 120 has something like 149 00:15:52,200 --> 00:15:59,400 that and the benefit is which going to really really pay off for us that if you 150 00:15:59,700 --> 00:16:04,830 shooting for this sort of approximate regret to those things it models data better 151 00:16:04,860 --> 00:16:09,480 so it seems like a better model of human behaviour and as you might notice, 152 00:16:09,750 --> 00:16:19,409 the error went way down. So instead of something that grows this time, I have an error that doesn't grow this time so much, much faster. 153 00:16:19,410 --> 00:16:27,570 This amount, this quality of learning is achievable much, much faster than the ultimate quality of learning that the classical learning algorithms do. 154 00:16:29,740 --> 00:16:34,120 If you have studied online learning that you might realise this band that actually 155 00:16:34,120 --> 00:16:38,679 what's new here is its use of using game theory rather than the band itself. 156 00:16:38,680 --> 00:16:40,980 And I'll come back to this in the second half of the talk. 157 00:16:43,320 --> 00:16:52,229 So my point is that this sort of rote regret algorithms actually achieved by the classical algorithms you might all know or many of you might know, 158 00:16:52,230 --> 00:16:59,790 such as music without evade hedge regret matching almost all of the algorithms you might be aware of actually have this band. 159 00:17:00,360 --> 00:17:05,130 So there is nothing new here. It just more. Maybe more suited for game theory. 160 00:17:05,730 --> 00:17:09,650 So why do I say it's more suited for game theory? Again, two points. 161 00:17:09,660 --> 00:17:13,770 One, I think it matches human behaviour better than the actual regatta algorithms. 162 00:17:14,370 --> 00:17:18,090 But here is another one. I like game theory. 163 00:17:18,540 --> 00:17:24,600 Thinking of game theory, I got into interested in learning because I wanted to study what I said in the title. 164 00:17:24,930 --> 00:17:30,000 The Quality of outcomes in games. So what do I mean by quality of outcomes in games? 165 00:17:30,570 --> 00:17:38,960 So the classical notion by. Elias and Christos Papadimitriou is that of price of energy. 166 00:17:39,200 --> 00:17:48,320 And what price of energy does is defines the cost divided by some sort of magical, optimal cost of an equilibrium. 167 00:17:48,590 --> 00:17:53,749 Elias, like all of us, was studying. Nash Equilibrium and not learning outcomes. 168 00:17:53,750 --> 00:18:00,950 But this is the price of energy. You look at the cost of inertia, equilibrium divided by optimal cost, and that's called the price of energy. 169 00:18:03,380 --> 00:18:12,260 Now, since I want to study learning outcomes, I wanna change this to a related concept sometimes called the price of total energy. 170 00:18:12,260 --> 00:18:15,620 But I'm not so sure. I kind of like calling the war price of energy. 171 00:18:15,620 --> 00:18:19,579 Sorry. Very sad. No, no, no. They were Nash equilibrium. 172 00:18:19,580 --> 00:18:22,640 They were learning. So what does it mean that they're learning? 173 00:18:22,910 --> 00:18:26,030 I make the assumption that everyone is a learner. 174 00:18:26,030 --> 00:18:33,950 They are all running some sort of intricate algorithm. And what I'm measuring is the average cost divided by the optimum. 175 00:18:34,130 --> 00:18:40,310 This equation I wrote assumes a stable population, so there are six fixed populations. 176 00:18:40,310 --> 00:18:45,080 So the optimum is fixed. They're three durations, they're three times the optimum that could happen. 177 00:18:45,380 --> 00:18:52,460 Or if I want to be more generous and this is a topic I slightly mentioned the beginning, but as I said, I will barely touch on it. 178 00:18:52,700 --> 00:18:57,469 I can tell you dynamic population when I change the population over time and then I 179 00:18:57,470 --> 00:19:03,890 have to make an even more sort of generous average the population of these learners. 180 00:19:03,890 --> 00:19:10,010 That's the top and the bottom and the optimal, which of course also not changes because I change the population. 181 00:19:10,190 --> 00:19:12,890 This is the natural analogue of the price of energy. 182 00:19:13,070 --> 00:19:21,530 This is the kind of thing, some interest that they might want to prove that if everyone is a learner, that in many games these ratios stay very small. 183 00:19:22,670 --> 00:19:31,490 Now, as you may or may not know, and I hope you have seen some of this, there are beautiful results out there that prove price of energy bans. 184 00:19:32,760 --> 00:19:46,680 And. They often come with a framework championed by Team Wrap Garden, which is in some way a very natural, simple framework. 185 00:19:46,690 --> 00:19:48,700 Here is what it, roughly speaking, does. 186 00:19:49,030 --> 00:19:57,110 It says that if the game has this natural that he called smoothness property, then you can prove good price offender events. 187 00:19:57,130 --> 00:19:59,160 Now let the smoothness properties stay. 188 00:19:59,500 --> 00:20:06,130 I wrote the inequality technically down on the slide and I'll show you in a second that it actually shows something. 189 00:20:06,640 --> 00:20:13,510 But let me say it in English first. So if you imagine any solution, any kind of outcome that is fair, 190 00:20:13,510 --> 00:20:20,770 so is the vector of different actions that have horrendously high cost, you know, 100 times above the optimum. 191 00:20:22,150 --> 00:20:30,850 Then what this was to say that a lot of these guys have a good action to take that improves the situation, and that's what the inequality says. 192 00:20:31,270 --> 00:20:37,660 If the cost of the solution is incredibly high, you is a multiplier less than one. 193 00:20:37,960 --> 00:20:41,350 Then the right hand side is now not quite this high. 194 00:20:43,350 --> 00:20:51,720 Because move was less than one in the afternoon is noisier. Then one of the guys on average, someone has a good option to take. 195 00:20:51,900 --> 00:20:57,000 This is very natural that this condition should obviously imply a price of energy. 196 00:20:57,000 --> 00:21:01,379 But if the cost is very high, that's not much because someone wants to improve. 197 00:21:01,380 --> 00:21:08,910 And indeed the proof is just as easy. If you are a Nash equilibrium, then everyone doesn't regret this other action. 198 00:21:08,910 --> 00:21:12,450 They instead want to do. Whatever they were doing is in inequality. 199 00:21:12,450 --> 00:21:16,290 All you do is rearrange the inequality and you get the cost is some bad. 200 00:21:17,460 --> 00:21:20,760 So this is a for all you know, is very straight forward proof. 201 00:21:22,890 --> 00:21:31,920 What's interesting about it is that almost all with very, very, very few exceptions, all the price of energy bands follow this framework. 202 00:21:33,120 --> 00:21:41,370 For example, the original paper I had this team that got mentioned in the introduction about price of 203 00:21:41,370 --> 00:21:49,499 energy and traffic writing had this framework or the extension for atomic or bigger traffic. 204 00:21:49,500 --> 00:21:56,280 Writing also has this framework. And in many, many others, I guess these are the two that most relevant in in the kids. 205 00:21:56,280 --> 00:21:59,100 They all follow this very, very simplistic framework. 206 00:21:59,670 --> 00:22:08,700 One thing to keep in mind is the numbers we are getting here, we are proving price of energy, balance of a factor of two, factor of four. 207 00:22:08,700 --> 00:22:13,230 So it's if you want to be generous, factor of two and a half even worse. 208 00:22:16,760 --> 00:22:20,660 When I'm interested in this kind of puns, I don't much care. 209 00:22:20,840 --> 00:22:24,050 But there's 2.5 or 2.6. That's noise. 210 00:22:24,950 --> 00:22:29,930 Like, if either you're not at all interested in this because because these factors are not relevant. 211 00:22:30,110 --> 00:22:34,820 Or if you are, if you give that, if I give you one more digit, you'll say, sure, whatever. 212 00:22:36,170 --> 00:22:41,660 And if this is interesting at all, then the last digit doesn't matter anymore or else it's not interesting. 213 00:22:42,320 --> 00:22:49,520 So this is the attitude I want to take to learning. I'm not interested in that precise optimisation because this is what I'm shooting for. 214 00:22:49,850 --> 00:22:54,820 And here you either are not interested in this at all or you're not interested in the last digit of precision. 215 00:22:56,070 --> 00:23:02,970 So that's the attitude I'm going to take. This framework has the beautiful advantage that it works very, very nicely. 216 00:23:03,180 --> 00:23:07,290 Also, a learning outcomes almost identically. You don't have to change anything. 217 00:23:07,860 --> 00:23:11,790 I never used in the proof back here, the stability. 218 00:23:11,910 --> 00:23:17,820 What I'm using instead is the no regrets condition, which is exactly what learning outcome has. 219 00:23:18,120 --> 00:23:27,440 If you want to adjust the proof of learning, that would say that if you learned, then you don't regret very much not having done something else. 220 00:23:27,450 --> 00:23:30,960 You have a bit of regret or not horrible amount of regret that all. 221 00:23:31,560 --> 00:23:37,170 If the game was smoothest, you can still finish the same inequality and you get a same sort of band. 222 00:23:37,410 --> 00:23:43,170 One important thing to note is that the regret there are, of course, came into the band, entered the band. 223 00:23:43,410 --> 00:23:48,629 Like if you have large regret that or then you actually not very big, very close to the optimum. 224 00:23:48,630 --> 00:23:55,710 If you don't have logic at that or then if have no regret at all, then we can carry over the proof if you have logic later or you lost something. 225 00:23:56,840 --> 00:24:00,329 So the regulator really matters. This is maybe the point. 226 00:24:00,330 --> 00:24:03,540 They should come back to this speed of convergence resolved. 227 00:24:03,570 --> 00:24:09,320 Remember, I wanted to use approximate regret. So I want to put in this epsilon as an extra bond. 228 00:24:10,430 --> 00:24:16,580 If I do this and they do the same machinery, which I didn't put the algebra out on the slide, but I hope you believe me. 229 00:24:16,790 --> 00:24:20,990 Exact same manipulation. I lost the epsilon and here is how I lose it. 230 00:24:21,170 --> 00:24:23,630 I lose it. It's a multiplier. 231 00:24:24,170 --> 00:24:33,980 We used to have a lambda over one minus fuel as our bond, and it's now one plus one plus epsilon times lambda and one minus something we do again. 232 00:24:34,160 --> 00:24:38,209 A little bit of different, but not much. If I kept epsilon small enough. 233 00:24:38,210 --> 00:24:45,310 This is noise. Very small change. But at the expense that I can push the additive very significantly that. 234 00:24:48,510 --> 00:24:58,650 Okay. So this is sort of my basic introduction of what I wanted to tell you on, on, on games. 235 00:25:01,780 --> 00:25:06,040 Before I go and talk about learning, there is a little segment I wanted to. 236 00:25:06,370 --> 00:25:09,519 I'm sure it's many of your minds, even though no one is speaking up. 237 00:25:09,520 --> 00:25:13,050 But I'm sure many of you were thinking, so how good a model is this learning? 238 00:25:13,060 --> 00:25:16,690 Are people actually learning? So I want to show you two slides. 239 00:25:16,990 --> 00:25:22,510 In asking whether this error is a is a good model of learning. 240 00:25:22,510 --> 00:25:26,110 So I promise that a little bit of a discussion and I want to show you some data. 241 00:25:26,470 --> 00:25:30,760 So. Again, the main assumption here is everyone's learning. 242 00:25:31,360 --> 00:25:34,360 I.M. Cave is almost everyone's learning, of course. 243 00:25:34,690 --> 00:25:38,260 That's almost as good. But are people learning at all? 244 00:25:38,410 --> 00:25:42,820 Anyone learning? Is any people learning at all? And there are two couple of things that I can offer. 245 00:25:44,080 --> 00:25:49,180 Learning is utterly doable. So it's both doable information only. 246 00:25:49,810 --> 00:25:53,050 So no need for everyone to know who else is in the game. 247 00:25:53,080 --> 00:25:57,340 You're learning from data. Data is available. You can learn it's doable. 248 00:25:59,140 --> 00:26:02,590 And what's better if your opponent's doing badly? 249 00:26:03,250 --> 00:26:06,909 You learn even better outcomes. You take advantage of badly playing. 250 00:26:06,910 --> 00:26:16,210 Opponents learn. That's what learning does. I feels to me like people should be learning. 251 00:26:16,390 --> 00:26:24,080 That is. All I'm asking them if there's a single strategy that's consistently really good, please make it up to it. 252 00:26:24,500 --> 00:26:30,559 I don't know where, whether Oxford people live far enough for this to make sense, but if you take, for example, California, 253 00:26:30,560 --> 00:26:38,570 where people drive long distances, if there is a highway on which in the morning there never is congestion, that's what I'm asking you to notice. 254 00:26:38,780 --> 00:26:47,520 Please know this isn't driving it. It's a pretty reasonable assumption in many, many algorithms that actually capable of implementing this. 255 00:26:49,470 --> 00:26:53,410 And the ideas of these algorithms are very, very natural try stuff. 256 00:26:53,430 --> 00:26:58,750 You know, if something was very bad in the past, don't try it with very high probability that it's a bad idea. 257 00:26:58,950 --> 00:27:03,630 Try stuff that has higher you know usually do the things that the natural probabilities. 258 00:27:05,110 --> 00:27:08,200 Nonetheless, if I look at data, it's so-so. 259 00:27:08,620 --> 00:27:12,309 So again, back to the Microsoft data and this paper is from Denison. 260 00:27:12,310 --> 00:27:15,850 I can follow them wisely. Saucony is with Denison. 261 00:27:15,850 --> 00:27:21,010 I can follow in Westchester Counties from S.E. 15. We again, we looked at Microsoft Azure data. 262 00:27:22,110 --> 00:27:25,080 Microsoft editor has the problem that unlike Traffic, 263 00:27:25,170 --> 00:27:31,320 I think I don't actually know what this guy's value is because that's not in the data, the behaviour is in the data. 264 00:27:31,800 --> 00:27:38,250 But I have to do and that's the content of this paper is try to guess the value given their behaviour. 265 00:27:39,400 --> 00:27:46,930 They're proposing that the best guess we can make is the value on which they regret would be as small as possible. 266 00:27:47,850 --> 00:27:53,490 So when I show you slides of how much regret you have, this is a very optimistic slide. 267 00:27:53,850 --> 00:27:58,110 If I guess they value wrong, they have regrets. Bigger than what I'm showing. 268 00:27:58,410 --> 00:28:07,410 This is a smaller regret as they physically can get. So it says 30% of the participants have zero or negative regret. 269 00:28:07,860 --> 00:28:12,030 So there is a value on which they're doing better than any single bit with hindsight. 270 00:28:12,330 --> 00:28:17,910 So maybe they varying the bid in a clever way and they doing better than what I could have done. 271 00:28:18,150 --> 00:28:23,700 Another about ten, ten, 10% or 20% of people have less than 10%. 272 00:28:24,600 --> 00:28:29,310 Less than 10% regret the law and 50% of the participants have. 273 00:28:29,670 --> 00:28:39,390 We have 10% or more regret that or in fact, these guys up here have an incredible amount of regret that are doing something really wrong. 274 00:28:39,660 --> 00:28:43,920 And again, I don't guarantee to you that they have the value, right? 275 00:28:43,920 --> 00:28:50,820 Especially not those guys, but I guarantee they have this much regret that or I am giving you an optimistic estimate of they regret that. 276 00:28:51,600 --> 00:28:55,320 So this is a an assumption and may be interesting mathematically, 277 00:28:55,500 --> 00:29:01,230 but I think there's a lot of research to be done in the direction of understanding better of what's really going on here. 278 00:29:01,740 --> 00:29:05,640 And I guess for some of you, for when humans are making decisions, 279 00:29:05,970 --> 00:29:14,340 then I talked to some of you about some ideas I have of the actually acquiring and understanding the data is a cost that's not in this models. 280 00:29:15,400 --> 00:29:18,490 So the human work involved in figuring out the right thing to do. 281 00:29:20,940 --> 00:29:24,749 Anyway. The models we are using. Maybe I'm going to skip this. So I have enough. 282 00:29:24,750 --> 00:29:32,400 Sorry, I have enough for a little bit of talking about the second half I was hoping to talk about. 283 00:29:32,670 --> 00:29:35,190 So what I try to tell you so far and again, 284 00:29:35,190 --> 00:29:45,300 I admit I skipped one important segment is I try to talk to you that that that this learning is a human behaviour is a behaviour model. 285 00:29:45,810 --> 00:29:47,879 It's mathematically very very nice. 286 00:29:47,880 --> 00:29:58,290 It has exactly the Nash like inequality without the Nash stability assumption it matches data better than than than Nash equilibrium. 287 00:29:58,290 --> 00:30:08,250 But I admit not at all perfectly. And it there very nice balance which I certainly didn't show it to you except for the framework. 288 00:30:09,090 --> 00:30:16,080 Sometime you can put 2000 bands of the find that if everyone does or most people satisfied there's no regret learning properties. 289 00:30:16,380 --> 00:30:23,160 Then the average delays in a network are pretty good, like within some small, constant times the optimum. 290 00:30:27,590 --> 00:30:31,370 And again, it can be supported by a broad set of learning algorithms. 291 00:30:31,550 --> 00:30:38,040 So maybe works as a behavioural assumption. So hopefully I convinced you that despite. 292 00:30:38,370 --> 00:30:44,100 Interesting to study other behavioural assumptions. It's this learning is an interesting and good assumption. 293 00:30:44,430 --> 00:30:52,620 But I would like to spend the second half of the talk on his recent work actually trying to understand the aspect of learning that I hid 294 00:30:52,620 --> 00:31:00,930 under the rug till now and I want to recover it from the rug and tell you that this is an important issue and I should talk about it. 295 00:31:02,300 --> 00:31:08,760 And that's upcoming joint work with Cactus Shootaround and Todos Liquors. 296 00:31:11,690 --> 00:31:17,260 Feedback. So when I think of learning algorithms and most of the online learning literature, 297 00:31:17,270 --> 00:31:20,690 much of it, especially the early online literature, was that way too. 298 00:31:21,680 --> 00:31:26,629 I made the very unnatural and very uncalled for assumption that whenever you're 299 00:31:26,630 --> 00:31:31,130 choosing a path to travel and God gives you feedback on all other facets. 300 00:31:32,000 --> 00:31:35,360 Good luck. That is an alter ego of yours. 301 00:31:35,390 --> 00:31:40,070 Also went on the other pass and you know you go and you took an hour to drive to work. 302 00:31:40,070 --> 00:31:46,340 But you're actually aware that on this other pass it would have taken 35 minutes and the third pass it would have taken 10 minutes. 303 00:31:47,030 --> 00:31:55,760 That's called the full information feedback. And in fact, much, much of the results I had on my slide were all full information, feedback. 304 00:31:55,940 --> 00:32:03,500 You were choosing a strategy. You experienced the outcome of the strategy, but I fed you full information on everything else you could have done. 305 00:32:04,560 --> 00:32:07,740 Now I'm not aware of many full information feedback setups. 306 00:32:08,040 --> 00:32:13,790 So, for example, the traffic packets out there certainly don't know what happens on other parts that they're not trying. 307 00:32:14,340 --> 00:32:15,420 That's not how it works. 308 00:32:16,140 --> 00:32:23,340 When you're driving, maybe you get some information listening to news coverage of traffic delays, but not as good as the one you actually drove on. 309 00:32:23,790 --> 00:32:27,090 So you're getting some partial information, not full information. 310 00:32:27,690 --> 00:32:32,050 So one thing would be important to have, especially information feedback models. 311 00:32:32,070 --> 00:32:36,300 That is, if I don't give you full information, can you recover some of the same things? 312 00:32:38,890 --> 00:32:47,740 So for this segment of the talk, I'm going to hide the other people's behaviour in notation and instead of saying cost of your action acts, 313 00:32:47,740 --> 00:32:51,639 it will be just cost the time t of x. 314 00:32:51,640 --> 00:32:59,230 Those things are getting hidden. I hide person I because I'm focusing on one person's learning and I'm hiding the other people's behaviour 315 00:32:59,440 --> 00:33:04,600 by simply allowing the cost to vary this time going very this time because of the other guy's behaviour. 316 00:33:04,900 --> 00:33:14,520 But maybe this is a more compact notation and like a learning theorist, I would like to actually have costs normalise to zero and one. 317 00:33:14,520 --> 00:33:19,240 And so you have to take the, you know how long it maximum take and normalise them and I assume you did this. 318 00:33:21,640 --> 00:33:32,190 There are two parts to. I'm going to start by talking about what the classical not full information model, which is certainly simple. 319 00:33:32,280 --> 00:33:36,180 You simply only get the information about the the action you selected. 320 00:33:36,450 --> 00:33:43,049 But I also want to mention an extension, more technically difficult extension of what you're doing in a very, 321 00:33:43,050 --> 00:33:46,620 very beautiful partial information feedback model that I really like. 322 00:33:46,920 --> 00:33:51,239 Here is what it looks like. Every action here is a choice as a node. 323 00:33:51,240 --> 00:33:57,990 So, for example, if you choose action X, that's in red. But actions are some are connected by your visibility. 324 00:33:57,990 --> 00:34:02,280 If you choose an action X, then you also get feedback on its neighbours. 325 00:34:02,280 --> 00:34:10,290 So there's a graph on the action space and you get partial information, which means you get information on some stuff and not on some other stuff. 326 00:34:10,860 --> 00:34:15,329 So in this case you get information on the neighbours of know that it's a very, 327 00:34:15,330 --> 00:34:21,480 very clean and I think elegant partial information feedback model that we are studying. 328 00:34:21,720 --> 00:34:28,890 This is a current state of this line of research and I have many, many good questions to ask and I'm sure you will. 329 00:34:29,130 --> 00:34:38,100 You'll do a good job asking me questions of, you know, other extension of partial information, not exactly your neighbours, but some other things. 330 00:34:38,100 --> 00:34:44,370 There are many, many interesting questions here whether what we're doing extends to other forms of personal information, feedback. 331 00:34:46,240 --> 00:34:48,459 So let's first tell you what's known. 332 00:34:48,460 --> 00:34:56,170 And again, I know some of you know a lot about online learning and might the first slide or two here might be a little well-known territory. 333 00:34:56,170 --> 00:35:04,930 But let me just remind you, in case and in case you don't remember, it's the fact that you don't get full information, feedback. 334 00:35:04,930 --> 00:35:08,560 That's not news to the learning community. Here is the standard algorithm. 335 00:35:09,100 --> 00:35:13,389 So the standard algorithm goes as follows. So remember, you're choosing an action. 336 00:35:13,390 --> 00:35:18,580 You find its cost, but there's all these other actions and you have no idea what their costs are. 337 00:35:19,000 --> 00:35:24,159 You didn't get feedback. So here is the proposed way of dealing with it. 338 00:35:24,160 --> 00:35:27,520 It's called important sampling, and it's a very, very simple idea. 339 00:35:28,480 --> 00:35:34,030 If you haven't seen it, this is not at all my result. But this should be your take home message because it's so simple. 340 00:35:34,720 --> 00:35:40,000 What you should do is pretend that everything you haven't seen is as good as it can be. 341 00:35:40,240 --> 00:35:44,350 Zero cost. It was ideal. You could have gotten there in no time. 342 00:35:45,190 --> 00:35:52,990 But everything you did see. I will scale the cost up by dividing by the probability that I chose that action. 343 00:35:54,100 --> 00:36:00,030 So I had that probability distribution of choosing things. This one, maybe I chose this probability 10%. 344 00:36:00,040 --> 00:36:06,850 This one is probably the 1%. And what I'm going to do is if I don't see it, I pretend pretend it has zero. 345 00:36:07,120 --> 00:36:10,510 But if I do see it, a scale of the cost is the probability. 346 00:36:11,830 --> 00:36:19,900 Now. What's good in this? Well, it's good in this that for if I take any fixed action, it's expected cost is precisely right. 347 00:36:21,180 --> 00:36:28,320 Nike, probably tippy it will have this ginormous cost computing contributing to the exact expectation. 348 00:36:28,560 --> 00:36:33,300 Exactly the cost and probably three rested zero and doesn't contribute anything. 349 00:36:33,600 --> 00:36:42,630 They expect that cost is precisely right. Yes, it increased the variance by an awful lot, but the expectation is exactly right. 350 00:36:43,690 --> 00:36:53,590 Similarly, if I now imagine that I use these imagined cost, not my real cost, but the scale up costs, and I run a full information. 351 00:36:54,610 --> 00:36:59,260 Learning algorithms and thinking these were this field of things were the real cost. 352 00:36:59,830 --> 00:37:05,650 Then the cost of this full information algorithm will also be exactly right in expectation. 353 00:37:06,760 --> 00:37:17,060 And this is a bit more complicated computation. If I choose the probability distribution P to choose among these this filter costs, then, 354 00:37:17,370 --> 00:37:23,199 you know, you get this using linearity of expectation, you get that it's the expected. 355 00:37:23,200 --> 00:37:26,530 Of course there are the expectation. The cost is exactly right. 356 00:37:26,770 --> 00:37:31,750 Again, I increase the variance by a lot, but the expectation is exactly right. 357 00:37:33,070 --> 00:37:39,639 So what I could do is run my full information algorithm on this apps and this is what the proposed method is. 358 00:37:39,640 --> 00:37:43,210 Just run your full information on the apps. Well, almost. 359 00:37:43,960 --> 00:37:47,860 There is a slide back here, which I referred to a bunch of times over. 360 00:37:47,860 --> 00:37:55,070 Do they increase the variance? Remember in costs we liked normalising the cost to zero and one. 361 00:37:56,480 --> 00:38:00,950 So we have to normalise because this dividing by the probability increases the numbers. 362 00:38:01,790 --> 00:38:10,950 But what should I re normalise to? The probabilities could be as low as zero, and if they divide by zero, the number will get pretty damn big. 363 00:38:11,430 --> 00:38:15,059 So at the moment I should be normalise them to be between zero and infinity. 364 00:38:15,060 --> 00:38:21,510 Which isn't going to work for me. So here is what I view as a dirty trick of the trade. 365 00:38:21,660 --> 00:38:25,620 Maybe not dirty a trick of the trade that learning theory used for years and years. 366 00:38:25,620 --> 00:38:29,130 And I want to say, no, no, no, don't do that. And here is what they do. 367 00:38:29,340 --> 00:38:31,410 Just put in a little bit of random noise. 368 00:38:32,010 --> 00:38:38,640 Don't let the probabilities go down to zero, because then you are running things up to infinity and that's really painful. 369 00:38:40,080 --> 00:38:45,720 Instead, put in a little bit of random noise, say don't let any probability go below gamma. 370 00:38:45,960 --> 00:38:49,170 Gamma is a random noise. The rest, you're running this algorithm. 371 00:38:50,190 --> 00:38:54,420 This works okay from the classical regress perspective. 372 00:38:54,780 --> 00:38:57,510 And for some reason, I put the slide in the middle here. 373 00:38:57,690 --> 00:39:06,390 But it doesn't work for my goal because that small probability that gamma that you put in the random noise you're choosing horrible pass. 374 00:39:06,780 --> 00:39:10,339 Why are you doing that? Again. 375 00:39:10,340 --> 00:39:14,510 I don't know if I know proper right numbers here. I know California thinks. 376 00:39:14,840 --> 00:39:18,049 Well, if you know that one alone, it's horrendous. It's horrendous. 377 00:39:18,050 --> 00:39:23,630 Every day, even with probably three gamma launchers, one on one, that seems stupid. 378 00:39:23,780 --> 00:39:29,900 It guarantees that you have got that much a get because it's probably got you're choosing a horrendously bad bus. 379 00:39:30,140 --> 00:39:34,890 Don't do that. It works for the regret, but it works for the cerebellum. 380 00:39:35,040 --> 00:39:40,640 But it's obviously a bad idea. Probably people are not doing this and they shouldn't be doing this. 381 00:39:40,650 --> 00:39:49,670 It looks like an altogether bad idea. So one question one should ask Is there a better model of learning that avoids this trick? 382 00:39:50,480 --> 00:39:55,700 Ken I'm still going to use important sampling, but I want to avoid this trick. 383 00:39:56,760 --> 00:39:58,590 They want to not put in random noise. 384 00:40:00,130 --> 00:40:08,950 And what I'm proposing instead, I think, is an equally simple and much more realistic alternative of what I propose you should do instead. 385 00:40:11,040 --> 00:40:23,700 So here is a very simple idea. What I should do is if something goes down below probability, gamma, whatever the parameter is, just don't do it. 386 00:40:24,120 --> 00:40:27,780 Don't like small probabilities because I don't like dividing. Is that number. 387 00:40:27,930 --> 00:40:34,290 Just don't do it. So that's called freezing. And what I do is temporarily freeze those options. 388 00:40:34,650 --> 00:40:39,450 I discovered that driving on California Road one on one is really bad. 389 00:40:39,630 --> 00:40:42,990 Just don't do it. Don't do it. Is probably the one in 100. 390 00:40:43,170 --> 00:40:46,170 Don't do it with one in a thousand. Just don't do it. 391 00:40:47,290 --> 00:40:50,680 Now what happens if it's called freezing and this is what technically I mean, 392 00:40:50,920 --> 00:40:55,959 I'm going to change the probably to zero if it went below a cut-off I'm slightly 393 00:40:55,960 --> 00:41:00,400 cheating and that cheating is that wiggle sign over there because once I did this, 394 00:41:00,400 --> 00:41:03,700 the numbers don't seem to run anymore, so I have to scale them up a bit. 395 00:41:04,510 --> 00:41:07,910 That's the wiggle. Not by much like I'm sure I. 396 00:41:07,930 --> 00:41:12,230 This is one plus epsilon. This is where my one plus epsilon will come in. 397 00:41:12,470 --> 00:41:16,520 I'm giving myself one plus epsilon error. So scaling probability is a little bit. 398 00:41:16,520 --> 00:41:20,180 One third me. So I do this. That's called freezing. 399 00:41:20,480 --> 00:41:24,440 And here is what I would like to do with this freezing. 400 00:41:24,440 --> 00:41:29,960 So what does it mean? So let's go back to what the what we study, the important sampling. 401 00:41:30,230 --> 00:41:35,690 So the first thing I see that, like I said in English, I forgot to show it to you. 402 00:41:35,810 --> 00:41:40,430 I want to make sure that the total probability that I froze is not too much. 403 00:41:40,430 --> 00:41:49,180 So when I scale up, I don't have to scale up by a lot. So first of all, if you look at any fixed action, what am I doing? 404 00:41:49,200 --> 00:41:57,660 I used to have that the expected cost is exactly right and I don't have that anymore because I did this artificial freezing action. 405 00:41:57,990 --> 00:42:02,550 So maybe it has a very high cost and I artificially never tried it. 406 00:42:02,790 --> 00:42:08,100 And in those cases the cost is zero for sure, even though it should have been something big in expectation. 407 00:42:09,450 --> 00:42:16,140 But this estimate of the cost, which is not not right, is what's called negatively biased. 408 00:42:16,440 --> 00:42:20,519 It's guaranteed to be less than it should be because sometimes I zero it out when I 409 00:42:20,520 --> 00:42:26,460 shouldn't zero it that so my cost that I estimate is lower than the actual cost. 410 00:42:26,910 --> 00:42:32,100 That's good news because if I do well compared to this, I surely did well compared to the real thing. 411 00:42:33,240 --> 00:42:41,700 So a benefit number so far. Second, and maybe that's the important part, the algorithm is not suffering. 412 00:42:41,910 --> 00:42:45,420 The expected cost of the algorithm is exactly right. 413 00:42:46,580 --> 00:42:51,230 Modulo that slight to the little over there, which is again the scaling up of one plus epsilon, 414 00:42:51,620 --> 00:42:54,830 because I'm not playing the answer in which I'm cheating. 415 00:42:55,070 --> 00:42:58,160 I'm only playing the relapse, the real arm. 416 00:42:58,160 --> 00:43:03,560 So I'm not having an estimation problem. The real arms have the right probability and right scaling up. 417 00:43:03,740 --> 00:43:08,490 There's nothing wrong there. So my expense of the operating expense I have. 418 00:43:08,490 --> 00:43:12,450 Exactly right. Modulo one plus epsilon are various. 419 00:43:12,450 --> 00:43:16,080 My estimate is what's called negatively biased. It's too low. 420 00:43:16,590 --> 00:43:20,350 So if the algorithm does well against the estimate there, I'm all set. 421 00:43:20,370 --> 00:43:24,000 So again, but I want is I can run the full information algorithm. 422 00:43:24,120 --> 00:43:30,059 I now know that no costs will go above one over Gamma because I zeroed out any gamma probability. 423 00:43:30,060 --> 00:43:33,470 So I solved the variance problem, I limited my variance. 424 00:43:33,560 --> 00:43:41,970 These come up and I can run the algorithm. It's a very, very simple idea and indeed it carries these bands. 425 00:43:44,440 --> 00:43:47,319 You get bands that I'm going to maybe show you in a minute, 426 00:43:47,320 --> 00:43:53,890 but I think I want more you to take on the technique or the idea than the actual arithmetic of how much the bands come out for. 427 00:43:54,070 --> 00:43:58,660 So all I wanna propose take the classical method, but instead of injecting noise, 428 00:43:58,840 --> 00:44:05,440 just take it that take the opposite action, not play it rather than stop playing it. 429 00:44:06,400 --> 00:44:09,610 Exact same cut off as they were you using with the noise. 430 00:44:09,610 --> 00:44:16,530 Anything goes below the noise level. I just don't play it. That causes an error in my estimate, but error goes in the right direction. 431 00:44:16,540 --> 00:44:20,760 It's a lower band, not an upper band. Good enough. Very simple idea. 432 00:44:21,360 --> 00:44:27,900 Many, many benefits. Maybe. One is that I even like it as a sort of behavioural assumption. 433 00:44:28,320 --> 00:44:33,330 If something is so unlikely to be good, people that tend not to be, do it, not to do it. 434 00:44:33,340 --> 00:44:37,000 So I think this improves the behavioural match to this assumption. 435 00:44:37,380 --> 00:44:42,390 It's a black box reduction. I didn't tell you what the algorithm underlying algorithm was. 436 00:44:42,630 --> 00:44:47,580 You can take your favourite learning algorithm of which there are many, many of them and do this to it. 437 00:44:47,820 --> 00:44:52,800 And it becomes a learning algorithm in the in the not so full information set up. 438 00:44:53,850 --> 00:45:04,200 It recovers the smaller response and it has these benefits that it doesn't that it recovers these bands which have a no dependence on time 439 00:45:04,200 --> 00:45:12,360 in the in the error band because the full information version had the property and I'm not doing anything that has information loss. 440 00:45:13,390 --> 00:45:18,100 And the last one, which I guess is sort of would have been relevant from the slides I skipped, 441 00:45:18,670 --> 00:45:22,719 but maybe worth mentioning if you're going to have to change your behaviour, 442 00:45:22,720 --> 00:45:24,760 if in the middle of driving for many, 443 00:45:24,760 --> 00:45:32,620 many months it discovers that something changed and this monologue used to be really bad, but now it's actually pretty good. 444 00:45:33,040 --> 00:45:37,450 Your algorithm will recover because it's not driving any probability below gamma. 445 00:45:38,050 --> 00:45:41,410 If you've ever been, you might stop there. You'd never go slower than that. 446 00:45:41,770 --> 00:45:50,709 So it is actually good in a very V humanly since learning is good, but not all learning algorithms are good that it it adjusts to changing situation. 447 00:45:50,710 --> 00:45:55,660 And this is the part of the slides I skipped so I won't spend so much time on it instead. 448 00:45:55,870 --> 00:46:01,029 And the last thing I want to tell you is that it also works on partial information. 449 00:46:01,030 --> 00:46:05,080 That's actually a more sophisticated version of feedback. 450 00:46:05,090 --> 00:46:11,350 So let's go back to this partial feedback graph that I showed you a little bit before. 451 00:46:11,740 --> 00:46:19,340 So remember the model here is that you not only get feedback on what you did, but you get some feedback on some of the other ones. 452 00:46:19,360 --> 00:46:24,510 So a subset of the other ones like the neighbours. Now just to get you. 453 00:46:27,280 --> 00:46:33,250 Clear or spaced on very small is if the graph is complete like every edge is in. 454 00:46:34,190 --> 00:46:38,840 That's the full information feedback. Every time you do anything you see all the neighbours and that's all the notes. 455 00:46:39,500 --> 00:46:46,219 The graph is empty, there's no edges. That's what's usually called the bandit or no or partial information or no information. 456 00:46:46,220 --> 00:46:48,830 Feedback. You only get feedback on the action you chose. 457 00:46:48,830 --> 00:46:54,560 No neighbours and this offers you something in between that there are some other kind of graphs. 458 00:46:56,880 --> 00:47:02,790 What we are going to prove. And I'm going to give you a very, very short hint that this freezing technique is good for this tool. 459 00:47:03,150 --> 00:47:13,320 Is that what I can do with this is replace the number of options, which is the size of the independent set of this guy. 460 00:47:14,180 --> 00:47:20,750 And that I think is the relevant parameter here in the full information feedback built environment that has size one. 461 00:47:21,260 --> 00:47:27,409 And indeed you see everything all the time. If there's no edges there, then you have not feedback. 462 00:47:27,410 --> 00:47:31,639 And indeed the number of number of options, you kind of have to try them. 463 00:47:31,640 --> 00:47:38,320 There's no way you can get the information without trying them all. But here, the independent set is a very, very natural model. 464 00:47:38,510 --> 00:47:42,980 It's very natural band and in fact easy to see that the lower bands of this form, 465 00:47:43,250 --> 00:47:48,980 because if there is an independent set where the good things are happening, those you kind of have to try. 466 00:47:49,310 --> 00:47:57,110 They don't give feedback about each other, but it offers you a smooth trade-off between the full information and the partial information. 467 00:47:58,050 --> 00:47:59,520 So what happens in a graph? 468 00:48:02,620 --> 00:48:10,480 Again, I have to quickly remind you that the important thing, this important sampling, is that every time you see something, 469 00:48:11,170 --> 00:48:17,380 you should update the cause dividing down with the probability that you saw it to correct for the fact that you don't always see this. 470 00:48:18,170 --> 00:48:21,590 So what matters here is not how often you play who we are. 471 00:48:21,590 --> 00:48:28,990 And what matters is how much you sit. That's the divider that you have to correct with the probability that I saw the arm. 472 00:48:29,260 --> 00:48:32,469 So I'm going to care about how often I see the arm. 473 00:48:32,470 --> 00:48:35,470 If I see it because of the is fine with me. Doesn't matter. 474 00:48:36,950 --> 00:48:38,360 Oh, sorry. Wrong direction. 475 00:48:39,770 --> 00:48:50,660 So what I want to do in my estimated cost is when I see no change, estimated cost as the cost, I see divided by the probability that I saw the item. 476 00:48:51,380 --> 00:48:56,750 So in this estimation, I would like to freeze items that have low probability of being seen. 477 00:48:57,850 --> 00:49:01,000 Very natural. Instead of low probability of being paid. 478 00:49:01,900 --> 00:49:03,990 So I'm hoping that I can keep it. 479 00:49:04,420 --> 00:49:11,590 Keep a cut-off for a low probability of being seen higher than one over the number of notes or epsilon over the number of notes. 480 00:49:12,580 --> 00:49:26,460 Oh, sorry. I guess this is just saying that had I be able to do this, I can recover all the all the learning algorithms that we had before. 481 00:49:26,480 --> 00:49:33,500 All I have to make sure that if I if I want to freeze all of the items that are not seen by sufficiently high probability, 482 00:49:34,340 --> 00:49:39,330 they're not going to, you know, clean up the whole graph. So how large can I keep this cut of? 483 00:49:40,390 --> 00:49:45,760 So I claim that I can use a band that's Epsilon over the number of independent said, 484 00:49:45,970 --> 00:49:51,040 and the third probability I freeze is actually constant times epsilon. 485 00:49:52,860 --> 00:49:57,390 So it comes in two parts, and maybe I'll give you a hint of what the parts are. 486 00:49:57,840 --> 00:49:59,460 So first thing I'm going to do, 487 00:49:59,550 --> 00:50:05,730 I look at my network and look at all the notes that are not seen with sufficiently big probabilities, say, these are the nodes here. 488 00:50:06,210 --> 00:50:12,150 And I want to argue that the amount of probability I froze is less than epsilon. 489 00:50:12,450 --> 00:50:20,370 If I use this epsilon over the independent set as my cut-off, this is a simple and maybe nice argument. 490 00:50:20,550 --> 00:50:24,060 Say the black notes here, an independent set. 491 00:50:24,300 --> 00:50:30,330 If I take a maximally independent said, then of course every note is visible from one of the independent set notes. 492 00:50:30,330 --> 00:50:37,800 That's what was maximal about the independent set. So every note is connected to one of them and therefore it's seen from one of them. 493 00:50:38,010 --> 00:50:44,399 And therefore the probability of just associated with being seeing the black notes is covers everything. 494 00:50:44,400 --> 00:50:49,020 And that's the number of independence times, whatever the cut-off was oftentimes gamma, as I said. 495 00:50:51,050 --> 00:50:56,360 This is nice, but unfortunately there is a technical difficulty, which I'm going to. 496 00:50:58,100 --> 00:51:04,930 Admit that. Unfortunately, once I write these notes, some other notes will become not seen high enough. 497 00:51:04,940 --> 00:51:06,170 So here is the picture. 498 00:51:06,320 --> 00:51:12,470 I find that despite of notes, which is the bubble and now standing on its head less than probably the epsilon as we just agreed. 499 00:51:12,680 --> 00:51:17,780 And fortunately, there are some new nodes that have not seen enough because they neighbours got frozen. 500 00:51:18,050 --> 00:51:21,950 So I have to do this iteratively and as best as I understand, 501 00:51:21,950 --> 00:51:26,450 if I do this iteratively with the same cut-off, I might unfortunately wipe out the whole graph. 502 00:51:26,750 --> 00:51:35,930 But if I do it with a decreased cut-off and I'm proposing sort of the previous probability, then I can try to convince you that it will stop. 503 00:51:36,200 --> 00:51:40,010 It will stop not doing anything worse than doubling my probability. 504 00:51:41,020 --> 00:51:45,550 And here is a very sketchy long line argument for this. 505 00:51:45,940 --> 00:51:50,950 So look at the next step here. Is this no day or this other notice that gets five that. 506 00:51:53,140 --> 00:52:00,580 These nodes, these angels have the probability that a second ago they were seen by more than a gamma probability, 507 00:52:00,580 --> 00:52:05,250 and now they're seen by less than two or more of our three. Oh. 508 00:52:05,260 --> 00:52:13,980 How did that happen? That means two thirds Gamma is going towards the previous notes and only one third is going out. 509 00:52:14,580 --> 00:52:18,240 So a large majority of every scene from I just like that. 510 00:52:18,540 --> 00:52:22,860 This is sort of like a three structure, vibratory structure. 511 00:52:23,010 --> 00:52:30,720 Twice as much done is up. And as you all know, binary trees have more notes on the site, on the on the leaves than in the centre. 512 00:52:30,750 --> 00:52:37,200 That is my argument and I slightly cheated on the details, but that's basically the argument of why this is happening. 513 00:52:38,850 --> 00:52:41,640 So at most, I bought another gum affection. 514 00:52:42,120 --> 00:52:50,550 So I'm going to stop here and maybe give you a sort of high level picture of what I'm hoping you to come from the talk. 515 00:52:50,910 --> 00:52:59,610 I in some way, I did two different things that I think are supporting each other and related to each other, but a little bit different. 516 00:53:00,090 --> 00:53:08,430 So first of all, I hope I convinced you that learning is a reasonable model of games erratic behaviour, even though it's not 100% matching data. 517 00:53:08,760 --> 00:53:19,260 And probably, you know, having just came after the previous talk where the main message was that behaviour, human behaviour is so irrational. 518 00:53:19,590 --> 00:53:24,510 I'm not claiming that this is a, this is a better rational model of human behaviour. 519 00:53:24,510 --> 00:53:30,450 It's not perfect. And I showed you even data where half, half of the participants don't even match this. 520 00:53:30,750 --> 00:53:34,110 So one can do more in studying what's a good behaviour model? 521 00:53:34,500 --> 00:53:38,700 But it has many advantages. It allows the players to adopt to opponents. 522 00:53:38,700 --> 00:53:44,549 They don't need common prior. They take advantage of the commons by playing badly and the very, 523 00:53:44,550 --> 00:53:51,720 very simple strategies and pretty natural straight strategies guarantee this any sort of certainly better matches 524 00:53:51,960 --> 00:53:57,300 better than than Nash equilibrium assumption and one that's not on the slide but probably should have been. 525 00:53:57,570 --> 00:54:06,000 It's beautifully usable in mathematical proofs it's such a nice condition you know and an narrower almost like the Nash condition, 526 00:54:06,210 --> 00:54:11,100 it makes the error very explicit and it's very nicely usable in mathematical conditions. 527 00:54:12,170 --> 00:54:15,980 But the second part I wanted to also talk about. 528 00:54:16,490 --> 00:54:22,080 It even works on dynamic environment. Oops. 529 00:54:22,080 --> 00:54:26,580 I don't know the second part. Okay, the second part, I still need to talk about that. 530 00:54:26,970 --> 00:54:33,480 The which is the blue line here that when I started to do this, 531 00:54:33,480 --> 00:54:40,740 I went find my colleagues who then more research on online algorithms and try to ask them for okay, okay, give me algorithms that do this. 532 00:54:41,100 --> 00:54:47,610 But at some point I realised that my game theory perspective is a little bit different than the classical perspective, 533 00:54:48,090 --> 00:54:55,680 and I think there is a lot more that can be done to understand how this all works or whether it works on partial information feedback. 534 00:54:56,160 --> 00:55:02,280 So passion, information, feedback, there's a very nice and very developed theory of online learning. 535 00:55:02,610 --> 00:55:06,900 But the attitude, the game, the the game series to take, take, take to this, 536 00:55:07,170 --> 00:55:13,290 there was the form of personal information that I think is natural and might not be the same as what they think is natural. 537 00:55:13,620 --> 00:55:23,069 And my absolute interest in not additive, but this sort of mix of multiplicative and additive error, that is the last digit optimisation. 538 00:55:23,070 --> 00:55:29,370 Humans don't do a little bit of error review stay in, but the speed of learning really, 539 00:55:29,370 --> 00:55:41,639 really matters both in how fast changing environments can be tolerate and also in, you know, how fast they they achieve reasonable social welfare. 540 00:55:41,640 --> 00:55:50,760 So the price of energy kind of bends and for some reason these but called approximate no regret conditions or not, 541 00:55:50,890 --> 00:55:57,330 that algorithms are much, much less developed in in the partial information setting than I wish it was. 542 00:55:57,330 --> 00:56:01,020 And I tried to contribute myself to improve the situation. 543 00:56:01,020 --> 00:56:03,270 So I'm going to stop here and thank you very much.