1
00:00:01,450 --> 00:00:16,720
Okay. Okay.

2
00:00:16,730 --> 00:00:21,590
So I'm delighted to introduce Ava TARDIS, who's going to give this year's Lovelace Lecture.

3
00:00:22,430 --> 00:00:29,930
Actually, I know that Ava is really well known to many people, and also she has so many contributions,

4
00:00:29,930 --> 00:00:33,430
in her words, that if I read them all, it would take all of her time.

5
00:00:33,440 --> 00:00:36,080
So I've made a small selection.

6
00:00:36,770 --> 00:00:48,140
So Ava as well, she received the Fulkerson Prize in 1980, 1988 for funding minimum course circulations in strongly polynomial time.

7
00:00:48,620 --> 00:00:58,729
And then in 2006, won the George Dancing Prize for for deep and wide ranging contributions to mathematical programming,

8
00:00:58,730 --> 00:01:03,470
including the first strongly polynomial time algorithm for minimum cost flows.

9
00:01:03,710 --> 00:01:10,520
Several other variants network flows, linear programming, sub modularity, circuit complexity, scheduling.

10
00:01:10,880 --> 00:01:11,990
The list goes on.

11
00:01:12,980 --> 00:01:23,000
She moved on from there to laying the foundations of algorithmic game theory, for which she was a recipient of the Journal Prize in 2012.

12
00:01:23,000 --> 00:01:26,480
For her contribution was her paper with Tim Ruff Garden.

13
00:01:26,780 --> 00:01:28,400
How about unselfish rooting?

14
00:01:28,970 --> 00:01:38,690
And then there's a string of awards, a National Academy of Science fellow of the AMS, member of the Academy American Academy of Arts and Sciences.

15
00:01:38,690 --> 00:01:44,200
And I noticed just today that it's this award upcoming in 2017.

16
00:01:44,210 --> 00:01:50,120
So it's a real honour for us to hear about her contributions to game theory.

17
00:01:50,120 --> 00:01:53,930
And I think she has a really unique take and combining that with learning.

18
00:01:54,470 --> 00:02:07,930
So we're really looking forward to the lecture. Thanks. So. So thank you very much and thank you for the nice introduction.

19
00:02:09,370 --> 00:02:16,750
And indeed, what I want to do today is thank you something about actually not just regular games, but games,

20
00:02:16,930 --> 00:02:27,520
repeated games and then thinking about combining that is learning where I'm thinking of learning as a way to model some of the agents in these games.

21
00:02:27,760 --> 00:02:36,850
So I guess the previous lecture where one where we heard a lot about how irrational human players are in games was an interesting introduction.

22
00:02:37,870 --> 00:02:44,949
I'm proposing a more mathematical model of how people behave, not quite the classical model.

23
00:02:44,950 --> 00:02:50,379
And I'll tell you a little bit more and maybe it's a good moment to try to defend that.

24
00:02:50,380 --> 00:02:54,040
My rational model is applicable in many situations.

25
00:02:54,040 --> 00:03:01,900
So for this talk my favourite application is writing, which apparently I pronounce differently than Leslie.

26
00:03:04,000 --> 00:03:07,000
You can think of it as traffic, as in cars.

27
00:03:07,330 --> 00:03:11,500
That is, people choosing a path to go to work in a car.

28
00:03:12,160 --> 00:03:16,660
Or if you want, you can think about package writing that is internet package.

29
00:03:16,900 --> 00:03:27,979
In most cases, the basic setup is that cars are driving and congestion slows down like the more cars are random,

30
00:03:27,980 --> 00:03:31,360
more packets are around, the slower it gets to get to the destination.

31
00:03:31,600 --> 00:03:33,730
From the decision makers perspective,

32
00:03:33,940 --> 00:03:41,050
these two are actually a little bit different in the sense that in the packet case the decision makers are the writers.

33
00:03:41,350 --> 00:03:47,650
So there is an algorithm behind decision making and maybe we can believe that algorithms are actually rational.

34
00:03:47,830 --> 00:03:52,000
Someone quoted them up and presumably they coded them up to do the right thing,

35
00:03:52,840 --> 00:04:01,270
whereas humans driving there might be other decision making processes and they might not be fully driving on the shortest path.

36
00:04:02,080 --> 00:04:06,790
We'll come back to this in a little bit, but for now, you can think of it either way.

37
00:04:07,330 --> 00:04:12,400
So traffic writing is a place where congestion causes delays, so therefore it's a game.

38
00:04:12,610 --> 00:04:18,130
My driving going to pass might help me get someplace, but at the same time slow as everyone has done on the same bus.

39
00:04:20,160 --> 00:04:27,410
As a stream or an idea in my research that I will potentially touch on at least one to mention it,

40
00:04:27,690 --> 00:04:35,040
even if that's not going to be a full focus of the of the talk that sometimes in this situation,

41
00:04:35,250 --> 00:04:41,760
Internet traffic, I think for sure, and even car traffic to some extent situation actually does change.

42
00:04:41,760 --> 00:04:49,190
And these players in the game will repeatedly play the same game and not really facing the same situation, but something changes.

43
00:04:49,200 --> 00:04:52,640
So in the Internet case, you know, something becomes popular.

44
00:04:52,650 --> 00:04:59,250
Everyone's due to the website to download that content. It's hard for the Internet traffic people to know this ahead of time.

45
00:04:59,250 --> 00:05:05,399
They don't. So they're going to have to adjust to the changing traffic and the same thing on traffic streams.

46
00:05:05,400 --> 00:05:10,979
If you're very smart and you know where the football game is happening around the football game tends to end.

47
00:05:10,980 --> 00:05:15,450
So therefore the traffic will be bad because people are coming out of the football game,

48
00:05:15,450 --> 00:05:22,140
say maybe you can avoid it, but other people do not and they get stuck in the bad traffic and have to adjust.

49
00:05:22,440 --> 00:05:29,340
So sometimes the situation changes and I'll come back to this in a little bit that how our Siri adjusts to this.

50
00:05:29,610 --> 00:05:33,060
So here is our model of what I think of as a repeated game.

51
00:05:33,270 --> 00:05:39,149
So there are a bunch of agents, maybe their end of them in every step and I'm going to think of discrete steps.

52
00:05:39,150 --> 00:05:42,900
So in the morning, every morning you wake up and maybe drive to work.

53
00:05:43,170 --> 00:05:51,510
I'm going to generally think of you take action i they t and you do this every day and then some something happens in everyday.

54
00:05:51,510 --> 00:05:58,230
You figure out how many minutes it took to get to work. You might, you know, go around and every day something happens.

55
00:05:58,530 --> 00:06:02,730
And but I'm going to assume that when you're rational,

56
00:06:02,910 --> 00:06:10,890
you try to minimise the average time it takes to get your package to the right destination, which again assumes that they're somewhat rational.

57
00:06:11,670 --> 00:06:15,329
And what players are doing is trying to learn from data.

58
00:06:15,330 --> 00:06:21,690
So this is maybe a more clearly a model of of what the what the writers are trying to do.

59
00:06:21,690 --> 00:06:27,450
They're trying to get the packets, the destination, and where is the information coming from that good data,

60
00:06:27,450 --> 00:06:32,309
whereas they know how much the previous packets take over the peak time right now is,

61
00:06:32,310 --> 00:06:38,220
or they know something about previous data and that's what they're trying to use as the information.

62
00:06:39,510 --> 00:06:45,300
And what I'm going to ask, I guess in some way the talk will consist of two parts.

63
00:06:46,050 --> 00:06:51,990
In the beginning I'm going to ask about the research direction I and many of us have been pursuing for

64
00:06:51,990 --> 00:06:58,670
a number of years ask what can we say about the outcome of these games when the players are learners?

65
00:06:58,680 --> 00:07:04,650
I'm going to make the assumption that the read assumption here that every player is trying to learn from past data.

66
00:07:04,890 --> 00:07:09,600
And I'm going to ask, what can I say about what will happen in the game if everyone's trying to learn?

67
00:07:11,950 --> 00:07:21,219
This led me to think about learning and led me to realise that in some way this online learning that I'm going to use is very well developed.

68
00:07:21,220 --> 00:07:27,610
Siri But there are some other ways game theory gets me motivated to think about online learning questions or think

69
00:07:27,610 --> 00:07:34,179
about understanding or like learning behaviour in a little bit differently than the online learning community will do.

70
00:07:34,180 --> 00:07:43,749
And I'm going to give you a little bit of test on some recent work where we try to ask what what does not natural learning

71
00:07:43,750 --> 00:07:52,660
behaviour tell you or what kind of algorithms can you say that that assured us that players are actually learning online?

72
00:07:53,410 --> 00:07:57,879
The speed at which you're learning is, I'm sure, important for online learning.

73
00:07:57,880 --> 00:08:00,490
This is a topic that gets started online learning,

74
00:08:00,640 --> 00:08:08,469
but in particular it's very important for game theory because if I want to have a system where there's some transient behaviour,

75
00:08:08,470 --> 00:08:12,129
where people stay in the system for a little bit and then vanish, if they're very,

76
00:08:12,130 --> 00:08:17,080
very slow at learning, then, you know, for a little while they were in a system, they'd learned enough.

77
00:08:17,080 --> 00:08:21,250
I can't say anything about their behaviour. In order to be able to say something,

78
00:08:21,490 --> 00:08:27,010
I have to know they stayed in the system long enough to learn and in particular I have to know how long does it take to learn?

79
00:08:27,400 --> 00:08:33,370
So I have to know some balance on how fast they learn to be able to say anything meaningful.

80
00:08:33,580 --> 00:08:40,240
So that's what I would like to tell you a little bit about. So two very related are very intertwined parts.

81
00:08:41,460 --> 00:08:50,100
So games probably I should start with maybe this is a computer science audience so I think or large part.

82
00:08:50,460 --> 00:08:55,800
So I'm assuming that trying to tell you that learning is a reasonable behaviour

83
00:08:55,800 --> 00:09:00,210
of what happens in an online setting maybe doesn't need so much defence,

84
00:09:00,810 --> 00:09:06,960
but maybe it's still good to say that. Of course, if I go back to what classical game theory always did,

85
00:09:07,110 --> 00:09:12,330
then the classical thing we all studied me included this Nash equilibrium and not learning outcomes.

86
00:09:13,170 --> 00:09:18,360
So maybe the natural thing would be to say that a light on the study Nash equilibrium.

87
00:09:18,930 --> 00:09:25,919
And I guess there are many good reasons not to, but one of them is that Nash equilibrium is hard to find.

88
00:09:25,920 --> 00:09:31,830
And there are many reasons, many variations on the sense of what I mean by hard to find.

89
00:09:32,040 --> 00:09:37,349
One of them is in order to find the Nash equilibrium,

90
00:09:37,350 --> 00:09:42,360
I have to be very well informed of who else is in the system and what on earth might they wanna do,

91
00:09:42,660 --> 00:09:46,940
which is an information, lack of information kind of argument.

92
00:09:46,950 --> 00:09:55,110
And of course there is also the famous does go Clarke is a very powerful attribute to resolve that is computationally hard to find,

93
00:09:55,800 --> 00:09:59,790
give others a sort of alternate pulse sort of leading the same conclusion.

94
00:09:59,970 --> 00:10:05,100
NASH Equilibrium is a problematic model. To get the Nash equilibrium, you simply have to know too much.

95
00:10:05,280 --> 00:10:09,060
And even if you knew a lot, it would be computationally hard.

96
00:10:09,970 --> 00:10:14,050
But then I can also offer you a very empirical evidence.

97
00:10:14,410 --> 00:10:19,300
Hey, I can look. That's not what's happening. And here is my empirical evidence.

98
00:10:19,720 --> 00:10:25,240
If I want to have a repeat of game and the simplest model of what does it mean to be at Nash Equilibrium?

99
00:10:25,240 --> 00:10:32,649
And indeed the very first studies understanding this kind of games was to assume that, you know,

100
00:10:32,650 --> 00:10:39,790
the players might like try to do stuff and then they going to settle into a Nash equilibrium and then going to repeat the

101
00:10:39,790 --> 00:10:47,169
very same action every single day because they figured out what's an equilibrium if they could ever figure out such a state.

102
00:10:47,170 --> 00:10:50,920
Indeed, that state should be stable because if they ever were in a state.

103
00:10:51,190 --> 00:10:56,409
But. Nash equilibrium means exactly this equation right down here in equality right here,

104
00:10:56,410 --> 00:11:02,060
it says that they're doing something which has lower cost than anything else they could try to do.

105
00:11:02,080 --> 00:11:06,700
So if they try something else, it's only going to be worse. The next day they're going to go back to the right thing.

106
00:11:07,420 --> 00:11:14,320
So now it's actually very stable because it has this quality of it's usually referred to as a no regret inequality.

107
00:11:14,470 --> 00:11:22,730
It's NASH because you're not regretting that you're not doing something else and it's stable because they find this Nash equilibrium.

108
00:11:23,260 --> 00:11:26,970
But you look at any data and sure, life isn't this stable.

109
00:11:26,980 --> 00:11:34,930
I can tell you whatever data you look at, the data that's easy to show you is actually not the main subject of this talk is action data.

110
00:11:34,930 --> 00:11:38,889
And I guess I've been working this different kind of action data.

111
00:11:38,890 --> 00:11:43,420
This one is coming from being or Microsoft advertisment data.

112
00:11:43,600 --> 00:11:50,469
They're just not this stable. That's not what's happening. So there is a complex of the problem with Nash equilibrium.

113
00:11:50,470 --> 00:11:55,420
There's an information theoretic problem, there's not enough information, and then there's an empirical problem.

114
00:11:55,540 --> 00:12:03,169
It's just not happening. Now one way to look at what the Microsoft advertiser's doing and again,

115
00:12:03,170 --> 00:12:09,680
these are people these are actually high advertisers that change their bid very, very frequently.

116
00:12:09,860 --> 00:12:16,820
So frequently that they virtually certainly know they're not humans like they were just too often they just they just change all the time.

117
00:12:17,270 --> 00:12:20,750
This has to be an algorithm that does this, not a human, I'm guessing.

118
00:12:23,240 --> 00:12:30,560
With this kind of changes. I think what best models this behaviour is some sort of online optimisation algorithms.

119
00:12:30,710 --> 00:12:35,300
This is what online optimal optimisation does. It does some form of gradient descent.

120
00:12:35,570 --> 00:12:41,330
And this totally looks to me as a just first eyeballing it as a gradient descent algorithm.

121
00:12:41,330 --> 00:12:48,200
So that's another evidence that maybe learning is a better model of what really happening in real life.

122
00:12:48,650 --> 00:12:54,470
So here is what I'm going to do. I'm going to assume as a behavioural model that people are learning.

123
00:12:55,350 --> 00:12:57,330
And what does it mean that they're learning?

124
00:12:57,540 --> 00:13:04,710
So but the ones that really show up in the system say this is the beginning of this picture is day one for some user.

125
00:13:05,430 --> 00:13:09,120
And maybe he's clueless. He's trying other stuff. He's doing wrong things.

126
00:13:09,540 --> 00:13:14,190
But as time goes on, he gets better and maybe does something reasonable.

127
00:13:15,640 --> 00:13:20,620
And then I guess I want to use an assumption that's very, very similar to the Nash equilibrium,

128
00:13:20,800 --> 00:13:26,950
except it does not you use the stability that is they like to do every day something different.

129
00:13:28,300 --> 00:13:36,460
No regret would mean that for every fixed action, whatever they did is lower cost than having done that fixed action with hindsight.

130
00:13:36,730 --> 00:13:38,560
Okay. This is the no regrets condition.

131
00:13:38,890 --> 00:13:46,390
What I really mean, of course, is not no regret because there was the initial experimentation where you could have easily made a few mistakes.

132
00:13:46,660 --> 00:13:55,830
I mean, small regret. And it's usually referred to as no regret, as long as this error doesn't go linearly with time.

133
00:13:55,850 --> 00:14:04,550
So for example, if having normalised the cost between zero and one, the biggest error you can accumulate is to at 90.

134
00:14:04,730 --> 00:14:13,190
If regret grows roughly a zero to or something not quite linearly, then it's called no regret, which is your gut vanishes as time goes on.

135
00:14:14,410 --> 00:14:17,500
This is a pretty standard sort of no regret algorithm.

136
00:14:21,460 --> 00:14:25,300
So I'm going to actually spend at least one slide in.

137
00:14:25,990 --> 00:14:29,800
I'm going to defend to you the no regrets assumption,

138
00:14:29,980 --> 00:14:37,030
but maybe I should admit that actually like using a little bit different no regret the assumption which I actually

139
00:14:37,030 --> 00:14:42,580
think at least the data shows matches human behaviour much better than the actual no regret the assumption.

140
00:14:43,830 --> 00:14:48,480
Is an approximate version, so this must, in all regret, remember the cost.

141
00:14:49,290 --> 00:14:58,890
That you experience for any version of youth or any player, II is less than the cost of anything sexual besides site plus some sort of error instead.

142
00:14:59,910 --> 00:15:09,030
In a somewhat recent paper from last year at NIPS, we try to introduce that a better notion for game theory is this thing called approximate regret.

143
00:15:09,030 --> 00:15:16,950
And what I'm doing is allowing you instead of a purely additive error and on something that algorithms,

144
00:15:16,950 --> 00:15:23,639
people often like doing, especially since I for example came from approximation algorithm is an area is approximate error.

145
00:15:23,640 --> 00:15:31,470
And what I'm doing is mixing that too. I'm giving you a even plus epsilon error that I'm allowing you to have a multiplicative error,

146
00:15:32,190 --> 00:15:42,959
one plus Epsilon saying that most humans and even the systems missing some low bit of noise something like and for looking at empirical data say five,

147
00:15:42,960 --> 00:15:46,620
10%. They're not very sensitive. They not optimised the last digit.

148
00:15:47,100 --> 00:15:52,200
So this is modelling this would be an epsilon equals 1/10 120 has something like

149
00:15:52,200 --> 00:15:59,400
that and the benefit is which going to really really pay off for us that if you

150
00:15:59,700 --> 00:16:04,830
shooting for this sort of approximate regret to those things it models data better

151
00:16:04,860 --> 00:16:09,480
so it seems like a better model of human behaviour and as you might notice,

152
00:16:09,750 --> 00:16:19,409
the error went way down. So instead of something that grows this time, I have an error that doesn't grow this time so much, much faster.

153
00:16:19,410 --> 00:16:27,570
This amount, this quality of learning is achievable much, much faster than the ultimate quality of learning that the classical learning algorithms do.

154
00:16:29,740 --> 00:16:34,120
If you have studied online learning that you might realise this band that actually

155
00:16:34,120 --> 00:16:38,679
what's new here is its use of using game theory rather than the band itself.

156
00:16:38,680 --> 00:16:40,980
And I'll come back to this in the second half of the talk.

157
00:16:43,320 --> 00:16:52,229
So my point is that this sort of rote regret algorithms actually achieved by the classical algorithms you might all know or many of you might know,

158
00:16:52,230 --> 00:16:59,790
such as music without evade hedge regret matching almost all of the algorithms you might be aware of actually have this band.

159
00:17:00,360 --> 00:17:05,130
So there is nothing new here. It just more. Maybe more suited for game theory.

160
00:17:05,730 --> 00:17:09,650
So why do I say it's more suited for game theory? Again, two points.

161
00:17:09,660 --> 00:17:13,770
One, I think it matches human behaviour better than the actual regatta algorithms.

162
00:17:14,370 --> 00:17:18,090
But here is another one. I like game theory.

163
00:17:18,540 --> 00:17:24,600
Thinking of game theory, I got into interested in learning because I wanted to study what I said in the title.

164
00:17:24,930 --> 00:17:30,000
The Quality of outcomes in games. So what do I mean by quality of outcomes in games?

165
00:17:30,570 --> 00:17:38,960
So the classical notion by. Elias and Christos Papadimitriou is that of price of energy.

166
00:17:39,200 --> 00:17:48,320
And what price of energy does is defines the cost divided by some sort of magical, optimal cost of an equilibrium.

167
00:17:48,590 --> 00:17:53,749
Elias, like all of us, was studying. Nash Equilibrium and not learning outcomes.

168
00:17:53,750 --> 00:18:00,950
But this is the price of energy. You look at the cost of inertia, equilibrium divided by optimal cost, and that's called the price of energy.

169
00:18:03,380 --> 00:18:12,260
Now, since I want to study learning outcomes, I wanna change this to a related concept sometimes called the price of total energy.

170
00:18:12,260 --> 00:18:15,620
But I'm not so sure. I kind of like calling the war price of energy.

171
00:18:15,620 --> 00:18:19,579
Sorry. Very sad. No, no, no. They were Nash equilibrium.

172
00:18:19,580 --> 00:18:22,640
They were learning. So what does it mean that they're learning?

173
00:18:22,910 --> 00:18:26,030
I make the assumption that everyone is a learner.

174
00:18:26,030 --> 00:18:33,950
They are all running some sort of intricate algorithm. And what I'm measuring is the average cost divided by the optimum.

175
00:18:34,130 --> 00:18:40,310
This equation I wrote assumes a stable population, so there are six fixed populations.

176
00:18:40,310 --> 00:18:45,080
So the optimum is fixed. They're three durations, they're three times the optimum that could happen.

177
00:18:45,380 --> 00:18:52,460
Or if I want to be more generous and this is a topic I slightly mentioned the beginning, but as I said, I will barely touch on it.

178
00:18:52,700 --> 00:18:57,469
I can tell you dynamic population when I change the population over time and then I

179
00:18:57,470 --> 00:19:03,890
have to make an even more sort of generous average the population of these learners.

180
00:19:03,890 --> 00:19:10,010
That's the top and the bottom and the optimal, which of course also not changes because I change the population.

181
00:19:10,190 --> 00:19:12,890
This is the natural analogue of the price of energy.

182
00:19:13,070 --> 00:19:21,530
This is the kind of thing, some interest that they might want to prove that if everyone is a learner, that in many games these ratios stay very small.

183
00:19:22,670 --> 00:19:31,490
Now, as you may or may not know, and I hope you have seen some of this, there are beautiful results out there that prove price of energy bans.

184
00:19:32,760 --> 00:19:46,680
And. They often come with a framework championed by Team Wrap Garden, which is in some way a very natural, simple framework.

185
00:19:46,690 --> 00:19:48,700
Here is what it, roughly speaking, does.

186
00:19:49,030 --> 00:19:57,110
It says that if the game has this natural that he called smoothness property, then you can prove good price offender events.

187
00:19:57,130 --> 00:19:59,160
Now let the smoothness properties stay.

188
00:19:59,500 --> 00:20:06,130
I wrote the inequality technically down on the slide and I'll show you in a second that it actually shows something.

189
00:20:06,640 --> 00:20:13,510
But let me say it in English first. So if you imagine any solution, any kind of outcome that is fair,

190
00:20:13,510 --> 00:20:20,770
so is the vector of different actions that have horrendously high cost, you know, 100 times above the optimum.

191
00:20:22,150 --> 00:20:30,850
Then what this was to say that a lot of these guys have a good action to take that improves the situation, and that's what the inequality says.

192
00:20:31,270 --> 00:20:37,660
If the cost of the solution is incredibly high, you is a multiplier less than one.

193
00:20:37,960 --> 00:20:41,350
Then the right hand side is now not quite this high.

194
00:20:43,350 --> 00:20:51,720
Because move was less than one in the afternoon is noisier. Then one of the guys on average, someone has a good option to take.

195
00:20:51,900 --> 00:20:57,000
This is very natural that this condition should obviously imply a price of energy.

196
00:20:57,000 --> 00:21:01,379
But if the cost is very high, that's not much because someone wants to improve.

197
00:21:01,380 --> 00:21:08,910
And indeed the proof is just as easy. If you are a Nash equilibrium, then everyone doesn't regret this other action.

198
00:21:08,910 --> 00:21:12,450
They instead want to do. Whatever they were doing is in inequality.

199
00:21:12,450 --> 00:21:16,290
All you do is rearrange the inequality and you get the cost is some bad.

200
00:21:17,460 --> 00:21:20,760
So this is a for all you know, is very straight forward proof.

201
00:21:22,890 --> 00:21:31,920
What's interesting about it is that almost all with very, very, very few exceptions, all the price of energy bands follow this framework.

202
00:21:33,120 --> 00:21:41,370
For example, the original paper I had this team that got mentioned in the introduction about price of

203
00:21:41,370 --> 00:21:49,499
energy and traffic writing had this framework or the extension for atomic or bigger traffic.

204
00:21:49,500 --> 00:21:56,280
Writing also has this framework. And in many, many others, I guess these are the two that most relevant in in the kids.

205
00:21:56,280 --> 00:21:59,100
They all follow this very, very simplistic framework.

206
00:21:59,670 --> 00:22:08,700
One thing to keep in mind is the numbers we are getting here, we are proving price of energy, balance of a factor of two, factor of four.

207
00:22:08,700 --> 00:22:13,230
So it's if you want to be generous, factor of two and a half even worse.

208
00:22:16,760 --> 00:22:20,660
When I'm interested in this kind of puns, I don't much care.

209
00:22:20,840 --> 00:22:24,050
But there's 2.5 or 2.6. That's noise.

210
00:22:24,950 --> 00:22:29,930
Like, if either you're not at all interested in this because because these factors are not relevant.

211
00:22:30,110 --> 00:22:34,820
Or if you are, if you give that, if I give you one more digit, you'll say, sure, whatever.

212
00:22:36,170 --> 00:22:41,660
And if this is interesting at all, then the last digit doesn't matter anymore or else it's not interesting.

213
00:22:42,320 --> 00:22:49,520
So this is the attitude I want to take to learning. I'm not interested in that precise optimisation because this is what I'm shooting for.

214
00:22:49,850 --> 00:22:54,820
And here you either are not interested in this at all or you're not interested in the last digit of precision.

215
00:22:56,070 --> 00:23:02,970
So that's the attitude I'm going to take. This framework has the beautiful advantage that it works very, very nicely.

216
00:23:03,180 --> 00:23:07,290
Also, a learning outcomes almost identically. You don't have to change anything.

217
00:23:07,860 --> 00:23:11,790
I never used in the proof back here, the stability.

218
00:23:11,910 --> 00:23:17,820
What I'm using instead is the no regrets condition, which is exactly what learning outcome has.

219
00:23:18,120 --> 00:23:27,440
If you want to adjust the proof of learning, that would say that if you learned, then you don't regret very much not having done something else.

220
00:23:27,450 --> 00:23:30,960
You have a bit of regret or not horrible amount of regret that all.

221
00:23:31,560 --> 00:23:37,170
If the game was smoothest, you can still finish the same inequality and you get a same sort of band.

222
00:23:37,410 --> 00:23:43,170
One important thing to note is that the regret there are, of course, came into the band, entered the band.

223
00:23:43,410 --> 00:23:48,629
Like if you have large regret that or then you actually not very big, very close to the optimum.

224
00:23:48,630 --> 00:23:55,710
If you don't have logic at that or then if have no regret at all, then we can carry over the proof if you have logic later or you lost something.

225
00:23:56,840 --> 00:24:00,329
So the regulator really matters. This is maybe the point.

226
00:24:00,330 --> 00:24:03,540
They should come back to this speed of convergence resolved.

227
00:24:03,570 --> 00:24:09,320
Remember, I wanted to use approximate regret. So I want to put in this epsilon as an extra bond.

228
00:24:10,430 --> 00:24:16,580
If I do this and they do the same machinery, which I didn't put the algebra out on the slide, but I hope you believe me.

229
00:24:16,790 --> 00:24:20,990
Exact same manipulation. I lost the epsilon and here is how I lose it.

230
00:24:21,170 --> 00:24:23,630
I lose it. It's a multiplier.

231
00:24:24,170 --> 00:24:33,980
We used to have a lambda over one minus fuel as our bond, and it's now one plus one plus epsilon times lambda and one minus something we do again.

232
00:24:34,160 --> 00:24:38,209
A little bit of different, but not much. If I kept epsilon small enough.

233
00:24:38,210 --> 00:24:45,310
This is noise. Very small change. But at the expense that I can push the additive very significantly that.

234
00:24:48,510 --> 00:24:58,650
Okay. So this is sort of my basic introduction of what I wanted to tell you on, on, on games.

235
00:25:01,780 --> 00:25:06,040
Before I go and talk about learning, there is a little segment I wanted to.

236
00:25:06,370 --> 00:25:09,519
I'm sure it's many of your minds, even though no one is speaking up.

237
00:25:09,520 --> 00:25:13,050
But I'm sure many of you were thinking, so how good a model is this learning?

238
00:25:13,060 --> 00:25:16,690
Are people actually learning? So I want to show you two slides.

239
00:25:16,990 --> 00:25:22,510
In asking whether this error is a is a good model of learning.

240
00:25:22,510 --> 00:25:26,110
So I promise that a little bit of a discussion and I want to show you some data.

241
00:25:26,470 --> 00:25:30,760
So. Again, the main assumption here is everyone's learning.

242
00:25:31,360 --> 00:25:34,360
I.M. Cave is almost everyone's learning, of course.

243
00:25:34,690 --> 00:25:38,260
That's almost as good. But are people learning at all?

244
00:25:38,410 --> 00:25:42,820
Anyone learning? Is any people learning at all? And there are two couple of things that I can offer.

245
00:25:44,080 --> 00:25:49,180
Learning is utterly doable. So it's both doable information only.

246
00:25:49,810 --> 00:25:53,050
So no need for everyone to know who else is in the game.

247
00:25:53,080 --> 00:25:57,340
You're learning from data. Data is available. You can learn it's doable.

248
00:25:59,140 --> 00:26:02,590
And what's better if your opponent's doing badly?

249
00:26:03,250 --> 00:26:06,909
You learn even better outcomes. You take advantage of badly playing.

250
00:26:06,910 --> 00:26:16,210
Opponents learn. That's what learning does. I feels to me like people should be learning.

251
00:26:16,390 --> 00:26:24,080
That is. All I'm asking them if there's a single strategy that's consistently really good, please make it up to it.

252
00:26:24,500 --> 00:26:30,559
I don't know where, whether Oxford people live far enough for this to make sense, but if you take, for example, California,

253
00:26:30,560 --> 00:26:38,570
where people drive long distances, if there is a highway on which in the morning there never is congestion, that's what I'm asking you to notice.

254
00:26:38,780 --> 00:26:47,520
Please know this isn't driving it. It's a pretty reasonable assumption in many, many algorithms that actually capable of implementing this.

255
00:26:49,470 --> 00:26:53,410
And the ideas of these algorithms are very, very natural try stuff.

256
00:26:53,430 --> 00:26:58,750
You know, if something was very bad in the past, don't try it with very high probability that it's a bad idea.

257
00:26:58,950 --> 00:27:03,630
Try stuff that has higher you know usually do the things that the natural probabilities.

258
00:27:05,110 --> 00:27:08,200
Nonetheless, if I look at data, it's so-so.

259
00:27:08,620 --> 00:27:12,309
So again, back to the Microsoft data and this paper is from Denison.

260
00:27:12,310 --> 00:27:15,850
I can follow them wisely. Saucony is with Denison.

261
00:27:15,850 --> 00:27:21,010
I can follow in Westchester Counties from S.E. 15. We again, we looked at Microsoft Azure data.

262
00:27:22,110 --> 00:27:25,080
Microsoft editor has the problem that unlike Traffic,

263
00:27:25,170 --> 00:27:31,320
I think I don't actually know what this guy's value is because that's not in the data, the behaviour is in the data.

264
00:27:31,800 --> 00:27:38,250
But I have to do and that's the content of this paper is try to guess the value given their behaviour.

265
00:27:39,400 --> 00:27:46,930
They're proposing that the best guess we can make is the value on which they regret would be as small as possible.

266
00:27:47,850 --> 00:27:53,490
So when I show you slides of how much regret you have, this is a very optimistic slide.

267
00:27:53,850 --> 00:27:58,110
If I guess they value wrong, they have regrets. Bigger than what I'm showing.

268
00:27:58,410 --> 00:28:07,410
This is a smaller regret as they physically can get. So it says 30% of the participants have zero or negative regret.

269
00:28:07,860 --> 00:28:12,030
So there is a value on which they're doing better than any single bit with hindsight.

270
00:28:12,330 --> 00:28:17,910
So maybe they varying the bid in a clever way and they doing better than what I could have done.

271
00:28:18,150 --> 00:28:23,700
Another about ten, ten, 10% or 20% of people have less than 10%.

272
00:28:24,600 --> 00:28:29,310
Less than 10% regret the law and 50% of the participants have.

273
00:28:29,670 --> 00:28:39,390
We have 10% or more regret that or in fact, these guys up here have an incredible amount of regret that are doing something really wrong.

274
00:28:39,660 --> 00:28:43,920
And again, I don't guarantee to you that they have the value, right?

275
00:28:43,920 --> 00:28:50,820
Especially not those guys, but I guarantee they have this much regret that or I am giving you an optimistic estimate of they regret that.

276
00:28:51,600 --> 00:28:55,320
So this is a an assumption and may be interesting mathematically,

277
00:28:55,500 --> 00:29:01,230
but I think there's a lot of research to be done in the direction of understanding better of what's really going on here.

278
00:29:01,740 --> 00:29:05,640
And I guess for some of you, for when humans are making decisions,

279
00:29:05,970 --> 00:29:14,340
then I talked to some of you about some ideas I have of the actually acquiring and understanding the data is a cost that's not in this models.

280
00:29:15,400 --> 00:29:18,490
So the human work involved in figuring out the right thing to do.

281
00:29:20,940 --> 00:29:24,749
Anyway. The models we are using. Maybe I'm going to skip this. So I have enough.

282
00:29:24,750 --> 00:29:32,400
Sorry, I have enough for a little bit of talking about the second half I was hoping to talk about.

283
00:29:32,670 --> 00:29:35,190
So what I try to tell you so far and again,

284
00:29:35,190 --> 00:29:45,300
I admit I skipped one important segment is I try to talk to you that that that this learning is a human behaviour is a behaviour model.

285
00:29:45,810 --> 00:29:47,879
It's mathematically very very nice.

286
00:29:47,880 --> 00:29:58,290
It has exactly the Nash like inequality without the Nash stability assumption it matches data better than than than Nash equilibrium.

287
00:29:58,290 --> 00:30:08,250
But I admit not at all perfectly. And it there very nice balance which I certainly didn't show it to you except for the framework.

288
00:30:09,090 --> 00:30:16,080
Sometime you can put 2000 bands of the find that if everyone does or most people satisfied there's no regret learning properties.

289
00:30:16,380 --> 00:30:23,160
Then the average delays in a network are pretty good, like within some small, constant times the optimum.

290
00:30:27,590 --> 00:30:31,370
And again, it can be supported by a broad set of learning algorithms.

291
00:30:31,550 --> 00:30:38,040
So maybe works as a behavioural assumption. So hopefully I convinced you that despite.

292
00:30:38,370 --> 00:30:44,100
Interesting to study other behavioural assumptions. It's this learning is an interesting and good assumption.

293
00:30:44,430 --> 00:30:52,620
But I would like to spend the second half of the talk on his recent work actually trying to understand the aspect of learning that I hid

294
00:30:52,620 --> 00:31:00,930
under the rug till now and I want to recover it from the rug and tell you that this is an important issue and I should talk about it.

295
00:31:02,300 --> 00:31:08,760
And that's upcoming joint work with Cactus Shootaround and Todos Liquors.

296
00:31:11,690 --> 00:31:17,260
Feedback. So when I think of learning algorithms and most of the online learning literature,

297
00:31:17,270 --> 00:31:20,690
much of it, especially the early online literature, was that way too.

298
00:31:21,680 --> 00:31:26,629
I made the very unnatural and very uncalled for assumption that whenever you're

299
00:31:26,630 --> 00:31:31,130
choosing a path to travel and God gives you feedback on all other facets.

300
00:31:32,000 --> 00:31:35,360
Good luck. That is an alter ego of yours.

301
00:31:35,390 --> 00:31:40,070
Also went on the other pass and you know you go and you took an hour to drive to work.

302
00:31:40,070 --> 00:31:46,340
But you're actually aware that on this other pass it would have taken 35 minutes and the third pass it would have taken 10 minutes.

303
00:31:47,030 --> 00:31:55,760
That's called the full information feedback. And in fact, much, much of the results I had on my slide were all full information, feedback.

304
00:31:55,940 --> 00:32:03,500
You were choosing a strategy. You experienced the outcome of the strategy, but I fed you full information on everything else you could have done.

305
00:32:04,560 --> 00:32:07,740
Now I'm not aware of many full information feedback setups.

306
00:32:08,040 --> 00:32:13,790
So, for example, the traffic packets out there certainly don't know what happens on other parts that they're not trying.

307
00:32:14,340 --> 00:32:15,420
That's not how it works.

308
00:32:16,140 --> 00:32:23,340
When you're driving, maybe you get some information listening to news coverage of traffic delays, but not as good as the one you actually drove on.

309
00:32:23,790 --> 00:32:27,090
So you're getting some partial information, not full information.

310
00:32:27,690 --> 00:32:32,050
So one thing would be important to have, especially information feedback models.

311
00:32:32,070 --> 00:32:36,300
That is, if I don't give you full information, can you recover some of the same things?

312
00:32:38,890 --> 00:32:47,740
So for this segment of the talk, I'm going to hide the other people's behaviour in notation and instead of saying cost of your action acts,

313
00:32:47,740 --> 00:32:51,639
it will be just cost the time t of x.

314
00:32:51,640 --> 00:32:59,230
Those things are getting hidden. I hide person I because I'm focusing on one person's learning and I'm hiding the other people's behaviour

315
00:32:59,440 --> 00:33:04,600
by simply allowing the cost to vary this time going very this time because of the other guy's behaviour.

316
00:33:04,900 --> 00:33:14,520
But maybe this is a more compact notation and like a learning theorist, I would like to actually have costs normalise to zero and one.

317
00:33:14,520 --> 00:33:19,240
And so you have to take the, you know how long it maximum take and normalise them and I assume you did this.

318
00:33:21,640 --> 00:33:32,190
There are two parts to. I'm going to start by talking about what the classical not full information model, which is certainly simple.

319
00:33:32,280 --> 00:33:36,180
You simply only get the information about the the action you selected.

320
00:33:36,450 --> 00:33:43,049
But I also want to mention an extension, more technically difficult extension of what you're doing in a very,

321
00:33:43,050 --> 00:33:46,620
very beautiful partial information feedback model that I really like.

322
00:33:46,920 --> 00:33:51,239
Here is what it looks like. Every action here is a choice as a node.

323
00:33:51,240 --> 00:33:57,990
So, for example, if you choose action X, that's in red. But actions are some are connected by your visibility.

324
00:33:57,990 --> 00:34:02,280
If you choose an action X, then you also get feedback on its neighbours.

325
00:34:02,280 --> 00:34:10,290
So there's a graph on the action space and you get partial information, which means you get information on some stuff and not on some other stuff.

326
00:34:10,860 --> 00:34:15,329
So in this case you get information on the neighbours of know that it's a very,

327
00:34:15,330 --> 00:34:21,480
very clean and I think elegant partial information feedback model that we are studying.

328
00:34:21,720 --> 00:34:28,890
This is a current state of this line of research and I have many, many good questions to ask and I'm sure you will.

329
00:34:29,130 --> 00:34:38,100
You'll do a good job asking me questions of, you know, other extension of partial information, not exactly your neighbours, but some other things.

330
00:34:38,100 --> 00:34:44,370
There are many, many interesting questions here whether what we're doing extends to other forms of personal information, feedback.

331
00:34:46,240 --> 00:34:48,459
So let's first tell you what's known.

332
00:34:48,460 --> 00:34:56,170
And again, I know some of you know a lot about online learning and might the first slide or two here might be a little well-known territory.

333
00:34:56,170 --> 00:35:04,930
But let me just remind you, in case and in case you don't remember, it's the fact that you don't get full information, feedback.

334
00:35:04,930 --> 00:35:08,560
That's not news to the learning community. Here is the standard algorithm.

335
00:35:09,100 --> 00:35:13,389
So the standard algorithm goes as follows. So remember, you're choosing an action.

336
00:35:13,390 --> 00:35:18,580
You find its cost, but there's all these other actions and you have no idea what their costs are.

337
00:35:19,000 --> 00:35:24,159
You didn't get feedback. So here is the proposed way of dealing with it.

338
00:35:24,160 --> 00:35:27,520
It's called important sampling, and it's a very, very simple idea.

339
00:35:28,480 --> 00:35:34,030
If you haven't seen it, this is not at all my result. But this should be your take home message because it's so simple.

340
00:35:34,720 --> 00:35:40,000
What you should do is pretend that everything you haven't seen is as good as it can be.

341
00:35:40,240 --> 00:35:44,350
Zero cost. It was ideal. You could have gotten there in no time.

342
00:35:45,190 --> 00:35:52,990
But everything you did see. I will scale the cost up by dividing by the probability that I chose that action.

343
00:35:54,100 --> 00:36:00,030
So I had that probability distribution of choosing things. This one, maybe I chose this probability 10%.

344
00:36:00,040 --> 00:36:06,850
This one is probably the 1%. And what I'm going to do is if I don't see it, I pretend pretend it has zero.

345
00:36:07,120 --> 00:36:10,510
But if I do see it, a scale of the cost is the probability.

346
00:36:11,830 --> 00:36:19,900
Now. What's good in this? Well, it's good in this that for if I take any fixed action, it's expected cost is precisely right.

347
00:36:21,180 --> 00:36:28,320
Nike, probably tippy it will have this ginormous cost computing contributing to the exact expectation.

348
00:36:28,560 --> 00:36:33,300
Exactly the cost and probably three rested zero and doesn't contribute anything.

349
00:36:33,600 --> 00:36:42,630
They expect that cost is precisely right. Yes, it increased the variance by an awful lot, but the expectation is exactly right.

350
00:36:43,690 --> 00:36:53,590
Similarly, if I now imagine that I use these imagined cost, not my real cost, but the scale up costs, and I run a full information.

351
00:36:54,610 --> 00:36:59,260
Learning algorithms and thinking these were this field of things were the real cost.

352
00:36:59,830 --> 00:37:05,650
Then the cost of this full information algorithm will also be exactly right in expectation.

353
00:37:06,760 --> 00:37:17,060
And this is a bit more complicated computation. If I choose the probability distribution P to choose among these this filter costs, then,

354
00:37:17,370 --> 00:37:23,199
you know, you get this using linearity of expectation, you get that it's the expected.

355
00:37:23,200 --> 00:37:26,530
Of course there are the expectation. The cost is exactly right.

356
00:37:26,770 --> 00:37:31,750
Again, I increase the variance by a lot, but the expectation is exactly right.

357
00:37:33,070 --> 00:37:39,639
So what I could do is run my full information algorithm on this apps and this is what the proposed method is.

358
00:37:39,640 --> 00:37:43,210
Just run your full information on the apps. Well, almost.

359
00:37:43,960 --> 00:37:47,860
There is a slide back here, which I referred to a bunch of times over.

360
00:37:47,860 --> 00:37:55,070
Do they increase the variance? Remember in costs we liked normalising the cost to zero and one.

361
00:37:56,480 --> 00:38:00,950
So we have to normalise because this dividing by the probability increases the numbers.

362
00:38:01,790 --> 00:38:10,950
But what should I re normalise to? The probabilities could be as low as zero, and if they divide by zero, the number will get pretty damn big.

363
00:38:11,430 --> 00:38:15,059
So at the moment I should be normalise them to be between zero and infinity.

364
00:38:15,060 --> 00:38:21,510
Which isn't going to work for me. So here is what I view as a dirty trick of the trade.

365
00:38:21,660 --> 00:38:25,620
Maybe not dirty a trick of the trade that learning theory used for years and years.

366
00:38:25,620 --> 00:38:29,130
And I want to say, no, no, no, don't do that. And here is what they do.

367
00:38:29,340 --> 00:38:31,410
Just put in a little bit of random noise.

368
00:38:32,010 --> 00:38:38,640
Don't let the probabilities go down to zero, because then you are running things up to infinity and that's really painful.

369
00:38:40,080 --> 00:38:45,720
Instead, put in a little bit of random noise, say don't let any probability go below gamma.

370
00:38:45,960 --> 00:38:49,170
Gamma is a random noise. The rest, you're running this algorithm.

371
00:38:50,190 --> 00:38:54,420
This works okay from the classical regress perspective.

372
00:38:54,780 --> 00:38:57,510
And for some reason, I put the slide in the middle here.

373
00:38:57,690 --> 00:39:06,390
But it doesn't work for my goal because that small probability that gamma that you put in the random noise you're choosing horrible pass.

374
00:39:06,780 --> 00:39:10,339
Why are you doing that? Again.

375
00:39:10,340 --> 00:39:14,510
I don't know if I know proper right numbers here. I know California thinks.

376
00:39:14,840 --> 00:39:18,049
Well, if you know that one alone, it's horrendous. It's horrendous.

377
00:39:18,050 --> 00:39:23,630
Every day, even with probably three gamma launchers, one on one, that seems stupid.

378
00:39:23,780 --> 00:39:29,900
It guarantees that you have got that much a get because it's probably got you're choosing a horrendously bad bus.

379
00:39:30,140 --> 00:39:34,890
Don't do that. It works for the regret, but it works for the cerebellum.

380
00:39:35,040 --> 00:39:40,640
But it's obviously a bad idea. Probably people are not doing this and they shouldn't be doing this.

381
00:39:40,650 --> 00:39:49,670
It looks like an altogether bad idea. So one question one should ask Is there a better model of learning that avoids this trick?

382
00:39:50,480 --> 00:39:55,700
Ken I'm still going to use important sampling, but I want to avoid this trick.

383
00:39:56,760 --> 00:39:58,590
They want to not put in random noise.

384
00:40:00,130 --> 00:40:08,950
And what I'm proposing instead, I think, is an equally simple and much more realistic alternative of what I propose you should do instead.

385
00:40:11,040 --> 00:40:23,700
So here is a very simple idea. What I should do is if something goes down below probability, gamma, whatever the parameter is, just don't do it.

386
00:40:24,120 --> 00:40:27,780
Don't like small probabilities because I don't like dividing. Is that number.

387
00:40:27,930 --> 00:40:34,290
Just don't do it. So that's called freezing. And what I do is temporarily freeze those options.

388
00:40:34,650 --> 00:40:39,450
I discovered that driving on California Road one on one is really bad.

389
00:40:39,630 --> 00:40:42,990
Just don't do it. Don't do it. Is probably the one in 100.

390
00:40:43,170 --> 00:40:46,170
Don't do it with one in a thousand. Just don't do it.

391
00:40:47,290 --> 00:40:50,680
Now what happens if it's called freezing and this is what technically I mean,

392
00:40:50,920 --> 00:40:55,959
I'm going to change the probably to zero if it went below a cut-off I'm slightly

393
00:40:55,960 --> 00:41:00,400
cheating and that cheating is that wiggle sign over there because once I did this,

394
00:41:00,400 --> 00:41:03,700
the numbers don't seem to run anymore, so I have to scale them up a bit.

395
00:41:04,510 --> 00:41:07,910
That's the wiggle. Not by much like I'm sure I.

396
00:41:07,930 --> 00:41:12,230
This is one plus epsilon. This is where my one plus epsilon will come in.

397
00:41:12,470 --> 00:41:16,520
I'm giving myself one plus epsilon error. So scaling probability is a little bit.

398
00:41:16,520 --> 00:41:20,180
One third me. So I do this. That's called freezing.

399
00:41:20,480 --> 00:41:24,440
And here is what I would like to do with this freezing.

400
00:41:24,440 --> 00:41:29,960
So what does it mean? So let's go back to what the what we study, the important sampling.

401
00:41:30,230 --> 00:41:35,690
So the first thing I see that, like I said in English, I forgot to show it to you.

402
00:41:35,810 --> 00:41:40,430
I want to make sure that the total probability that I froze is not too much.

403
00:41:40,430 --> 00:41:49,180
So when I scale up, I don't have to scale up by a lot. So first of all, if you look at any fixed action, what am I doing?

404
00:41:49,200 --> 00:41:57,660
I used to have that the expected cost is exactly right and I don't have that anymore because I did this artificial freezing action.

405
00:41:57,990 --> 00:42:02,550
So maybe it has a very high cost and I artificially never tried it.

406
00:42:02,790 --> 00:42:08,100
And in those cases the cost is zero for sure, even though it should have been something big in expectation.

407
00:42:09,450 --> 00:42:16,140
But this estimate of the cost, which is not not right, is what's called negatively biased.

408
00:42:16,440 --> 00:42:20,519
It's guaranteed to be less than it should be because sometimes I zero it out when I

409
00:42:20,520 --> 00:42:26,460
shouldn't zero it that so my cost that I estimate is lower than the actual cost.

410
00:42:26,910 --> 00:42:32,100
That's good news because if I do well compared to this, I surely did well compared to the real thing.

411
00:42:33,240 --> 00:42:41,700
So a benefit number so far. Second, and maybe that's the important part, the algorithm is not suffering.

412
00:42:41,910 --> 00:42:45,420
The expected cost of the algorithm is exactly right.

413
00:42:46,580 --> 00:42:51,230
Modulo that slight to the little over there, which is again the scaling up of one plus epsilon,

414
00:42:51,620 --> 00:42:54,830
because I'm not playing the answer in which I'm cheating.

415
00:42:55,070 --> 00:42:58,160
I'm only playing the relapse, the real arm.

416
00:42:58,160 --> 00:43:03,560
So I'm not having an estimation problem. The real arms have the right probability and right scaling up.

417
00:43:03,740 --> 00:43:08,490
There's nothing wrong there. So my expense of the operating expense I have.

418
00:43:08,490 --> 00:43:12,450
Exactly right. Modulo one plus epsilon are various.

419
00:43:12,450 --> 00:43:16,080
My estimate is what's called negatively biased. It's too low.

420
00:43:16,590 --> 00:43:20,350
So if the algorithm does well against the estimate there, I'm all set.

421
00:43:20,370 --> 00:43:24,000
So again, but I want is I can run the full information algorithm.

422
00:43:24,120 --> 00:43:30,059
I now know that no costs will go above one over Gamma because I zeroed out any gamma probability.

423
00:43:30,060 --> 00:43:33,470
So I solved the variance problem, I limited my variance.

424
00:43:33,560 --> 00:43:41,970
These come up and I can run the algorithm. It's a very, very simple idea and indeed it carries these bands.

425
00:43:44,440 --> 00:43:47,319
You get bands that I'm going to maybe show you in a minute,

426
00:43:47,320 --> 00:43:53,890
but I think I want more you to take on the technique or the idea than the actual arithmetic of how much the bands come out for.

427
00:43:54,070 --> 00:43:58,660
So all I wanna propose take the classical method, but instead of injecting noise,

428
00:43:58,840 --> 00:44:05,440
just take it that take the opposite action, not play it rather than stop playing it.

429
00:44:06,400 --> 00:44:09,610
Exact same cut off as they were you using with the noise.

430
00:44:09,610 --> 00:44:16,530
Anything goes below the noise level. I just don't play it. That causes an error in my estimate, but error goes in the right direction.

431
00:44:16,540 --> 00:44:20,760
It's a lower band, not an upper band. Good enough. Very simple idea.

432
00:44:21,360 --> 00:44:27,900
Many, many benefits. Maybe. One is that I even like it as a sort of behavioural assumption.

433
00:44:28,320 --> 00:44:33,330
If something is so unlikely to be good, people that tend not to be, do it, not to do it.

434
00:44:33,340 --> 00:44:37,000
So I think this improves the behavioural match to this assumption.

435
00:44:37,380 --> 00:44:42,390
It's a black box reduction. I didn't tell you what the algorithm underlying algorithm was.

436
00:44:42,630 --> 00:44:47,580
You can take your favourite learning algorithm of which there are many, many of them and do this to it.

437
00:44:47,820 --> 00:44:52,800
And it becomes a learning algorithm in the in the not so full information set up.

438
00:44:53,850 --> 00:45:04,200
It recovers the smaller response and it has these benefits that it doesn't that it recovers these bands which have a no dependence on time

439
00:45:04,200 --> 00:45:12,360
in the in the error band because the full information version had the property and I'm not doing anything that has information loss.

440
00:45:13,390 --> 00:45:18,100
And the last one, which I guess is sort of would have been relevant from the slides I skipped,

441
00:45:18,670 --> 00:45:22,719
but maybe worth mentioning if you're going to have to change your behaviour,

442
00:45:22,720 --> 00:45:24,760
if in the middle of driving for many,

443
00:45:24,760 --> 00:45:32,620
many months it discovers that something changed and this monologue used to be really bad, but now it's actually pretty good.

444
00:45:33,040 --> 00:45:37,450
Your algorithm will recover because it's not driving any probability below gamma.

445
00:45:38,050 --> 00:45:41,410
If you've ever been, you might stop there. You'd never go slower than that.

446
00:45:41,770 --> 00:45:50,709
So it is actually good in a very V humanly since learning is good, but not all learning algorithms are good that it it adjusts to changing situation.

447
00:45:50,710 --> 00:45:55,660
And this is the part of the slides I skipped so I won't spend so much time on it instead.

448
00:45:55,870 --> 00:46:01,029
And the last thing I want to tell you is that it also works on partial information.

449
00:46:01,030 --> 00:46:05,080
That's actually a more sophisticated version of feedback.

450
00:46:05,090 --> 00:46:11,350
So let's go back to this partial feedback graph that I showed you a little bit before.

451
00:46:11,740 --> 00:46:19,340
So remember the model here is that you not only get feedback on what you did, but you get some feedback on some of the other ones.

452
00:46:19,360 --> 00:46:24,510
So a subset of the other ones like the neighbours. Now just to get you.

453
00:46:27,280 --> 00:46:33,250
Clear or spaced on very small is if the graph is complete like every edge is in.

454
00:46:34,190 --> 00:46:38,840
That's the full information feedback. Every time you do anything you see all the neighbours and that's all the notes.

455
00:46:39,500 --> 00:46:46,219
The graph is empty, there's no edges. That's what's usually called the bandit or no or partial information or no information.

456
00:46:46,220 --> 00:46:48,830
Feedback. You only get feedback on the action you chose.

457
00:46:48,830 --> 00:46:54,560
No neighbours and this offers you something in between that there are some other kind of graphs.

458
00:46:56,880 --> 00:47:02,790
What we are going to prove. And I'm going to give you a very, very short hint that this freezing technique is good for this tool.

459
00:47:03,150 --> 00:47:13,320
Is that what I can do with this is replace the number of options, which is the size of the independent set of this guy.

460
00:47:14,180 --> 00:47:20,750
And that I think is the relevant parameter here in the full information feedback built environment that has size one.

461
00:47:21,260 --> 00:47:27,409
And indeed you see everything all the time. If there's no edges there, then you have not feedback.

462
00:47:27,410 --> 00:47:31,639
And indeed the number of number of options, you kind of have to try them.

463
00:47:31,640 --> 00:47:38,320
There's no way you can get the information without trying them all. But here, the independent set is a very, very natural model.

464
00:47:38,510 --> 00:47:42,980
It's very natural band and in fact easy to see that the lower bands of this form,

465
00:47:43,250 --> 00:47:48,980
because if there is an independent set where the good things are happening, those you kind of have to try.

466
00:47:49,310 --> 00:47:57,110
They don't give feedback about each other, but it offers you a smooth trade-off between the full information and the partial information.

467
00:47:58,050 --> 00:47:59,520
So what happens in a graph?

468
00:48:02,620 --> 00:48:10,480
Again, I have to quickly remind you that the important thing, this important sampling, is that every time you see something,

469
00:48:11,170 --> 00:48:17,380
you should update the cause dividing down with the probability that you saw it to correct for the fact that you don't always see this.

470
00:48:18,170 --> 00:48:21,590
So what matters here is not how often you play who we are.

471
00:48:21,590 --> 00:48:28,990
And what matters is how much you sit. That's the divider that you have to correct with the probability that I saw the arm.

472
00:48:29,260 --> 00:48:32,469
So I'm going to care about how often I see the arm.

473
00:48:32,470 --> 00:48:35,470
If I see it because of the is fine with me. Doesn't matter.

474
00:48:36,950 --> 00:48:38,360
Oh, sorry. Wrong direction.

475
00:48:39,770 --> 00:48:50,660
So what I want to do in my estimated cost is when I see no change, estimated cost as the cost, I see divided by the probability that I saw the item.

476
00:48:51,380 --> 00:48:56,750
So in this estimation, I would like to freeze items that have low probability of being seen.

477
00:48:57,850 --> 00:49:01,000
Very natural. Instead of low probability of being paid.

478
00:49:01,900 --> 00:49:03,990
So I'm hoping that I can keep it.

479
00:49:04,420 --> 00:49:11,590
Keep a cut-off for a low probability of being seen higher than one over the number of notes or epsilon over the number of notes.

480
00:49:12,580 --> 00:49:26,460
Oh, sorry. I guess this is just saying that had I be able to do this, I can recover all the all the learning algorithms that we had before.

481
00:49:26,480 --> 00:49:33,500
All I have to make sure that if I if I want to freeze all of the items that are not seen by sufficiently high probability,

482
00:49:34,340 --> 00:49:39,330
they're not going to, you know, clean up the whole graph. So how large can I keep this cut of?

483
00:49:40,390 --> 00:49:45,760
So I claim that I can use a band that's Epsilon over the number of independent said,

484
00:49:45,970 --> 00:49:51,040
and the third probability I freeze is actually constant times epsilon.

485
00:49:52,860 --> 00:49:57,390
So it comes in two parts, and maybe I'll give you a hint of what the parts are.

486
00:49:57,840 --> 00:49:59,460
So first thing I'm going to do,

487
00:49:59,550 --> 00:50:05,730
I look at my network and look at all the notes that are not seen with sufficiently big probabilities, say, these are the nodes here.

488
00:50:06,210 --> 00:50:12,150
And I want to argue that the amount of probability I froze is less than epsilon.

489
00:50:12,450 --> 00:50:20,370
If I use this epsilon over the independent set as my cut-off, this is a simple and maybe nice argument.

490
00:50:20,550 --> 00:50:24,060
Say the black notes here, an independent set.

491
00:50:24,300 --> 00:50:30,330
If I take a maximally independent said, then of course every note is visible from one of the independent set notes.

492
00:50:30,330 --> 00:50:37,800
That's what was maximal about the independent set. So every note is connected to one of them and therefore it's seen from one of them.

493
00:50:38,010 --> 00:50:44,399
And therefore the probability of just associated with being seeing the black notes is covers everything.

494
00:50:44,400 --> 00:50:49,020
And that's the number of independence times, whatever the cut-off was oftentimes gamma, as I said.

495
00:50:51,050 --> 00:50:56,360
This is nice, but unfortunately there is a technical difficulty, which I'm going to.

496
00:50:58,100 --> 00:51:04,930
Admit that. Unfortunately, once I write these notes, some other notes will become not seen high enough.

497
00:51:04,940 --> 00:51:06,170
So here is the picture.

498
00:51:06,320 --> 00:51:12,470
I find that despite of notes, which is the bubble and now standing on its head less than probably the epsilon as we just agreed.

499
00:51:12,680 --> 00:51:17,780
And fortunately, there are some new nodes that have not seen enough because they neighbours got frozen.

500
00:51:18,050 --> 00:51:21,950
So I have to do this iteratively and as best as I understand,

501
00:51:21,950 --> 00:51:26,450
if I do this iteratively with the same cut-off, I might unfortunately wipe out the whole graph.

502
00:51:26,750 --> 00:51:35,930
But if I do it with a decreased cut-off and I'm proposing sort of the previous probability, then I can try to convince you that it will stop.

503
00:51:36,200 --> 00:51:40,010
It will stop not doing anything worse than doubling my probability.

504
00:51:41,020 --> 00:51:45,550
And here is a very sketchy long line argument for this.

505
00:51:45,940 --> 00:51:50,950
So look at the next step here. Is this no day or this other notice that gets five that.

506
00:51:53,140 --> 00:52:00,580
These nodes, these angels have the probability that a second ago they were seen by more than a gamma probability,

507
00:52:00,580 --> 00:52:05,250
and now they're seen by less than two or more of our three. Oh.

508
00:52:05,260 --> 00:52:13,980
How did that happen? That means two thirds Gamma is going towards the previous notes and only one third is going out.

509
00:52:14,580 --> 00:52:18,240
So a large majority of every scene from I just like that.

510
00:52:18,540 --> 00:52:22,860
This is sort of like a three structure, vibratory structure.

511
00:52:23,010 --> 00:52:30,720
Twice as much done is up. And as you all know, binary trees have more notes on the site, on the on the leaves than in the centre.

512
00:52:30,750 --> 00:52:37,200
That is my argument and I slightly cheated on the details, but that's basically the argument of why this is happening.

513
00:52:38,850 --> 00:52:41,640
So at most, I bought another gum affection.

514
00:52:42,120 --> 00:52:50,550
So I'm going to stop here and maybe give you a sort of high level picture of what I'm hoping you to come from the talk.

515
00:52:50,910 --> 00:52:59,610
I in some way, I did two different things that I think are supporting each other and related to each other, but a little bit different.

516
00:53:00,090 --> 00:53:08,430
So first of all, I hope I convinced you that learning is a reasonable model of games erratic behaviour, even though it's not 100% matching data.

517
00:53:08,760 --> 00:53:19,260
And probably, you know, having just came after the previous talk where the main message was that behaviour, human behaviour is so irrational.

518
00:53:19,590 --> 00:53:24,510
I'm not claiming that this is a, this is a better rational model of human behaviour.

519
00:53:24,510 --> 00:53:30,450
It's not perfect. And I showed you even data where half, half of the participants don't even match this.

520
00:53:30,750 --> 00:53:34,110
So one can do more in studying what's a good behaviour model?

521
00:53:34,500 --> 00:53:38,700
But it has many advantages. It allows the players to adopt to opponents.

522
00:53:38,700 --> 00:53:44,549
They don't need common prior. They take advantage of the commons by playing badly and the very,

523
00:53:44,550 --> 00:53:51,720
very simple strategies and pretty natural straight strategies guarantee this any sort of certainly better matches

524
00:53:51,960 --> 00:53:57,300
better than than Nash equilibrium assumption and one that's not on the slide but probably should have been.

525
00:53:57,570 --> 00:54:06,000
It's beautifully usable in mathematical proofs it's such a nice condition you know and an narrower almost like the Nash condition,

526
00:54:06,210 --> 00:54:11,100
it makes the error very explicit and it's very nicely usable in mathematical conditions.

527
00:54:12,170 --> 00:54:15,980
But the second part I wanted to also talk about.

528
00:54:16,490 --> 00:54:22,080
It even works on dynamic environment. Oops.

529
00:54:22,080 --> 00:54:26,580
I don't know the second part. Okay, the second part, I still need to talk about that.

530
00:54:26,970 --> 00:54:33,480
The which is the blue line here that when I started to do this,

531
00:54:33,480 --> 00:54:40,740
I went find my colleagues who then more research on online algorithms and try to ask them for okay, okay, give me algorithms that do this.

532
00:54:41,100 --> 00:54:47,610
But at some point I realised that my game theory perspective is a little bit different than the classical perspective,

533
00:54:48,090 --> 00:54:55,680
and I think there is a lot more that can be done to understand how this all works or whether it works on partial information feedback.

534
00:54:56,160 --> 00:55:02,280
So passion, information, feedback, there's a very nice and very developed theory of online learning.

535
00:55:02,610 --> 00:55:06,900
But the attitude, the game, the the game series to take, take, take to this,

536
00:55:07,170 --> 00:55:13,290
there was the form of personal information that I think is natural and might not be the same as what they think is natural.

537
00:55:13,620 --> 00:55:23,069
And my absolute interest in not additive, but this sort of mix of multiplicative and additive error, that is the last digit optimisation.

538
00:55:23,070 --> 00:55:29,370
Humans don't do a little bit of error review stay in, but the speed of learning really,

539
00:55:29,370 --> 00:55:41,639
really matters both in how fast changing environments can be tolerate and also in, you know, how fast they they achieve reasonable social welfare.

540
00:55:41,640 --> 00:55:50,760
So the price of energy kind of bends and for some reason these but called approximate no regret conditions or not,

541
00:55:50,890 --> 00:55:57,330
that algorithms are much, much less developed in in the partial information setting than I wish it was.

542
00:55:57,330 --> 00:56:01,020
And I tried to contribute myself to improve the situation.

543
00:56:01,020 --> 00:56:03,270
So I'm going to stop here and thank you very much.