1
00:00:00,060 --> 00:00:04,800
Do my best to do it remotely. OK, so I work in an interpretive machine learning,

2
00:00:04,800 --> 00:00:10,890
and over the years I've seen the field of machine learning lean more and more toward more complicated models.

3
00:00:10,890 --> 00:00:13,890
Even in cases where they're completely unnecessary.

4
00:00:13,890 --> 00:00:20,820
And so I didn't just want to give a talk that says stop using black box models because that's really not constructive, right?

5
00:00:20,820 --> 00:00:27,630
That's just destructive. It's just telling people what not to do rather than what they could do.

6
00:00:27,630 --> 00:00:34,590
And so I wanted to tell you, not just don't use black box models, but I wanted to tell you why you don't need them.

7
00:00:34,590 --> 00:00:41,390
Well, now let me give you an example of recidivism prediction in the U.S. criminal justice system.

8
00:00:41,390 --> 00:00:45,020
Now here they use predictive models to determine people's risk of being arrested in

9
00:00:45,020 --> 00:00:49,610
order to determine whether to release someone on bail or parole or for social services.

10
00:00:49,610 --> 00:00:55,040
And in some cases, the models are so complicated that it's easy to compute the predictions incorrectly.

11
00:00:55,040 --> 00:01:00,290
So this is an article from The New York Times about a case where a typographical error in

12
00:01:00,290 --> 00:01:06,200
the input to a black box predictive model led to years of extra prison time for someone.

13
00:01:06,200 --> 00:01:11,760
And the typo was in his criminal history feature as he. Was denied parole.

14
00:01:11,760 --> 00:01:18,660
He left, he compared his scoresheet to someone else, and then they found he found that there was a typo in the criminal history features.

15
00:01:18,660 --> 00:01:22,990
Anyway, the model that the justice system was using for his prediction was called compass.

16
00:01:22,990 --> 00:01:27,450
You may have heard of it. It's a very famous black box model used in the justice system.

17
00:01:27,450 --> 00:01:33,390
A lot of arguments about it, and you'd think that models like Compass would be more accurate because they use

18
00:01:33,390 --> 00:01:39,360
over 100 features and they're created by a company whose job it was to do that.

19
00:01:39,360 --> 00:01:43,620
And so, yeah, you think these models would be more accurate, but they're not.

20
00:01:43,620 --> 00:01:49,440
So we didn't experiment from Florida to test the accuracy of this particular black box model.

21
00:01:49,440 --> 00:01:53,790
And by the way, compass, it's pretty widely used across the justice system.

22
00:01:53,790 --> 00:02:00,630
And yeah, we compared compass to our latest machine learning method in the lab at the time of this experiment, which is called Corals.

23
00:02:00,630 --> 00:02:02,730
Corals is an optimal decision tree method.

24
00:02:02,730 --> 00:02:12,300
It produces really sparse, one sided decision trees, and it came up with this machine learning model that fits in the bottom of a PowerPoint slide.

25
00:02:12,300 --> 00:02:18,690
And the model says if the person is young and their male, predict arrest within two years of the compass score calculation.

26
00:02:18,690 --> 00:02:25,440
Also, if they're a little older and they have two or three prior offences predict arrest within two years of the cup a score calculation.

27
00:02:25,440 --> 00:02:28,290
Or if they have more than three priors, predict arrest, otherwise predict no arrest,

28
00:02:28,290 --> 00:02:35,970
and we looked at this model and thought, OK, that's pretty simple. But how is it possibly going to be as accurate as Compass?

29
00:02:35,970 --> 00:02:42,060
And it was. And so what I'm showing you here is that the models are about equally accurate.

30
00:02:42,060 --> 00:02:48,590
This is 10 folds of data, so each colour is a different fold here. And yeah, the performance is very similar.

31
00:02:48,590 --> 00:02:56,090
But not only did these two models perform the same as it turns out, no matter which machine learning method we tried, they all performed the same.

32
00:02:56,090 --> 00:03:01,400
And some of these are complete black boxes like compass, which is proprietary,

33
00:03:01,400 --> 00:03:05,690
or some of them are just black boxes because they're just huge formulas and

34
00:03:05,690 --> 00:03:12,830
you can't fit them on a slide like support vector machines with radial basis, function kernels and decision trees, red forest and so on.

35
00:03:12,830 --> 00:03:19,220
So, you know. There was this huge debate about the algorithmic fairness of compass.

36
00:03:19,220 --> 00:03:25,040
But the truth is that we just don't seem to need compass at all. So why are we still using it right?

37
00:03:25,040 --> 00:03:30,700
Now. Back to my point here.

38
00:03:30,700 --> 00:03:36,250
So there's doesn't seem to be any benefit from complicated models for re-arrest prediction and criminal justice.

39
00:03:36,250 --> 00:03:39,550
There's a lot of literature on exactly that problem.

40
00:03:39,550 --> 00:03:46,260
There's just no reason to use the black box model for for criminal recidivism prediction, as far as I can tell.

41
00:03:46,260 --> 00:03:51,030
But it's true that there's no benefit from complicated models for lots of different problems.

42
00:03:51,030 --> 00:03:56,130
And I'm listing here a whole bunch of problems that I've worked on in my career.

43
00:03:56,130 --> 00:04:02,760
And for none of these problems have we seemed to to need a black box model.

44
00:04:02,760 --> 00:04:05,940
Now it really depends on your your data representation, though,

45
00:04:05,940 --> 00:04:14,680
because like if you're if you're working in computer vision, neural networks are really great for computer vision.

46
00:04:14,680 --> 00:04:20,550
They're great for when you're data, you know, when you need to create a good representation for your data,

47
00:04:20,550 --> 00:04:23,250
that's where you want to use a neural network.

48
00:04:23,250 --> 00:04:29,250
But if your data naturally come with good data representation, like in all the problems I've listed here,

49
00:04:29,250 --> 00:04:39,520
then all the algorithms tend to perform very, very similarly as long as you're willing to do some preprocessing on the data.

50
00:04:39,520 --> 00:04:46,160
So. Why, then, are we still using complicated models?

51
00:04:46,160 --> 00:04:50,690
There's some really good reasons. First of all, we like them. They're profitable.

52
00:04:50,690 --> 00:04:54,130
The compass people are making a profit off the US justice system.

53
00:04:54,130 --> 00:05:01,920
It's very much easier to sell something like compass than to sell something like the Coral's model I had on the previous slide.

54
00:05:01,920 --> 00:05:09,230
Also, it's much easier to construct a black box model than a simple model to construct a black box that let you take your data,

55
00:05:09,230 --> 00:05:14,540
throw it into an algorithm, you get a model, whereas to construct a simpler model, it's much more difficult.

56
00:05:14,540 --> 00:05:20,920
You actually have to optimise for the simplicity of the model.

57
00:05:20,920 --> 00:05:26,200
OK, so let's yeah, it's kind of ironic, right,

58
00:05:26,200 --> 00:05:32,980
that that complicated models are much easier to construct and that simple models can be really, really hard to find.

59
00:05:32,980 --> 00:05:38,980
OK, so let's say that we're doing supervised learning where we want to minimise the lost function to make our models accurate.

60
00:05:38,980 --> 00:05:42,910
But now we also want our models to be simple and we have to constrain them.

61
00:05:42,910 --> 00:05:53,400
We have to force them to be simple. And once you get too constrained optimisation and depending on the constraints, the problem becomes much harder.

62
00:05:53,400 --> 00:06:00,390
Now, it's a problem in the left is about finding an accurate decision tree that is much easier than finding an optimal,

63
00:06:00,390 --> 00:06:05,710
sparse tree with the same level of accuracy. It's exponentially harder.

64
00:06:05,710 --> 00:06:13,660
If the problem on the left, oh, and here, here's some examples here, so this is cart of in a cart green card algorithm on on a data set.

65
00:06:13,660 --> 00:06:21,190
And you know, that's very easy. If you want to find an optimal sparse tree with the same or better accuracy,

66
00:06:21,190 --> 00:06:25,660
you have to run a specialised algorithm that solves a much harder computational problem.

67
00:06:25,660 --> 00:06:32,860
And this algorithm on the left, on the right here, ghost is an algorithm that we've designed that was published in 2020,

68
00:06:32,860 --> 00:06:38,380
and it's getting, you know, accuracy's that are that are better than cart in much.

69
00:06:38,380 --> 00:06:41,980
The models are much faster, but it does a huge amount of work to get to that,

70
00:06:41,980 --> 00:06:46,210
whereas card is like from the nineteen nineties or nineteen eighties, right?

71
00:06:46,210 --> 00:06:53,200
OK, now if that problem on the left is about finding an accurate linear model, well, that's easy.

72
00:06:53,200 --> 00:06:56,170
You know, you can do that with regression or logistic regression.

73
00:06:56,170 --> 00:07:01,560
But then once you had sparsity constraints, we all know that the problem becomes much harder.

74
00:07:01,560 --> 00:07:06,540
So what about if you just unleash the most complexity you have on the problem,

75
00:07:06,540 --> 00:07:12,690
like we all know how easy it is to construct an accurate neural network or an accurate, boosted decision tree?

76
00:07:12,690 --> 00:07:18,990
Now my question is, can you get the same accuracy with maybe an accurate and fresh decision,

77
00:07:18,990 --> 00:07:23,100
Drake, or maybe an accurate its first linear model, right? Are these things?

78
00:07:23,100 --> 00:07:31,010
Am I going to get the same accuracy when I solve this problem as opposed to this problem?

79
00:07:31,010 --> 00:07:35,780
So do I need to sacrifice accuracy in order to gain interpretability?

80
00:07:35,780 --> 00:07:40,970
So can I determine whether this equality is true without actually solving this?

81
00:07:40,970 --> 00:07:48,770
And what would it take to do that right? What would it take to check whether or not I could get an accurate and simple model?

82
00:07:48,770 --> 00:07:53,420
Same level of accuracy as my black box.

83
00:07:53,420 --> 00:07:59,540
OK, so in other words, can I determine the existence of a simple yet accurate model without actually finding one?

84
00:07:59,540 --> 00:08:04,010
So that's what I'm trying to do during this talk. OK. So in this talk,

85
00:08:04,010 --> 00:08:13,640
I'm going to define a condition under which a simple yet accurate model is likely to exist in that condition is that the Rashomon set is large,

86
00:08:13,640 --> 00:08:18,540
and I'm going to tell you what that means in minute.

87
00:08:18,540 --> 00:08:27,510
OK, so the Rashomon set is the set of models with low true loss, the true Russian mindset, the set of models with low true loss.

88
00:08:27,510 --> 00:08:33,240
OK, so fusion, your expected loss. This is, if you know, the whole distribution that the data comes from.

89
00:08:33,240 --> 00:08:38,250
And then this is abstract function space. Just my hypothesis space.

90
00:08:38,250 --> 00:08:45,670
And then this is the Russian mindset. It's the set of models that have expected loss less than some value theta.

91
00:08:45,670 --> 00:08:54,710
OK, now I claim that if the true romance that is large, so in other words, if there are a lot of good models.

92
00:08:54,710 --> 00:08:59,210
Then a simple yet accurate model is likely to exist. OK, so this is this is the idea.

93
00:08:59,210 --> 00:09:04,220
The idea is that if there are a lot of good models, then hopefully at least one of them is simple.

94
00:09:04,220 --> 00:09:07,790
So this is kind of like a big fish theory, right? Like a big ocean theory.

95
00:09:07,790 --> 00:09:14,450
Like if you have you have a really big ocean, then you're more likely to find a big fish swimming in there somewhere.

96
00:09:14,450 --> 00:09:22,040
OK. So in a sea of equally accurate models, maybe there's a good one in there somewhere.

97
00:09:22,040 --> 00:09:30,380
Maybe there's at least one simple one. OK, now I'm just going to change my notation just very slightly.

98
00:09:30,380 --> 00:09:34,820
Where instead of expected loss with all that notation, I'm just going to write L.

99
00:09:34,820 --> 00:09:39,790
OK, so l as expected loss.

100
00:09:39,790 --> 00:09:47,620
And yeah, oh, by the way, I drew like a really nice, smooth lost function there with like one minimum, but it doesn't have to be that way.

101
00:09:47,620 --> 00:09:52,330
In fact, the space could look like this and the Rashomon set could be disjoint.

102
00:09:52,330 --> 00:10:02,160
And in fact, it's possible for the whole space to be discreet. And so the romance that is just a bunch of like points in the in the space.

103
00:10:02,160 --> 00:10:14,330
OK, so what I want to do now is. Create the simplest possible abstract setting to show you how this thing at the bottom could possibly happen.

104
00:10:14,330 --> 00:10:18,350
And I want to make it precise as to how this happens. I don't want to just wave my hands and say it happens.

105
00:10:18,350 --> 00:10:25,520
I want to actually just show you an abstract setting where it does happen, OK?

106
00:10:25,520 --> 00:10:38,630
So. I'm going to take two finite hypothesis spaces, so two finite function spaces F1, which is the set of simple models and F2, which is all models.

107
00:10:38,630 --> 00:10:43,070
And I will say that F1 lives enough to write simple models live in all models.

108
00:10:43,070 --> 00:10:47,690
And in fact, I'm going to say that F1 is uniformly drawn from F2 without replacement.

109
00:10:47,690 --> 00:10:54,350
So I know this is abstract, and I know that simple models are not drawn randomly from a more complex model class.

110
00:10:54,350 --> 00:10:59,900
But in reality, as long as each complex model is reasonably close to a simple model,

111
00:10:59,900 --> 00:11:06,120
then everything's the same idea that I'm going to show you is going to work out just fine.

112
00:11:06,120 --> 00:11:10,080
Now, let's say that Upstart is the best model if I knew everything.

113
00:11:10,080 --> 00:11:15,000
So it's the mall where I get to use all models, the but the best model to choose from,

114
00:11:15,000 --> 00:11:18,370
where I get to use all models and I know the whole distribution of where the data comes from.

115
00:11:18,370 --> 00:11:23,100
OK, so this is if I know everything you could use, the more complex middle class we know,

116
00:11:23,100 --> 00:11:28,770
I know the whole lost function, everything whereas this guy F1 hat.

117
00:11:28,770 --> 00:11:32,250
That's what I can get on my data. So this is the empirical risk.

118
00:11:32,250 --> 00:11:38,940
Minimise her from a simple function class. OK, so that's what I would love to get if I knew everything.

119
00:11:38,940 --> 00:11:46,850
This is what I can get with my data. Now, what I want to know is if these two guys achieve the same level of accuracy.

120
00:11:46,850 --> 00:11:52,330
OK. So again, using my simpler notation, this this thing becomes l.

121
00:11:52,330 --> 00:12:00,830
This empirical risk becomes lhat. And I want to know whether the best true risk of the complex class,

122
00:12:00,830 --> 00:12:05,720
so I want to know whether after you start as close to the best empirical risk of the simpler class.

123
00:12:05,720 --> 00:12:09,680
So that's all head of F1 habit. OK.

124
00:12:09,680 --> 00:12:15,960
So I want to know whether these two things are close. So in other words, I want to know whether what I compute on my data.

125
00:12:15,960 --> 00:12:28,180
Is close to the best possible thing I could get if I knew everything. And the bound is going to involve the Rashomon ratio.

126
00:12:28,180 --> 00:12:38,660
Now, the Rashomon ratio is the fraction of models that are good. OK, so it's the fraction of all models with low true loss.

127
00:12:38,660 --> 00:12:43,160
But divided by the total number of models.

128
00:12:43,160 --> 00:12:51,250
OK, so this is the fraction of models that are good. And that's going to appear in my bound.

129
00:12:51,250 --> 00:12:56,230
OK, so I put all the notation from the previous slide up in the top here.

130
00:12:56,230 --> 00:12:59,030
And the bound goes like this.

131
00:12:59,030 --> 00:13:07,620
It says for any with high probability with Epsilon greater than zero, with high probability, and I haven't told you what this is yet, but I will.

132
00:13:07,620 --> 00:13:20,020
With high probability, with respect to all randomness. The empirical risk on the.

133
00:13:20,020 --> 00:13:25,810
Function that I get from my data is close to the best possible thing I could get.

134
00:13:25,810 --> 00:13:33,040
And the probability with which this holds depends on the Rashomon ratio.

135
00:13:33,040 --> 00:13:39,490
So if the Marshman ratio is larger, so if I have more good models, then this sound is more likely to hold.

136
00:13:39,490 --> 00:13:47,060
And my what I can compete on, my data is going to be close to the best possible thing I can get for new everything.

137
00:13:47,060 --> 00:13:51,190
Another nice thing about this band is that only depends on the size of F1,

138
00:13:51,190 --> 00:13:57,250
the smaller function class, it doesn't depend on the size of the larger function class.

139
00:13:57,250 --> 00:14:00,850
OK, so cool.

140
00:14:00,850 --> 00:14:05,800
So this this bound is saying, you know, as long as the Rashomon ratio is big enough, then then we're good.

141
00:14:05,800 --> 00:14:09,920
OK? So I notice this bounded a little.

142
00:14:09,920 --> 00:14:15,700
This probability here is a little inscrutable, so I'm going to give you some examples of its calculation.

143
00:14:15,700 --> 00:14:22,780
So let's say that we had 100000 functions. Now, if at least one percent of them are good,

144
00:14:22,780 --> 00:14:30,460
then the bound holds with 99 percent probability when you have at least five hundred twenty six simple functions.

145
00:14:30,460 --> 00:14:38,920
So here's another example. Again, 100000 medals. Ratio, at least half a percent of them are good.

146
00:14:38,920 --> 00:14:45,630
Then the band holds with 99 percent probability. When just over a thousand of them are simple.

147
00:14:45,630 --> 00:14:49,620
So, Cynthia, can I ask you something? Absolutely.

148
00:14:49,620 --> 00:14:56,160
So in your definition of their rational racial ratio, you have to a fairer parameter and then you have like a Gunma parameter.

149
00:14:56,160 --> 00:15:00,060
So are there related or yeah, sorry, they're the same.

150
00:15:00,060 --> 00:15:03,900
So theta is the same as gamma. OK. Yeah, sorry about that.

151
00:15:03,900 --> 00:15:10,830
Yeah, it's fine. And what happens if I feel like if if one is the same as the same size of two,

152
00:15:10,830 --> 00:15:19,740
then you think that the situation is like generating a sense and it boils down to our like understanding of like empirical risk minimisation.

153
00:15:19,740 --> 00:15:26,190
Well, if everyone is the same, if everyone is the same as F two, then all of the models are simple and the bound trivially holds.

154
00:15:26,190 --> 00:15:35,370
So you're right. But yeah. Yeah. So this is using regular learning theory like this side is just regular learning theory.

155
00:15:35,370 --> 00:15:40,800
So just this part of it will give you you'll get back to a regular learning you.

156
00:15:40,800 --> 00:15:44,790
We're leveraging learning theory in this moment. Yeah, yeah.

157
00:15:44,790 --> 00:15:52,330
Sorry about the notation issue. There should have been a a gamma. I changed recently from theta to gamma.

158
00:15:52,330 --> 00:15:58,190
Sorry, do you mind if I just ask, what do you mean when you say 100000 models?

159
00:15:58,190 --> 00:16:01,610
Obviously, you have continuous coefficients that the.

160
00:16:01,610 --> 00:16:09,410
No, no, no, I'm in a abstract setting where everything is discreet, so even if I'm in a setting where there are finite hypothesis spaces.

161
00:16:09,410 --> 00:16:18,200
So you want to think about this as being if your data live on a giant hypercube or in a giant categorical space where every you know,

162
00:16:18,200 --> 00:16:22,550
even if you have continuous functions, the realisations of them are discrete.

163
00:16:22,550 --> 00:16:30,020
And so you should just think of this as just an abstract setting where where everything is discrete.

164
00:16:30,020 --> 00:16:32,810
OK. All right. Thank you. Yeah.

165
00:16:32,810 --> 00:16:44,300
And the thing I was going to say after this is that actually in general, the idea generalise is to the case of everything being continuous and smooth.

166
00:16:44,300 --> 00:16:52,040
But you have to you have to replace some of the assumptions. And in particular, this random draw assumption, which is unrealistic,

167
00:16:52,040 --> 00:16:58,130
you would replace this with a smoothness assumption over the over the class of models,

168
00:16:58,130 --> 00:17:06,380
and you have to assume that that the that F1 is a good cover for F2.

169
00:17:06,380 --> 00:17:10,850
And then in that case, the whole idea generalise is to more realistic settings.

170
00:17:10,850 --> 00:17:18,290
Mm-Hmm. Yes. Actually, Konstantinos Gutsiness is asking, What does it mean that we uniformly draw F2?

171
00:17:18,290 --> 00:17:23,000
Then we need F2 to be specific, but it seems that you are addressing this point now, right?

172
00:17:23,000 --> 00:17:26,540
Yeah, yeah. Yeah, I was just about you guys are like one step ahead of me,

173
00:17:26,540 --> 00:17:31,700
and that's totally great because it means that people are following this lecture, which makes me happy, which is really hard to do remotely.

174
00:17:31,700 --> 00:17:35,500
OK, cool. All right. Great.

175
00:17:35,500 --> 00:17:43,160
So, yeah, so I gave some examples of this. And essentially what I what I was trying to say here is that if the Rashomon ratio is sufficiently large,

176
00:17:43,160 --> 00:17:48,380
so if you have a large enough set of good models, then with high probability,

177
00:17:48,380 --> 00:17:52,190
the best empirical risk over the simpler class is close to the best possible

178
00:17:52,190 --> 00:17:56,390
true risk of the larger class and the generalisation guarantee comes from F1.

179
00:17:56,390 --> 00:18:02,160
So this is basically the simplest possible abstract setting where the Rashomon, you know,

180
00:18:02,160 --> 00:18:09,980
a large Rashomon ratio actually gives you a better guarantee on on the quality of your performance.

181
00:18:09,980 --> 00:18:15,920
Yeah, quality of your performance on data, right? As opposed to knowing everything.

182
00:18:15,920 --> 00:18:23,960
And as I mentioned, we in the paper, this is only the first theorem and there's a series of theorems that I won't have time to get into today.

183
00:18:23,960 --> 00:18:30,140
But essentially, we replace the random draw assumption with Smith's assumptions so that everything

184
00:18:30,140 --> 00:18:35,540
is nice and smooth and that you're in that F1 is a good approximating set for F2.

185
00:18:35,540 --> 00:18:44,150
The other assumption that you can make is that the Rashomon set contains a really big ball so that that as long as F1 approximates F2 nicely,

186
00:18:44,150 --> 00:18:53,090
there's at least one F1 and that big Rorschach concept ball. And then you don't need the random draw assumption anymore, and it's much more realistic.

187
00:18:53,090 --> 00:18:57,530
OK, so so the results I just showed you and the theorems that I didn't,

188
00:18:57,530 --> 00:19:03,690
that I don't have time to to talk about in the in the continuous setting suggests

189
00:19:03,690 --> 00:19:10,100
that as long as F1 is a good approximating set for F2 and the Rashomon set is large,

190
00:19:10,100 --> 00:19:15,350
then we might as well work with the simpler class because we're not getting any benefit from using the more complex class.

191
00:19:15,350 --> 00:19:20,480
You're going to get the same level of accuracy for simpler class, for the complex class.

192
00:19:20,480 --> 00:19:26,390
So in other words, if decision trees, which are peaceful as constant functions approximate neural networks,

193
00:19:26,390 --> 00:19:30,920
which are smooth functions, and for my problem, if the Russians at large,

194
00:19:30,920 --> 00:19:33,980
then I can just work with decision trees because I'm not going to get any benefit from

195
00:19:33,980 --> 00:19:39,330
working with neural networks because decision trees approximate neural networks.

196
00:19:39,330 --> 00:19:44,380
OK, now I want to point out that we're not doing standard learning theory,

197
00:19:44,380 --> 00:19:48,960
we're using standard learning theory, but but here this is not the same thing.

198
00:19:48,960 --> 00:20:00,750
So large Rashomon large Rashomon ratios pertain to the existence of good models of models with good generalisation and good performance.

199
00:20:00,750 --> 00:20:03,150
That's different than regular learning theory, right?

200
00:20:03,150 --> 00:20:11,340
Those regular learning theory compares empirical risk to a true risk for the same function or for a class of functions.

201
00:20:11,340 --> 00:20:16,350
Here we're talking about existence of models from a different class.

202
00:20:16,350 --> 00:20:21,780
So the Rashomon ratio is not the same thing as the geometric margin that's used in support vector

203
00:20:21,780 --> 00:20:27,960
machines and other forms of learning theory because the margin is measured with respect to one model,

204
00:20:27,960 --> 00:20:32,760
whereas the freshman ratio is a function of many models. It's not.

205
00:20:32,760 --> 00:20:41,430
The same thing is the V-C dimension. The V-C Dimension is data independent. It's a it's a it's a property of a function class only.

206
00:20:41,430 --> 00:20:50,190
Whereas the Rashomon ratio is a property of a specific dataset. The Rashomon ratio is large for a specific dataset and function class.

207
00:20:50,190 --> 00:20:56,370
It's not the same thing as algorithmic stability, which talks about the way you search through the space to find a model.

208
00:20:56,370 --> 00:21:02,160
Stability depends on making changes to a data set. Here are the data set is fixed.

209
00:21:02,160 --> 00:21:04,710
It's not the same thing as Rademacher complexity.

210
00:21:04,710 --> 00:21:13,530
Rademacher complexity fits the function class's ability to fit noisy targets, whereas the RECCOMEND ratio uses fixed labels.

211
00:21:13,530 --> 00:21:18,330
It's not the same thing as a flat minimum, which has become popular in neural networks here.

212
00:21:18,330 --> 00:21:26,520
We don't necessarily even have to have a continuous function space. And the Russian mindset could include many local minimum.

213
00:21:26,520 --> 00:21:38,420
OK. So, all right, so that's what the theory says, the theory says that large Rashomon sets allow us to use simpler functions without losing accuracy.

214
00:21:38,420 --> 00:21:45,260
Now, what happens in practise, what actually what actually happens in reality?

215
00:21:45,260 --> 00:21:50,180
Well, usually you can't figure that out because measuring the Rashomon ratio is not something

216
00:21:50,180 --> 00:21:55,850
that you could normally do because it would require you to look at the whole model class,

217
00:21:55,850 --> 00:22:02,250
which is not practical. So but today we're going to do it anyway, just to find out what happens.

218
00:22:02,250 --> 00:22:08,700
OK, so don't do this at home, but we're going to do it today and we're going to use the empirical Rashomon ratio because we have data.

219
00:22:08,700 --> 00:22:13,710
So we'll use this quantity, which is the fraction of models that are good.

220
00:22:13,710 --> 00:22:20,010
OK, so this is just the number of functions with low lot, low empirical loss divided by the total number of functions.

221
00:22:20,010 --> 00:22:26,770
OK. All right, so and again, you'd never calculates that in reality, but we're going to do it.

222
00:22:26,770 --> 00:22:31,360
And in particular, I want to do an experiment. I'm going to compare two things.

223
00:22:31,360 --> 00:22:37,630
The first one is the size of the freshman set, and the second is the performance of left lots of different machine learning models.

224
00:22:37,630 --> 00:22:46,810
OK, so the first part of the experiment, I'm going to calculate the size of the Russian mindset, and I'm going to estimate it.

225
00:22:46,810 --> 00:22:51,130
And the way I'll do it is using decision trees of depth. Seven. OK, why?

226
00:22:51,130 --> 00:23:00,680
Why decision trees of depth seven. Well. It's because a decision trees can be sampled.

227
00:23:00,680 --> 00:23:06,350
I can sample decision trees. OK. And also decision trees are peaceful, has constant functions.

228
00:23:06,350 --> 00:23:11,960
There are good approximating, set for a much larger function space because they can fit and they can also filter data sets they like.

229
00:23:11,960 --> 00:23:17,360
If you the same trees that seven are pretty powerful, they can actually fit a lot of data sets really well.

230
00:23:17,360 --> 00:23:23,360
OK, so that's why that's why we're going to calculate the rational side, the empirical measurements at that way.

231
00:23:23,360 --> 00:23:30,320
OK. So, OK, so we have. So let's say let's say that we have that, OK, what's the other thing we're going to check?

232
00:23:30,320 --> 00:23:34,280
And oh wait, I forgot to mention. Yeah, I just want to explain this a little more.

233
00:23:34,280 --> 00:23:41,420
So let's say we have a function space that we're interested in. That function space includes decision, tree support, vector machines, decisions.

234
00:23:41,420 --> 00:23:46,850
It's just a big function class that we're interested in, and what we're doing essentially is.

235
00:23:46,850 --> 00:23:51,830
OK, so here's here's the Rashomon set. We're going to approximate that whole function class with decision trees.

236
00:23:51,830 --> 00:23:59,180
And I claim that decision trees of seven or a good cover for this space because they're

237
00:23:59,180 --> 00:24:03,290
they're just piece by constant functions and so they approximate smooth functions.

238
00:24:03,290 --> 00:24:12,440
And so and also, you know, they approximate random forest and boosted decision trees, too, which are essentially combinations of trees.

239
00:24:12,440 --> 00:24:18,380
So anyway, I'm going to just compute the fraction of decision trees at seven that are in the Rashomon set.

240
00:24:18,380 --> 00:24:24,200
And that's how I'm going to get this estimate of the size of the rational.

241
00:24:24,200 --> 00:24:30,980
OK, so then the second part of the oh yeah, and these are all my trees, my little green dots are the trees.

242
00:24:30,980 --> 00:24:37,100
The second part of the experiment, I'm going to just run a whole bunch of different machine learning methods on the dataset,

243
00:24:37,100 --> 00:24:45,690
and I want to know whether they perform similarly, because if they do, it means that they all live in a big Russian one set.

244
00:24:45,690 --> 00:24:51,660
OK, so that means if this is true, that means the regime on set can accommodate functions of lots of different types,

245
00:24:51,660 --> 00:24:58,680
right, because it could accommodate a support vector machine and a porous PC, a decision tree.

246
00:24:58,680 --> 00:25:03,120
So I want I want to know if all these methods perform similarly,

247
00:25:03,120 --> 00:25:07,620
and I want to know if they generalise and I want to know how that correlates with the size of the Russian mindset.

248
00:25:07,620 --> 00:25:13,810
OK. All right, so that's the experiment. Let me show you the results.

249
00:25:13,810 --> 00:25:16,910
OK, so the results are.

250
00:25:16,910 --> 00:25:27,730
That when the Rashomon ratio is measured to be large by party, then all the methods tend to perform similarly, and they generalise.

251
00:25:27,730 --> 00:25:34,810
That's the result I'm going to show you in the next couple of slides. And interestingly, the result isn't always true.

252
00:25:34,810 --> 00:25:41,270
That surprised us, but we think it's because of an artefact of the way that we're measuring the size of the Rashomon set.

253
00:25:41,270 --> 00:25:45,340
And if features are correlated with each other, it's really easy to.

254
00:25:45,340 --> 00:25:48,790
It's really easy to overestimate the size of the rash onset.

255
00:25:48,790 --> 00:25:55,530
And so sometimes our measurement of the Russian mindset is too small. OK.

256
00:25:55,530 --> 00:26:02,190
So great. Let me show you the experiment we did 64 data set.

257
00:26:02,190 --> 00:26:10,270
So the large number of data sets. Categorical data sets, real value data sets, regression data sets, synthetic data sets,

258
00:26:10,270 --> 00:26:15,790
the number of features ranged between three and seven hundred eighty four and the number of classes range between two indexes is just flat.

259
00:26:15,790 --> 00:26:23,260
Just three downloaded the whole repository OK and generated and generated synthetic datasets, too.

260
00:26:23,260 --> 00:26:30,040
All right. Now, when we had a large freshman ratio, these are the kinds of results we get.

261
00:26:30,040 --> 00:26:31,670
Lots of different machine learning methods,

262
00:26:31,670 --> 00:26:38,230
they all perform very similarly like this is for different data sets here, five different machine learning methods.

263
00:26:38,230 --> 00:26:43,780
They're all performing very similarly, and they're all generalising between training and test.

264
00:26:43,780 --> 00:26:52,420
OK, so this is for large freshman ratios, for small Rashomon ratios, but that's not always what we got.

265
00:26:52,420 --> 00:26:59,800
So for small freshman ratios, sometimes the accuracy would be all over the place like different methods would perform differently.

266
00:26:59,800 --> 00:27:06,280
And sometimes they wouldn't generalise as well between training and test like you're seeing over here with the large freshman ratios.

267
00:27:06,280 --> 00:27:12,070
This always happened with a small freshman ratios. There was a variety of different results like we could get.

268
00:27:12,070 --> 00:27:19,240
Also, sometimes cases where everything generalise really well. But our theory luckily really applies to large freshman sets.

269
00:27:19,240 --> 00:27:26,480
So we're making a conclusion. You know, if the freshman said it's large, then you get this kind of these nice properties.

270
00:27:26,480 --> 00:27:36,940
Cynthia, how do you define a freshman said, being large or small, you're taking the data sets and then you are kind of like,

271
00:27:36,940 --> 00:27:44,050
I'm looking at the quantities of like the freshman sets, and that's how you define larger or risk model of the look.

272
00:27:44,050 --> 00:27:50,770
Yeah. So believe it or not, valley is around 10 to the negative thirty seven, which we're sort of.

273
00:27:50,770 --> 00:27:55,510
These values came from important sampling rate with the set of decision trees of depth seven.

274
00:27:55,510 --> 00:28:04,640
So these are actually larger values and defending anything that was like 10 to the 38th and 39th or below is like a small Rashomon ratio.

275
00:28:04,640 --> 00:28:09,970
Yeah. And we were just looking relative to, you know, what we would get on these different data sets.

276
00:28:09,970 --> 00:28:13,750
When you see the data and I question,

277
00:28:13,750 --> 00:28:22,180
do you think it is likely to be shown a speech that data sets allow for a function as basis with large récemment sets?

278
00:28:22,180 --> 00:28:31,870
That is a great question, and I think they do, but there is a lot I could say about that.

279
00:28:31,870 --> 00:28:39,760
So, so I work in interpretable machine learning, and we've been trying to design interpretable models for computer vision for a long time,

280
00:28:39,760 --> 00:28:45,220
and we've been able to create models for computer vision that that we that are interpretable.

281
00:28:45,220 --> 00:28:51,610
But the definition of interpretability is different between computer vision than it is for, like other types of problems.

282
00:28:51,610 --> 00:28:55,390
So, for example, you would never want to do a decision tree on pixels.

283
00:28:55,390 --> 00:28:58,360
That's that doesn't make that's not interpretable, right?

284
00:28:58,360 --> 00:29:05,890
What you'd want to do is maybe kiss best reasoning where you say this part of the image looks like this part of this other image.

285
00:29:05,890 --> 00:29:14,380
And for those for those types of problems, we've been able to design interpretable neural networks that are constrained to reason in this way,

286
00:29:14,380 --> 00:29:19,630
but still attain the same level of accuracy as regular black box neural networks.

287
00:29:19,630 --> 00:29:29,740
Mm hmm. And so I think the only reason we're able to do this is because the Rashomon that permits it, the romance that's large enough to permit it.

288
00:29:29,740 --> 00:29:32,650
And so I can say that about computer vision.

289
00:29:32,650 --> 00:29:40,300
I obviously can't say that about every possible application of machine learning to every possible problem.

290
00:29:40,300 --> 00:29:47,950
I can only talk about the problems I've worked on, but I even started working in materials science, which is a super complicated domain.

291
00:29:47,950 --> 00:29:52,990
And even there we were able to find to find models that were interpretable to our human materials

292
00:29:52,990 --> 00:29:59,800
science colleagues that were as accurate or more accurate than the black boxes we could construct.

293
00:29:59,800 --> 00:30:06,040
And I say more accurate because sometimes the insight you get from the interpretability actually allows you to boost accuracy.

294
00:30:06,040 --> 00:30:11,140
So I think I hope that answers your question.

295
00:30:11,140 --> 00:30:16,000
Mm hmm. Thanks. Yeah, that was a really good question. Thank you for asking.

296
00:30:16,000 --> 00:30:23,170
OK. Great, so now, as I mentioned, you really can't measure the size of the Russian set in practise,

297
00:30:23,170 --> 00:30:29,260
but that's OK because we got a lot of information out of these experiments and in particular,

298
00:30:29,260 --> 00:30:35,290
if if the Rashomon ratio is large, we found that all the methods performed similarly in general as well.

299
00:30:35,290 --> 00:30:39,920
Now, if the methods performed differently, it's likely to be a small Rashomon ratio.

300
00:30:39,920 --> 00:30:49,820
Now. We're not completely sure about this yet, but we think it is a viable, possible explanation for what's going on.

301
00:30:49,820 --> 00:31:00,530
And you know, it does explain why me and a lot of other people have found that algorithms perform similarly across many problems.

302
00:31:00,530 --> 00:31:04,130
It's because there's a large Rashomon ratio. Yeah.

303
00:31:04,130 --> 00:31:11,740
So why do simple? What so why do simple models perform? Well, it's possibly because there's a large rush immigration.

304
00:31:11,740 --> 00:31:16,000
OK. We found something else besides the results that I just showed you that we were

305
00:31:16,000 --> 00:31:20,830
really surprised about and we found this on every single dataset that we examined.

306
00:31:20,830 --> 00:31:28,060
And what we planned is something called the Rashomon curve. I'll show you a cartoon of it before I actually show you the real thing.

307
00:31:28,060 --> 00:31:31,920
It's a plot of the Rashomon ratio versus the empirical risk.

308
00:31:31,920 --> 00:31:39,570
OK, so let's say that you take a hierarchy of hypothesis basis, so we have the simplest ones to the more complex ones.

309
00:31:39,570 --> 00:31:45,030
So this is like decision trees of depth one, two, three, four, five, six and seven like that.

310
00:31:45,030 --> 00:32:00,230
OK. So embedded spaces. Now when when you go down this curve here, when you add more complexity, what should happen to these quantities?

311
00:32:00,230 --> 00:32:03,490
Well, as you add more complexity.

312
00:32:03,490 --> 00:32:11,410
The best empirical risk for each function class goes down because you can fit, you can fit better, you have more models that you can fit better so.

313
00:32:11,410 --> 00:32:16,550
So as we increase from here to here, we expect to go this way.

314
00:32:16,550 --> 00:32:21,530
What about the Rashomon ratio? Well.

315
00:32:21,530 --> 00:32:29,540
The numerator goes up because you have more good models, but the denominator goes up as well because you have more models.

316
00:32:29,540 --> 00:32:31,250
So what, what happens?

317
00:32:31,250 --> 00:32:38,540
And as it actually turns out, oh, sorry, I just put the Rashomon ratio there for you again, the fraction of models that are good, right?

318
00:32:38,540 --> 00:32:44,360
Both the numerator and the denominator go up because you have more models in both cases.

319
00:32:44,360 --> 00:32:51,350
As it turns out, it goes down, as it turns out, the denominator increases much more quickly than the numerator.

320
00:32:51,350 --> 00:32:58,830
So what happens is that you take your simplest function class and then you run it and you increase complexity a little bit.

321
00:32:58,830 --> 00:33:03,050
So what happens is that the the empirical risk goes down.

322
00:33:03,050 --> 00:33:09,380
But the Rashomon ratio tends to stay kind of constant. But then all of a sudden it just like nosedives.

323
00:33:09,380 --> 00:33:14,630
And, you know, sometimes you overfed a little bit so you can see a little bit of like, you know, maybe it's going to sway a little bit,

324
00:33:14,630 --> 00:33:23,100
but most of the time it goes down so quickly that you can't even see, you can't even see this like, you know, full.

325
00:33:23,100 --> 00:33:29,310
And, you know, we were kind of surprised to see this because we saw it on every single data set that we examined.

326
00:33:29,310 --> 00:33:32,460
We were not expecting to see anything like this.

327
00:33:32,460 --> 00:33:37,620
And sometimes you see the whole curve, like sometimes you see it go over and down, but sometimes it just goes down.

328
00:33:37,620 --> 00:33:43,870
You don't even see this part because if the simpler models already perform pretty well, it just kind of goes down.

329
00:33:43,870 --> 00:33:49,980
OK, so yeah, we saw it. And like I said, all kinds of different data sets of sometimes like I said,

330
00:33:49,980 --> 00:33:56,870
you see the little curve, sometimes you just see like parts of it, like that vertical part.

331
00:33:56,870 --> 00:34:03,410
And there's always, you know, there's always some kind of like turning point up here or else it's the top.

332
00:34:03,410 --> 00:34:09,800
And yet we saw it on every single data that we experimented with. OK, so that's what happens on the training set.

333
00:34:09,800 --> 00:34:16,640
What about the test set? And luckily, statistical learning theory tells us what the difference is between the training and test results.

334
00:34:16,640 --> 00:34:23,120
OK, so the generalisation is better for smaller function classes than it is for larger function classes.

335
00:34:23,120 --> 00:34:30,740
So you would expect to over fit when you have a really big function class and then your your true risk is going to become worse.

336
00:34:30,740 --> 00:34:39,590
Right? So what the what the theory kind of tells us is that we should really kind of be looking around this elbow what we call the Rossmann elbow,

337
00:34:39,590 --> 00:34:43,820
because this is the simplest function class that describes the data.

338
00:34:43,820 --> 00:34:52,630
Well, right, that has low empirical risk. So this elbow seems to be like a really good choice for model selection.

339
00:34:52,630 --> 00:34:58,150
So let me show you the results. I'll show you the Rashwan curves for all 64 data sets.

340
00:34:58,150 --> 00:35:02,090
All right. And as you can see, maybe I should.

341
00:35:02,090 --> 00:35:08,200
Yeah. As you can, I'm going to zoom in to some of these just to show you what kind of what's going on here.

342
00:35:08,200 --> 00:35:12,610
And I just want to point out that we're averaging over 10 volts to plot these,

343
00:35:12,610 --> 00:35:17,920
both for her training and test empirical risks, as well as the Rashomon ratio.

344
00:35:17,920 --> 00:35:25,500
And so what you see, you should see it going across and down or just down for all of these data sets.

345
00:35:25,500 --> 00:35:30,120
OK, so let me zoom in a little bit. So sometimes the theory was insightful.

346
00:35:30,120 --> 00:35:35,490
Sometimes you really did see like the elbow being the best model that really worked.

347
00:35:35,490 --> 00:35:39,660
But as you know, with randomness, statistical learning theory, it's all probabilistic.

348
00:35:39,660 --> 00:35:43,980
So sometimes it doesn't really work. Sometimes everything, just always generalised.

349
00:35:43,980 --> 00:35:47,010
And the training and test points were right on top of each other.

350
00:35:47,010 --> 00:35:53,910
And and then sometimes we never generalise in which case you're just seeing these big uncertainty bands, right?

351
00:35:53,910 --> 00:35:57,090
The big generalisation, gaps between training and test.

352
00:35:57,090 --> 00:36:04,170
But regardless of which of these three situations you're in, the elbow just seems to be a good choice for model selection because again,

353
00:36:04,170 --> 00:36:15,480
it's the simplest function class that describes the data well, and in no cases did that to the elbow turn out to be a really bad choice, right?

354
00:36:15,480 --> 00:36:20,770
Yes, so the elbow model always seems to be a good choice for model selection.

355
00:36:20,770 --> 00:36:26,240
So that makes you wonder, like where you are relative to the elbow.

356
00:36:26,240 --> 00:36:29,210
Right, because in real problems, you don't actually see the whole curve.

357
00:36:29,210 --> 00:36:39,890
You just end up, you know, you'd pick your function class and you, you know, ran your method and you don't know where you are on the curve.

358
00:36:39,890 --> 00:36:49,680
You could be anywhere on this curve. So you might want to figure out where you are in the curve to figure out whether you're close to the elbow.

359
00:36:49,680 --> 00:36:53,920
So. And remember, you usually can't measure any point.

360
00:36:53,920 --> 00:37:02,910
You can't measure any point on this curve because the curve requires the Rashomon ratio, which is the fraction of good models, you can measure that.

361
00:37:02,910 --> 00:37:17,940
OK, so what can you do? Well. If you are in this part of the curve, then different models with different complexity levels perform differently.

362
00:37:17,940 --> 00:37:26,130
So if you run a whole bunch of different machine learning methods with different levels of complexity and they all perform differently,

363
00:37:26,130 --> 00:37:33,540
then you're probably on this part of the curve here. And in that case, you probably want to.

364
00:37:33,540 --> 00:37:38,180
You probably want to increase your complexity and go to the elbow.

365
00:37:38,180 --> 00:37:42,800
Whereas if you're in, whereas if you're in in this part of the curve, well,

366
00:37:42,800 --> 00:37:48,750
then it doesn't matter which machine learning method you choose, they all perform very, very similarly.

367
00:37:48,750 --> 00:37:52,820
You have very similar empirical risk. And so in that case,

368
00:37:52,820 --> 00:37:56,210
you might want to try to make the model simpler so you can go up toward the

369
00:37:56,210 --> 00:38:01,450
elbow because you probably won't lose performance if you make the model simpler.

370
00:38:01,450 --> 00:38:11,140
So I had been thinking that this kind of might explain some of the things that me and others have been observing across problems and across data.

371
00:38:11,140 --> 00:38:18,310
There are some problems like image net, where the field has been designing more and more complicated models, and it reduces error.

372
00:38:18,310 --> 00:38:23,970
So maybe we're still in this part of the curve, right? And.

373
00:38:23,970 --> 00:38:30,390
You know, on the other hand. If you think about problems like Amnesty, where?

374
00:38:30,390 --> 00:38:33,740
No matter which method you use, you get 100 percent accuracy.

375
00:38:33,740 --> 00:38:41,570
So in that case, we're like, we may be on this part of the curve and we could use simpler models and still get 100 percent accuracy.

376
00:38:41,570 --> 00:38:49,260
And those simpler models might have other properties like they might generalise better outside of amnesty, and they might be more interpretable.

377
00:38:49,260 --> 00:38:51,690
And then there's a kind of these kind of problems,

378
00:38:51,690 --> 00:38:57,850
the kind of problems I usually work on where it doesn't matter which machine learning method you pick.

379
00:38:57,850 --> 00:39:03,910
They just all kind of perform similarly, like for the re-arrest, you know, the group for your prediction,

380
00:39:03,910 --> 00:39:08,200
I feel like we're in this part of the Rashomon part because like with rigorous prediction,

381
00:39:08,200 --> 00:39:12,520
you can get a really simple model that predicts just as well as like your super complicated model.

382
00:39:12,520 --> 00:39:16,370
And there's just an inherent level of noise like you just can't get.

383
00:39:16,370 --> 00:39:22,150
If you try to get more accurate, you'll just over fit, basically.

384
00:39:22,150 --> 00:39:28,960
Yeah, so for these types of problems, I think we probably want to be walking up up the curve,

385
00:39:28,960 --> 00:39:36,220
reducing complexity to get kind of a more simple model that is interpretable but still maintain your level of accuracy.

386
00:39:36,220 --> 00:39:40,240
Just walking up toward the elbow there.

387
00:39:40,240 --> 00:39:48,550
OK, so what I've gotten to is an easy check, a simple check for the possible presence of a simpler yet accurate model,

388
00:39:48,550 --> 00:39:54,310
which is that you should pick several of your favourite machine learning methods and you run them all in the data set.

389
00:39:54,310 --> 00:40:02,650
OK. You run them all in the data set. If they all perform differently, your model class is maybe too small to include the elbow solution you can.

390
00:40:02,650 --> 00:40:11,470
You can get a little bit more complex. So, yeah, use a more complex model class if other machine learning methods perform similarly.

391
00:40:11,470 --> 00:40:19,350
Your model class might be a little bit too big than you need, in which case you can try to find specialised models that will move you up.

392
00:40:19,350 --> 00:40:25,490
You know, they have the special properties like interpretability just decreased your complexity.

393
00:40:25,490 --> 00:40:30,620
OK, so great. So, all right, I've defined my condition,

394
00:40:30,620 --> 00:40:36,410
which is that the Rashomon set is large and I've showed you that you don't need to calculate the Rashomon ratio.

395
00:40:36,410 --> 00:40:44,170
You can just try lots of different machine learning methods, and that gives you a sense of whether simpler solutions might exist.

396
00:40:44,170 --> 00:40:49,810
And now a lot of people don't believe me about this or they're not interested, and that's fine.

397
00:40:49,810 --> 00:40:55,630
But sometimes it can be kind of silly like so I want to tell you a story that happened a couple of summers ago.

398
00:40:55,630 --> 00:41:01,390
I found out about this explainable machine learning. It's called the Explainable Machine Learning Challenge.

399
00:41:01,390 --> 00:41:05,710
And my group decided we had to enter it. I do a lot of data science competitions.

400
00:41:05,710 --> 00:41:09,910
I actually coach Duke's data science competition team where we enter data science competitions.

401
00:41:09,910 --> 00:41:18,460
And then this thing came out and we were like, Oh, we got to do it. But I mean, the goal of the competition was to create a black box and explain it.

402
00:41:18,460 --> 00:41:24,070
And so we got the data set. It was a nice big data set from Flaco and loan defaults.

403
00:41:24,070 --> 00:41:27,820
The dataset had thousands of rows.

404
00:41:27,820 --> 00:41:34,750
Each one was a person and with their whole credit history, we had to decide whether or not they would default on their loan.

405
00:41:34,750 --> 00:41:41,240
And we looked at it and we and we thought, you know, this looks like it has a good data representation.

406
00:41:41,240 --> 00:41:48,170
And I thought, could I be wrong? Could it be a problem with a good data representation where you or you need a black box?

407
00:41:48,170 --> 00:41:51,140
And so I said to my students, Look, I don't know about this competition.

408
00:41:51,140 --> 00:41:57,530
Just try running a bunch of different machine learning methods on the data set and see whether they all perform the same.

409
00:41:57,530 --> 00:42:03,950
So a day or so later, they came back and they said, Yep, all the methods are performing the same.

410
00:42:03,950 --> 00:42:08,690
And then at that point, we pretty much knew that the dataset had a large Russian mindset.

411
00:42:08,690 --> 00:42:14,480
So we said, we said, OK, we think we can construct an interpretable model for this dataset.

412
00:42:14,480 --> 00:42:20,300
And so we had a debate like, should we follow the competition rules, should we create a black box and explain it?

413
00:42:20,300 --> 00:42:28,660
Or should we actually try to create an apparently interpretable model? So we decided that for after about two seconds of debate,

414
00:42:28,660 --> 00:42:33,700
we decided that for a problem as important as credit risk, we should create an inherently interpretable model.

415
00:42:33,700 --> 00:42:39,370
So we did. We created a globally interpretable model with a create a beautiful visualisation

416
00:42:39,370 --> 00:42:44,500
tool that had the same accuracy as the best neural network that we could construct.

417
00:42:44,500 --> 00:42:49,720
So in fact, it's all live. You can actually play with it. You can go to this data set data.

418
00:42:49,720 --> 00:42:55,540
That's right, the data Duke Data Science Go website, which is just running,

419
00:42:55,540 --> 00:43:00,550
it's just running on the Duke servers and you can play around with the with the Fishko dataset in our model.

420
00:43:00,550 --> 00:43:04,180
And I'm just showing you a snapshot of it. I don't want to bring up the whole thing,

421
00:43:04,180 --> 00:43:14,410
but basically that it had like a bunch of sub scales and you could click on the subscales and you could get points for different things.

422
00:43:14,410 --> 00:43:21,940
Like, for instance, this is the delinquency sub score. And it's a set of sparse logistic regression models, essentially.

423
00:43:21,940 --> 00:43:28,810
So for instance, you'd get like a point for your percent of trades being never delinquent for this person.

424
00:43:28,810 --> 00:43:32,710
Actually, their trades were kind of delinquent. That's why they got a point.

425
00:43:32,710 --> 00:43:36,060
The number of months since the most recent delinquency, they get points for that.

426
00:43:36,060 --> 00:43:46,390
And so it's yeah. So you just add up the points and each set of points would translate into a little score and you'd get up the score as it was.

427
00:43:46,390 --> 00:43:52,570
It was very nice. It was a nice d composable model, a little sparse logistic regression type models.

428
00:43:52,570 --> 00:43:58,300
And so we sent this in to the competition wondering what the judges would think of it.

429
00:43:58,300 --> 00:44:03,490
Because I thought they're going to have no idea how to judge this, because it's an inherently interpretable model.

430
00:44:03,490 --> 00:44:07,480
And I was right. They had no idea how to judge this and we totally bombed.

431
00:44:07,480 --> 00:44:11,500
We did absolutely terribly. We didn't even place. But luckily,

432
00:44:11,500 --> 00:44:16,720
the judges realised that actually what happened was the judges didn't allow the the they

433
00:44:16,720 --> 00:44:20,320
didn't allow any of the judges to play with the visualisations that people had constructed.

434
00:44:20,320 --> 00:44:24,940
So every team that created a visualisation tool for their model of the judges didn't get to play with it.

435
00:44:24,940 --> 00:44:34,060
So that gave us a major disadvantage. But luckily, the judges realised that their judging criteria wasn't very good and they saw value in what we did.

436
00:44:34,060 --> 00:44:37,960
And so they gave us an award. They actually created a little award for us.

437
00:44:37,960 --> 00:44:40,300
They created the Fake Recognition Award,

438
00:44:40,300 --> 00:44:48,630
acknowledging our submission for going above and beyond expectations with a fully transparent global model and a user friendly dashboard.

439
00:44:48,630 --> 00:44:55,130
And so I was really excited about this, and I thought, OK, I'll send in, I'll send a write a paper about it.

440
00:44:55,130 --> 00:44:59,840
And we'll send it into a special issue for a journal on decision making.

441
00:44:59,840 --> 00:45:06,230
And I was told to email the editor guest editor of the special issue to see if the paper is appropriate.

442
00:45:06,230 --> 00:45:13,120
So I emailed the person So dear, fancy esteemed professor at Fancy Stanford University.

443
00:45:13,120 --> 00:45:18,670
And we have this paper, we don't know whether it fits into the scope of our of the special issue.

444
00:45:18,670 --> 00:45:20,620
It's not a traditional methodology paper.

445
00:45:20,620 --> 00:45:30,200
It's an analysis of this competition dataset, including a globally interpretable machine learning model, didn't lose accuracy over the black boxes.

446
00:45:30,200 --> 00:45:36,630
It won this award. What do you think? And he sent me back this email saying, Dear Cynthia, thanks for reaching out.

447
00:45:36,630 --> 00:45:40,620
This is an interesting paper, but I'm afraid it's not a good fit for the special issue.

448
00:45:40,620 --> 00:45:45,030
It's also related to my own recent work on explainability of neural nets.

449
00:45:45,030 --> 00:45:48,450
Is the phaco data still available? If so, could you share it?

450
00:45:48,450 --> 00:45:56,490
And I was like, Oh my gosh, you know, I send the guy a paper saying, Hey, you don't need a black box for this dataset.

451
00:45:56,490 --> 00:46:04,590
And he sends me back an email saying, I don't care about your paper, but can you send me the data so I can create a black box for it and explain it?

452
00:46:04,590 --> 00:46:10,740
And so that's unfortunately that the state of where things are at the moment.

453
00:46:10,740 --> 00:46:21,240
OK, so to summarise, I have to find a condition under which a simple yet accurate model is likely to exist, which is that the Rashomon set is large.

454
00:46:21,240 --> 00:46:23,550
I showed a simple check for large freshman sets,

455
00:46:23,550 --> 00:46:28,530
which is to run many different machine learning methods on your data to see if they all perform similarly.

456
00:46:28,530 --> 00:46:37,340
If they do, there's a good chance that you have a large, that you have a large freshman set and that you can find a simpler model.

457
00:46:37,340 --> 00:46:48,580
I introduced the notion of Rashomon curves, which we found to be true for every to have that characteristic pattern, for every dataset we examined.

458
00:46:48,580 --> 00:46:56,290
And so, yeah, so now that we know that interpretable yet accurate models tend to exist, we can go find them.

459
00:46:56,290 --> 00:47:02,920
And that's what my lab works on. It's finding these these models. So if finally, at the end of the talk, I get to introduce myself.

460
00:47:02,920 --> 00:47:10,720
So, yeah, I leave the prediction analysis lab. Most of my time is dedicated to the problems of optimal decision trace.

461
00:47:10,720 --> 00:47:16,000
So finding really tiny little if then role based models like the Coral's model I showed you earlier.

462
00:47:16,000 --> 00:47:25,540
For recidivism, we have lots of we have the fastest code right now, but about three orders of magnitude for optimal decision trees.

463
00:47:25,540 --> 00:47:30,280
I also work on medical scoring systems, which we've used for a lot of medical applications,

464
00:47:30,280 --> 00:47:33,100
and this is a model that's called that you helps to be score,

465
00:47:33,100 --> 00:47:39,220
which is used in intensive care units by doctors to help predict whether a patient will have a seizure.

466
00:47:39,220 --> 00:47:45,540
And that helps the doctors monitor the patient and and prevent brain damage and save lives.

467
00:47:45,540 --> 00:47:49,740
I also work on interpretable neural networks for computer vision.

468
00:47:49,740 --> 00:47:57,900
And as I mentioned earlier, we've shown that you can create interpretable models for rear vision that have the same accuracy as black boxes.

469
00:47:57,900 --> 00:48:09,290
And we're using them now to do with a collaboration with radiologists to help with mammograms that reading mammograms automatically.

470
00:48:09,290 --> 00:48:15,120
With to provide a computer aided decision and not a computer computer.

471
00:48:15,120 --> 00:48:21,770
Your decision. Just an automated decision, right, where it's computer aided rather than automated.

472
00:48:21,770 --> 00:48:25,250
I also work on data visualisation and dimension reduction,

473
00:48:25,250 --> 00:48:32,320
where we're trying to project high dimensional data onto low dimensional and to two dimensions so that you can understand the structure,

474
00:48:32,320 --> 00:48:41,430
high dimensional structure in the data. So we're trying to preserve as much of the high dimensional structure as possible when projecting onto 2-D.

475
00:48:41,430 --> 00:48:49,170
And then I'm also I also work as one of three professors and almost exact opposite exactly matching project

476
00:48:49,170 --> 00:48:57,660
where we're trying to match units almost exactly so that we can do interpretable causal inference.

477
00:48:57,660 --> 00:49:03,270
And then finally, the last one is understanding the set of good models and the importance of variables, which is what you heard about.

478
00:49:03,270 --> 00:49:06,720
You heard about one of those projects today in this category.

479
00:49:06,720 --> 00:49:15,210
And then finally, as I mentioned, I coach the Duke data science competition team where we rate automated, automated computer poetry.

480
00:49:15,210 --> 00:49:25,530
And we do image super resolution. This year, we were competing in a citation labelling competition, which was really fun.

481
00:49:25,530 --> 00:49:33,350
And yeah, I love competing in data science competitions, and I've been coaching students for years to do that.

482
00:49:33,350 --> 00:49:39,400
OK, thank you very much. Thanks a lot for this very thought provoking took.

483
00:49:39,400 --> 00:49:43,840
Cynthia. Yeah, we have some minutes for questions.

484
00:49:43,840 --> 00:49:51,370
Judith Rousseau was asking something. ProPublica. Do you want to ask yourself?

485
00:49:51,370 --> 00:50:02,430
Sure. So are there some situations where you would be not quite sure, but the accuracy or relevance of your résumé?

486
00:50:02,430 --> 00:50:08,860
And so I'm not sure my question makes sense, but. Do you trust them?

487
00:50:08,860 --> 00:50:16,750
Yeah, we actually don't trust our freshman curve estimates that much. We're only trusting them to determine whether the Russian mindset is large

488
00:50:16,750 --> 00:50:21,940
because it's very difficult to estimate the sizes of really small Rashomon sets.

489
00:50:21,940 --> 00:50:27,520
So if our estimates are that the Russian mind of small, then we just know it's small.

490
00:50:27,520 --> 00:50:33,580
We don't know really what its value is. And luckily, like I said in practise, you never really need to.

491
00:50:33,580 --> 00:50:38,530
You never really need to construct the Rashomon curve or the rational ratio.

492
00:50:38,530 --> 00:50:43,420
Because if we're just gaining the insight from it to figure out,

493
00:50:43,420 --> 00:50:50,500
there's this sort of important information that if you try a lot of different machine learning methods and they all perform similarly

494
00:50:50,500 --> 00:51:03,880
that that you probably have a large Russian mindset and that's all that's all we really needed to glean from that from those estimates.

495
00:51:03,880 --> 00:51:09,790
Let me sense. Thanks. OK.

496
00:51:09,790 --> 00:51:20,560
So you mentioned at some point the the fact that the regime on ratio is not the same as local minima,

497
00:51:20,560 --> 00:51:27,550
and I understand that the reason is like you may have like a several like a small or several local minima that would like,

498
00:51:27,550 --> 00:51:36,130
I mean, like a lot regime on set or you make a discreet hypothesis in space.

499
00:51:36,130 --> 00:51:41,120
So they look at minimal narrative. Would it make sense, but otherwise a.

500
00:51:41,120 --> 00:51:46,820
What's the goal, how someone like that situation doesn't exist,

501
00:51:46,820 --> 00:51:53,750
so so then like what would be the relation between like the usual narrative in the planning,

502
00:51:53,750 --> 00:52:02,120
for example, about the local minima and its good properties and these like there being like large rational assets?

503
00:52:02,120 --> 00:52:06,020
So if you have if you have a flat minimum, then you do have a large Russian mindset, right?

504
00:52:06,020 --> 00:52:12,140
Right. Because you would have like this flat area, the flat minimum, and then you'd be able to put a ball in there.

505
00:52:12,140 --> 00:52:16,380
It's just that that we can have a large Russian mindset without having a flat minimum.

506
00:52:16,380 --> 00:52:20,900
Well, I see. Yes.

507
00:52:20,900 --> 00:52:25,490
You know, I give not only sustenance figure above their name rational.

508
00:52:25,490 --> 00:52:30,380
Oh yeah. So that name came from from Leo Bremen, who got it from the movie Rashomon.

509
00:52:30,380 --> 00:52:36,560
So, so there's a Japanese movie. I haven't watched it yet. I've been meaning to watch it, but I have children.

510
00:52:36,560 --> 00:52:42,390
And so it's kind of hard to like, you know, it's like you don't want to watch that movie about violent stuff with the kids.

511
00:52:42,390 --> 00:52:49,790
Alright, so I haven't been watching that. But it's a movie about a violent crime that occurred.

512
00:52:49,790 --> 00:52:57,110
And there's four different perspectives on the crime, and in the end, you end up thinking that there's no real truth,

513
00:52:57,110 --> 00:53:01,850
and there's just just a lot of different ways of seeing the same thing, but that there's no truth.

514
00:53:01,850 --> 00:53:06,140
And so it's the same thing with with models, right? There's no true model.

515
00:53:06,140 --> 00:53:12,110
There's just a lot of different. There's just a lot of like there's no underlying truth, right?

516
00:53:12,110 --> 00:53:15,020
We don't we just have a finite dataset. So there's no truth.

517
00:53:15,020 --> 00:53:20,360
There's just a lot of models that perform well, just a lot of good explanations for what's what's actually happened.

518
00:53:20,360 --> 00:53:25,640
And so the Rashomon that is it's the set of good explanations for for the data.

519
00:53:25,640 --> 00:53:34,800
Mm-Hmm. Know, I remember like this paper by Monday, he also talks about the the heat.

520
00:53:34,800 --> 00:53:38,890
He mentioned the late Rashomon and also come and I wonder,

521
00:53:38,890 --> 00:53:46,470
like what it like a Rashomon and all like perspective of like moral complexity and the philosophy of related or equivalent,

522
00:53:46,470 --> 00:53:50,730
or they are like pointing to different aspects. What do you think about that?

523
00:53:50,730 --> 00:53:54,240
Well, I think that Rashomon enables them right?

524
00:53:54,240 --> 00:54:02,640
Because Pratima, large Rashomon that say that you can find a real, like a simpler model that explains the data well.

525
00:54:02,640 --> 00:54:07,180
Mm hmm. Yes. Yeah. Yeah, I got to.

526
00:54:07,180 --> 00:54:12,530
I'm one of the people who is lucky enough to get a chance to meet Leo Berman,

527
00:54:12,530 --> 00:54:17,960
although the time I met him, he told me that my paper on boosting was was not.

528
00:54:17,960 --> 00:54:21,860
He walked up to me during a nips poster session and he said,

529
00:54:21,860 --> 00:54:29,000
and I had been trying to prove I'd been trying to prove that at a boost, whether or not maximises the margin.

530
00:54:29,000 --> 00:54:34,910
And he said, Well, I already proved that if you want to have a real thesis, you could do something else. And it was.

531
00:54:34,910 --> 00:54:38,720
But, you know, I ended up becoming friends with him and I remember him like, you know,

532
00:54:38,720 --> 00:54:44,120
waving to me at the end of the conference, and I did manage to actually prove the theorem.

533
00:54:44,120 --> 00:54:50,180
You know, in the end, I did prove that that added boost does not maximise its margin.

534
00:54:50,180 --> 00:54:55,520
But yeah, it was a it was interesting in getting a chance to meet him.

535
00:54:55,520 --> 00:55:01,790
Yeah, just a really outspoken guy who's done amazing things because of his work in industry.

536
00:55:01,790 --> 00:55:10,160
And, you know, just kind of going out in the real world and understanding the value of things like interpretability for building,

537
00:55:10,160 --> 00:55:17,950
just creating decision trees and the value that they created for people. Hmm.

538
00:55:17,950 --> 00:55:31,970
OK. So the idea that, you know, there are no more questions about thanks a lot for your time and for your son took.

539
00:55:31,970 --> 00:55:38,570
Thanks. I wish I could meet all of you in person, but maybe day.

540
00:55:38,570 --> 00:55:47,900
Yeah. Thank you.

541
00:55:47,900 --> 00:55:56,410
Bye bye. Thank you. Thanks, Michael.

542
00:55:56,410 --> 00:56:01,520
Thank you, bye. Yeah, thanks a lot.