1 00:00:00,060 --> 00:00:04,800 Do my best to do it remotely. OK, so I work in an interpretive machine learning, 2 00:00:04,800 --> 00:00:10,890 and over the years I've seen the field of machine learning lean more and more toward more complicated models. 3 00:00:10,890 --> 00:00:13,890 Even in cases where they're completely unnecessary. 4 00:00:13,890 --> 00:00:20,820 And so I didn't just want to give a talk that says stop using black box models because that's really not constructive, right? 5 00:00:20,820 --> 00:00:27,630 That's just destructive. It's just telling people what not to do rather than what they could do. 6 00:00:27,630 --> 00:00:34,590 And so I wanted to tell you, not just don't use black box models, but I wanted to tell you why you don't need them. 7 00:00:34,590 --> 00:00:41,390 Well, now let me give you an example of recidivism prediction in the U.S. criminal justice system. 8 00:00:41,390 --> 00:00:45,020 Now here they use predictive models to determine people's risk of being arrested in 9 00:00:45,020 --> 00:00:49,610 order to determine whether to release someone on bail or parole or for social services. 10 00:00:49,610 --> 00:00:55,040 And in some cases, the models are so complicated that it's easy to compute the predictions incorrectly. 11 00:00:55,040 --> 00:01:00,290 So this is an article from The New York Times about a case where a typographical error in 12 00:01:00,290 --> 00:01:06,200 the input to a black box predictive model led to years of extra prison time for someone. 13 00:01:06,200 --> 00:01:11,760 And the typo was in his criminal history feature as he. Was denied parole. 14 00:01:11,760 --> 00:01:18,660 He left, he compared his scoresheet to someone else, and then they found he found that there was a typo in the criminal history features. 15 00:01:18,660 --> 00:01:22,990 Anyway, the model that the justice system was using for his prediction was called compass. 16 00:01:22,990 --> 00:01:27,450 You may have heard of it. It's a very famous black box model used in the justice system. 17 00:01:27,450 --> 00:01:33,390 A lot of arguments about it, and you'd think that models like Compass would be more accurate because they use 18 00:01:33,390 --> 00:01:39,360 over 100 features and they're created by a company whose job it was to do that. 19 00:01:39,360 --> 00:01:43,620 And so, yeah, you think these models would be more accurate, but they're not. 20 00:01:43,620 --> 00:01:49,440 So we didn't experiment from Florida to test the accuracy of this particular black box model. 21 00:01:49,440 --> 00:01:53,790 And by the way, compass, it's pretty widely used across the justice system. 22 00:01:53,790 --> 00:02:00,630 And yeah, we compared compass to our latest machine learning method in the lab at the time of this experiment, which is called Corals. 23 00:02:00,630 --> 00:02:02,730 Corals is an optimal decision tree method. 24 00:02:02,730 --> 00:02:12,300 It produces really sparse, one sided decision trees, and it came up with this machine learning model that fits in the bottom of a PowerPoint slide. 25 00:02:12,300 --> 00:02:18,690 And the model says if the person is young and their male, predict arrest within two years of the compass score calculation. 26 00:02:18,690 --> 00:02:25,440 Also, if they're a little older and they have two or three prior offences predict arrest within two years of the cup a score calculation. 27 00:02:25,440 --> 00:02:28,290 Or if they have more than three priors, predict arrest, otherwise predict no arrest, 28 00:02:28,290 --> 00:02:35,970 and we looked at this model and thought, OK, that's pretty simple. But how is it possibly going to be as accurate as Compass? 29 00:02:35,970 --> 00:02:42,060 And it was. And so what I'm showing you here is that the models are about equally accurate. 30 00:02:42,060 --> 00:02:48,590 This is 10 folds of data, so each colour is a different fold here. And yeah, the performance is very similar. 31 00:02:48,590 --> 00:02:56,090 But not only did these two models perform the same as it turns out, no matter which machine learning method we tried, they all performed the same. 32 00:02:56,090 --> 00:03:01,400 And some of these are complete black boxes like compass, which is proprietary, 33 00:03:01,400 --> 00:03:05,690 or some of them are just black boxes because they're just huge formulas and 34 00:03:05,690 --> 00:03:12,830 you can't fit them on a slide like support vector machines with radial basis, function kernels and decision trees, red forest and so on. 35 00:03:12,830 --> 00:03:19,220 So, you know. There was this huge debate about the algorithmic fairness of compass. 36 00:03:19,220 --> 00:03:25,040 But the truth is that we just don't seem to need compass at all. So why are we still using it right? 37 00:03:25,040 --> 00:03:30,700 Now. Back to my point here. 38 00:03:30,700 --> 00:03:36,250 So there's doesn't seem to be any benefit from complicated models for re-arrest prediction and criminal justice. 39 00:03:36,250 --> 00:03:39,550 There's a lot of literature on exactly that problem. 40 00:03:39,550 --> 00:03:46,260 There's just no reason to use the black box model for for criminal recidivism prediction, as far as I can tell. 41 00:03:46,260 --> 00:03:51,030 But it's true that there's no benefit from complicated models for lots of different problems. 42 00:03:51,030 --> 00:03:56,130 And I'm listing here a whole bunch of problems that I've worked on in my career. 43 00:03:56,130 --> 00:04:02,760 And for none of these problems have we seemed to to need a black box model. 44 00:04:02,760 --> 00:04:05,940 Now it really depends on your your data representation, though, 45 00:04:05,940 --> 00:04:14,680 because like if you're if you're working in computer vision, neural networks are really great for computer vision. 46 00:04:14,680 --> 00:04:20,550 They're great for when you're data, you know, when you need to create a good representation for your data, 47 00:04:20,550 --> 00:04:23,250 that's where you want to use a neural network. 48 00:04:23,250 --> 00:04:29,250 But if your data naturally come with good data representation, like in all the problems I've listed here, 49 00:04:29,250 --> 00:04:39,520 then all the algorithms tend to perform very, very similarly as long as you're willing to do some preprocessing on the data. 50 00:04:39,520 --> 00:04:46,160 So. Why, then, are we still using complicated models? 51 00:04:46,160 --> 00:04:50,690 There's some really good reasons. First of all, we like them. They're profitable. 52 00:04:50,690 --> 00:04:54,130 The compass people are making a profit off the US justice system. 53 00:04:54,130 --> 00:05:01,920 It's very much easier to sell something like compass than to sell something like the Coral's model I had on the previous slide. 54 00:05:01,920 --> 00:05:09,230 Also, it's much easier to construct a black box model than a simple model to construct a black box that let you take your data, 55 00:05:09,230 --> 00:05:14,540 throw it into an algorithm, you get a model, whereas to construct a simpler model, it's much more difficult. 56 00:05:14,540 --> 00:05:20,920 You actually have to optimise for the simplicity of the model. 57 00:05:20,920 --> 00:05:26,200 OK, so let's yeah, it's kind of ironic, right, 58 00:05:26,200 --> 00:05:32,980 that that complicated models are much easier to construct and that simple models can be really, really hard to find. 59 00:05:32,980 --> 00:05:38,980 OK, so let's say that we're doing supervised learning where we want to minimise the lost function to make our models accurate. 60 00:05:38,980 --> 00:05:42,910 But now we also want our models to be simple and we have to constrain them. 61 00:05:42,910 --> 00:05:53,400 We have to force them to be simple. And once you get too constrained optimisation and depending on the constraints, the problem becomes much harder. 62 00:05:53,400 --> 00:06:00,390 Now, it's a problem in the left is about finding an accurate decision tree that is much easier than finding an optimal, 63 00:06:00,390 --> 00:06:05,710 sparse tree with the same level of accuracy. It's exponentially harder. 64 00:06:05,710 --> 00:06:13,660 If the problem on the left, oh, and here, here's some examples here, so this is cart of in a cart green card algorithm on on a data set. 65 00:06:13,660 --> 00:06:21,190 And you know, that's very easy. If you want to find an optimal sparse tree with the same or better accuracy, 66 00:06:21,190 --> 00:06:25,660 you have to run a specialised algorithm that solves a much harder computational problem. 67 00:06:25,660 --> 00:06:32,860 And this algorithm on the left, on the right here, ghost is an algorithm that we've designed that was published in 2020, 68 00:06:32,860 --> 00:06:38,380 and it's getting, you know, accuracy's that are that are better than cart in much. 69 00:06:38,380 --> 00:06:41,980 The models are much faster, but it does a huge amount of work to get to that, 70 00:06:41,980 --> 00:06:46,210 whereas card is like from the nineteen nineties or nineteen eighties, right? 71 00:06:46,210 --> 00:06:53,200 OK, now if that problem on the left is about finding an accurate linear model, well, that's easy. 72 00:06:53,200 --> 00:06:56,170 You know, you can do that with regression or logistic regression. 73 00:06:56,170 --> 00:07:01,560 But then once you had sparsity constraints, we all know that the problem becomes much harder. 74 00:07:01,560 --> 00:07:06,540 So what about if you just unleash the most complexity you have on the problem, 75 00:07:06,540 --> 00:07:12,690 like we all know how easy it is to construct an accurate neural network or an accurate, boosted decision tree? 76 00:07:12,690 --> 00:07:18,990 Now my question is, can you get the same accuracy with maybe an accurate and fresh decision, 77 00:07:18,990 --> 00:07:23,100 Drake, or maybe an accurate its first linear model, right? Are these things? 78 00:07:23,100 --> 00:07:31,010 Am I going to get the same accuracy when I solve this problem as opposed to this problem? 79 00:07:31,010 --> 00:07:35,780 So do I need to sacrifice accuracy in order to gain interpretability? 80 00:07:35,780 --> 00:07:40,970 So can I determine whether this equality is true without actually solving this? 81 00:07:40,970 --> 00:07:48,770 And what would it take to do that right? What would it take to check whether or not I could get an accurate and simple model? 82 00:07:48,770 --> 00:07:53,420 Same level of accuracy as my black box. 83 00:07:53,420 --> 00:07:59,540 OK, so in other words, can I determine the existence of a simple yet accurate model without actually finding one? 84 00:07:59,540 --> 00:08:04,010 So that's what I'm trying to do during this talk. OK. So in this talk, 85 00:08:04,010 --> 00:08:13,640 I'm going to define a condition under which a simple yet accurate model is likely to exist in that condition is that the Rashomon set is large, 86 00:08:13,640 --> 00:08:18,540 and I'm going to tell you what that means in minute. 87 00:08:18,540 --> 00:08:27,510 OK, so the Rashomon set is the set of models with low true loss, the true Russian mindset, the set of models with low true loss. 88 00:08:27,510 --> 00:08:33,240 OK, so fusion, your expected loss. This is, if you know, the whole distribution that the data comes from. 89 00:08:33,240 --> 00:08:38,250 And then this is abstract function space. Just my hypothesis space. 90 00:08:38,250 --> 00:08:45,670 And then this is the Russian mindset. It's the set of models that have expected loss less than some value theta. 91 00:08:45,670 --> 00:08:54,710 OK, now I claim that if the true romance that is large, so in other words, if there are a lot of good models. 92 00:08:54,710 --> 00:08:59,210 Then a simple yet accurate model is likely to exist. OK, so this is this is the idea. 93 00:08:59,210 --> 00:09:04,220 The idea is that if there are a lot of good models, then hopefully at least one of them is simple. 94 00:09:04,220 --> 00:09:07,790 So this is kind of like a big fish theory, right? Like a big ocean theory. 95 00:09:07,790 --> 00:09:14,450 Like if you have you have a really big ocean, then you're more likely to find a big fish swimming in there somewhere. 96 00:09:14,450 --> 00:09:22,040 OK. So in a sea of equally accurate models, maybe there's a good one in there somewhere. 97 00:09:22,040 --> 00:09:30,380 Maybe there's at least one simple one. OK, now I'm just going to change my notation just very slightly. 98 00:09:30,380 --> 00:09:34,820 Where instead of expected loss with all that notation, I'm just going to write L. 99 00:09:34,820 --> 00:09:39,790 OK, so l as expected loss. 100 00:09:39,790 --> 00:09:47,620 And yeah, oh, by the way, I drew like a really nice, smooth lost function there with like one minimum, but it doesn't have to be that way. 101 00:09:47,620 --> 00:09:52,330 In fact, the space could look like this and the Rashomon set could be disjoint. 102 00:09:52,330 --> 00:10:02,160 And in fact, it's possible for the whole space to be discreet. And so the romance that is just a bunch of like points in the in the space. 103 00:10:02,160 --> 00:10:14,330 OK, so what I want to do now is. Create the simplest possible abstract setting to show you how this thing at the bottom could possibly happen. 104 00:10:14,330 --> 00:10:18,350 And I want to make it precise as to how this happens. I don't want to just wave my hands and say it happens. 105 00:10:18,350 --> 00:10:25,520 I want to actually just show you an abstract setting where it does happen, OK? 106 00:10:25,520 --> 00:10:38,630 So. I'm going to take two finite hypothesis spaces, so two finite function spaces F1, which is the set of simple models and F2, which is all models. 107 00:10:38,630 --> 00:10:43,070 And I will say that F1 lives enough to write simple models live in all models. 108 00:10:43,070 --> 00:10:47,690 And in fact, I'm going to say that F1 is uniformly drawn from F2 without replacement. 109 00:10:47,690 --> 00:10:54,350 So I know this is abstract, and I know that simple models are not drawn randomly from a more complex model class. 110 00:10:54,350 --> 00:10:59,900 But in reality, as long as each complex model is reasonably close to a simple model, 111 00:10:59,900 --> 00:11:06,120 then everything's the same idea that I'm going to show you is going to work out just fine. 112 00:11:06,120 --> 00:11:10,080 Now, let's say that Upstart is the best model if I knew everything. 113 00:11:10,080 --> 00:11:15,000 So it's the mall where I get to use all models, the but the best model to choose from, 114 00:11:15,000 --> 00:11:18,370 where I get to use all models and I know the whole distribution of where the data comes from. 115 00:11:18,370 --> 00:11:23,100 OK, so this is if I know everything you could use, the more complex middle class we know, 116 00:11:23,100 --> 00:11:28,770 I know the whole lost function, everything whereas this guy F1 hat. 117 00:11:28,770 --> 00:11:32,250 That's what I can get on my data. So this is the empirical risk. 118 00:11:32,250 --> 00:11:38,940 Minimise her from a simple function class. OK, so that's what I would love to get if I knew everything. 119 00:11:38,940 --> 00:11:46,850 This is what I can get with my data. Now, what I want to know is if these two guys achieve the same level of accuracy. 120 00:11:46,850 --> 00:11:52,330 OK. So again, using my simpler notation, this this thing becomes l. 121 00:11:52,330 --> 00:12:00,830 This empirical risk becomes lhat. And I want to know whether the best true risk of the complex class, 122 00:12:00,830 --> 00:12:05,720 so I want to know whether after you start as close to the best empirical risk of the simpler class. 123 00:12:05,720 --> 00:12:09,680 So that's all head of F1 habit. OK. 124 00:12:09,680 --> 00:12:15,960 So I want to know whether these two things are close. So in other words, I want to know whether what I compute on my data. 125 00:12:15,960 --> 00:12:28,180 Is close to the best possible thing I could get if I knew everything. And the bound is going to involve the Rashomon ratio. 126 00:12:28,180 --> 00:12:38,660 Now, the Rashomon ratio is the fraction of models that are good. OK, so it's the fraction of all models with low true loss. 127 00:12:38,660 --> 00:12:43,160 But divided by the total number of models. 128 00:12:43,160 --> 00:12:51,250 OK, so this is the fraction of models that are good. And that's going to appear in my bound. 129 00:12:51,250 --> 00:12:56,230 OK, so I put all the notation from the previous slide up in the top here. 130 00:12:56,230 --> 00:12:59,030 And the bound goes like this. 131 00:12:59,030 --> 00:13:07,620 It says for any with high probability with Epsilon greater than zero, with high probability, and I haven't told you what this is yet, but I will. 132 00:13:07,620 --> 00:13:20,020 With high probability, with respect to all randomness. The empirical risk on the. 133 00:13:20,020 --> 00:13:25,810 Function that I get from my data is close to the best possible thing I could get. 134 00:13:25,810 --> 00:13:33,040 And the probability with which this holds depends on the Rashomon ratio. 135 00:13:33,040 --> 00:13:39,490 So if the Marshman ratio is larger, so if I have more good models, then this sound is more likely to hold. 136 00:13:39,490 --> 00:13:47,060 And my what I can compete on, my data is going to be close to the best possible thing I can get for new everything. 137 00:13:47,060 --> 00:13:51,190 Another nice thing about this band is that only depends on the size of F1, 138 00:13:51,190 --> 00:13:57,250 the smaller function class, it doesn't depend on the size of the larger function class. 139 00:13:57,250 --> 00:14:00,850 OK, so cool. 140 00:14:00,850 --> 00:14:05,800 So this this bound is saying, you know, as long as the Rashomon ratio is big enough, then then we're good. 141 00:14:05,800 --> 00:14:09,920 OK? So I notice this bounded a little. 142 00:14:09,920 --> 00:14:15,700 This probability here is a little inscrutable, so I'm going to give you some examples of its calculation. 143 00:14:15,700 --> 00:14:22,780 So let's say that we had 100000 functions. Now, if at least one percent of them are good, 144 00:14:22,780 --> 00:14:30,460 then the bound holds with 99 percent probability when you have at least five hundred twenty six simple functions. 145 00:14:30,460 --> 00:14:38,920 So here's another example. Again, 100000 medals. Ratio, at least half a percent of them are good. 146 00:14:38,920 --> 00:14:45,630 Then the band holds with 99 percent probability. When just over a thousand of them are simple. 147 00:14:45,630 --> 00:14:49,620 So, Cynthia, can I ask you something? Absolutely. 148 00:14:49,620 --> 00:14:56,160 So in your definition of their rational racial ratio, you have to a fairer parameter and then you have like a Gunma parameter. 149 00:14:56,160 --> 00:15:00,060 So are there related or yeah, sorry, they're the same. 150 00:15:00,060 --> 00:15:03,900 So theta is the same as gamma. OK. Yeah, sorry about that. 151 00:15:03,900 --> 00:15:10,830 Yeah, it's fine. And what happens if I feel like if if one is the same as the same size of two, 152 00:15:10,830 --> 00:15:19,740 then you think that the situation is like generating a sense and it boils down to our like understanding of like empirical risk minimisation. 153 00:15:19,740 --> 00:15:26,190 Well, if everyone is the same, if everyone is the same as F two, then all of the models are simple and the bound trivially holds. 154 00:15:26,190 --> 00:15:35,370 So you're right. But yeah. Yeah. So this is using regular learning theory like this side is just regular learning theory. 155 00:15:35,370 --> 00:15:40,800 So just this part of it will give you you'll get back to a regular learning you. 156 00:15:40,800 --> 00:15:44,790 We're leveraging learning theory in this moment. Yeah, yeah. 157 00:15:44,790 --> 00:15:52,330 Sorry about the notation issue. There should have been a a gamma. I changed recently from theta to gamma. 158 00:15:52,330 --> 00:15:58,190 Sorry, do you mind if I just ask, what do you mean when you say 100000 models? 159 00:15:58,190 --> 00:16:01,610 Obviously, you have continuous coefficients that the. 160 00:16:01,610 --> 00:16:09,410 No, no, no, I'm in a abstract setting where everything is discreet, so even if I'm in a setting where there are finite hypothesis spaces. 161 00:16:09,410 --> 00:16:18,200 So you want to think about this as being if your data live on a giant hypercube or in a giant categorical space where every you know, 162 00:16:18,200 --> 00:16:22,550 even if you have continuous functions, the realisations of them are discrete. 163 00:16:22,550 --> 00:16:30,020 And so you should just think of this as just an abstract setting where where everything is discrete. 164 00:16:30,020 --> 00:16:32,810 OK. All right. Thank you. Yeah. 165 00:16:32,810 --> 00:16:44,300 And the thing I was going to say after this is that actually in general, the idea generalise is to the case of everything being continuous and smooth. 166 00:16:44,300 --> 00:16:52,040 But you have to you have to replace some of the assumptions. And in particular, this random draw assumption, which is unrealistic, 167 00:16:52,040 --> 00:16:58,130 you would replace this with a smoothness assumption over the over the class of models, 168 00:16:58,130 --> 00:17:06,380 and you have to assume that that the that F1 is a good cover for F2. 169 00:17:06,380 --> 00:17:10,850 And then in that case, the whole idea generalise is to more realistic settings. 170 00:17:10,850 --> 00:17:18,290 Mm-Hmm. Yes. Actually, Konstantinos Gutsiness is asking, What does it mean that we uniformly draw F2? 171 00:17:18,290 --> 00:17:23,000 Then we need F2 to be specific, but it seems that you are addressing this point now, right? 172 00:17:23,000 --> 00:17:26,540 Yeah, yeah. Yeah, I was just about you guys are like one step ahead of me, 173 00:17:26,540 --> 00:17:31,700 and that's totally great because it means that people are following this lecture, which makes me happy, which is really hard to do remotely. 174 00:17:31,700 --> 00:17:35,500 OK, cool. All right. Great. 175 00:17:35,500 --> 00:17:43,160 So, yeah, so I gave some examples of this. And essentially what I what I was trying to say here is that if the Rashomon ratio is sufficiently large, 176 00:17:43,160 --> 00:17:48,380 so if you have a large enough set of good models, then with high probability, 177 00:17:48,380 --> 00:17:52,190 the best empirical risk over the simpler class is close to the best possible 178 00:17:52,190 --> 00:17:56,390 true risk of the larger class and the generalisation guarantee comes from F1. 179 00:17:56,390 --> 00:18:02,160 So this is basically the simplest possible abstract setting where the Rashomon, you know, 180 00:18:02,160 --> 00:18:09,980 a large Rashomon ratio actually gives you a better guarantee on on the quality of your performance. 181 00:18:09,980 --> 00:18:15,920 Yeah, quality of your performance on data, right? As opposed to knowing everything. 182 00:18:15,920 --> 00:18:23,960 And as I mentioned, we in the paper, this is only the first theorem and there's a series of theorems that I won't have time to get into today. 183 00:18:23,960 --> 00:18:30,140 But essentially, we replace the random draw assumption with Smith's assumptions so that everything 184 00:18:30,140 --> 00:18:35,540 is nice and smooth and that you're in that F1 is a good approximating set for F2. 185 00:18:35,540 --> 00:18:44,150 The other assumption that you can make is that the Rashomon set contains a really big ball so that that as long as F1 approximates F2 nicely, 186 00:18:44,150 --> 00:18:53,090 there's at least one F1 and that big Rorschach concept ball. And then you don't need the random draw assumption anymore, and it's much more realistic. 187 00:18:53,090 --> 00:18:57,530 OK, so so the results I just showed you and the theorems that I didn't, 188 00:18:57,530 --> 00:19:03,690 that I don't have time to to talk about in the in the continuous setting suggests 189 00:19:03,690 --> 00:19:10,100 that as long as F1 is a good approximating set for F2 and the Rashomon set is large, 190 00:19:10,100 --> 00:19:15,350 then we might as well work with the simpler class because we're not getting any benefit from using the more complex class. 191 00:19:15,350 --> 00:19:20,480 You're going to get the same level of accuracy for simpler class, for the complex class. 192 00:19:20,480 --> 00:19:26,390 So in other words, if decision trees, which are peaceful as constant functions approximate neural networks, 193 00:19:26,390 --> 00:19:30,920 which are smooth functions, and for my problem, if the Russians at large, 194 00:19:30,920 --> 00:19:33,980 then I can just work with decision trees because I'm not going to get any benefit from 195 00:19:33,980 --> 00:19:39,330 working with neural networks because decision trees approximate neural networks. 196 00:19:39,330 --> 00:19:44,380 OK, now I want to point out that we're not doing standard learning theory, 197 00:19:44,380 --> 00:19:48,960 we're using standard learning theory, but but here this is not the same thing. 198 00:19:48,960 --> 00:20:00,750 So large Rashomon large Rashomon ratios pertain to the existence of good models of models with good generalisation and good performance. 199 00:20:00,750 --> 00:20:03,150 That's different than regular learning theory, right? 200 00:20:03,150 --> 00:20:11,340 Those regular learning theory compares empirical risk to a true risk for the same function or for a class of functions. 201 00:20:11,340 --> 00:20:16,350 Here we're talking about existence of models from a different class. 202 00:20:16,350 --> 00:20:21,780 So the Rashomon ratio is not the same thing as the geometric margin that's used in support vector 203 00:20:21,780 --> 00:20:27,960 machines and other forms of learning theory because the margin is measured with respect to one model, 204 00:20:27,960 --> 00:20:32,760 whereas the freshman ratio is a function of many models. It's not. 205 00:20:32,760 --> 00:20:41,430 The same thing is the V-C dimension. The V-C Dimension is data independent. It's a it's a it's a property of a function class only. 206 00:20:41,430 --> 00:20:50,190 Whereas the Rashomon ratio is a property of a specific dataset. The Rashomon ratio is large for a specific dataset and function class. 207 00:20:50,190 --> 00:20:56,370 It's not the same thing as algorithmic stability, which talks about the way you search through the space to find a model. 208 00:20:56,370 --> 00:21:02,160 Stability depends on making changes to a data set. Here are the data set is fixed. 209 00:21:02,160 --> 00:21:04,710 It's not the same thing as Rademacher complexity. 210 00:21:04,710 --> 00:21:13,530 Rademacher complexity fits the function class's ability to fit noisy targets, whereas the RECCOMEND ratio uses fixed labels. 211 00:21:13,530 --> 00:21:18,330 It's not the same thing as a flat minimum, which has become popular in neural networks here. 212 00:21:18,330 --> 00:21:26,520 We don't necessarily even have to have a continuous function space. And the Russian mindset could include many local minimum. 213 00:21:26,520 --> 00:21:38,420 OK. So, all right, so that's what the theory says, the theory says that large Rashomon sets allow us to use simpler functions without losing accuracy. 214 00:21:38,420 --> 00:21:45,260 Now, what happens in practise, what actually what actually happens in reality? 215 00:21:45,260 --> 00:21:50,180 Well, usually you can't figure that out because measuring the Rashomon ratio is not something 216 00:21:50,180 --> 00:21:55,850 that you could normally do because it would require you to look at the whole model class, 217 00:21:55,850 --> 00:22:02,250 which is not practical. So but today we're going to do it anyway, just to find out what happens. 218 00:22:02,250 --> 00:22:08,700 OK, so don't do this at home, but we're going to do it today and we're going to use the empirical Rashomon ratio because we have data. 219 00:22:08,700 --> 00:22:13,710 So we'll use this quantity, which is the fraction of models that are good. 220 00:22:13,710 --> 00:22:20,010 OK, so this is just the number of functions with low lot, low empirical loss divided by the total number of functions. 221 00:22:20,010 --> 00:22:26,770 OK. All right, so and again, you'd never calculates that in reality, but we're going to do it. 222 00:22:26,770 --> 00:22:31,360 And in particular, I want to do an experiment. I'm going to compare two things. 223 00:22:31,360 --> 00:22:37,630 The first one is the size of the freshman set, and the second is the performance of left lots of different machine learning models. 224 00:22:37,630 --> 00:22:46,810 OK, so the first part of the experiment, I'm going to calculate the size of the Russian mindset, and I'm going to estimate it. 225 00:22:46,810 --> 00:22:51,130 And the way I'll do it is using decision trees of depth. Seven. OK, why? 226 00:22:51,130 --> 00:23:00,680 Why decision trees of depth seven. Well. It's because a decision trees can be sampled. 227 00:23:00,680 --> 00:23:06,350 I can sample decision trees. OK. And also decision trees are peaceful, has constant functions. 228 00:23:06,350 --> 00:23:11,960 There are good approximating, set for a much larger function space because they can fit and they can also filter data sets they like. 229 00:23:11,960 --> 00:23:17,360 If you the same trees that seven are pretty powerful, they can actually fit a lot of data sets really well. 230 00:23:17,360 --> 00:23:23,360 OK, so that's why that's why we're going to calculate the rational side, the empirical measurements at that way. 231 00:23:23,360 --> 00:23:30,320 OK. So, OK, so we have. So let's say let's say that we have that, OK, what's the other thing we're going to check? 232 00:23:30,320 --> 00:23:34,280 And oh wait, I forgot to mention. Yeah, I just want to explain this a little more. 233 00:23:34,280 --> 00:23:41,420 So let's say we have a function space that we're interested in. That function space includes decision, tree support, vector machines, decisions. 234 00:23:41,420 --> 00:23:46,850 It's just a big function class that we're interested in, and what we're doing essentially is. 235 00:23:46,850 --> 00:23:51,830 OK, so here's here's the Rashomon set. We're going to approximate that whole function class with decision trees. 236 00:23:51,830 --> 00:23:59,180 And I claim that decision trees of seven or a good cover for this space because they're 237 00:23:59,180 --> 00:24:03,290 they're just piece by constant functions and so they approximate smooth functions. 238 00:24:03,290 --> 00:24:12,440 And so and also, you know, they approximate random forest and boosted decision trees, too, which are essentially combinations of trees. 239 00:24:12,440 --> 00:24:18,380 So anyway, I'm going to just compute the fraction of decision trees at seven that are in the Rashomon set. 240 00:24:18,380 --> 00:24:24,200 And that's how I'm going to get this estimate of the size of the rational. 241 00:24:24,200 --> 00:24:30,980 OK, so then the second part of the oh yeah, and these are all my trees, my little green dots are the trees. 242 00:24:30,980 --> 00:24:37,100 The second part of the experiment, I'm going to just run a whole bunch of different machine learning methods on the dataset, 243 00:24:37,100 --> 00:24:45,690 and I want to know whether they perform similarly, because if they do, it means that they all live in a big Russian one set. 244 00:24:45,690 --> 00:24:51,660 OK, so that means if this is true, that means the regime on set can accommodate functions of lots of different types, 245 00:24:51,660 --> 00:24:58,680 right, because it could accommodate a support vector machine and a porous PC, a decision tree. 246 00:24:58,680 --> 00:25:03,120 So I want I want to know if all these methods perform similarly, 247 00:25:03,120 --> 00:25:07,620 and I want to know if they generalise and I want to know how that correlates with the size of the Russian mindset. 248 00:25:07,620 --> 00:25:13,810 OK. All right, so that's the experiment. Let me show you the results. 249 00:25:13,810 --> 00:25:16,910 OK, so the results are. 250 00:25:16,910 --> 00:25:27,730 That when the Rashomon ratio is measured to be large by party, then all the methods tend to perform similarly, and they generalise. 251 00:25:27,730 --> 00:25:34,810 That's the result I'm going to show you in the next couple of slides. And interestingly, the result isn't always true. 252 00:25:34,810 --> 00:25:41,270 That surprised us, but we think it's because of an artefact of the way that we're measuring the size of the Rashomon set. 253 00:25:41,270 --> 00:25:45,340 And if features are correlated with each other, it's really easy to. 254 00:25:45,340 --> 00:25:48,790 It's really easy to overestimate the size of the rash onset. 255 00:25:48,790 --> 00:25:55,530 And so sometimes our measurement of the Russian mindset is too small. OK. 256 00:25:55,530 --> 00:26:02,190 So great. Let me show you the experiment we did 64 data set. 257 00:26:02,190 --> 00:26:10,270 So the large number of data sets. Categorical data sets, real value data sets, regression data sets, synthetic data sets, 258 00:26:10,270 --> 00:26:15,790 the number of features ranged between three and seven hundred eighty four and the number of classes range between two indexes is just flat. 259 00:26:15,790 --> 00:26:23,260 Just three downloaded the whole repository OK and generated and generated synthetic datasets, too. 260 00:26:23,260 --> 00:26:30,040 All right. Now, when we had a large freshman ratio, these are the kinds of results we get. 261 00:26:30,040 --> 00:26:31,670 Lots of different machine learning methods, 262 00:26:31,670 --> 00:26:38,230 they all perform very similarly like this is for different data sets here, five different machine learning methods. 263 00:26:38,230 --> 00:26:43,780 They're all performing very similarly, and they're all generalising between training and test. 264 00:26:43,780 --> 00:26:52,420 OK, so this is for large freshman ratios, for small Rashomon ratios, but that's not always what we got. 265 00:26:52,420 --> 00:26:59,800 So for small freshman ratios, sometimes the accuracy would be all over the place like different methods would perform differently. 266 00:26:59,800 --> 00:27:06,280 And sometimes they wouldn't generalise as well between training and test like you're seeing over here with the large freshman ratios. 267 00:27:06,280 --> 00:27:12,070 This always happened with a small freshman ratios. There was a variety of different results like we could get. 268 00:27:12,070 --> 00:27:19,240 Also, sometimes cases where everything generalise really well. But our theory luckily really applies to large freshman sets. 269 00:27:19,240 --> 00:27:26,480 So we're making a conclusion. You know, if the freshman said it's large, then you get this kind of these nice properties. 270 00:27:26,480 --> 00:27:36,940 Cynthia, how do you define a freshman said, being large or small, you're taking the data sets and then you are kind of like, 271 00:27:36,940 --> 00:27:44,050 I'm looking at the quantities of like the freshman sets, and that's how you define larger or risk model of the look. 272 00:27:44,050 --> 00:27:50,770 Yeah. So believe it or not, valley is around 10 to the negative thirty seven, which we're sort of. 273 00:27:50,770 --> 00:27:55,510 These values came from important sampling rate with the set of decision trees of depth seven. 274 00:27:55,510 --> 00:28:04,640 So these are actually larger values and defending anything that was like 10 to the 38th and 39th or below is like a small Rashomon ratio. 275 00:28:04,640 --> 00:28:09,970 Yeah. And we were just looking relative to, you know, what we would get on these different data sets. 276 00:28:09,970 --> 00:28:13,750 When you see the data and I question, 277 00:28:13,750 --> 00:28:22,180 do you think it is likely to be shown a speech that data sets allow for a function as basis with large récemment sets? 278 00:28:22,180 --> 00:28:31,870 That is a great question, and I think they do, but there is a lot I could say about that. 279 00:28:31,870 --> 00:28:39,760 So, so I work in interpretable machine learning, and we've been trying to design interpretable models for computer vision for a long time, 280 00:28:39,760 --> 00:28:45,220 and we've been able to create models for computer vision that that we that are interpretable. 281 00:28:45,220 --> 00:28:51,610 But the definition of interpretability is different between computer vision than it is for, like other types of problems. 282 00:28:51,610 --> 00:28:55,390 So, for example, you would never want to do a decision tree on pixels. 283 00:28:55,390 --> 00:28:58,360 That's that doesn't make that's not interpretable, right? 284 00:28:58,360 --> 00:29:05,890 What you'd want to do is maybe kiss best reasoning where you say this part of the image looks like this part of this other image. 285 00:29:05,890 --> 00:29:14,380 And for those for those types of problems, we've been able to design interpretable neural networks that are constrained to reason in this way, 286 00:29:14,380 --> 00:29:19,630 but still attain the same level of accuracy as regular black box neural networks. 287 00:29:19,630 --> 00:29:29,740 Mm hmm. And so I think the only reason we're able to do this is because the Rashomon that permits it, the romance that's large enough to permit it. 288 00:29:29,740 --> 00:29:32,650 And so I can say that about computer vision. 289 00:29:32,650 --> 00:29:40,300 I obviously can't say that about every possible application of machine learning to every possible problem. 290 00:29:40,300 --> 00:29:47,950 I can only talk about the problems I've worked on, but I even started working in materials science, which is a super complicated domain. 291 00:29:47,950 --> 00:29:52,990 And even there we were able to find to find models that were interpretable to our human materials 292 00:29:52,990 --> 00:29:59,800 science colleagues that were as accurate or more accurate than the black boxes we could construct. 293 00:29:59,800 --> 00:30:06,040 And I say more accurate because sometimes the insight you get from the interpretability actually allows you to boost accuracy. 294 00:30:06,040 --> 00:30:11,140 So I think I hope that answers your question. 295 00:30:11,140 --> 00:30:16,000 Mm hmm. Thanks. Yeah, that was a really good question. Thank you for asking. 296 00:30:16,000 --> 00:30:23,170 OK. Great, so now, as I mentioned, you really can't measure the size of the Russian set in practise, 297 00:30:23,170 --> 00:30:29,260 but that's OK because we got a lot of information out of these experiments and in particular, 298 00:30:29,260 --> 00:30:35,290 if if the Rashomon ratio is large, we found that all the methods performed similarly in general as well. 299 00:30:35,290 --> 00:30:39,920 Now, if the methods performed differently, it's likely to be a small Rashomon ratio. 300 00:30:39,920 --> 00:30:49,820 Now. We're not completely sure about this yet, but we think it is a viable, possible explanation for what's going on. 301 00:30:49,820 --> 00:31:00,530 And you know, it does explain why me and a lot of other people have found that algorithms perform similarly across many problems. 302 00:31:00,530 --> 00:31:04,130 It's because there's a large Rashomon ratio. Yeah. 303 00:31:04,130 --> 00:31:11,740 So why do simple? What so why do simple models perform? Well, it's possibly because there's a large rush immigration. 304 00:31:11,740 --> 00:31:16,000 OK. We found something else besides the results that I just showed you that we were 305 00:31:16,000 --> 00:31:20,830 really surprised about and we found this on every single dataset that we examined. 306 00:31:20,830 --> 00:31:28,060 And what we planned is something called the Rashomon curve. I'll show you a cartoon of it before I actually show you the real thing. 307 00:31:28,060 --> 00:31:31,920 It's a plot of the Rashomon ratio versus the empirical risk. 308 00:31:31,920 --> 00:31:39,570 OK, so let's say that you take a hierarchy of hypothesis basis, so we have the simplest ones to the more complex ones. 309 00:31:39,570 --> 00:31:45,030 So this is like decision trees of depth one, two, three, four, five, six and seven like that. 310 00:31:45,030 --> 00:32:00,230 OK. So embedded spaces. Now when when you go down this curve here, when you add more complexity, what should happen to these quantities? 311 00:32:00,230 --> 00:32:03,490 Well, as you add more complexity. 312 00:32:03,490 --> 00:32:11,410 The best empirical risk for each function class goes down because you can fit, you can fit better, you have more models that you can fit better so. 313 00:32:11,410 --> 00:32:16,550 So as we increase from here to here, we expect to go this way. 314 00:32:16,550 --> 00:32:21,530 What about the Rashomon ratio? Well. 315 00:32:21,530 --> 00:32:29,540 The numerator goes up because you have more good models, but the denominator goes up as well because you have more models. 316 00:32:29,540 --> 00:32:31,250 So what, what happens? 317 00:32:31,250 --> 00:32:38,540 And as it actually turns out, oh, sorry, I just put the Rashomon ratio there for you again, the fraction of models that are good, right? 318 00:32:38,540 --> 00:32:44,360 Both the numerator and the denominator go up because you have more models in both cases. 319 00:32:44,360 --> 00:32:51,350 As it turns out, it goes down, as it turns out, the denominator increases much more quickly than the numerator. 320 00:32:51,350 --> 00:32:58,830 So what happens is that you take your simplest function class and then you run it and you increase complexity a little bit. 321 00:32:58,830 --> 00:33:03,050 So what happens is that the the empirical risk goes down. 322 00:33:03,050 --> 00:33:09,380 But the Rashomon ratio tends to stay kind of constant. But then all of a sudden it just like nosedives. 323 00:33:09,380 --> 00:33:14,630 And, you know, sometimes you overfed a little bit so you can see a little bit of like, you know, maybe it's going to sway a little bit, 324 00:33:14,630 --> 00:33:23,100 but most of the time it goes down so quickly that you can't even see, you can't even see this like, you know, full. 325 00:33:23,100 --> 00:33:29,310 And, you know, we were kind of surprised to see this because we saw it on every single data set that we examined. 326 00:33:29,310 --> 00:33:32,460 We were not expecting to see anything like this. 327 00:33:32,460 --> 00:33:37,620 And sometimes you see the whole curve, like sometimes you see it go over and down, but sometimes it just goes down. 328 00:33:37,620 --> 00:33:43,870 You don't even see this part because if the simpler models already perform pretty well, it just kind of goes down. 329 00:33:43,870 --> 00:33:49,980 OK, so yeah, we saw it. And like I said, all kinds of different data sets of sometimes like I said, 330 00:33:49,980 --> 00:33:56,870 you see the little curve, sometimes you just see like parts of it, like that vertical part. 331 00:33:56,870 --> 00:34:03,410 And there's always, you know, there's always some kind of like turning point up here or else it's the top. 332 00:34:03,410 --> 00:34:09,800 And yet we saw it on every single data that we experimented with. OK, so that's what happens on the training set. 333 00:34:09,800 --> 00:34:16,640 What about the test set? And luckily, statistical learning theory tells us what the difference is between the training and test results. 334 00:34:16,640 --> 00:34:23,120 OK, so the generalisation is better for smaller function classes than it is for larger function classes. 335 00:34:23,120 --> 00:34:30,740 So you would expect to over fit when you have a really big function class and then your your true risk is going to become worse. 336 00:34:30,740 --> 00:34:39,590 Right? So what the what the theory kind of tells us is that we should really kind of be looking around this elbow what we call the Rossmann elbow, 337 00:34:39,590 --> 00:34:43,820 because this is the simplest function class that describes the data. 338 00:34:43,820 --> 00:34:52,630 Well, right, that has low empirical risk. So this elbow seems to be like a really good choice for model selection. 339 00:34:52,630 --> 00:34:58,150 So let me show you the results. I'll show you the Rashwan curves for all 64 data sets. 340 00:34:58,150 --> 00:35:02,090 All right. And as you can see, maybe I should. 341 00:35:02,090 --> 00:35:08,200 Yeah. As you can, I'm going to zoom in to some of these just to show you what kind of what's going on here. 342 00:35:08,200 --> 00:35:12,610 And I just want to point out that we're averaging over 10 volts to plot these, 343 00:35:12,610 --> 00:35:17,920 both for her training and test empirical risks, as well as the Rashomon ratio. 344 00:35:17,920 --> 00:35:25,500 And so what you see, you should see it going across and down or just down for all of these data sets. 345 00:35:25,500 --> 00:35:30,120 OK, so let me zoom in a little bit. So sometimes the theory was insightful. 346 00:35:30,120 --> 00:35:35,490 Sometimes you really did see like the elbow being the best model that really worked. 347 00:35:35,490 --> 00:35:39,660 But as you know, with randomness, statistical learning theory, it's all probabilistic. 348 00:35:39,660 --> 00:35:43,980 So sometimes it doesn't really work. Sometimes everything, just always generalised. 349 00:35:43,980 --> 00:35:47,010 And the training and test points were right on top of each other. 350 00:35:47,010 --> 00:35:53,910 And and then sometimes we never generalise in which case you're just seeing these big uncertainty bands, right? 351 00:35:53,910 --> 00:35:57,090 The big generalisation, gaps between training and test. 352 00:35:57,090 --> 00:36:04,170 But regardless of which of these three situations you're in, the elbow just seems to be a good choice for model selection because again, 353 00:36:04,170 --> 00:36:15,480 it's the simplest function class that describes the data well, and in no cases did that to the elbow turn out to be a really bad choice, right? 354 00:36:15,480 --> 00:36:20,770 Yes, so the elbow model always seems to be a good choice for model selection. 355 00:36:20,770 --> 00:36:26,240 So that makes you wonder, like where you are relative to the elbow. 356 00:36:26,240 --> 00:36:29,210 Right, because in real problems, you don't actually see the whole curve. 357 00:36:29,210 --> 00:36:39,890 You just end up, you know, you'd pick your function class and you, you know, ran your method and you don't know where you are on the curve. 358 00:36:39,890 --> 00:36:49,680 You could be anywhere on this curve. So you might want to figure out where you are in the curve to figure out whether you're close to the elbow. 359 00:36:49,680 --> 00:36:53,920 So. And remember, you usually can't measure any point. 360 00:36:53,920 --> 00:37:02,910 You can't measure any point on this curve because the curve requires the Rashomon ratio, which is the fraction of good models, you can measure that. 361 00:37:02,910 --> 00:37:17,940 OK, so what can you do? Well. If you are in this part of the curve, then different models with different complexity levels perform differently. 362 00:37:17,940 --> 00:37:26,130 So if you run a whole bunch of different machine learning methods with different levels of complexity and they all perform differently, 363 00:37:26,130 --> 00:37:33,540 then you're probably on this part of the curve here. And in that case, you probably want to. 364 00:37:33,540 --> 00:37:38,180 You probably want to increase your complexity and go to the elbow. 365 00:37:38,180 --> 00:37:42,800 Whereas if you're in, whereas if you're in in this part of the curve, well, 366 00:37:42,800 --> 00:37:48,750 then it doesn't matter which machine learning method you choose, they all perform very, very similarly. 367 00:37:48,750 --> 00:37:52,820 You have very similar empirical risk. And so in that case, 368 00:37:52,820 --> 00:37:56,210 you might want to try to make the model simpler so you can go up toward the 369 00:37:56,210 --> 00:38:01,450 elbow because you probably won't lose performance if you make the model simpler. 370 00:38:01,450 --> 00:38:11,140 So I had been thinking that this kind of might explain some of the things that me and others have been observing across problems and across data. 371 00:38:11,140 --> 00:38:18,310 There are some problems like image net, where the field has been designing more and more complicated models, and it reduces error. 372 00:38:18,310 --> 00:38:23,970 So maybe we're still in this part of the curve, right? And. 373 00:38:23,970 --> 00:38:30,390 You know, on the other hand. If you think about problems like Amnesty, where? 374 00:38:30,390 --> 00:38:33,740 No matter which method you use, you get 100 percent accuracy. 375 00:38:33,740 --> 00:38:41,570 So in that case, we're like, we may be on this part of the curve and we could use simpler models and still get 100 percent accuracy. 376 00:38:41,570 --> 00:38:49,260 And those simpler models might have other properties like they might generalise better outside of amnesty, and they might be more interpretable. 377 00:38:49,260 --> 00:38:51,690 And then there's a kind of these kind of problems, 378 00:38:51,690 --> 00:38:57,850 the kind of problems I usually work on where it doesn't matter which machine learning method you pick. 379 00:38:57,850 --> 00:39:03,910 They just all kind of perform similarly, like for the re-arrest, you know, the group for your prediction, 380 00:39:03,910 --> 00:39:08,200 I feel like we're in this part of the Rashomon part because like with rigorous prediction, 381 00:39:08,200 --> 00:39:12,520 you can get a really simple model that predicts just as well as like your super complicated model. 382 00:39:12,520 --> 00:39:16,370 And there's just an inherent level of noise like you just can't get. 383 00:39:16,370 --> 00:39:22,150 If you try to get more accurate, you'll just over fit, basically. 384 00:39:22,150 --> 00:39:28,960 Yeah, so for these types of problems, I think we probably want to be walking up up the curve, 385 00:39:28,960 --> 00:39:36,220 reducing complexity to get kind of a more simple model that is interpretable but still maintain your level of accuracy. 386 00:39:36,220 --> 00:39:40,240 Just walking up toward the elbow there. 387 00:39:40,240 --> 00:39:48,550 OK, so what I've gotten to is an easy check, a simple check for the possible presence of a simpler yet accurate model, 388 00:39:48,550 --> 00:39:54,310 which is that you should pick several of your favourite machine learning methods and you run them all in the data set. 389 00:39:54,310 --> 00:40:02,650 OK. You run them all in the data set. If they all perform differently, your model class is maybe too small to include the elbow solution you can. 390 00:40:02,650 --> 00:40:11,470 You can get a little bit more complex. So, yeah, use a more complex model class if other machine learning methods perform similarly. 391 00:40:11,470 --> 00:40:19,350 Your model class might be a little bit too big than you need, in which case you can try to find specialised models that will move you up. 392 00:40:19,350 --> 00:40:25,490 You know, they have the special properties like interpretability just decreased your complexity. 393 00:40:25,490 --> 00:40:30,620 OK, so great. So, all right, I've defined my condition, 394 00:40:30,620 --> 00:40:36,410 which is that the Rashomon set is large and I've showed you that you don't need to calculate the Rashomon ratio. 395 00:40:36,410 --> 00:40:44,170 You can just try lots of different machine learning methods, and that gives you a sense of whether simpler solutions might exist. 396 00:40:44,170 --> 00:40:49,810 And now a lot of people don't believe me about this or they're not interested, and that's fine. 397 00:40:49,810 --> 00:40:55,630 But sometimes it can be kind of silly like so I want to tell you a story that happened a couple of summers ago. 398 00:40:55,630 --> 00:41:01,390 I found out about this explainable machine learning. It's called the Explainable Machine Learning Challenge. 399 00:41:01,390 --> 00:41:05,710 And my group decided we had to enter it. I do a lot of data science competitions. 400 00:41:05,710 --> 00:41:09,910 I actually coach Duke's data science competition team where we enter data science competitions. 401 00:41:09,910 --> 00:41:18,460 And then this thing came out and we were like, Oh, we got to do it. But I mean, the goal of the competition was to create a black box and explain it. 402 00:41:18,460 --> 00:41:24,070 And so we got the data set. It was a nice big data set from Flaco and loan defaults. 403 00:41:24,070 --> 00:41:27,820 The dataset had thousands of rows. 404 00:41:27,820 --> 00:41:34,750 Each one was a person and with their whole credit history, we had to decide whether or not they would default on their loan. 405 00:41:34,750 --> 00:41:41,240 And we looked at it and we and we thought, you know, this looks like it has a good data representation. 406 00:41:41,240 --> 00:41:48,170 And I thought, could I be wrong? Could it be a problem with a good data representation where you or you need a black box? 407 00:41:48,170 --> 00:41:51,140 And so I said to my students, Look, I don't know about this competition. 408 00:41:51,140 --> 00:41:57,530 Just try running a bunch of different machine learning methods on the data set and see whether they all perform the same. 409 00:41:57,530 --> 00:42:03,950 So a day or so later, they came back and they said, Yep, all the methods are performing the same. 410 00:42:03,950 --> 00:42:08,690 And then at that point, we pretty much knew that the dataset had a large Russian mindset. 411 00:42:08,690 --> 00:42:14,480 So we said, we said, OK, we think we can construct an interpretable model for this dataset. 412 00:42:14,480 --> 00:42:20,300 And so we had a debate like, should we follow the competition rules, should we create a black box and explain it? 413 00:42:20,300 --> 00:42:28,660 Or should we actually try to create an apparently interpretable model? So we decided that for after about two seconds of debate, 414 00:42:28,660 --> 00:42:33,700 we decided that for a problem as important as credit risk, we should create an inherently interpretable model. 415 00:42:33,700 --> 00:42:39,370 So we did. We created a globally interpretable model with a create a beautiful visualisation 416 00:42:39,370 --> 00:42:44,500 tool that had the same accuracy as the best neural network that we could construct. 417 00:42:44,500 --> 00:42:49,720 So in fact, it's all live. You can actually play with it. You can go to this data set data. 418 00:42:49,720 --> 00:42:55,540 That's right, the data Duke Data Science Go website, which is just running, 419 00:42:55,540 --> 00:43:00,550 it's just running on the Duke servers and you can play around with the with the Fishko dataset in our model. 420 00:43:00,550 --> 00:43:04,180 And I'm just showing you a snapshot of it. I don't want to bring up the whole thing, 421 00:43:04,180 --> 00:43:14,410 but basically that it had like a bunch of sub scales and you could click on the subscales and you could get points for different things. 422 00:43:14,410 --> 00:43:21,940 Like, for instance, this is the delinquency sub score. And it's a set of sparse logistic regression models, essentially. 423 00:43:21,940 --> 00:43:28,810 So for instance, you'd get like a point for your percent of trades being never delinquent for this person. 424 00:43:28,810 --> 00:43:32,710 Actually, their trades were kind of delinquent. That's why they got a point. 425 00:43:32,710 --> 00:43:36,060 The number of months since the most recent delinquency, they get points for that. 426 00:43:36,060 --> 00:43:46,390 And so it's yeah. So you just add up the points and each set of points would translate into a little score and you'd get up the score as it was. 427 00:43:46,390 --> 00:43:52,570 It was very nice. It was a nice d composable model, a little sparse logistic regression type models. 428 00:43:52,570 --> 00:43:58,300 And so we sent this in to the competition wondering what the judges would think of it. 429 00:43:58,300 --> 00:44:03,490 Because I thought they're going to have no idea how to judge this, because it's an inherently interpretable model. 430 00:44:03,490 --> 00:44:07,480 And I was right. They had no idea how to judge this and we totally bombed. 431 00:44:07,480 --> 00:44:11,500 We did absolutely terribly. We didn't even place. But luckily, 432 00:44:11,500 --> 00:44:16,720 the judges realised that actually what happened was the judges didn't allow the the they 433 00:44:16,720 --> 00:44:20,320 didn't allow any of the judges to play with the visualisations that people had constructed. 434 00:44:20,320 --> 00:44:24,940 So every team that created a visualisation tool for their model of the judges didn't get to play with it. 435 00:44:24,940 --> 00:44:34,060 So that gave us a major disadvantage. But luckily, the judges realised that their judging criteria wasn't very good and they saw value in what we did. 436 00:44:34,060 --> 00:44:37,960 And so they gave us an award. They actually created a little award for us. 437 00:44:37,960 --> 00:44:40,300 They created the Fake Recognition Award, 438 00:44:40,300 --> 00:44:48,630 acknowledging our submission for going above and beyond expectations with a fully transparent global model and a user friendly dashboard. 439 00:44:48,630 --> 00:44:55,130 And so I was really excited about this, and I thought, OK, I'll send in, I'll send a write a paper about it. 440 00:44:55,130 --> 00:44:59,840 And we'll send it into a special issue for a journal on decision making. 441 00:44:59,840 --> 00:45:06,230 And I was told to email the editor guest editor of the special issue to see if the paper is appropriate. 442 00:45:06,230 --> 00:45:13,120 So I emailed the person So dear, fancy esteemed professor at Fancy Stanford University. 443 00:45:13,120 --> 00:45:18,670 And we have this paper, we don't know whether it fits into the scope of our of the special issue. 444 00:45:18,670 --> 00:45:20,620 It's not a traditional methodology paper. 445 00:45:20,620 --> 00:45:30,200 It's an analysis of this competition dataset, including a globally interpretable machine learning model, didn't lose accuracy over the black boxes. 446 00:45:30,200 --> 00:45:36,630 It won this award. What do you think? And he sent me back this email saying, Dear Cynthia, thanks for reaching out. 447 00:45:36,630 --> 00:45:40,620 This is an interesting paper, but I'm afraid it's not a good fit for the special issue. 448 00:45:40,620 --> 00:45:45,030 It's also related to my own recent work on explainability of neural nets. 449 00:45:45,030 --> 00:45:48,450 Is the phaco data still available? If so, could you share it? 450 00:45:48,450 --> 00:45:56,490 And I was like, Oh my gosh, you know, I send the guy a paper saying, Hey, you don't need a black box for this dataset. 451 00:45:56,490 --> 00:46:04,590 And he sends me back an email saying, I don't care about your paper, but can you send me the data so I can create a black box for it and explain it? 452 00:46:04,590 --> 00:46:10,740 And so that's unfortunately that the state of where things are at the moment. 453 00:46:10,740 --> 00:46:21,240 OK, so to summarise, I have to find a condition under which a simple yet accurate model is likely to exist, which is that the Rashomon set is large. 454 00:46:21,240 --> 00:46:23,550 I showed a simple check for large freshman sets, 455 00:46:23,550 --> 00:46:28,530 which is to run many different machine learning methods on your data to see if they all perform similarly. 456 00:46:28,530 --> 00:46:37,340 If they do, there's a good chance that you have a large, that you have a large freshman set and that you can find a simpler model. 457 00:46:37,340 --> 00:46:48,580 I introduced the notion of Rashomon curves, which we found to be true for every to have that characteristic pattern, for every dataset we examined. 458 00:46:48,580 --> 00:46:56,290 And so, yeah, so now that we know that interpretable yet accurate models tend to exist, we can go find them. 459 00:46:56,290 --> 00:47:02,920 And that's what my lab works on. It's finding these these models. So if finally, at the end of the talk, I get to introduce myself. 460 00:47:02,920 --> 00:47:10,720 So, yeah, I leave the prediction analysis lab. Most of my time is dedicated to the problems of optimal decision trace. 461 00:47:10,720 --> 00:47:16,000 So finding really tiny little if then role based models like the Coral's model I showed you earlier. 462 00:47:16,000 --> 00:47:25,540 For recidivism, we have lots of we have the fastest code right now, but about three orders of magnitude for optimal decision trees. 463 00:47:25,540 --> 00:47:30,280 I also work on medical scoring systems, which we've used for a lot of medical applications, 464 00:47:30,280 --> 00:47:33,100 and this is a model that's called that you helps to be score, 465 00:47:33,100 --> 00:47:39,220 which is used in intensive care units by doctors to help predict whether a patient will have a seizure. 466 00:47:39,220 --> 00:47:45,540 And that helps the doctors monitor the patient and and prevent brain damage and save lives. 467 00:47:45,540 --> 00:47:49,740 I also work on interpretable neural networks for computer vision. 468 00:47:49,740 --> 00:47:57,900 And as I mentioned earlier, we've shown that you can create interpretable models for rear vision that have the same accuracy as black boxes. 469 00:47:57,900 --> 00:48:09,290 And we're using them now to do with a collaboration with radiologists to help with mammograms that reading mammograms automatically. 470 00:48:09,290 --> 00:48:15,120 With to provide a computer aided decision and not a computer computer. 471 00:48:15,120 --> 00:48:21,770 Your decision. Just an automated decision, right, where it's computer aided rather than automated. 472 00:48:21,770 --> 00:48:25,250 I also work on data visualisation and dimension reduction, 473 00:48:25,250 --> 00:48:32,320 where we're trying to project high dimensional data onto low dimensional and to two dimensions so that you can understand the structure, 474 00:48:32,320 --> 00:48:41,430 high dimensional structure in the data. So we're trying to preserve as much of the high dimensional structure as possible when projecting onto 2-D. 475 00:48:41,430 --> 00:48:49,170 And then I'm also I also work as one of three professors and almost exact opposite exactly matching project 476 00:48:49,170 --> 00:48:57,660 where we're trying to match units almost exactly so that we can do interpretable causal inference. 477 00:48:57,660 --> 00:49:03,270 And then finally, the last one is understanding the set of good models and the importance of variables, which is what you heard about. 478 00:49:03,270 --> 00:49:06,720 You heard about one of those projects today in this category. 479 00:49:06,720 --> 00:49:15,210 And then finally, as I mentioned, I coach the Duke data science competition team where we rate automated, automated computer poetry. 480 00:49:15,210 --> 00:49:25,530 And we do image super resolution. This year, we were competing in a citation labelling competition, which was really fun. 481 00:49:25,530 --> 00:49:33,350 And yeah, I love competing in data science competitions, and I've been coaching students for years to do that. 482 00:49:33,350 --> 00:49:39,400 OK, thank you very much. Thanks a lot for this very thought provoking took. 483 00:49:39,400 --> 00:49:43,840 Cynthia. Yeah, we have some minutes for questions. 484 00:49:43,840 --> 00:49:51,370 Judith Rousseau was asking something. ProPublica. Do you want to ask yourself? 485 00:49:51,370 --> 00:50:02,430 Sure. So are there some situations where you would be not quite sure, but the accuracy or relevance of your résumé? 486 00:50:02,430 --> 00:50:08,860 And so I'm not sure my question makes sense, but. Do you trust them? 487 00:50:08,860 --> 00:50:16,750 Yeah, we actually don't trust our freshman curve estimates that much. We're only trusting them to determine whether the Russian mindset is large 488 00:50:16,750 --> 00:50:21,940 because it's very difficult to estimate the sizes of really small Rashomon sets. 489 00:50:21,940 --> 00:50:27,520 So if our estimates are that the Russian mind of small, then we just know it's small. 490 00:50:27,520 --> 00:50:33,580 We don't know really what its value is. And luckily, like I said in practise, you never really need to. 491 00:50:33,580 --> 00:50:38,530 You never really need to construct the Rashomon curve or the rational ratio. 492 00:50:38,530 --> 00:50:43,420 Because if we're just gaining the insight from it to figure out, 493 00:50:43,420 --> 00:50:50,500 there's this sort of important information that if you try a lot of different machine learning methods and they all perform similarly 494 00:50:50,500 --> 00:51:03,880 that that you probably have a large Russian mindset and that's all that's all we really needed to glean from that from those estimates. 495 00:51:03,880 --> 00:51:09,790 Let me sense. Thanks. OK. 496 00:51:09,790 --> 00:51:20,560 So you mentioned at some point the the fact that the regime on ratio is not the same as local minima, 497 00:51:20,560 --> 00:51:27,550 and I understand that the reason is like you may have like a several like a small or several local minima that would like, 498 00:51:27,550 --> 00:51:36,130 I mean, like a lot regime on set or you make a discreet hypothesis in space. 499 00:51:36,130 --> 00:51:41,120 So they look at minimal narrative. Would it make sense, but otherwise a. 500 00:51:41,120 --> 00:51:46,820 What's the goal, how someone like that situation doesn't exist, 501 00:51:46,820 --> 00:51:53,750 so so then like what would be the relation between like the usual narrative in the planning, 502 00:51:53,750 --> 00:52:02,120 for example, about the local minima and its good properties and these like there being like large rational assets? 503 00:52:02,120 --> 00:52:06,020 So if you have if you have a flat minimum, then you do have a large Russian mindset, right? 504 00:52:06,020 --> 00:52:12,140 Right. Because you would have like this flat area, the flat minimum, and then you'd be able to put a ball in there. 505 00:52:12,140 --> 00:52:16,380 It's just that that we can have a large Russian mindset without having a flat minimum. 506 00:52:16,380 --> 00:52:20,900 Well, I see. Yes. 507 00:52:20,900 --> 00:52:25,490 You know, I give not only sustenance figure above their name rational. 508 00:52:25,490 --> 00:52:30,380 Oh yeah. So that name came from from Leo Bremen, who got it from the movie Rashomon. 509 00:52:30,380 --> 00:52:36,560 So, so there's a Japanese movie. I haven't watched it yet. I've been meaning to watch it, but I have children. 510 00:52:36,560 --> 00:52:42,390 And so it's kind of hard to like, you know, it's like you don't want to watch that movie about violent stuff with the kids. 511 00:52:42,390 --> 00:52:49,790 Alright, so I haven't been watching that. But it's a movie about a violent crime that occurred. 512 00:52:49,790 --> 00:52:57,110 And there's four different perspectives on the crime, and in the end, you end up thinking that there's no real truth, 513 00:52:57,110 --> 00:53:01,850 and there's just just a lot of different ways of seeing the same thing, but that there's no truth. 514 00:53:01,850 --> 00:53:06,140 And so it's the same thing with with models, right? There's no true model. 515 00:53:06,140 --> 00:53:12,110 There's just a lot of different. There's just a lot of like there's no underlying truth, right? 516 00:53:12,110 --> 00:53:15,020 We don't we just have a finite dataset. So there's no truth. 517 00:53:15,020 --> 00:53:20,360 There's just a lot of models that perform well, just a lot of good explanations for what's what's actually happened. 518 00:53:20,360 --> 00:53:25,640 And so the Rashomon that is it's the set of good explanations for for the data. 519 00:53:25,640 --> 00:53:34,800 Mm-Hmm. Know, I remember like this paper by Monday, he also talks about the the heat. 520 00:53:34,800 --> 00:53:38,890 He mentioned the late Rashomon and also come and I wonder, 521 00:53:38,890 --> 00:53:46,470 like what it like a Rashomon and all like perspective of like moral complexity and the philosophy of related or equivalent, 522 00:53:46,470 --> 00:53:50,730 or they are like pointing to different aspects. What do you think about that? 523 00:53:50,730 --> 00:53:54,240 Well, I think that Rashomon enables them right? 524 00:53:54,240 --> 00:54:02,640 Because Pratima, large Rashomon that say that you can find a real, like a simpler model that explains the data well. 525 00:54:02,640 --> 00:54:07,180 Mm hmm. Yes. Yeah. Yeah, I got to. 526 00:54:07,180 --> 00:54:12,530 I'm one of the people who is lucky enough to get a chance to meet Leo Berman, 527 00:54:12,530 --> 00:54:17,960 although the time I met him, he told me that my paper on boosting was was not. 528 00:54:17,960 --> 00:54:21,860 He walked up to me during a nips poster session and he said, 529 00:54:21,860 --> 00:54:29,000 and I had been trying to prove I'd been trying to prove that at a boost, whether or not maximises the margin. 530 00:54:29,000 --> 00:54:34,910 And he said, Well, I already proved that if you want to have a real thesis, you could do something else. And it was. 531 00:54:34,910 --> 00:54:38,720 But, you know, I ended up becoming friends with him and I remember him like, you know, 532 00:54:38,720 --> 00:54:44,120 waving to me at the end of the conference, and I did manage to actually prove the theorem. 533 00:54:44,120 --> 00:54:50,180 You know, in the end, I did prove that that added boost does not maximise its margin. 534 00:54:50,180 --> 00:54:55,520 But yeah, it was a it was interesting in getting a chance to meet him. 535 00:54:55,520 --> 00:55:01,790 Yeah, just a really outspoken guy who's done amazing things because of his work in industry. 536 00:55:01,790 --> 00:55:10,160 And, you know, just kind of going out in the real world and understanding the value of things like interpretability for building, 537 00:55:10,160 --> 00:55:17,950 just creating decision trees and the value that they created for people. Hmm. 538 00:55:17,950 --> 00:55:31,970 OK. So the idea that, you know, there are no more questions about thanks a lot for your time and for your son took. 539 00:55:31,970 --> 00:55:38,570 Thanks. I wish I could meet all of you in person, but maybe day. 540 00:55:38,570 --> 00:55:47,900 Yeah. Thank you. 541 00:55:47,900 --> 00:55:56,410 Bye bye. Thank you. Thanks, Michael. 542 00:55:56,410 --> 00:56:01,520 Thank you, bye. Yeah, thanks a lot.