1 00:00:11,410 --> 00:00:14,670 Thank you. Thank you, Leslie, for this very kind introduction. 2 00:00:14,680 --> 00:00:18,220 And thank you very much for inviting me to come here. I'm really excited to be here. 3 00:00:18,460 --> 00:00:26,140 And, uh, yeah, to tell you a little bit about differential privacy in general and about my research then specifically in this field. 4 00:00:27,460 --> 00:00:32,170 So the talk, as we say, is should be algorithms who protect the privacy. 5 00:00:32,680 --> 00:00:33,850 So algorithms, you know, 6 00:00:33,850 --> 00:00:40,840 are everywhere these days because every computer program and also every integrated circuit is actually implementing an algorithm. 7 00:00:41,230 --> 00:00:51,280 And so if you use your phone or your laptop or your home appliance or your car or your cleaning robot or certain medical devices, um, 8 00:00:51,430 --> 00:00:56,080 you will obviously have algorithms running there, and they're usually very useful, 9 00:00:56,350 --> 00:01:02,950 but they can also collect data about you right now, data is also collected in other ways. 10 00:01:02,950 --> 00:01:06,520 For example, governments collect data about their citizens. 11 00:01:06,760 --> 00:01:09,790 In the U.S., for example, they do a census every ten years. 12 00:01:10,240 --> 00:01:14,560 Or if you participate in research studies, there will also be data collected about you. 13 00:01:15,370 --> 00:01:18,390 Um, this data is usually collected for a certain purpose. 14 00:01:18,400 --> 00:01:24,280 So for example, governments collect data so they can decide how to distribute funds who are not. 15 00:01:24,280 --> 00:01:36,160 So uh, um, also insurances use it to decide, you know, how much it costs you to costs to insure you banks to make loan decisions or, uh, 16 00:01:36,190 --> 00:01:46,960 doctors use, uh, AI algorithms right now in order to decide whether they should do certain medical procedures on you or not on your phone. 17 00:01:47,230 --> 00:01:53,980 Um, if you use, for example, Google keyboards, they suggest word sentence completions. 18 00:01:54,250 --> 00:01:58,960 And that's usually done, um, based on data that is collected by their users. 19 00:02:00,020 --> 00:02:04,760 So in general, this sounds like it's great. There's certain usefulness in this data. 20 00:02:05,090 --> 00:02:10,360 So there's this utility. But on the other side, of course, immediately comes to the concern. 21 00:02:10,370 --> 00:02:16,370 What about our privacy? So you have two conflicting goals. On the one side, you can use this data to do something useful in the world. 22 00:02:16,880 --> 00:02:20,330 Um, but on the other side, how can you protect the privacy? 23 00:02:20,600 --> 00:02:24,170 And this is really the goal of these algorithms that I will be talking about. 24 00:02:24,320 --> 00:02:31,430 So they want you. They want you to be able to use the data, but in such a way that the privacy of the individuals is protected. 25 00:02:31,760 --> 00:02:35,150 And try to explain to you how they are trying to achieve that. 26 00:02:37,300 --> 00:02:40,540 Um, now, what are the challenges for, uh, privacy? 27 00:02:40,900 --> 00:02:47,230 Uh, there's actually one challenge. Let me start with a second one here. Uh, one simple challenge is already that statistics. 28 00:02:47,800 --> 00:02:51,760 If your database gives you exact data, might already leak a lot of information. 29 00:02:52,120 --> 00:02:57,880 For example, if we know the average salary of the people in this room, and then one person leaves, 30 00:02:58,180 --> 00:03:02,770 and then I give you the average salary, again, you can easily figure out if you know how many people were here. 31 00:03:02,980 --> 00:03:06,820 You can easily figure out what was the salary of the person leaving. Right. Some simple math. 32 00:03:07,420 --> 00:03:13,480 Um, so in general, by giving exact answers, you might already leak a lot of information that you might not want to leak. 33 00:03:13,990 --> 00:03:21,010 Um, even worse in machine learning. Very often this, uh, algorithms actually memorise data. 34 00:03:21,460 --> 00:03:28,570 So, um, for example, there was some paper in 2018 by collinear oil where they showed that the Smart Compose, 35 00:03:28,570 --> 00:03:30,910 which is the sentence completion in Gmail, 36 00:03:31,270 --> 00:03:38,560 uh, was leaking the Social Security number of one of the users, okay, because they had used training data that had that information in it. 37 00:03:38,800 --> 00:03:43,120 And it's happened to learn that by heart. So how can you deal with this? 38 00:03:43,510 --> 00:03:48,790 Well, so the first idea that people had was to say, oh, why don't we just anonymize data? 39 00:03:48,820 --> 00:03:53,170 Isn't that good enough? Yeah. So I might release data, but I just don't put the names. 40 00:03:53,620 --> 00:03:58,360 And there are all kind of famous examples that showed that this was a bad idea. 41 00:03:58,600 --> 00:04:05,950 One of them was, uh, uh, published by Narayan and Sharma Cough in 2008. 42 00:04:06,250 --> 00:04:12,370 Um, what they did was they used data that Netflix had published, so Netflix had published. 43 00:04:12,750 --> 00:04:17,050 Uh, so here every row was a user and every column was a movie. 44 00:04:17,320 --> 00:04:21,010 And so the information was whether this user liked this movie or not. 45 00:04:21,550 --> 00:04:25,630 And they had anonymized the data by removing the names of the users. 46 00:04:26,850 --> 00:04:35,550 But what these authors showed was, oh, if you combine this information together with some public information that the Internet Movie Database had, 47 00:04:36,120 --> 00:04:40,679 because many of the users that gave their information to Netflix or the then published on the 48 00:04:40,680 --> 00:04:45,000 Internet Movie Database saying which movies they watched and whether they liked them or not, 49 00:04:45,690 --> 00:04:53,760 and they gave their names. And so if you combine these two data sources, uh, they were able to actually figure out the names for many of the users. 50 00:04:54,090 --> 00:04:58,470 And they showed that basically, on average, four movies uniquely identify a user. 51 00:04:59,600 --> 00:05:02,960 Okay, so it's a bad idea to just anonymize the data. 52 00:05:03,080 --> 00:05:07,550 Because if you combine different sources, you might be able then to de anonymize it. 53 00:05:09,450 --> 00:05:12,550 Okay. So. So the first idea doesn't really work by itself. 54 00:05:12,570 --> 00:05:14,910 Of course you have to anonymize, but you have to do more. 55 00:05:15,940 --> 00:05:22,780 And so we what it also shows us this example is that we need techniques that can handle composition of data sources. 56 00:05:22,990 --> 00:05:26,500 Yeah. When you analyse what happens if you combine different data sources. 57 00:05:26,770 --> 00:05:31,060 So here are the two data sources the Netflix in the Internet Movie Database. 58 00:05:32,300 --> 00:05:37,610 And then the second idea, the researchers, which was cannot be generalised to randomised response. 59 00:05:37,640 --> 00:05:43,100 Now let me explain to you what is randomised response. And then I will explain to you how they generalised it. 60 00:05:44,510 --> 00:05:49,159 So randomised response is a technique which was invented by Varner, a sociologist, 61 00:05:49,160 --> 00:05:54,710 in the 1960s, and this has been used ever since without any big privacy concerns. 62 00:05:55,160 --> 00:05:58,700 Okay. It's a good knowledge, um, to have. 63 00:05:58,850 --> 00:06:03,320 And so for later in the talk. And so it has it solves the following problem. 64 00:06:03,770 --> 00:06:08,450 Assume you have some population and you want to find out what percentage P of 65 00:06:08,450 --> 00:06:13,210 the population has property A now property A might be something embarrassing. 66 00:06:13,230 --> 00:06:17,250 You don't wanna admit that you have property A like snoring at night, for example. 67 00:06:17,360 --> 00:06:20,660 Uh. Worst things. Yeah, I leave it up to your imagination. 68 00:06:20,810 --> 00:06:24,889 So nobody wants to imagine that they have prob ever wants to admit that they have property. 69 00:06:24,890 --> 00:06:30,320 A still you want to find out what percentage P of people has this property. 70 00:06:30,860 --> 00:06:36,890 And now here is a solution that Warner came up with. So you have your volunteers who participate. 71 00:06:37,160 --> 00:06:41,330 And instead of asking them whether they have the property or not, they would just lie to you. 72 00:06:41,360 --> 00:06:48,590 You ask them to run this simple algorithm. So they should first throw a coin and not tell anyone about the outcome of the coin flip. 73 00:06:49,310 --> 00:06:53,030 And then if it was hit, then they should answer truthfully. 74 00:06:53,970 --> 00:06:58,200 If it was no, it was not hit. Then they should throw the coin again. 75 00:06:58,740 --> 00:07:01,770 And now they should just answer according to the coin throwing. 76 00:07:01,780 --> 00:07:09,590 So if it was, heads say yes and otherwise say no. Okay, but they should not tell anyone, uh, how often they threw the coin. 77 00:07:09,600 --> 00:07:13,299 They should not tell anyone the outcome of the first coin flip. Okay. 78 00:07:13,300 --> 00:07:18,400 That's the private information. And so they should just give this information back now. 79 00:07:18,430 --> 00:07:24,669 Yes or no depending on this algorithm. And then what you do is you determine the percentage Q of yes. 80 00:07:24,670 --> 00:07:32,430 Answers from this algorithm. And then you output two times q minus at half as estimate for p, the two answer. 81 00:07:33,640 --> 00:07:37,120 And it turns out that this is an unbiased estimator for the true answer. 82 00:07:37,690 --> 00:07:42,520 So it will give you a good estimate. Now why does this also pursue privacy? 83 00:07:43,090 --> 00:07:46,360 Well, because you have what's called plausible deniability. 84 00:07:47,170 --> 00:07:51,190 So everyone has to answer yes with probability at least the fourth. 85 00:07:51,850 --> 00:07:57,759 If you have the property, then you will answer yes here with probability half here and then here. 86 00:07:57,760 --> 00:08:03,250 Or as it was probably half, you said no and then again with probability a half. So all together with probability fourth you will also answer yes here. 87 00:08:03,910 --> 00:08:07,479 So if you have the property you will have a probability of three fourths of answering. 88 00:08:07,480 --> 00:08:15,030 Yes. If you don't have the problem with the property, you will still have probability of one fourth right of entering. 89 00:08:15,030 --> 00:08:18,959 Yes. So everyone has a probability of at least the fourth of entering. 90 00:08:18,960 --> 00:08:25,620 Yes. So if you answered yes and then I come to you and say, oh, you have property, I, you can say, no, no, no, I actually don't have it right. 91 00:08:25,620 --> 00:08:29,000 It was just the coin flips, not me. It's a coin flips. 92 00:08:29,010 --> 00:08:32,880 Right. You can blame the coin flips for you, I answer, and nobody can prove you otherwise. 93 00:08:34,100 --> 00:08:39,440 Okay, so it's private. And note also that you need randomisation. 94 00:08:39,440 --> 00:08:43,610 Without randomisation, this would not work right that you need to keep private for yourself. 95 00:08:44,180 --> 00:08:47,360 Okay, but randomisation is very important. Okay. 96 00:08:47,360 --> 00:08:51,170 So now the question was how can we make this more formal and more general. 97 00:08:52,280 --> 00:08:56,830 And that's exactly what differential privacy does. So it's a mathematically rigorous definition. 98 00:08:56,840 --> 00:09:02,870 You will see it in a second. Um and it measures the privacy protection given by an algorithm. 99 00:09:03,380 --> 00:09:11,240 And the goal is, first of all, to make it very unlikely that if you combine, uh, that user data can be reconstructed. 100 00:09:11,420 --> 00:09:14,570 If no matter how many data sources are combined. 101 00:09:15,050 --> 00:09:20,980 It gives you a way to measure how much impact on your privacy the combination of data sources has. 102 00:09:21,260 --> 00:09:25,100 And then you can have a threshold and say I'm only allowing so much privacy loss. 103 00:09:25,430 --> 00:09:29,659 And if this combination would be a profit, then I'm not allowing this combination. 104 00:09:29,660 --> 00:09:31,520 Meaning I'm not giving out my data anymore. 105 00:09:32,660 --> 00:09:42,530 Um, and it also gives a prime, a parameter that tells you for each algorithm how much data leakage stays, how much privacy loss stays. 106 00:09:43,370 --> 00:09:47,510 And there are several current deployments. And I'll talk about this, uh, later in the talk. 107 00:09:48,970 --> 00:09:52,750 Okay. So now let me give you the informal definition before giving you the formal definition. 108 00:09:53,260 --> 00:09:58,840 The informal definition is as follows. So assume you have a database and it has a thousand people in it. 109 00:09:59,560 --> 00:10:04,750 And you ran some randomised algorithm on it. Maybe computing the average salary or something okay. 110 00:10:05,620 --> 00:10:09,820 And so you run some randomised algorithm on it and it gives you some noisy answer. 111 00:10:10,420 --> 00:10:19,090 Okay. Noisy meaning because there was some randomisation involved. And now assume that you have the same database except the last person has changed. 112 00:10:19,450 --> 00:10:24,970 There is no person 1000. Everything else is the same. And now you run the same algorithm on it. 113 00:10:25,600 --> 00:10:27,310 And you also get a noisy answer. 114 00:10:28,400 --> 00:10:35,960 And now what you want is that you should return the same answer for D and for B prime, with almost the same probability. 115 00:10:36,990 --> 00:10:38,880 So mathematically speaking, 116 00:10:39,120 --> 00:10:46,050 the first the algorithm executed on the first database gives you some probability distribution on the output potential outputs it. 117 00:10:46,940 --> 00:10:51,230 And the one on the second input gives you another probability distribution. 118 00:10:51,830 --> 00:10:56,510 And what you want is that this probability distributions are very close, very similar. 119 00:10:57,770 --> 00:11:01,310 So it's some kind of smoothness requirement on the output of the algorithms. 120 00:11:01,820 --> 00:11:06,380 Okay. So if the inputs are very similar like here, they only define the last person. 121 00:11:06,870 --> 00:11:10,820 Then the output distributions for your answers should be very similar. 122 00:11:12,350 --> 00:11:15,590 Okay. And if this is the only thing you remember from this talk. You got it. 123 00:11:16,670 --> 00:11:20,330 Okay. I will have it again on a later slide. 124 00:11:20,480 --> 00:11:23,240 Now, why is this? Why is this a good definition? 125 00:11:23,540 --> 00:11:31,009 Well, because then, based on the answer of the mechanism, it is very unlikely that you can distinguish between d or d prime, 126 00:11:31,010 --> 00:11:34,490 that you can figure out that the input was d, or that the input was d prime. 127 00:11:35,210 --> 00:11:38,720 You will not be able to distinguish because the output distributions are so close. 128 00:11:39,140 --> 00:11:42,620 It would be very unlikely that you can distinguish I should say. Okay. 129 00:11:43,130 --> 00:11:50,170 That's the idea behind it. Now, more formally speaking, um, so we say two inputs are neighbouring. 130 00:11:50,210 --> 00:11:56,150 If they define at most one entry and other be more formal in the next slides because it depends on the application. 131 00:11:56,270 --> 00:12:02,670 What does it mean? One entry. Okay. And so epsilon is some number larger than zero. 132 00:12:02,820 --> 00:12:10,470 And now we say a randomised algorithm. Sometimes they call it mechanism A with a set p of possible outputs. 133 00:12:11,010 --> 00:12:14,340 Is epsilon differential private. So p is all possible outputs. 134 00:12:14,700 --> 00:12:22,920 Is epsilon differential private if for all possible subsets S of P and all possible neighbouring inputs D and D prime, 135 00:12:23,370 --> 00:12:30,389 the probability that the mechanism started on t outputs S is at most e to the epsilon 136 00:12:30,390 --> 00:12:36,600 times the probability that a mechanism with uh input d prime outputs an element of s. 137 00:12:37,140 --> 00:12:41,010 That's why I gave you the informal definition beforehand, because this is somewhat of a mess. 138 00:12:41,250 --> 00:12:47,160 Yeah. But let me try to interpret this. So Epsilon you should think of as a small number close to zero. 139 00:12:47,280 --> 00:12:51,240 So e to the epsilon is just a little bit larger than one. Okay. 140 00:12:51,510 --> 00:12:57,450 So what you want is the probability. And instead of saying element of s you can just think for simplicity here. 141 00:12:57,720 --> 00:13:01,980 Um, that's is just one element of P. And that this is equal to S. 142 00:13:02,250 --> 00:13:10,620 It makes it easier. So you can think the probability that the mechanism outputs an element of S is almost the same. 143 00:13:10,890 --> 00:13:14,310 If you the input was d, what if the input was d prime. 144 00:13:15,520 --> 00:13:22,440 That's what this definition says. And the randomness here, I should emphasise, is only over the randomness of the here, 145 00:13:22,440 --> 00:13:25,890 for the probability is only over the random choices of the algorithm. 146 00:13:26,580 --> 00:13:33,360 Okay, the input is worst case. We are saying for all possibility and d prime okay, there's no input distribution. 147 00:13:33,810 --> 00:13:35,910 This is only on the randomness of the algorithm. 148 00:13:38,380 --> 00:13:45,840 And of course, because we say for all possible neighbouring D and D prime, you can also by symmetry exactly get the inverse statement right. 149 00:13:45,850 --> 00:13:52,990 So you get that the probability then you started on D prime is a most e to the epsilon times higher than the probability when you started on D. 150 00:13:53,260 --> 00:13:57,160 So really what you get is an upper and a lower bound for the probability. 151 00:13:57,310 --> 00:14:02,110 When you started on D it's upper bounded by at most e to the epsilon times the 152 00:14:02,110 --> 00:14:05,890 probability for d prime and its lower bounded times e to the minus epsilon. 153 00:14:06,460 --> 00:14:11,080 And so if epsilon is small close to zero then this is roughly one. 154 00:14:11,080 --> 00:14:12,880 And it gives you a good privacy guarantee. 155 00:14:14,160 --> 00:14:21,480 Now, uh, one thing, however, you should keep in mind that in practice, Epsilon is not a small nor close to zero. 156 00:14:21,720 --> 00:14:26,160 Uh, you think of epsilon is more like 7 or 8 when used by companies. 157 00:14:26,610 --> 00:14:32,550 So E to the epsilon is already pretty big. But even worse, in the US census they used epsilon equals 19. 158 00:14:34,890 --> 00:14:39,330 So you might argue well you know this has probability over individuals. 159 00:14:39,330 --> 00:14:42,500 And you know that 300 million Americans, the accountant blah blah blah. 160 00:14:42,510 --> 00:14:48,630 But then you only count a subset of them anyway. And so but still, you know, e to the 19 seems like it's a big number. 161 00:14:50,820 --> 00:14:54,090 Okay. But in theory epsilon is small, close to zero. 162 00:14:56,310 --> 00:15:00,510 Okay. And so the crucial thing is, uh, this definition of neighbouring. 163 00:15:00,510 --> 00:15:06,420 Right. So we have this constraint on the probability distributions for neighbouring data sets. 164 00:15:07,000 --> 00:15:12,329 Okay. So here again this is the one thing you should remember from this talk is differential privacy 165 00:15:12,330 --> 00:15:17,520 means there's a strong requirement on the output distributions for neighbouring inputs. 166 00:15:17,550 --> 00:15:24,480 The sort of a smoothness requirement. They need to be very close. And this is completely different from anything I've seen in algorithms before. 167 00:15:25,230 --> 00:15:29,520 Okay. This was not studied before. And that's why this is also such an interesting research area. 168 00:15:29,530 --> 00:15:32,700 If you had not thought about such kind of algorithms before. 169 00:15:34,800 --> 00:15:39,450 Okay. Now why is this useful to have this definition? So now I can informally. 170 00:15:40,660 --> 00:15:45,970 So assume that the use of data AI is used in database D, but not in D prime. 171 00:15:47,160 --> 00:15:52,170 Then. Now you could write down. Well, what is the harm for a user ie. 172 00:15:52,200 --> 00:16:00,570 If the output was a s and you can multiply this with the probability that the output was s, and now you can sum over all possible s. 173 00:16:00,840 --> 00:16:04,440 So here you get the exact expected harm that you have. 174 00:16:04,470 --> 00:16:07,590 If I say data was used in the database. 175 00:16:08,690 --> 00:16:13,250 And here you can do the same thing for D prime D primaries where ice data was not used. 176 00:16:14,000 --> 00:16:21,530 And now because you know this bound here, you know that this expect the term here um is a multiplied with e to the epsilon is 177 00:16:21,530 --> 00:16:27,080 an upper bound to the harm that you get if the data was used okay or was differently. 178 00:16:27,320 --> 00:16:31,729 Any harm for user I that happens with the inclusion of data. 179 00:16:31,730 --> 00:16:38,540 I is that most effect that e to the epsilon higher then the harm the expected harm without ice data. 180 00:16:39,900 --> 00:16:46,590 Okay. So your. This definition is useful because you can found the expected harm that you have by being included. 181 00:16:47,460 --> 00:16:52,320 And again epsilon close to zero is good. Epsilon 19 might not be quite as good. 182 00:16:53,640 --> 00:16:57,580 Okay. And each of the epsilon is called a risk multiplier. 183 00:16:57,610 --> 00:17:01,070 Okay. So now, what is a good definition of neighbouring? 184 00:17:01,990 --> 00:17:08,000 Uh, now, especially in the early days, people in differential privacy mostly looked at databases, which they thought of as tables. 185 00:17:08,510 --> 00:17:11,510 Okay, so they saw the tables. So you have some table. 186 00:17:11,510 --> 00:17:15,230 Every row is a user and every column tells you something about the user. 187 00:17:15,950 --> 00:17:19,850 And then the standard definition of neighbouring goes to databases are neighbouring. 188 00:17:19,940 --> 00:17:24,020 If they define at most one row okay. So one user is different. 189 00:17:26,140 --> 00:17:29,590 But, um, then they said, isn't this maybe too generous? 190 00:17:29,890 --> 00:17:33,430 And they restricted it more. And they say two databases are neighbouring. 191 00:17:33,730 --> 00:17:39,670 If they differ only in the inter entrance for one row, so one entry in one row is different. 192 00:17:40,920 --> 00:17:45,630 What even more restrictive one entry in one row differs by at most one. 193 00:17:47,010 --> 00:17:52,590 Okay. And so as you notice, these definitions of neighbouring become more and more specific. 194 00:17:53,040 --> 00:17:56,340 And thus the requirements become less and less strict. 195 00:17:56,550 --> 00:18:01,290 Right. Because you're only required to have very similar distributions if you're neighbouring, 196 00:18:02,160 --> 00:18:07,050 if you're neighbouring here in this one, in this way you also neighbouring here and you also neighbouring here. 197 00:18:07,470 --> 00:18:10,320 While there are cases where your neighbouring here and not neighbouring here. 198 00:18:10,890 --> 00:18:15,360 So here you have more requirements because you have more neighbouring relationships than down here. 199 00:18:16,550 --> 00:18:16,880 Okay. 200 00:18:17,060 --> 00:18:24,950 And now, uh, people have studied this kind of different kind of, uh, requirements because depending on the application, that might be more suitable. 201 00:18:25,220 --> 00:18:28,460 And you may be able to get better balance down here. 202 00:18:29,420 --> 00:18:32,450 Now balancing both. Well, we'll get to that in a second. 203 00:18:32,990 --> 00:18:38,960 In the error that you get. Okay. So there are epsilon differential private algorithm for all of this. 204 00:18:39,230 --> 00:18:42,500 But the error on second formalise error. 205 00:18:42,590 --> 00:18:47,800 The error is better if the if you have fewer constraints can you handle a smaller. 206 00:18:47,810 --> 00:18:52,400 You will get a smaller error. You might also look at graphs. 207 00:18:52,410 --> 00:18:56,810 Think of social networks. Uh, each graph, you know, each user is a node. 208 00:18:56,840 --> 00:19:00,170 If there's a friendship relationship, you have a link edge. 209 00:19:01,130 --> 00:19:07,130 And now you could look at vertex neighbourhood of graphs. There are two graphs are neighbouring if they define at most one node. 210 00:19:08,000 --> 00:19:12,020 Or you can say h neighbouring where the nodes are actually known. 211 00:19:12,020 --> 00:19:18,470 And you want to protect the privacy of the edges. And so you say, uh, two graphs are neighbouring if they define one edge. 212 00:19:20,090 --> 00:19:26,450 And well, you could just say that neighbouring if you have weighted graphs, the graphs are actually known but the weights are not known. 213 00:19:26,690 --> 00:19:30,260 And so only the weight of one edge can differ. Okay. 214 00:19:30,620 --> 00:19:38,630 Um. So say it. You know, this way, while I was talking, the notion of neighbouring is very closely related to what you are protecting, right? 215 00:19:38,930 --> 00:19:44,389 So if you're only protecting the weights of the edges, um, then you're not protecting anything else. 216 00:19:44,390 --> 00:19:49,310 You're not protecting the nodes or the edges in the graph, meaning they might just as well be publicly known. 217 00:19:50,060 --> 00:19:55,250 If you're each neighbouring, then the nodes might just as well be publicly known because you're not protecting them. 218 00:19:55,250 --> 00:19:58,880 You're only protecting the H relationship. Who is in relation with whom. 219 00:19:59,420 --> 00:20:04,460 And if your vertex neighbouring this is the strongest again, then you're protecting even the nodes in the graph. 220 00:20:05,800 --> 00:20:11,320 Okay, so what do you protect? You mean what you want to protect immediately carries over to what? 221 00:20:11,320 --> 00:20:12,520 Your definition of neighbouring. 222 00:20:12,850 --> 00:20:20,140 And depending on your definition of neighbouring, the more was the more strict or the less strict to this it carry carryover to how much era you need. 223 00:20:20,950 --> 00:20:28,300 And since there's so many possible applications and combinations of what what this right definition of neighbouring, there's lots of work to be done. 224 00:20:30,450 --> 00:20:35,700 Now let's go back and analyse randomised response. I claim randomised response is differentially private. 225 00:20:35,910 --> 00:20:39,240 The question is just for which value of epsilon. Okay. 226 00:20:39,750 --> 00:20:44,969 Now um but before doing that we need to have a definition of neighbouring. 227 00:20:44,970 --> 00:20:45,209 Right. 228 00:20:45,210 --> 00:20:53,430 Because I told you, um, you the most important thing in the definition of differential privacy is you need to exactly specify what is neighbouring. 229 00:20:54,150 --> 00:21:01,530 So here for, uh, randomised response, um, you want to protect the bit of each individual user. 230 00:21:02,450 --> 00:21:06,110 Okay, so then you say two inputs are neighbouring. 231 00:21:06,440 --> 00:21:10,250 If they differ in the bit of at most one user. 232 00:21:10,790 --> 00:21:19,520 Okay. So. Think of the initial inputs here as being a string d, then two inputs D and d prime are neighbouring. 233 00:21:19,550 --> 00:21:22,640 If one bit this is different. It's flipped. 234 00:21:23,950 --> 00:21:30,770 Okay. That's the neighbouring definition that you would use here. Now, why is differential privacy such a great definition? 235 00:21:31,160 --> 00:21:40,100 Well, it fulfils two main properties. And the first property is the post-processing theorem which says assume you have a 236 00:21:40,100 --> 00:21:45,080 mechanism that is epsilon differential private and it maps the data to some set R. 237 00:21:45,590 --> 00:21:49,820 And then you have some function that applies to R and compute some r prime. 238 00:21:50,810 --> 00:21:58,730 Then if you do the combination where you first run em and then run F on the output of em, this would be again epsilon differential private. 239 00:22:00,030 --> 00:22:04,140 Okay, so informally speaking, you had some data that was sensitive. 240 00:22:04,380 --> 00:22:07,890 You ran it through some mechanism. There was epsilon differential private. 241 00:22:08,040 --> 00:22:14,880 And now the data is clean, meaning you can do any post-processing on this output of the mechanism, not on the original data. 242 00:22:14,940 --> 00:22:18,040 Yeah. This does not allow F to touch the original data. 243 00:22:18,060 --> 00:22:19,830 F is not allowed to see the original data. 244 00:22:19,980 --> 00:22:25,440 The only is allowed to see the output of the mechanism, but if it only looks at the output of the mechanism, it will be fine. 245 00:22:26,380 --> 00:22:31,450 The whole thing. The combination of first running the mechanism and then this function is again epsilon differential private. 246 00:22:31,960 --> 00:22:37,540 So for randomised response this means as follows. You can really sorry this was not a good idea. 247 00:22:37,540 --> 00:22:42,639 This animation here. Um so you can really break it into two pieces. 248 00:22:42,640 --> 00:22:45,700 The first part which is what every participant had to do. 249 00:22:46,270 --> 00:22:51,670 Yeah this is the first part. This is really the mechanism. And you get this answers yes or no. 250 00:22:53,100 --> 00:22:58,080 This is the first part. This would be the mechanism. And we will show that this here is epsilon differential private. 251 00:22:58,800 --> 00:23:00,790 And then taking all this answers, 252 00:23:00,930 --> 00:23:08,040 figuring out the percentage Q of yes answers and outputting two columns Q minus a half is just a function F is just post-processing. 253 00:23:09,530 --> 00:23:12,950 Okay of this differentially private output that we got here from the mechanism. 254 00:23:13,870 --> 00:23:20,040 Okay. So. So now in order to analyse randomised response I only need to analyse this first part. 255 00:23:20,050 --> 00:23:26,110 I don't need to analyse the second part because the second part does not look at the initial data of the real users. 256 00:23:26,110 --> 00:23:32,470 Again has no contact to the user. Okay, so doesn't look at the initial data, only looks at the output of this mechanism. 257 00:23:34,440 --> 00:23:38,560 Okay. So this is one good theorem. And now you can actually do this analysis. 258 00:23:38,580 --> 00:23:41,610 I will not do it. I will just say you can actually do this. 259 00:23:41,610 --> 00:23:47,670 You know always look at the ratio of you know if you are in there or not in there and what's the probability that you say something, blah, blah, blah. 260 00:23:47,880 --> 00:23:53,070 What you will find out is that this ratio is at most over three, which is E to the L and three. 261 00:23:54,140 --> 00:23:57,470 Okay. So for randomised response this is your homework to do this yourself. 262 00:23:57,710 --> 00:24:01,940 But for randomised response I'm a professor right. Whenever it gets hard it's the homework. 263 00:24:03,200 --> 00:24:07,100 So for randomised response epsilon is LN3 okay. 264 00:24:07,280 --> 00:24:12,950 And I have to say there are different variants of randomised response where you would not use a fair coin. 265 00:24:12,950 --> 00:24:15,050 You would use a biased coin in your coin flips. 266 00:24:15,350 --> 00:24:20,120 And if you do this then you can make it also epsilon differentially private for any epsilon larger than zero. 267 00:24:21,050 --> 00:24:26,740 Just this simple variant that I showed to you was LN3. Okay, so this is the first useful thing. 268 00:24:27,160 --> 00:24:30,790 The second useful property of differential privacy is composition. 269 00:24:31,780 --> 00:24:35,980 Okay. So assume you have a first mechanism that's epsilon one differentially private. 270 00:24:36,250 --> 00:24:43,300 In the second one that's epsilon two differentially private. And now you want to have a mechanism that you can achieve. 271 00:24:43,480 --> 00:24:50,140 You want to have a mechanism that just outputs both outputs M1 and m2 the output of M1 and the output of M2. 272 00:24:51,160 --> 00:24:56,590 Then the theorem says that this is epsilon one plus epsilon two differentially private. 273 00:24:57,220 --> 00:25:00,430 And that's not too hard to show based on the definition of differential privacy. 274 00:25:01,380 --> 00:25:09,360 And now to do a post-processing. So once you output these things then you can now throw any function g on these two outputs. 275 00:25:09,910 --> 00:25:15,989 Basically post-processing is before. And that now tells you that also some function g arbitrary function g based on the 276 00:25:15,990 --> 00:25:19,650 output of these two things is also epsilon one plus epsilon two differentially private. 277 00:25:20,430 --> 00:25:27,209 So this allows you to analyse the harm or the influence on your privacy done by having multiple mechanisms, 278 00:25:27,210 --> 00:25:31,430 looking at the same data and outputting something about it. Okay. 279 00:25:31,610 --> 00:25:34,730 So if we look for example back to this movie data. 280 00:25:35,600 --> 00:25:41,000 So let's say D is a true data. So for all the users, whatever they looked at and whether they liked it or not. 281 00:25:42,350 --> 00:25:48,710 And then some of the data might be Netflix. Okay, so Netflix would be the first mechanism. 282 00:25:49,220 --> 00:25:51,200 And it is it discloses something. 283 00:25:52,280 --> 00:26:01,550 Now, as we said before, Netflix only anonymized the data and you can show that the station does not fulfil epsilon differential privacy. 284 00:26:01,580 --> 00:26:04,940 Epsilon differential privacy, meaning the epsilon would be infinite. Yeah. 285 00:26:05,930 --> 00:26:09,770 Now the second one is the Public Internet Movie Database. 286 00:26:09,770 --> 00:26:12,589 And the users might not write about all the movies they have seen. 287 00:26:12,590 --> 00:26:18,470 So this might be, again, just some incomplete set of the truth T and that's just disclosed. 288 00:26:18,500 --> 00:26:22,940 There's also mechanism to just write down whatever, I guess. Again, not at all private. 289 00:26:23,630 --> 00:26:26,720 Okay. So you have a second mechanism with epsilon being infinite. 290 00:26:27,380 --> 00:26:31,220 Okay. So now not surprising that the combination of the two is again no privacy. 291 00:26:32,480 --> 00:26:37,850 But if this first one would have had some privacy epsilon one in the second one would have had privacy epsilon two. 292 00:26:38,060 --> 00:26:44,720 Then the combination of the two having both of them out there would be now a privacy loss of epsilon one plus epsilon two. 293 00:26:46,350 --> 00:26:49,620 Okay. So this is the second very nice property of differential privacy. 294 00:26:51,170 --> 00:26:56,809 Um, now I there's one thing that people then usually ask me at this point in time. 295 00:26:56,810 --> 00:27:03,410 So I'm, I created a slide for it, which is there is two different notions of differentially private mechanisms. 296 00:27:03,800 --> 00:27:09,050 There's the one that you just saw, uh, which is called local differentially private algorithm. 297 00:27:09,620 --> 00:27:15,200 Uh, it's like a randomised response. So here the user has a two data whether they have property A or not. 298 00:27:16,010 --> 00:27:19,550 And they output some noise variant of it because they flip this coin. 299 00:27:20,060 --> 00:27:24,590 So they output a noisy variant of it and they give to the server the noisy variant and all. 300 00:27:24,590 --> 00:27:27,810 What the server does is post-processing. Okay. 301 00:27:27,820 --> 00:27:31,450 And as we know, the server cannot do any harm because the data was already private. 302 00:27:32,600 --> 00:27:36,370 Um. That's how I've always thought the notion of central differential private privacy, 303 00:27:36,380 --> 00:27:44,720 that's more of what you would see in medical databases where, uh, the users actually give the raw data to the server, to the database. 304 00:27:44,940 --> 00:27:47,270 Okay. So the database actually has the true answers in it, 305 00:27:47,630 --> 00:27:53,930 but then it let some differentially private mechanism run over it that then outputs a differentially private output. 306 00:27:55,140 --> 00:27:59,400 Okay. And so here the difference is the server has actually got all data, 307 00:27:59,730 --> 00:28:05,310 which is here where the server did not have the true data, only the uh, and uh the private data. 308 00:28:06,570 --> 00:28:10,379 Okay. That's a difference between the two. In this talk I will talk about the central model. 309 00:28:10,380 --> 00:28:15,600 Just because it's easy apart from randomised response, just because it's easy and most more work in the central model. 310 00:28:15,750 --> 00:28:21,690 But of course now there's also a lot of work done in the local model because, um, that's very often what, 311 00:28:21,690 --> 00:28:26,220 what happens when, you know, uh, internet companies are collecting data about their users. 312 00:28:26,760 --> 00:28:31,020 It's there usually it's in the local model that the users algorithm, 313 00:28:31,020 --> 00:28:35,400 the user's machine or iPhone is sending this differentially private data to the server. 314 00:28:38,280 --> 00:28:43,860 Now this is all great, but are there disadvantages? And yes, there are two main disadvantages. 315 00:28:44,490 --> 00:28:48,540 The first one is differential. Private algorithms are usually slower. 316 00:28:48,960 --> 00:28:52,980 Like before you would have had just to say yes I have a or not. 317 00:28:53,280 --> 00:28:55,200 And now you have to flip this two coins. 318 00:28:55,200 --> 00:29:00,210 And you know I just did this with volunteers and some talk and they had to think about, you know, when to do what. 319 00:29:00,870 --> 00:29:03,870 And so, you know, it's not so easy anymore. It's a bit more work. 320 00:29:04,350 --> 00:29:09,540 And in general differentially private algorithms are slower than non differentially private algorithms. 321 00:29:10,380 --> 00:29:16,710 And the question is just by how much. And the second even larger problem is the loss on accuracy. 322 00:29:17,490 --> 00:29:22,230 Okay. So for example which we also call error for example for randomised response. 323 00:29:23,520 --> 00:29:27,060 So what we are interested in is P would have been the two answer. 324 00:29:27,540 --> 00:29:31,290 Yeah. And now what we the way we measure the error is as follows. 325 00:29:31,620 --> 00:29:42,360 We say um what is the parameter alpha such that with probability at least two thirds q falls into this range between p plus alpha and p minus alpha. 326 00:29:44,070 --> 00:29:48,390 Okay. So with probability, at least two thirds Q must fall into this range. 327 00:29:48,810 --> 00:29:55,470 And then we say if I say accuracy. There's also a variant where, you know, this is some arbitrary parameter beta. 328 00:29:55,490 --> 00:29:58,430 But for this talk I just plugged in two thirds. Okay. 329 00:29:58,430 --> 00:30:05,450 So the accuracy is the alpha such that with probability at least two thirds, the answer that you get will fall into this range. 330 00:30:05,900 --> 00:30:10,130 And of course you want the alpha to be as small as possible because you want to be as close as possible. 331 00:30:11,440 --> 00:30:11,650 Yeah. 332 00:30:11,920 --> 00:30:19,270 And so whenever you write a paper about differential privacy, it's you should of course prove that your mechanism is epsilon differentially private. 333 00:30:19,810 --> 00:30:24,250 Well whatever notion you of privacy use. But then please always improve accuracy. 334 00:30:24,400 --> 00:30:27,880 But it's very important that you also show a bound on accuracy. 335 00:30:27,940 --> 00:30:32,290 I can give you an algorithm for any problem that's always epsilon differentially private. 336 00:30:32,470 --> 00:30:36,070 I always say zero. You know, it's definitely private. 337 00:30:36,080 --> 00:30:39,230 It doesn't leak any information, but it has miserable accuracy. 338 00:30:39,950 --> 00:30:45,770 So just writing a paper where you say, oh yeah, have a epsilon differential, private algorithm is not impressing me at all. 339 00:30:46,490 --> 00:30:49,940 You better show, you know, that it has always a small accuracy, even in theory. 340 00:30:50,180 --> 00:30:53,569 Well, with experiments. Yeah. So it's a very important part. 341 00:30:53,570 --> 00:30:57,110 And I'm emphasising this because I'm seeing papers where this is not done right. 342 00:30:57,110 --> 00:31:00,860 And I'm kind of wondering why people would not do that. Okay. 343 00:31:00,860 --> 00:31:10,399 So now for randomised response. What do we get. Well you can show actually using some kind of bounds that although maybe at most uh, 344 00:31:10,400 --> 00:31:15,770 some constant times square root of another constant divided by n okay. 345 00:31:15,950 --> 00:31:23,990 Which makes sense. So the more volunteers you have to participate, the smaller the alphabet get okay, the better the accuracy will get. 346 00:31:24,470 --> 00:31:29,720 If I have only three of you, you know I might be very noisy. If I have, uh, 10 million, it's much better. 347 00:31:31,790 --> 00:31:32,990 But so what have we seen? 348 00:31:33,740 --> 00:31:41,540 Um, so we have seen so far randomised response is LN3 differentially private and it's o of square root of one over n accurate. 349 00:31:42,470 --> 00:31:45,290 And in general the definition of accuracy is as follows. 350 00:31:45,560 --> 00:31:52,880 We say a mechanism is if accurate if for all inputs d the probability again over the randomness of the algorithm 351 00:31:53,150 --> 00:31:59,390 that the two output minus the noisy output is at most all for that probability is at least two thirds. 352 00:31:59,660 --> 00:32:03,710 So there is a good chance that the answer the error is at most alpha. 353 00:32:05,370 --> 00:32:12,510 And so the goal in differentially private algorithms research is always you have some computational problem. 354 00:32:12,900 --> 00:32:17,150 Yeah. Nothing as simple as computing an average. Know something more complicated. 355 00:32:17,150 --> 00:32:22,560 If you wanted to find a max cut in a graph or you know your favourite problem and, um, 356 00:32:22,800 --> 00:32:27,810 you want an algorithm that is differentially private, it should have small additive error. 357 00:32:27,840 --> 00:32:32,459 So small accuracy and the running time should be almost the running time. 358 00:32:32,460 --> 00:32:36,630 And also the space usage should be almost as good as the best Non-private algorithm. 359 00:32:37,690 --> 00:32:42,640 So these are the three things you're trying to optimise. And it's usually in that order. 360 00:32:43,590 --> 00:32:48,990 Which means people. People in papers usually stop after the first two, but sometimes already after the first. 361 00:32:50,910 --> 00:32:54,050 Okay, now there has been a lot of research. 362 00:32:54,090 --> 00:33:00,320 Let me stop complaining. There has been a lot of research in differential privacy on bunch of problems. 363 00:33:00,330 --> 00:33:04,320 I only talk about the simplest, which is a binary count in this talk here. 364 00:33:04,770 --> 00:33:09,209 Um, but there's also, you know, lots of other graph properties clustering. 365 00:33:09,210 --> 00:33:15,510 Machine learning is of course a very hot topic, differentially private machine learning and also mechanism design. 366 00:33:15,510 --> 00:33:19,720 So this is in auctions for example. Okay. 367 00:33:20,110 --> 00:33:26,240 Before getting into more definitions, let me give you some deployment examples so that you are motivated to keep on listening to me. 368 00:33:26,260 --> 00:33:32,200 Apart from having to sit here anyway, so um, there have been a few large scale deployments actually, 369 00:33:32,200 --> 00:33:37,480 and the first one was by Google, the wrap system, which was in Google Chrome. 370 00:33:37,540 --> 00:33:43,870 Uh, these are the authors here. And, um, they built this system in Google Chrome. 371 00:33:44,050 --> 00:33:54,610 And what it did was it detected the most frequently chosen homepages and also the most frequently chosen, uh, running process names in Google Chrome. 372 00:33:55,000 --> 00:33:59,229 And that's the information it sent back in a differentially private manner such that 373 00:33:59,230 --> 00:34:02,350 nobody could come to you afterwards and say that you always visit this homepage. 374 00:34:02,890 --> 00:34:08,330 So in a differentially private manner. Um, then more a little bit more recently. 375 00:34:08,660 --> 00:34:13,580 Uh, it's in Gboard. That's the Google, uh, keyboard on Android. 376 00:34:13,640 --> 00:34:18,440 So if you're typing on Android, um. In Google Chrome. 377 00:34:18,440 --> 00:34:22,940 What do you mean? You might use the cheap word. And they are. They did also learning for word prediction. 378 00:34:22,940 --> 00:34:26,570 So this is when you type in high and then it says the next word. 379 00:34:26,660 --> 00:34:33,530 Yeah or whatever. Thank. And then comes you. Yeah that's what they do in here. 380 00:34:33,530 --> 00:34:36,680 This example high. How off. Then they suggest you. 381 00:34:37,100 --> 00:34:40,430 And this is also done um based on differential privacy. 382 00:34:40,610 --> 00:34:44,179 And um this is the URL of the paper. Um, yeah. 383 00:34:44,180 --> 00:34:51,600 They evaluated the whole system. And they did more things and they evaluated it given evaluation of the whole system. 384 00:34:52,580 --> 00:34:56,960 Then. As I mentioned before, it's been used by the US 2020 U.S. census. 385 00:34:57,290 --> 00:35:02,810 There was a lot of battle about whether they should use differential privacy or not in the census. 386 00:35:03,080 --> 00:35:07,670 For the 2010 census, there were some privacy leaks, some awkward privacy leaks. 387 00:35:08,000 --> 00:35:14,990 And that's why they said, you know, the people who were running the census said, we need to use some protection, so let's use differential privacy. 388 00:35:15,200 --> 00:35:21,830 And the statisticians always said no. And the, you know, algorithms, people said yes and argument back and forth and back and forth. 389 00:35:22,070 --> 00:35:25,080 And so the decision was then made. Yeah a compromise. 390 00:35:25,100 --> 00:35:30,710 Yeah. Privacy is used. But Epsilon is 19. That was the time. 391 00:35:31,370 --> 00:35:37,729 That's now the statisticians are still very unhappy. Just recently I read an article that for some native people. 392 00:35:37,730 --> 00:35:42,530 Right. If you have a small group, uh, then the noise that you add makes a lot of difference. 393 00:35:42,530 --> 00:35:50,120 And so some native people are now complaining that, you know, based on that data, they compare the 2010 data with the 2020 data. 394 00:35:50,300 --> 00:35:57,410 They cannot, uh, you know, say how much their population has really grown or where there are needs and blah, blah, blah. 395 00:35:57,560 --> 00:36:01,670 And the census data is usually used to determine funding for native people. 396 00:36:01,670 --> 00:36:05,650 So it's of course important for them to be able to say, you know, we on so much. 397 00:36:05,660 --> 00:36:10,310 And so there are there are complaints about this for small populations, especially, uh, 398 00:36:10,340 --> 00:36:15,410 the statisticians are also complaining because some people, they complained, live in Manhattan in the Hudson River. 399 00:36:15,560 --> 00:36:21,080 And that should not have either. But, uh, then the, you know, algorithm, people say that's just noise, right? 400 00:36:21,080 --> 00:36:24,979 It's not many. But anyway, so there are some complaints and there are some discussion. 401 00:36:24,980 --> 00:36:28,670 And it would be interesting to see what happens in the 2030, uh, census. 402 00:36:30,100 --> 00:36:34,240 See. And then there was also some, uh, deployment in Apple. 403 00:36:34,660 --> 00:36:40,840 And this is also one of the strange papers because it has no authors except for like Apple Privacy Research group. 404 00:36:41,200 --> 00:36:44,290 Uh, very private, I guess. 405 00:36:46,240 --> 00:36:53,320 Okay. Until they showed that, uh, uh, for example, how you can use it to determine frequently used emojis. 406 00:36:53,800 --> 00:36:58,930 So, for example, for English speaking, if you have an English keyboard, uh, 407 00:36:58,930 --> 00:37:02,980 then it seems like the smiley face here where you have, you know, tears where you smile a lot. 408 00:37:03,220 --> 00:37:10,299 It's very popular, but the laugh is very, you know, only the second most popular, much lower level for the French, less my less. 409 00:37:10,300 --> 00:37:15,640 But I have much more love, for example. And so they figure that out with differential privacy. 410 00:37:16,730 --> 00:37:24,230 Also, they looked at, you know, uh, memory usage, which webpages train their battery a lot in Safari and things like that, 411 00:37:24,230 --> 00:37:27,590 and in which health topics people are interested in because, 412 00:37:27,590 --> 00:37:32,239 you know, they also sell you the Apple Watch and they want to know, are you someone you know who might be interested in health? 413 00:37:32,240 --> 00:37:35,780 So they also, uh, collect which health topics might be interesting. 414 00:37:36,260 --> 00:37:40,010 And finally, there was also work at Microsoft published by these people. 415 00:37:40,400 --> 00:37:48,080 And, uh, this is in Windows 10. Uh, they use this, uh, for estimating application usage statistics. 416 00:37:48,260 --> 00:37:52,070 And they also do the very. So this is a picture of their paper. Very careful evaluation. 417 00:37:54,070 --> 00:38:01,000 Okay, so you can see all become many big companies are using differential privacy already when they collect data about their users. 418 00:38:01,420 --> 00:38:07,060 And so now let me switch back and uh, um, tell you a little bit more about the theory behind it. 419 00:38:08,590 --> 00:38:13,690 So I said, in a static setting, the most simple problem that we are going to look at is binary count. 420 00:38:13,870 --> 00:38:19,750 So here what you have, you see, you have a, uh, string, a binary string D of length n. 421 00:38:19,900 --> 00:38:25,300 So think of n users and you want to just figure out the sum so the number of ones in the string. 422 00:38:25,990 --> 00:38:29,469 This is basically the same as before for randomised response. 423 00:38:29,470 --> 00:38:34,810 Right. That was finding the population. The percentage is the same as you know finding how many have it. 424 00:38:34,810 --> 00:38:41,760 And then divide by the size by n. Okay, but I will give you now a different mechanism that's more efficient. 425 00:38:41,760 --> 00:38:46,380 Meaning having lower accuracy. Um, so better accuracy. 426 00:38:47,190 --> 00:38:51,810 Um, so esp for the prime isn't neighbouring if it differs from D by one bit. 427 00:38:52,830 --> 00:38:57,110 And so now it turns out there is this very nice distribution called Laplace distribution. 428 00:38:57,120 --> 00:39:00,240 It existed before, but it looks like it's perfect. 429 00:39:00,270 --> 00:39:03,299 It was specifically designed for differential privacy which of was not. 430 00:39:03,300 --> 00:39:07,080 But it has only become really popular uh, through differential privacy. 431 00:39:07,470 --> 00:39:12,340 So it's this function here. So, uh, this distribution. 432 00:39:12,880 --> 00:39:16,930 So the Laplacian distribution has a parameter here I call it one where epsilon. 433 00:39:17,260 --> 00:39:22,570 And then you get this function. So at some value x you get this density function here. 434 00:39:23,590 --> 00:39:28,500 And so what you have is you have then uh the variance of sigma. 435 00:39:28,840 --> 00:39:35,560 And so if the sigma is very large like here in the light blue, what you get is you get a distribution that's very spread out. 436 00:39:35,800 --> 00:39:42,370 Yeah. So the probability that you pick minus ten is still, you know, not constant or that you pick plus ten. 437 00:39:43,720 --> 00:39:48,100 Um, the smaller the sigma the variance, the more picky the distribution is. 438 00:39:48,490 --> 00:39:52,960 Yeah. Uh, I should say it's symmetric around zero a very important symmetric around zero. 439 00:39:53,170 --> 00:39:57,920 And the more peaky it is. So it's a red one here for example. It's very picky okay. 440 00:39:58,000 --> 00:40:02,530 So if you sample from it you're very likely probability almost a half that you pick zero. 441 00:40:04,020 --> 00:40:07,100 Okay. So now what is a Laplace mechanism? 442 00:40:07,400 --> 00:40:09,200 The Laplace mechanism is very simple. 443 00:40:09,530 --> 00:40:17,690 Uh, you take the two, answer the sum of the ones, and you add a random variable that you chose according to this Laplace distribution. 444 00:40:19,010 --> 00:40:23,600 Okay. And it turns out you can show that this is epsilon differentially private. 445 00:40:24,550 --> 00:40:27,850 And that is Ellen three over Epsilon accurate. 446 00:40:29,310 --> 00:40:34,110 So. So why is this? What is the intuition? Well, you see, if you want good privacy. 447 00:40:34,440 --> 00:40:41,390 Yeah. So you have small epsilon. Then one over epsilon will be large, which means the variance of your distribution will be large. 448 00:40:41,400 --> 00:40:49,250 So you're more like the light blue here. Okay, so you're not too unlikely to actually add a number like ten to your numbers. 449 00:40:49,350 --> 00:40:52,700 But you remember, the input differed only by one, right? 450 00:40:52,700 --> 00:40:55,340 So you had two input strings that differed by one bit. 451 00:40:55,670 --> 00:41:02,239 And now you add a noise of, let's say ten to the one to the first string, the one that had this one bit not set. 452 00:41:02,240 --> 00:41:06,080 Let's say, let's say the first bit. They define the first bit. The first string has zero. 453 00:41:06,080 --> 00:41:13,730 In the first bit the other string has one in the first bit. And now the probability that for the first one you pick the ten. 454 00:41:14,630 --> 00:41:17,660 And so you would output. Then whatever the two ends are plus ten. 455 00:41:18,660 --> 00:41:21,660 And the probability that you take the first. 456 00:41:21,810 --> 00:41:24,870 The second one. And you pick the nine. Yeah. 457 00:41:25,140 --> 00:41:27,959 Which in this case you would then output in both cases the same thing, 458 00:41:27,960 --> 00:41:32,580 namely the two ends out of the first one plus ten because the second string had the first bit set. 459 00:41:32,850 --> 00:41:36,960 Yeah. So the probability that they are almost the same are very close right. 460 00:41:37,110 --> 00:41:40,320 Just a difference here between picking a ten or a nine. 461 00:41:41,720 --> 00:41:42,050 Okay. 462 00:41:42,230 --> 00:41:50,960 And so you can show that by taking this parameter one way epsilon for the Laplace definition you get exactly epsilon differentially private privacy. 463 00:41:51,020 --> 00:41:57,230 It seems like it's perfectly made for that. And for the accuracy if well, you need to do a little bit of work, but not too bad. 464 00:41:57,260 --> 00:42:00,729 Some properties of the Laplacian distribution. Okay. 465 00:42:00,730 --> 00:42:04,000 But this is very nice. It's independent of n okay. 466 00:42:04,000 --> 00:42:08,180 So accuracy is independent of the size of the input. Okay. 467 00:42:08,750 --> 00:42:13,460 That's an icing here. So that's good. But what if data is dynamic. 468 00:42:13,820 --> 00:42:20,150 And this is now finally where my book comes in. Because I like to always look at the dynamic setting where the input changes. 469 00:42:21,100 --> 00:42:26,330 Okay. And now the people cannot agree in the field what they should call the dynamic setting. 470 00:42:26,350 --> 00:42:34,420 They don't call it dynamic, that's for sure. They call it either differential privacy under continual observation or under continual release. 471 00:42:34,570 --> 00:42:38,260 Well, on streaming data. Well, sometimes on data streams. 472 00:42:38,350 --> 00:42:44,200 Okay. They are very creative, but it's all the same. It's where you get the data, one after the other. 473 00:42:45,400 --> 00:42:53,130 Okay. So for the general model, what you have, you have a stream of input data from the universe, you of some length let's say t. 474 00:42:53,980 --> 00:42:59,980 And now the privacy definition needs to be adapted. Um, and I'll talk about this in in a second. 475 00:43:00,130 --> 00:43:04,630 But what you need to do is every time you get some input, you need to give some output again. 476 00:43:05,380 --> 00:43:10,210 Okay. So before in the static setting you had one input and you gave one output. 477 00:43:10,810 --> 00:43:14,050 So you had just one chance of leaking information right. 478 00:43:14,470 --> 00:43:20,230 By now you have t chances of leaking information because you need to give two outputs after each input. 479 00:43:20,230 --> 00:43:23,950 And after the next bit of input arrives, you need to give some output again. 480 00:43:24,430 --> 00:43:27,430 So let's say for binary counting you need to keep on outputting the sum. 481 00:43:28,120 --> 00:43:32,860 So far the sum so far. And so you have t times the chance to leak information. 482 00:43:33,280 --> 00:43:36,520 So intuitively already the accuracy should go up. 483 00:43:36,700 --> 00:43:43,870 You need to add more noise. Okay, so one thing the open problem actually is continual binary counting. 484 00:43:43,870 --> 00:43:49,840 And I'll tell you what to open about it. Um, so where you have just a stream of input data zero one of length t, 485 00:43:50,230 --> 00:43:58,900 and after the t split is given for any t between one and capital T, you need to output the sum of all the bits so far. 486 00:44:01,710 --> 00:44:08,130 And, um, now, in differential privacy and the continual observation, there was a two definitions of privacy. 487 00:44:08,430 --> 00:44:12,720 I'll just show you one. Um, and I'll just mention that there's a second one. 488 00:44:13,110 --> 00:44:18,330 And this is the event level differential privacy. And that's very similar to what we've seen before. 489 00:44:18,870 --> 00:44:21,659 So we say Sigma and Sigma Prime are neighbouring. 490 00:44:21,660 --> 00:44:28,320 If they differ in one input value in which means in one time step, in one time step you can give something out, get something else. 491 00:44:29,460 --> 00:44:34,450 And then in the definition of differential privacy now you need to say over all time steps. 492 00:44:34,500 --> 00:44:40,890 So before we said over all neighbouring databases. And so now we need to say over all the time steps. 493 00:44:41,330 --> 00:44:45,900 Uh, before we also had over all possible outputs, the outputs on our two dimensional vector. 494 00:44:46,260 --> 00:44:49,650 And then for any neighbouring inputs you have this definition again. 495 00:44:51,040 --> 00:44:54,519 And is. It's as it's late in the talk and I don't want to bore you too much. 496 00:44:54,520 --> 00:44:59,020 I skip this definition of user level privacy, but there is a more, uh, you know, 497 00:44:59,020 --> 00:45:04,090 advanced or more stringent the harder version, which is called user level privacy. 498 00:45:04,360 --> 00:45:07,960 So everything that I'm going to say now is about event level privacy. 499 00:45:09,810 --> 00:45:16,790 And so now you care again about the accuracy. So what is the proper definition of accuracy for the continuous setting. 500 00:45:16,800 --> 00:45:20,940 So now you are outputting a two dimensional vector right. Because you are outputting T numbers. 501 00:45:21,660 --> 00:45:27,750 And so the error definition that is most popular is that you take the maximum over all time steps. 502 00:45:28,020 --> 00:45:34,800 So if you think of this as being a vector. So this is f of sigma is a two answers a two dimensional vector with the two answers. 503 00:45:35,220 --> 00:45:39,180 And um, alpha is the one that the algorithm gives you. 504 00:45:40,050 --> 00:45:48,360 Okay. And we will see this in a second. Then we will take the max, the infinite in one, which is the maximum of all the entries in this two vectors. 505 00:45:49,540 --> 00:45:53,139 Okay. So we say the mechanism is also accurate. 506 00:45:53,140 --> 00:46:00,850 If for all input streams of length t, the error of the two answer versus the output of the mechanism. 507 00:46:00,850 --> 00:46:02,980 Here, for every plug in the output of the mechanism, 508 00:46:03,340 --> 00:46:08,920 the probability that most is the difference is at most, although with at least probability two thirds. 509 00:46:11,050 --> 00:46:14,170 But. And now what's known about continual binary counting. 510 00:46:15,130 --> 00:46:21,850 Well, you could do it in a very naive way. The naive way would be to trust every time. 511 00:46:21,860 --> 00:46:26,200 At every time step you just add Laplacian noise. This is like a static algorithm. 512 00:46:26,200 --> 00:46:31,060 Run the static algorithm every time. But now you're outputting tee times. 513 00:46:31,610 --> 00:46:39,400 So and if you do composition right. Then the differential privacy of these tee outputs is the sum of the epsilon values. 514 00:46:40,240 --> 00:46:45,910 Okay. Since you are outputting t of them, you have to make sure that each of them is epsilon over t differentially private. 515 00:46:47,040 --> 00:46:47,370 Okay. 516 00:46:47,820 --> 00:46:55,200 So you have to require each at each output you have epsilon over t differentially private which means you need to use Laplacian noise t over epsilon. 517 00:46:56,690 --> 00:46:59,810 Okay. And then you can just use the composition that I showed you before, 518 00:47:00,110 --> 00:47:07,700 and you get the sum of all these key outputs every time epsilon uh of this t epsilon numbers every time epsilon over t. 519 00:47:07,850 --> 00:47:11,510 So you get that your epsilon differential product. So that's good. 520 00:47:11,630 --> 00:47:19,070 But what is your accuracy. If you have Laplacian noise t over epsilon your accuracy is t over epsilon which is really not good. 521 00:47:19,100 --> 00:47:22,730 It's the number of times steps over epsilon. Okay. 522 00:47:23,240 --> 00:47:29,120 And the question is can you do better. And the answer is yes, but not as good as you would like to. 523 00:47:29,990 --> 00:47:37,730 So what? What can you do? You can do an accuracy of loks squared t over epsilon. 524 00:47:39,170 --> 00:47:44,810 That's good. Much better than t. But there's also a lower bound of log t. 525 00:47:45,290 --> 00:47:51,560 So this was shown by to work Noah Petersen Ross Blum and this by these authors and Chauncey and so on. 526 00:47:51,710 --> 00:47:58,700 They actually showed slightly worse bounds. But you can, you know, basically get these bound said, okay, so there is actually a gap. 527 00:47:59,000 --> 00:48:00,470 Yeah, between the lower bound. 528 00:48:00,590 --> 00:48:07,430 So you cannot show anything better than log T as this lower bound, no matter which mechanism, there cannot be a mechanism better than log t. 529 00:48:08,240 --> 00:48:11,630 Um, but there's also this upper bound of log square T. 530 00:48:12,810 --> 00:48:16,290 Now, this lower bound of locked is actually really a disaster, right? 531 00:48:16,440 --> 00:48:23,190 Because if t goes to infinity, think of you use your phone every day, you know, and probably hundreds of times per day. 532 00:48:23,280 --> 00:48:31,080 Yeah. So your t they if every time they sending some information back whether you know you looked at health related data or not for example. 533 00:48:31,110 --> 00:48:39,490 Yeah. Then your T is very quickly very big. And this says here that then the accuracy has to go down by a factor of log t, 534 00:48:40,270 --> 00:48:44,310 where the companies who are collecting data about you don't like this, right. 535 00:48:44,330 --> 00:48:46,120 They don't want to have such a bad accuracy. 536 00:48:46,870 --> 00:48:54,670 So what they do is supposedly that's what they say is every time period, let's say every month, they give every user a certain privacy budget. 537 00:48:55,390 --> 00:49:01,450 And then when they whenever they use your data in some computation, they subtract something from your privacy budget. 538 00:49:02,200 --> 00:49:09,910 And if your privacy budget uses reaches zero, they don't use your data anymore for whatever training or whatever algorithms they use. 539 00:49:10,270 --> 00:49:18,780 They don't use your data anymore. Until the next month, when they reset your privacy budget to be the initial value again. 540 00:49:19,470 --> 00:49:22,590 Which means basically that they just delaying. 541 00:49:22,650 --> 00:49:29,100 Right. The privacy loss right there. Just no, uh, they're just delaying it a little bit because for a while they're not using your data. 542 00:49:30,000 --> 00:49:35,900 But that's how they deal with it. But on the other side, you know, if they don't want to have an accuracy error of to you, 543 00:49:35,910 --> 00:49:40,050 that's really the only thing they can do because otherwise, you know, it's too bad. 544 00:49:40,830 --> 00:49:46,980 Now it turns out there's a slightly relaxed version of differential privacy called epsilon delta differential privacy, which I don't want to define. 545 00:49:47,400 --> 00:49:50,400 And there you can do it even more complicated. 546 00:49:50,640 --> 00:49:55,560 There you can also actually log t to the 1.5 and was shown by Tung XI and so on. 547 00:49:56,220 --> 00:50:00,780 Um, and so here we did some work together with Baga and Apache. 548 00:50:01,050 --> 00:50:04,740 Uh, then what we did is we actually fixed figured out the constant. 549 00:50:04,740 --> 00:50:10,530 They had a really bad constant. We did an algorithm, a different algorithm where we show the exact constant. 550 00:50:11,010 --> 00:50:15,910 And, um, the other advantage of our approach is that our accuracy loss. 551 00:50:15,930 --> 00:50:21,970 So our noise is very smooth. While for the other algorithm, the noise was very, um, smooth, as you can see. 552 00:50:21,990 --> 00:50:26,990 So for each point in t this put a dot okay. And so you see that it was very smooth. 553 00:50:27,000 --> 00:50:34,379 And the reason being that the amount of noise, the number of Laplacian noises that you would add up depended on the time step. 554 00:50:34,380 --> 00:50:38,070 And it might even be just one Laplacian noise, or it might be locked in Laplacian noises. 555 00:50:38,070 --> 00:50:41,340 And so you got very non smooth noise in the setting. 556 00:50:41,340 --> 00:50:46,230 And in practice it's a disaster. So you want to count how many Covid cases there are right. 557 00:50:46,470 --> 00:50:49,920 And you don't want something where you say I don't know where they went up or not. 558 00:50:49,920 --> 00:50:53,670 Maybe might have been the noise that made it go up. Or maybe the data really goes up. 559 00:50:53,670 --> 00:50:57,450 So you want the noise that goes up in a smooth way and not such a jumpy way. 560 00:50:58,630 --> 00:51:03,490 Now let me just because I promised very, very high level sketch how this algorithm works. 561 00:51:03,820 --> 00:51:05,320 Okay. It's actually simple. 562 00:51:05,680 --> 00:51:11,480 Uh, even though you might not believe it, it's very our algorithm is very simple because, you see, what do you want to output? 563 00:51:11,500 --> 00:51:13,600 You want to output the sum of ones. 564 00:51:14,260 --> 00:51:21,549 Now, I can write this down as a linear algebra problem where you have a matrix A that is lower diagonal with all ones down here, 565 00:51:21,550 --> 00:51:28,340 and also rows up here. And then the output that you have is simply a times sigma where sigma your input vector. 566 00:51:29,230 --> 00:51:35,400 Yeah, that's exactly the output that you want. And even better if you are at timestep t. 567 00:51:35,410 --> 00:51:40,210 So this is the output for all. If you do this product you get all the answers for all time steps. 568 00:51:40,900 --> 00:51:46,889 If I'm just an time step lower T, I can just take the t by t principal submatrix. 569 00:51:46,890 --> 00:51:56,440 So the small version of a yeah is up t, but I just take the first t rows and columns and take just the first t entries of my input into this product. 570 00:51:57,070 --> 00:52:01,120 And the t coordinate of this product is, uh, the answer I want. 571 00:52:01,390 --> 00:52:05,590 So this is the correct answer. And so now how do we do make this differentially private. 572 00:52:06,160 --> 00:52:11,470 What we do is we show that there is a composition of A into two matrices L and R. 573 00:52:11,680 --> 00:52:14,739 And they are both lower triangular as well okay. 574 00:52:14,740 --> 00:52:19,510 That's the crucial thing. They need to both be lower triangular to work in this continual setting. 575 00:52:20,260 --> 00:52:29,500 And so they also have this nice property. If you restrict them to the top t by t u k the l's up t and R's up t and that product is still R is up T. 576 00:52:29,800 --> 00:52:36,520 So everything is beautifully well-behaved. And so now what our mechanism does is a time step t. 577 00:52:36,670 --> 00:52:40,630 It takes r t and multiplies this with the input sigma t. 578 00:52:40,840 --> 00:52:44,170 And then it adds a noise vector to it. Okay. 579 00:52:44,590 --> 00:52:49,930 So it adds noise vector to it here. And then it does post-processing by multiplying with l t. 580 00:52:50,900 --> 00:52:54,890 So our mechanism is this very simple thing here. Okay. 581 00:52:55,130 --> 00:53:00,620 Where you can prove that, you know, this is an unbiased estimator and you can show that this is differentially private. 582 00:53:00,800 --> 00:53:04,670 So what we get is we get a maximum epsilon delta differentially private. 583 00:53:05,000 --> 00:53:09,670 Um, we get this epsilon delta differentially private mechanism that has this accuracy. 584 00:53:09,680 --> 00:53:12,920 So then the whole proof is linear algebra. Okay. 585 00:53:13,190 --> 00:53:16,520 Good. So this is, uh, this is what we did here. 586 00:53:16,700 --> 00:53:21,340 Now, there has been work for in the continuous setting in, uh, these five cases. 587 00:53:21,350 --> 00:53:23,960 So we have data streams, which are sequences of numbers. 588 00:53:24,350 --> 00:53:31,540 We have databases which are sequences of rows that have been analysed, data structures where you have sequences of operations, 589 00:53:31,550 --> 00:53:38,690 for example, heaps, um, graph algorithms where you have again sequences of operations, insertions and deletions of edges. 590 00:53:39,320 --> 00:53:45,110 Uh, clustering algorithms where you have insertions and deletions of points and you want to maintain some clustering of them. 591 00:53:45,590 --> 00:53:52,970 Uh, all this has been studied, but of course, it's still in the beginning, meaning we don't necessarily have the best, uh, answers for them. 592 00:53:53,840 --> 00:54:00,470 So what are future directions? So one is this look at user level differential privacy, which I just mentioned before. 593 00:54:00,890 --> 00:54:04,520 Uh, the other one is local differential privacy that I defined. 594 00:54:05,030 --> 00:54:09,320 And there has been very little work of local differential privacy under continual observation. 595 00:54:10,160 --> 00:54:16,520 And then also look at the running time and actually implement this algorithms and see how fast they are and how they actually perform. 596 00:54:16,850 --> 00:54:18,680 Yeah. So actually do algorithms engineering. 597 00:54:19,460 --> 00:54:26,990 And then the specific problems that I'm still interested in come up with better algorithms for clustering or graph properties and also, uh, 598 00:54:26,990 --> 00:54:32,200 close the gap of this upper bound, which is located to the 1.5 for log square, 599 00:54:32,240 --> 00:54:37,450 T for continuous binary counting, and this log T for the lower bound close the gap. 600 00:54:37,460 --> 00:54:41,540 These are the two specific problems I want to look at. And then I want to just thank my co-authors.