1
00:00:11,410 --> 00:00:14,670
Thank you. Thank you, Leslie, for this very kind introduction.

2
00:00:14,680 --> 00:00:18,220
And thank you very much for inviting me to come here. I'm really excited to be here.

3
00:00:18,460 --> 00:00:26,140
And, uh, yeah, to tell you a little bit about differential privacy in general and about my research then specifically in this field.

4
00:00:27,460 --> 00:00:32,170
So the talk, as we say, is should be algorithms who protect the privacy.

5
00:00:32,680 --> 00:00:33,850
So algorithms, you know,

6
00:00:33,850 --> 00:00:40,840
are everywhere these days because every computer program and also every integrated circuit is actually implementing an algorithm.

7
00:00:41,230 --> 00:00:51,280
And so if you use your phone or your laptop or your home appliance or your car or your cleaning robot or certain medical devices, um,

8
00:00:51,430 --> 00:00:56,080
you will obviously have algorithms running there, and they're usually very useful,

9
00:00:56,350 --> 00:01:02,950
but they can also collect data about you right now, data is also collected in other ways.

10
00:01:02,950 --> 00:01:06,520
For example, governments collect data about their citizens.

11
00:01:06,760 --> 00:01:09,790
In the U.S., for example, they do a census every ten years.

12
00:01:10,240 --> 00:01:14,560
Or if you participate in research studies, there will also be data collected about you.

13
00:01:15,370 --> 00:01:18,390
Um, this data is usually collected for a certain purpose.

14
00:01:18,400 --> 00:01:24,280
So for example, governments collect data so they can decide how to distribute funds who are not.

15
00:01:24,280 --> 00:01:36,160
So uh, um, also insurances use it to decide, you know, how much it costs you to costs to insure you banks to make loan decisions or, uh,

16
00:01:36,190 --> 00:01:46,960
doctors use, uh, AI algorithms right now in order to decide whether they should do certain medical procedures on you or not on your phone.

17
00:01:47,230 --> 00:01:53,980
Um, if you use, for example, Google keyboards, they suggest word sentence completions.

18
00:01:54,250 --> 00:01:58,960
And that's usually done, um, based on data that is collected by their users.

19
00:02:00,020 --> 00:02:04,760
So in general, this sounds like it's great. There's certain usefulness in this data.

20
00:02:05,090 --> 00:02:10,360
So there's this utility. But on the other side, of course, immediately comes to the concern.

21
00:02:10,370 --> 00:02:16,370
What about our privacy? So you have two conflicting goals. On the one side, you can use this data to do something useful in the world.

22
00:02:16,880 --> 00:02:20,330
Um, but on the other side, how can you protect the privacy?

23
00:02:20,600 --> 00:02:24,170
And this is really the goal of these algorithms that I will be talking about.

24
00:02:24,320 --> 00:02:31,430
So they want you. They want you to be able to use the data, but in such a way that the privacy of the individuals is protected.

25
00:02:31,760 --> 00:02:35,150
And try to explain to you how they are trying to achieve that.

26
00:02:37,300 --> 00:02:40,540
Um, now, what are the challenges for, uh, privacy?

27
00:02:40,900 --> 00:02:47,230
Uh, there's actually one challenge. Let me start with a second one here. Uh, one simple challenge is already that statistics.

28
00:02:47,800 --> 00:02:51,760
If your database gives you exact data, might already leak a lot of information.

29
00:02:52,120 --> 00:02:57,880
For example, if we know the average salary of the people in this room, and then one person leaves,

30
00:02:58,180 --> 00:03:02,770
and then I give you the average salary, again, you can easily figure out if you know how many people were here.

31
00:03:02,980 --> 00:03:06,820
You can easily figure out what was the salary of the person leaving. Right. Some simple math.

32
00:03:07,420 --> 00:03:13,480
Um, so in general, by giving exact answers, you might already leak a lot of information that you might not want to leak.

33
00:03:13,990 --> 00:03:21,010
Um, even worse in machine learning. Very often this, uh, algorithms actually memorise data.

34
00:03:21,460 --> 00:03:28,570
So, um, for example, there was some paper in 2018 by collinear oil where they showed that the Smart Compose,

35
00:03:28,570 --> 00:03:30,910
which is the sentence completion in Gmail,

36
00:03:31,270 --> 00:03:38,560
uh, was leaking the Social Security number of one of the users, okay, because they had used training data that had that information in it.

37
00:03:38,800 --> 00:03:43,120
And it's happened to learn that by heart. So how can you deal with this?

38
00:03:43,510 --> 00:03:48,790
Well, so the first idea that people had was to say, oh, why don't we just anonymize data?

39
00:03:48,820 --> 00:03:53,170
Isn't that good enough? Yeah. So I might release data, but I just don't put the names.

40
00:03:53,620 --> 00:03:58,360
And there are all kind of famous examples that showed that this was a bad idea.

41
00:03:58,600 --> 00:04:05,950
One of them was, uh, uh, published by Narayan and Sharma Cough in 2008.

42
00:04:06,250 --> 00:04:12,370
Um, what they did was they used data that Netflix had published, so Netflix had published.

43
00:04:12,750 --> 00:04:17,050
Uh, so here every row was a user and every column was a movie.

44
00:04:17,320 --> 00:04:21,010
And so the information was whether this user liked this movie or not.

45
00:04:21,550 --> 00:04:25,630
And they had anonymized the data by removing the names of the users.

46
00:04:26,850 --> 00:04:35,550
But what these authors showed was, oh, if you combine this information together with some public information that the Internet Movie Database had,

47
00:04:36,120 --> 00:04:40,679
because many of the users that gave their information to Netflix or the then published on the

48
00:04:40,680 --> 00:04:45,000
Internet Movie Database saying which movies they watched and whether they liked them or not,

49
00:04:45,690 --> 00:04:53,760
and they gave their names. And so if you combine these two data sources, uh, they were able to actually figure out the names for many of the users.

50
00:04:54,090 --> 00:04:58,470
And they showed that basically, on average, four movies uniquely identify a user.

51
00:04:59,600 --> 00:05:02,960
Okay, so it's a bad idea to just anonymize the data.

52
00:05:03,080 --> 00:05:07,550
Because if you combine different sources, you might be able then to de anonymize it.

53
00:05:09,450 --> 00:05:12,550
Okay. So. So the first idea doesn't really work by itself.

54
00:05:12,570 --> 00:05:14,910
Of course you have to anonymize, but you have to do more.

55
00:05:15,940 --> 00:05:22,780
And so we what it also shows us this example is that we need techniques that can handle composition of data sources.

56
00:05:22,990 --> 00:05:26,500
Yeah. When you analyse what happens if you combine different data sources.

57
00:05:26,770 --> 00:05:31,060
So here are the two data sources the Netflix in the Internet Movie Database.

58
00:05:32,300 --> 00:05:37,610
And then the second idea, the researchers, which was cannot be generalised to randomised response.

59
00:05:37,640 --> 00:05:43,100
Now let me explain to you what is randomised response. And then I will explain to you how they generalised it.

60
00:05:44,510 --> 00:05:49,159
So randomised response is a technique which was invented by Varner, a sociologist,

61
00:05:49,160 --> 00:05:54,710
in the 1960s, and this has been used ever since without any big privacy concerns.

62
00:05:55,160 --> 00:05:58,700
Okay. It's a good knowledge, um, to have.

63
00:05:58,850 --> 00:06:03,320
And so for later in the talk. And so it has it solves the following problem.

64
00:06:03,770 --> 00:06:08,450
Assume you have some population and you want to find out what percentage P of

65
00:06:08,450 --> 00:06:13,210
the population has property A now property A might be something embarrassing.

66
00:06:13,230 --> 00:06:17,250
You don't wanna admit that you have property A like snoring at night, for example.

67
00:06:17,360 --> 00:06:20,660
Uh. Worst things. Yeah, I leave it up to your imagination.

68
00:06:20,810 --> 00:06:24,889
So nobody wants to imagine that they have prob ever wants to admit that they have property.

69
00:06:24,890 --> 00:06:30,320
A still you want to find out what percentage P of people has this property.

70
00:06:30,860 --> 00:06:36,890
And now here is a solution that Warner came up with. So you have your volunteers who participate.

71
00:06:37,160 --> 00:06:41,330
And instead of asking them whether they have the property or not, they would just lie to you.

72
00:06:41,360 --> 00:06:48,590
You ask them to run this simple algorithm. So they should first throw a coin and not tell anyone about the outcome of the coin flip.

73
00:06:49,310 --> 00:06:53,030
And then if it was hit, then they should answer truthfully.

74
00:06:53,970 --> 00:06:58,200
If it was no, it was not hit. Then they should throw the coin again.

75
00:06:58,740 --> 00:07:01,770
And now they should just answer according to the coin throwing.

76
00:07:01,780 --> 00:07:09,590
So if it was, heads say yes and otherwise say no. Okay, but they should not tell anyone, uh, how often they threw the coin.

77
00:07:09,600 --> 00:07:13,299
They should not tell anyone the outcome of the first coin flip. Okay.

78
00:07:13,300 --> 00:07:18,400
That's the private information. And so they should just give this information back now.

79
00:07:18,430 --> 00:07:24,669
Yes or no depending on this algorithm. And then what you do is you determine the percentage Q of yes.

80
00:07:24,670 --> 00:07:32,430
Answers from this algorithm. And then you output two times q minus at half as estimate for p, the two answer.

81
00:07:33,640 --> 00:07:37,120
And it turns out that this is an unbiased estimator for the true answer.

82
00:07:37,690 --> 00:07:42,520
So it will give you a good estimate. Now why does this also pursue privacy?

83
00:07:43,090 --> 00:07:46,360
Well, because you have what's called plausible deniability.

84
00:07:47,170 --> 00:07:51,190
So everyone has to answer yes with probability at least the fourth.

85
00:07:51,850 --> 00:07:57,759
If you have the property, then you will answer yes here with probability half here and then here.

86
00:07:57,760 --> 00:08:03,250
Or as it was probably half, you said no and then again with probability a half. So all together with probability fourth you will also answer yes here.

87
00:08:03,910 --> 00:08:07,479
So if you have the property you will have a probability of three fourths of answering.

88
00:08:07,480 --> 00:08:15,030
Yes. If you don't have the problem with the property, you will still have probability of one fourth right of entering.

89
00:08:15,030 --> 00:08:18,959
Yes. So everyone has a probability of at least the fourth of entering.

90
00:08:18,960 --> 00:08:25,620
Yes. So if you answered yes and then I come to you and say, oh, you have property, I, you can say, no, no, no, I actually don't have it right.

91
00:08:25,620 --> 00:08:29,000
It was just the coin flips, not me. It's a coin flips.

92
00:08:29,010 --> 00:08:32,880
Right. You can blame the coin flips for you, I answer, and nobody can prove you otherwise.

93
00:08:34,100 --> 00:08:39,440
Okay, so it's private. And note also that you need randomisation.

94
00:08:39,440 --> 00:08:43,610
Without randomisation, this would not work right that you need to keep private for yourself.

95
00:08:44,180 --> 00:08:47,360
Okay, but randomisation is very important. Okay.

96
00:08:47,360 --> 00:08:51,170
So now the question was how can we make this more formal and more general.

97
00:08:52,280 --> 00:08:56,830
And that's exactly what differential privacy does. So it's a mathematically rigorous definition.

98
00:08:56,840 --> 00:09:02,870
You will see it in a second. Um and it measures the privacy protection given by an algorithm.

99
00:09:03,380 --> 00:09:11,240
And the goal is, first of all, to make it very unlikely that if you combine, uh, that user data can be reconstructed.

100
00:09:11,420 --> 00:09:14,570
If no matter how many data sources are combined.

101
00:09:15,050 --> 00:09:20,980
It gives you a way to measure how much impact on your privacy the combination of data sources has.

102
00:09:21,260 --> 00:09:25,100
And then you can have a threshold and say I'm only allowing so much privacy loss.

103
00:09:25,430 --> 00:09:29,659
And if this combination would be a profit, then I'm not allowing this combination.

104
00:09:29,660 --> 00:09:31,520
Meaning I'm not giving out my data anymore.

105
00:09:32,660 --> 00:09:42,530
Um, and it also gives a prime, a parameter that tells you for each algorithm how much data leakage stays, how much privacy loss stays.

106
00:09:43,370 --> 00:09:47,510
And there are several current deployments. And I'll talk about this, uh, later in the talk.

107
00:09:48,970 --> 00:09:52,750
Okay. So now let me give you the informal definition before giving you the formal definition.

108
00:09:53,260 --> 00:09:58,840
The informal definition is as follows. So assume you have a database and it has a thousand people in it.

109
00:09:59,560 --> 00:10:04,750
And you ran some randomised algorithm on it. Maybe computing the average salary or something okay.

110
00:10:05,620 --> 00:10:09,820
And so you run some randomised algorithm on it and it gives you some noisy answer.

111
00:10:10,420 --> 00:10:19,090
Okay. Noisy meaning because there was some randomisation involved. And now assume that you have the same database except the last person has changed.

112
00:10:19,450 --> 00:10:24,970
There is no person 1000. Everything else is the same. And now you run the same algorithm on it.

113
00:10:25,600 --> 00:10:27,310
And you also get a noisy answer.

114
00:10:28,400 --> 00:10:35,960
And now what you want is that you should return the same answer for D and for B prime, with almost the same probability.

115
00:10:36,990 --> 00:10:38,880
So mathematically speaking,

116
00:10:39,120 --> 00:10:46,050
the first the algorithm executed on the first database gives you some probability distribution on the output potential outputs it.

117
00:10:46,940 --> 00:10:51,230
And the one on the second input gives you another probability distribution.

118
00:10:51,830 --> 00:10:56,510
And what you want is that this probability distributions are very close, very similar.

119
00:10:57,770 --> 00:11:01,310
So it's some kind of smoothness requirement on the output of the algorithms.

120
00:11:01,820 --> 00:11:06,380
Okay. So if the inputs are very similar like here, they only define the last person.

121
00:11:06,870 --> 00:11:10,820
Then the output distributions for your answers should be very similar.

122
00:11:12,350 --> 00:11:15,590
Okay. And if this is the only thing you remember from this talk. You got it.

123
00:11:16,670 --> 00:11:20,330
Okay. I will have it again on a later slide.

124
00:11:20,480 --> 00:11:23,240
Now, why is this? Why is this a good definition?

125
00:11:23,540 --> 00:11:31,009
Well, because then, based on the answer of the mechanism, it is very unlikely that you can distinguish between d or d prime,

126
00:11:31,010 --> 00:11:34,490
that you can figure out that the input was d, or that the input was d prime.

127
00:11:35,210 --> 00:11:38,720
You will not be able to distinguish because the output distributions are so close.

128
00:11:39,140 --> 00:11:42,620
It would be very unlikely that you can distinguish I should say. Okay.

129
00:11:43,130 --> 00:11:50,170
That's the idea behind it. Now, more formally speaking, um, so we say two inputs are neighbouring.

130
00:11:50,210 --> 00:11:56,150
If they define at most one entry and other be more formal in the next slides because it depends on the application.

131
00:11:56,270 --> 00:12:02,670
What does it mean? One entry. Okay. And so epsilon is some number larger than zero.

132
00:12:02,820 --> 00:12:10,470
And now we say a randomised algorithm. Sometimes they call it mechanism A with a set p of possible outputs.

133
00:12:11,010 --> 00:12:14,340
Is epsilon differential private. So p is all possible outputs.

134
00:12:14,700 --> 00:12:22,920
Is epsilon differential private if for all possible subsets S of P and all possible neighbouring inputs D and D prime,

135
00:12:23,370 --> 00:12:30,389
the probability that the mechanism started on t outputs S is at most e to the epsilon

136
00:12:30,390 --> 00:12:36,600
times the probability that a mechanism with uh input d prime outputs an element of s.

137
00:12:37,140 --> 00:12:41,010
That's why I gave you the informal definition beforehand, because this is somewhat of a mess.

138
00:12:41,250 --> 00:12:47,160
Yeah. But let me try to interpret this. So Epsilon you should think of as a small number close to zero.

139
00:12:47,280 --> 00:12:51,240
So e to the epsilon is just a little bit larger than one. Okay.

140
00:12:51,510 --> 00:12:57,450
So what you want is the probability. And instead of saying element of s you can just think for simplicity here.

141
00:12:57,720 --> 00:13:01,980
Um, that's is just one element of P. And that this is equal to S.

142
00:13:02,250 --> 00:13:10,620
It makes it easier. So you can think the probability that the mechanism outputs an element of S is almost the same.

143
00:13:10,890 --> 00:13:14,310
If you the input was d, what if the input was d prime.

144
00:13:15,520 --> 00:13:22,440
That's what this definition says. And the randomness here, I should emphasise, is only over the randomness of the here,

145
00:13:22,440 --> 00:13:25,890
for the probability is only over the random choices of the algorithm.

146
00:13:26,580 --> 00:13:33,360
Okay, the input is worst case. We are saying for all possibility and d prime okay, there's no input distribution.

147
00:13:33,810 --> 00:13:35,910
This is only on the randomness of the algorithm.

148
00:13:38,380 --> 00:13:45,840
And of course, because we say for all possible neighbouring D and D prime, you can also by symmetry exactly get the inverse statement right.

149
00:13:45,850 --> 00:13:52,990
So you get that the probability then you started on D prime is a most e to the epsilon times higher than the probability when you started on D.

150
00:13:53,260 --> 00:13:57,160
So really what you get is an upper and a lower bound for the probability.

151
00:13:57,310 --> 00:14:02,110
When you started on D it's upper bounded by at most e to the epsilon times the

152
00:14:02,110 --> 00:14:05,890
probability for d prime and its lower bounded times e to the minus epsilon.

153
00:14:06,460 --> 00:14:11,080
And so if epsilon is small close to zero then this is roughly one.

154
00:14:11,080 --> 00:14:12,880
And it gives you a good privacy guarantee.

155
00:14:14,160 --> 00:14:21,480
Now, uh, one thing, however, you should keep in mind that in practice, Epsilon is not a small nor close to zero.

156
00:14:21,720 --> 00:14:26,160
Uh, you think of epsilon is more like 7 or 8 when used by companies.

157
00:14:26,610 --> 00:14:32,550
So E to the epsilon is already pretty big. But even worse, in the US census they used epsilon equals 19.

158
00:14:34,890 --> 00:14:39,330
So you might argue well you know this has probability over individuals.

159
00:14:39,330 --> 00:14:42,500
And you know that 300 million Americans, the accountant blah blah blah.

160
00:14:42,510 --> 00:14:48,630
But then you only count a subset of them anyway. And so but still, you know, e to the 19 seems like it's a big number.

161
00:14:50,820 --> 00:14:54,090
Okay. But in theory epsilon is small, close to zero.

162
00:14:56,310 --> 00:15:00,510
Okay. And so the crucial thing is, uh, this definition of neighbouring.

163
00:15:00,510 --> 00:15:06,420
Right. So we have this constraint on the probability distributions for neighbouring data sets.

164
00:15:07,000 --> 00:15:12,329
Okay. So here again this is the one thing you should remember from this talk is differential privacy

165
00:15:12,330 --> 00:15:17,520
means there's a strong requirement on the output distributions for neighbouring inputs.

166
00:15:17,550 --> 00:15:24,480
The sort of a smoothness requirement. They need to be very close. And this is completely different from anything I've seen in algorithms before.

167
00:15:25,230 --> 00:15:29,520
Okay. This was not studied before. And that's why this is also such an interesting research area.

168
00:15:29,530 --> 00:15:32,700
If you had not thought about such kind of algorithms before.

169
00:15:34,800 --> 00:15:39,450
Okay. Now why is this useful to have this definition? So now I can informally.

170
00:15:40,660 --> 00:15:45,970
So assume that the use of data AI is used in database D, but not in D prime.

171
00:15:47,160 --> 00:15:52,170
Then. Now you could write down. Well, what is the harm for a user ie.

172
00:15:52,200 --> 00:16:00,570
If the output was a s and you can multiply this with the probability that the output was s, and now you can sum over all possible s.

173
00:16:00,840 --> 00:16:04,440
So here you get the exact expected harm that you have.

174
00:16:04,470 --> 00:16:07,590
If I say data was used in the database.

175
00:16:08,690 --> 00:16:13,250
And here you can do the same thing for D prime D primaries where ice data was not used.

176
00:16:14,000 --> 00:16:21,530
And now because you know this bound here, you know that this expect the term here um is a multiplied with e to the epsilon is

177
00:16:21,530 --> 00:16:27,080
an upper bound to the harm that you get if the data was used okay or was differently.

178
00:16:27,320 --> 00:16:31,729
Any harm for user I that happens with the inclusion of data.

179
00:16:31,730 --> 00:16:38,540
I is that most effect that e to the epsilon higher then the harm the expected harm without ice data.

180
00:16:39,900 --> 00:16:46,590
Okay. So your. This definition is useful because you can found the expected harm that you have by being included.

181
00:16:47,460 --> 00:16:52,320
And again epsilon close to zero is good. Epsilon 19 might not be quite as good.

182
00:16:53,640 --> 00:16:57,580
Okay. And each of the epsilon is called a risk multiplier.

183
00:16:57,610 --> 00:17:01,070
Okay. So now, what is a good definition of neighbouring?

184
00:17:01,990 --> 00:17:08,000
Uh, now, especially in the early days, people in differential privacy mostly looked at databases, which they thought of as tables.

185
00:17:08,510 --> 00:17:11,510
Okay, so they saw the tables. So you have some table.

186
00:17:11,510 --> 00:17:15,230
Every row is a user and every column tells you something about the user.

187
00:17:15,950 --> 00:17:19,850
And then the standard definition of neighbouring goes to databases are neighbouring.

188
00:17:19,940 --> 00:17:24,020
If they define at most one row okay. So one user is different.

189
00:17:26,140 --> 00:17:29,590
But, um, then they said, isn't this maybe too generous?

190
00:17:29,890 --> 00:17:33,430
And they restricted it more. And they say two databases are neighbouring.

191
00:17:33,730 --> 00:17:39,670
If they differ only in the inter entrance for one row, so one entry in one row is different.

192
00:17:40,920 --> 00:17:45,630
What even more restrictive one entry in one row differs by at most one.

193
00:17:47,010 --> 00:17:52,590
Okay. And so as you notice, these definitions of neighbouring become more and more specific.

194
00:17:53,040 --> 00:17:56,340
And thus the requirements become less and less strict.

195
00:17:56,550 --> 00:18:01,290
Right. Because you're only required to have very similar distributions if you're neighbouring,

196
00:18:02,160 --> 00:18:07,050
if you're neighbouring here in this one, in this way you also neighbouring here and you also neighbouring here.

197
00:18:07,470 --> 00:18:10,320
While there are cases where your neighbouring here and not neighbouring here.

198
00:18:10,890 --> 00:18:15,360
So here you have more requirements because you have more neighbouring relationships than down here.

199
00:18:16,550 --> 00:18:16,880
Okay.

200
00:18:17,060 --> 00:18:24,950
And now, uh, people have studied this kind of different kind of, uh, requirements because depending on the application, that might be more suitable.

201
00:18:25,220 --> 00:18:28,460
And you may be able to get better balance down here.

202
00:18:29,420 --> 00:18:32,450
Now balancing both. Well, we'll get to that in a second.

203
00:18:32,990 --> 00:18:38,960
In the error that you get. Okay. So there are epsilon differential private algorithm for all of this.

204
00:18:39,230 --> 00:18:42,500
But the error on second formalise error.

205
00:18:42,590 --> 00:18:47,800
The error is better if the if you have fewer constraints can you handle a smaller.

206
00:18:47,810 --> 00:18:52,400
You will get a smaller error. You might also look at graphs.

207
00:18:52,410 --> 00:18:56,810
Think of social networks. Uh, each graph, you know, each user is a node.

208
00:18:56,840 --> 00:19:00,170
If there's a friendship relationship, you have a link edge.

209
00:19:01,130 --> 00:19:07,130
And now you could look at vertex neighbourhood of graphs. There are two graphs are neighbouring if they define at most one node.

210
00:19:08,000 --> 00:19:12,020
Or you can say h neighbouring where the nodes are actually known.

211
00:19:12,020 --> 00:19:18,470
And you want to protect the privacy of the edges. And so you say, uh, two graphs are neighbouring if they define one edge.

212
00:19:20,090 --> 00:19:26,450
And well, you could just say that neighbouring if you have weighted graphs, the graphs are actually known but the weights are not known.

213
00:19:26,690 --> 00:19:30,260
And so only the weight of one edge can differ. Okay.

214
00:19:30,620 --> 00:19:38,630
Um. So say it. You know, this way, while I was talking, the notion of neighbouring is very closely related to what you are protecting, right?

215
00:19:38,930 --> 00:19:44,389
So if you're only protecting the weights of the edges, um, then you're not protecting anything else.

216
00:19:44,390 --> 00:19:49,310
You're not protecting the nodes or the edges in the graph, meaning they might just as well be publicly known.

217
00:19:50,060 --> 00:19:55,250
If you're each neighbouring, then the nodes might just as well be publicly known because you're not protecting them.

218
00:19:55,250 --> 00:19:58,880
You're only protecting the H relationship. Who is in relation with whom.

219
00:19:59,420 --> 00:20:04,460
And if your vertex neighbouring this is the strongest again, then you're protecting even the nodes in the graph.

220
00:20:05,800 --> 00:20:11,320
Okay, so what do you protect? You mean what you want to protect immediately carries over to what?

221
00:20:11,320 --> 00:20:12,520
Your definition of neighbouring.

222
00:20:12,850 --> 00:20:20,140
And depending on your definition of neighbouring, the more was the more strict or the less strict to this it carry carryover to how much era you need.

223
00:20:20,950 --> 00:20:28,300
And since there's so many possible applications and combinations of what what this right definition of neighbouring, there's lots of work to be done.

224
00:20:30,450 --> 00:20:35,700
Now let's go back and analyse randomised response. I claim randomised response is differentially private.

225
00:20:35,910 --> 00:20:39,240
The question is just for which value of epsilon. Okay.

226
00:20:39,750 --> 00:20:44,969
Now um but before doing that we need to have a definition of neighbouring.

227
00:20:44,970 --> 00:20:45,209
Right.

228
00:20:45,210 --> 00:20:53,430
Because I told you, um, you the most important thing in the definition of differential privacy is you need to exactly specify what is neighbouring.

229
00:20:54,150 --> 00:21:01,530
So here for, uh, randomised response, um, you want to protect the bit of each individual user.

230
00:21:02,450 --> 00:21:06,110
Okay, so then you say two inputs are neighbouring.

231
00:21:06,440 --> 00:21:10,250
If they differ in the bit of at most one user.

232
00:21:10,790 --> 00:21:19,520
Okay. So. Think of the initial inputs here as being a string d, then two inputs D and d prime are neighbouring.

233
00:21:19,550 --> 00:21:22,640
If one bit this is different. It's flipped.

234
00:21:23,950 --> 00:21:30,770
Okay. That's the neighbouring definition that you would use here. Now, why is differential privacy such a great definition?

235
00:21:31,160 --> 00:21:40,100
Well, it fulfils two main properties. And the first property is the post-processing theorem which says assume you have a

236
00:21:40,100 --> 00:21:45,080
mechanism that is epsilon differential private and it maps the data to some set R.

237
00:21:45,590 --> 00:21:49,820
And then you have some function that applies to R and compute some r prime.

238
00:21:50,810 --> 00:21:58,730
Then if you do the combination where you first run em and then run F on the output of em, this would be again epsilon differential private.

239
00:22:00,030 --> 00:22:04,140
Okay, so informally speaking, you had some data that was sensitive.

240
00:22:04,380 --> 00:22:07,890
You ran it through some mechanism. There was epsilon differential private.

241
00:22:08,040 --> 00:22:14,880
And now the data is clean, meaning you can do any post-processing on this output of the mechanism, not on the original data.

242
00:22:14,940 --> 00:22:18,040
Yeah. This does not allow F to touch the original data.

243
00:22:18,060 --> 00:22:19,830
F is not allowed to see the original data.

244
00:22:19,980 --> 00:22:25,440
The only is allowed to see the output of the mechanism, but if it only looks at the output of the mechanism, it will be fine.

245
00:22:26,380 --> 00:22:31,450
The whole thing. The combination of first running the mechanism and then this function is again epsilon differential private.

246
00:22:31,960 --> 00:22:37,540
So for randomised response this means as follows. You can really sorry this was not a good idea.

247
00:22:37,540 --> 00:22:42,639
This animation here. Um so you can really break it into two pieces.

248
00:22:42,640 --> 00:22:45,700
The first part which is what every participant had to do.

249
00:22:46,270 --> 00:22:51,670
Yeah this is the first part. This is really the mechanism. And you get this answers yes or no.

250
00:22:53,100 --> 00:22:58,080
This is the first part. This would be the mechanism. And we will show that this here is epsilon differential private.

251
00:22:58,800 --> 00:23:00,790
And then taking all this answers,

252
00:23:00,930 --> 00:23:08,040
figuring out the percentage Q of yes answers and outputting two columns Q minus a half is just a function F is just post-processing.

253
00:23:09,530 --> 00:23:12,950
Okay of this differentially private output that we got here from the mechanism.

254
00:23:13,870 --> 00:23:20,040
Okay. So. So now in order to analyse randomised response I only need to analyse this first part.

255
00:23:20,050 --> 00:23:26,110
I don't need to analyse the second part because the second part does not look at the initial data of the real users.

256
00:23:26,110 --> 00:23:32,470
Again has no contact to the user. Okay, so doesn't look at the initial data, only looks at the output of this mechanism.

257
00:23:34,440 --> 00:23:38,560
Okay. So this is one good theorem. And now you can actually do this analysis.

258
00:23:38,580 --> 00:23:41,610
I will not do it. I will just say you can actually do this.

259
00:23:41,610 --> 00:23:47,670
You know always look at the ratio of you know if you are in there or not in there and what's the probability that you say something, blah, blah, blah.

260
00:23:47,880 --> 00:23:53,070
What you will find out is that this ratio is at most over three, which is E to the L and three.

261
00:23:54,140 --> 00:23:57,470
Okay. So for randomised response this is your homework to do this yourself.

262
00:23:57,710 --> 00:24:01,940
But for randomised response I'm a professor right. Whenever it gets hard it's the homework.

263
00:24:03,200 --> 00:24:07,100
So for randomised response epsilon is LN3 okay.

264
00:24:07,280 --> 00:24:12,950
And I have to say there are different variants of randomised response where you would not use a fair coin.

265
00:24:12,950 --> 00:24:15,050
You would use a biased coin in your coin flips.

266
00:24:15,350 --> 00:24:20,120
And if you do this then you can make it also epsilon differentially private for any epsilon larger than zero.

267
00:24:21,050 --> 00:24:26,740
Just this simple variant that I showed to you was LN3. Okay, so this is the first useful thing.

268
00:24:27,160 --> 00:24:30,790
The second useful property of differential privacy is composition.

269
00:24:31,780 --> 00:24:35,980
Okay. So assume you have a first mechanism that's epsilon one differentially private.

270
00:24:36,250 --> 00:24:43,300
In the second one that's epsilon two differentially private. And now you want to have a mechanism that you can achieve.

271
00:24:43,480 --> 00:24:50,140
You want to have a mechanism that just outputs both outputs M1 and m2 the output of M1 and the output of M2.

272
00:24:51,160 --> 00:24:56,590
Then the theorem says that this is epsilon one plus epsilon two differentially private.

273
00:24:57,220 --> 00:25:00,430
And that's not too hard to show based on the definition of differential privacy.

274
00:25:01,380 --> 00:25:09,360
And now to do a post-processing. So once you output these things then you can now throw any function g on these two outputs.

275
00:25:09,910 --> 00:25:15,989
Basically post-processing is before. And that now tells you that also some function g arbitrary function g based on the

276
00:25:15,990 --> 00:25:19,650
output of these two things is also epsilon one plus epsilon two differentially private.

277
00:25:20,430 --> 00:25:27,209
So this allows you to analyse the harm or the influence on your privacy done by having multiple mechanisms,

278
00:25:27,210 --> 00:25:31,430
looking at the same data and outputting something about it. Okay.

279
00:25:31,610 --> 00:25:34,730
So if we look for example back to this movie data.

280
00:25:35,600 --> 00:25:41,000
So let's say D is a true data. So for all the users, whatever they looked at and whether they liked it or not.

281
00:25:42,350 --> 00:25:48,710
And then some of the data might be Netflix. Okay, so Netflix would be the first mechanism.

282
00:25:49,220 --> 00:25:51,200
And it is it discloses something.

283
00:25:52,280 --> 00:26:01,550
Now, as we said before, Netflix only anonymized the data and you can show that the station does not fulfil epsilon differential privacy.

284
00:26:01,580 --> 00:26:04,940
Epsilon differential privacy, meaning the epsilon would be infinite. Yeah.

285
00:26:05,930 --> 00:26:09,770
Now the second one is the Public Internet Movie Database.

286
00:26:09,770 --> 00:26:12,589
And the users might not write about all the movies they have seen.

287
00:26:12,590 --> 00:26:18,470
So this might be, again, just some incomplete set of the truth T and that's just disclosed.

288
00:26:18,500 --> 00:26:22,940
There's also mechanism to just write down whatever, I guess. Again, not at all private.

289
00:26:23,630 --> 00:26:26,720
Okay. So you have a second mechanism with epsilon being infinite.

290
00:26:27,380 --> 00:26:31,220
Okay. So now not surprising that the combination of the two is again no privacy.

291
00:26:32,480 --> 00:26:37,850
But if this first one would have had some privacy epsilon one in the second one would have had privacy epsilon two.

292
00:26:38,060 --> 00:26:44,720
Then the combination of the two having both of them out there would be now a privacy loss of epsilon one plus epsilon two.

293
00:26:46,350 --> 00:26:49,620
Okay. So this is the second very nice property of differential privacy.

294
00:26:51,170 --> 00:26:56,809
Um, now I there's one thing that people then usually ask me at this point in time.

295
00:26:56,810 --> 00:27:03,410
So I'm, I created a slide for it, which is there is two different notions of differentially private mechanisms.

296
00:27:03,800 --> 00:27:09,050
There's the one that you just saw, uh, which is called local differentially private algorithm.

297
00:27:09,620 --> 00:27:15,200
Uh, it's like a randomised response. So here the user has a two data whether they have property A or not.

298
00:27:16,010 --> 00:27:19,550
And they output some noise variant of it because they flip this coin.

299
00:27:20,060 --> 00:27:24,590
So they output a noisy variant of it and they give to the server the noisy variant and all.

300
00:27:24,590 --> 00:27:27,810
What the server does is post-processing. Okay.

301
00:27:27,820 --> 00:27:31,450
And as we know, the server cannot do any harm because the data was already private.

302
00:27:32,600 --> 00:27:36,370
Um. That's how I've always thought the notion of central differential private privacy,

303
00:27:36,380 --> 00:27:44,720
that's more of what you would see in medical databases where, uh, the users actually give the raw data to the server, to the database.

304
00:27:44,940 --> 00:27:47,270
Okay. So the database actually has the true answers in it,

305
00:27:47,630 --> 00:27:53,930
but then it let some differentially private mechanism run over it that then outputs a differentially private output.

306
00:27:55,140 --> 00:27:59,400
Okay. And so here the difference is the server has actually got all data,

307
00:27:59,730 --> 00:28:05,310
which is here where the server did not have the true data, only the uh, and uh the private data.

308
00:28:06,570 --> 00:28:10,379
Okay. That's a difference between the two. In this talk I will talk about the central model.

309
00:28:10,380 --> 00:28:15,600
Just because it's easy apart from randomised response, just because it's easy and most more work in the central model.

310
00:28:15,750 --> 00:28:21,690
But of course now there's also a lot of work done in the local model because, um, that's very often what,

311
00:28:21,690 --> 00:28:26,220
what happens when, you know, uh, internet companies are collecting data about their users.

312
00:28:26,760 --> 00:28:31,020
It's there usually it's in the local model that the users algorithm,

313
00:28:31,020 --> 00:28:35,400
the user's machine or iPhone is sending this differentially private data to the server.

314
00:28:38,280 --> 00:28:43,860
Now this is all great, but are there disadvantages? And yes, there are two main disadvantages.

315
00:28:44,490 --> 00:28:48,540
The first one is differential. Private algorithms are usually slower.

316
00:28:48,960 --> 00:28:52,980
Like before you would have had just to say yes I have a or not.

317
00:28:53,280 --> 00:28:55,200
And now you have to flip this two coins.

318
00:28:55,200 --> 00:29:00,210
And you know I just did this with volunteers and some talk and they had to think about, you know, when to do what.

319
00:29:00,870 --> 00:29:03,870
And so, you know, it's not so easy anymore. It's a bit more work.

320
00:29:04,350 --> 00:29:09,540
And in general differentially private algorithms are slower than non differentially private algorithms.

321
00:29:10,380 --> 00:29:16,710
And the question is just by how much. And the second even larger problem is the loss on accuracy.

322
00:29:17,490 --> 00:29:22,230
Okay. So for example which we also call error for example for randomised response.

323
00:29:23,520 --> 00:29:27,060
So what we are interested in is P would have been the two answer.

324
00:29:27,540 --> 00:29:31,290
Yeah. And now what we the way we measure the error is as follows.

325
00:29:31,620 --> 00:29:42,360
We say um what is the parameter alpha such that with probability at least two thirds q falls into this range between p plus alpha and p minus alpha.

326
00:29:44,070 --> 00:29:48,390
Okay. So with probability, at least two thirds Q must fall into this range.

327
00:29:48,810 --> 00:29:55,470
And then we say if I say accuracy. There's also a variant where, you know, this is some arbitrary parameter beta.

328
00:29:55,490 --> 00:29:58,430
But for this talk I just plugged in two thirds. Okay.

329
00:29:58,430 --> 00:30:05,450
So the accuracy is the alpha such that with probability at least two thirds, the answer that you get will fall into this range.

330
00:30:05,900 --> 00:30:10,130
And of course you want the alpha to be as small as possible because you want to be as close as possible.

331
00:30:11,440 --> 00:30:11,650
Yeah.

332
00:30:11,920 --> 00:30:19,270
And so whenever you write a paper about differential privacy, it's you should of course prove that your mechanism is epsilon differentially private.

333
00:30:19,810 --> 00:30:24,250
Well whatever notion you of privacy use. But then please always improve accuracy.

334
00:30:24,400 --> 00:30:27,880
But it's very important that you also show a bound on accuracy.

335
00:30:27,940 --> 00:30:32,290
I can give you an algorithm for any problem that's always epsilon differentially private.

336
00:30:32,470 --> 00:30:36,070
I always say zero. You know, it's definitely private.

337
00:30:36,080 --> 00:30:39,230
It doesn't leak any information, but it has miserable accuracy.

338
00:30:39,950 --> 00:30:45,770
So just writing a paper where you say, oh yeah, have a epsilon differential, private algorithm is not impressing me at all.

339
00:30:46,490 --> 00:30:49,940
You better show, you know, that it has always a small accuracy, even in theory.

340
00:30:50,180 --> 00:30:53,569
Well, with experiments. Yeah. So it's a very important part.

341
00:30:53,570 --> 00:30:57,110
And I'm emphasising this because I'm seeing papers where this is not done right.

342
00:30:57,110 --> 00:31:00,860
And I'm kind of wondering why people would not do that. Okay.

343
00:31:00,860 --> 00:31:10,399
So now for randomised response. What do we get. Well you can show actually using some kind of bounds that although maybe at most uh,

344
00:31:10,400 --> 00:31:15,770
some constant times square root of another constant divided by n okay.

345
00:31:15,950 --> 00:31:23,990
Which makes sense. So the more volunteers you have to participate, the smaller the alphabet get okay, the better the accuracy will get.

346
00:31:24,470 --> 00:31:29,720
If I have only three of you, you know I might be very noisy. If I have, uh, 10 million, it's much better.

347
00:31:31,790 --> 00:31:32,990
But so what have we seen?

348
00:31:33,740 --> 00:31:41,540
Um, so we have seen so far randomised response is LN3 differentially private and it's o of square root of one over n accurate.

349
00:31:42,470 --> 00:31:45,290
And in general the definition of accuracy is as follows.

350
00:31:45,560 --> 00:31:52,880
We say a mechanism is if accurate if for all inputs d the probability again over the randomness of the algorithm

351
00:31:53,150 --> 00:31:59,390
that the two output minus the noisy output is at most all for that probability is at least two thirds.

352
00:31:59,660 --> 00:32:03,710
So there is a good chance that the answer the error is at most alpha.

353
00:32:05,370 --> 00:32:12,510
And so the goal in differentially private algorithms research is always you have some computational problem.

354
00:32:12,900 --> 00:32:17,150
Yeah. Nothing as simple as computing an average. Know something more complicated.

355
00:32:17,150 --> 00:32:22,560
If you wanted to find a max cut in a graph or you know your favourite problem and, um,

356
00:32:22,800 --> 00:32:27,810
you want an algorithm that is differentially private, it should have small additive error.

357
00:32:27,840 --> 00:32:32,459
So small accuracy and the running time should be almost the running time.

358
00:32:32,460 --> 00:32:36,630
And also the space usage should be almost as good as the best Non-private algorithm.

359
00:32:37,690 --> 00:32:42,640
So these are the three things you're trying to optimise. And it's usually in that order.

360
00:32:43,590 --> 00:32:48,990
Which means people. People in papers usually stop after the first two, but sometimes already after the first.

361
00:32:50,910 --> 00:32:54,050
Okay, now there has been a lot of research.

362
00:32:54,090 --> 00:33:00,320
Let me stop complaining. There has been a lot of research in differential privacy on bunch of problems.

363
00:33:00,330 --> 00:33:04,320
I only talk about the simplest, which is a binary count in this talk here.

364
00:33:04,770 --> 00:33:09,209
Um, but there's also, you know, lots of other graph properties clustering.

365
00:33:09,210 --> 00:33:15,510
Machine learning is of course a very hot topic, differentially private machine learning and also mechanism design.

366
00:33:15,510 --> 00:33:19,720
So this is in auctions for example. Okay.

367
00:33:20,110 --> 00:33:26,240
Before getting into more definitions, let me give you some deployment examples so that you are motivated to keep on listening to me.

368
00:33:26,260 --> 00:33:32,200
Apart from having to sit here anyway, so um, there have been a few large scale deployments actually,

369
00:33:32,200 --> 00:33:37,480
and the first one was by Google, the wrap system, which was in Google Chrome.

370
00:33:37,540 --> 00:33:43,870
Uh, these are the authors here. And, um, they built this system in Google Chrome.

371
00:33:44,050 --> 00:33:54,610
And what it did was it detected the most frequently chosen homepages and also the most frequently chosen, uh, running process names in Google Chrome.

372
00:33:55,000 --> 00:33:59,229
And that's the information it sent back in a differentially private manner such that

373
00:33:59,230 --> 00:34:02,350
nobody could come to you afterwards and say that you always visit this homepage.

374
00:34:02,890 --> 00:34:08,330
So in a differentially private manner. Um, then more a little bit more recently.

375
00:34:08,660 --> 00:34:13,580
Uh, it's in Gboard. That's the Google, uh, keyboard on Android.

376
00:34:13,640 --> 00:34:18,440
So if you're typing on Android, um. In Google Chrome.

377
00:34:18,440 --> 00:34:22,940
What do you mean? You might use the cheap word. And they are. They did also learning for word prediction.

378
00:34:22,940 --> 00:34:26,570
So this is when you type in high and then it says the next word.

379
00:34:26,660 --> 00:34:33,530
Yeah or whatever. Thank. And then comes you. Yeah that's what they do in here.

380
00:34:33,530 --> 00:34:36,680
This example high. How off. Then they suggest you.

381
00:34:37,100 --> 00:34:40,430
And this is also done um based on differential privacy.

382
00:34:40,610 --> 00:34:44,179
And um this is the URL of the paper. Um, yeah.

383
00:34:44,180 --> 00:34:51,600
They evaluated the whole system. And they did more things and they evaluated it given evaluation of the whole system.

384
00:34:52,580 --> 00:34:56,960
Then. As I mentioned before, it's been used by the US 2020 U.S. census.

385
00:34:57,290 --> 00:35:02,810
There was a lot of battle about whether they should use differential privacy or not in the census.

386
00:35:03,080 --> 00:35:07,670
For the 2010 census, there were some privacy leaks, some awkward privacy leaks.

387
00:35:08,000 --> 00:35:14,990
And that's why they said, you know, the people who were running the census said, we need to use some protection, so let's use differential privacy.

388
00:35:15,200 --> 00:35:21,830
And the statisticians always said no. And the, you know, algorithms, people said yes and argument back and forth and back and forth.

389
00:35:22,070 --> 00:35:25,080
And so the decision was then made. Yeah a compromise.

390
00:35:25,100 --> 00:35:30,710
Yeah. Privacy is used. But Epsilon is 19. That was the time.

391
00:35:31,370 --> 00:35:37,729
That's now the statisticians are still very unhappy. Just recently I read an article that for some native people.

392
00:35:37,730 --> 00:35:42,530
Right. If you have a small group, uh, then the noise that you add makes a lot of difference.

393
00:35:42,530 --> 00:35:50,120
And so some native people are now complaining that, you know, based on that data, they compare the 2010 data with the 2020 data.

394
00:35:50,300 --> 00:35:57,410
They cannot, uh, you know, say how much their population has really grown or where there are needs and blah, blah, blah.

395
00:35:57,560 --> 00:36:01,670
And the census data is usually used to determine funding for native people.

396
00:36:01,670 --> 00:36:05,650
So it's of course important for them to be able to say, you know, we on so much.

397
00:36:05,660 --> 00:36:10,310
And so there are there are complaints about this for small populations, especially, uh,

398
00:36:10,340 --> 00:36:15,410
the statisticians are also complaining because some people, they complained, live in Manhattan in the Hudson River.

399
00:36:15,560 --> 00:36:21,080
And that should not have either. But, uh, then the, you know, algorithm, people say that's just noise, right?

400
00:36:21,080 --> 00:36:24,979
It's not many. But anyway, so there are some complaints and there are some discussion.

401
00:36:24,980 --> 00:36:28,670
And it would be interesting to see what happens in the 2030, uh, census.

402
00:36:30,100 --> 00:36:34,240
See. And then there was also some, uh, deployment in Apple.

403
00:36:34,660 --> 00:36:40,840
And this is also one of the strange papers because it has no authors except for like Apple Privacy Research group.

404
00:36:41,200 --> 00:36:44,290
Uh, very private, I guess.

405
00:36:46,240 --> 00:36:53,320
Okay. Until they showed that, uh, uh, for example, how you can use it to determine frequently used emojis.

406
00:36:53,800 --> 00:36:58,930
So, for example, for English speaking, if you have an English keyboard, uh,

407
00:36:58,930 --> 00:37:02,980
then it seems like the smiley face here where you have, you know, tears where you smile a lot.

408
00:37:03,220 --> 00:37:10,299
It's very popular, but the laugh is very, you know, only the second most popular, much lower level for the French, less my less.

409
00:37:10,300 --> 00:37:15,640
But I have much more love, for example. And so they figure that out with differential privacy.

410
00:37:16,730 --> 00:37:24,230
Also, they looked at, you know, uh, memory usage, which webpages train their battery a lot in Safari and things like that,

411
00:37:24,230 --> 00:37:27,590
and in which health topics people are interested in because,

412
00:37:27,590 --> 00:37:32,239
you know, they also sell you the Apple Watch and they want to know, are you someone you know who might be interested in health?

413
00:37:32,240 --> 00:37:35,780
So they also, uh, collect which health topics might be interesting.

414
00:37:36,260 --> 00:37:40,010
And finally, there was also work at Microsoft published by these people.

415
00:37:40,400 --> 00:37:48,080
And, uh, this is in Windows 10. Uh, they use this, uh, for estimating application usage statistics.

416
00:37:48,260 --> 00:37:52,070
And they also do the very. So this is a picture of their paper. Very careful evaluation.

417
00:37:54,070 --> 00:38:01,000
Okay, so you can see all become many big companies are using differential privacy already when they collect data about their users.

418
00:38:01,420 --> 00:38:07,060
And so now let me switch back and uh, um, tell you a little bit more about the theory behind it.

419
00:38:08,590 --> 00:38:13,690
So I said, in a static setting, the most simple problem that we are going to look at is binary count.

420
00:38:13,870 --> 00:38:19,750
So here what you have, you see, you have a, uh, string, a binary string D of length n.

421
00:38:19,900 --> 00:38:25,300
So think of n users and you want to just figure out the sum so the number of ones in the string.

422
00:38:25,990 --> 00:38:29,469
This is basically the same as before for randomised response.

423
00:38:29,470 --> 00:38:34,810
Right. That was finding the population. The percentage is the same as you know finding how many have it.

424
00:38:34,810 --> 00:38:41,760
And then divide by the size by n. Okay, but I will give you now a different mechanism that's more efficient.

425
00:38:41,760 --> 00:38:46,380
Meaning having lower accuracy. Um, so better accuracy.

426
00:38:47,190 --> 00:38:51,810
Um, so esp for the prime isn't neighbouring if it differs from D by one bit.

427
00:38:52,830 --> 00:38:57,110
And so now it turns out there is this very nice distribution called Laplace distribution.

428
00:38:57,120 --> 00:39:00,240
It existed before, but it looks like it's perfect.

429
00:39:00,270 --> 00:39:03,299
It was specifically designed for differential privacy which of was not.

430
00:39:03,300 --> 00:39:07,080
But it has only become really popular uh, through differential privacy.

431
00:39:07,470 --> 00:39:12,340
So it's this function here. So, uh, this distribution.

432
00:39:12,880 --> 00:39:16,930
So the Laplacian distribution has a parameter here I call it one where epsilon.

433
00:39:17,260 --> 00:39:22,570
And then you get this function. So at some value x you get this density function here.

434
00:39:23,590 --> 00:39:28,500
And so what you have is you have then uh the variance of sigma.

435
00:39:28,840 --> 00:39:35,560
And so if the sigma is very large like here in the light blue, what you get is you get a distribution that's very spread out.

436
00:39:35,800 --> 00:39:42,370
Yeah. So the probability that you pick minus ten is still, you know, not constant or that you pick plus ten.

437
00:39:43,720 --> 00:39:48,100
Um, the smaller the sigma the variance, the more picky the distribution is.

438
00:39:48,490 --> 00:39:52,960
Yeah. Uh, I should say it's symmetric around zero a very important symmetric around zero.

439
00:39:53,170 --> 00:39:57,920
And the more peaky it is. So it's a red one here for example. It's very picky okay.

440
00:39:58,000 --> 00:40:02,530
So if you sample from it you're very likely probability almost a half that you pick zero.

441
00:40:04,020 --> 00:40:07,100
Okay. So now what is a Laplace mechanism?

442
00:40:07,400 --> 00:40:09,200
The Laplace mechanism is very simple.

443
00:40:09,530 --> 00:40:17,690
Uh, you take the two, answer the sum of the ones, and you add a random variable that you chose according to this Laplace distribution.

444
00:40:19,010 --> 00:40:23,600
Okay. And it turns out you can show that this is epsilon differentially private.

445
00:40:24,550 --> 00:40:27,850
And that is Ellen three over Epsilon accurate.

446
00:40:29,310 --> 00:40:34,110
So. So why is this? What is the intuition? Well, you see, if you want good privacy.

447
00:40:34,440 --> 00:40:41,390
Yeah. So you have small epsilon. Then one over epsilon will be large, which means the variance of your distribution will be large.

448
00:40:41,400 --> 00:40:49,250
So you're more like the light blue here. Okay, so you're not too unlikely to actually add a number like ten to your numbers.

449
00:40:49,350 --> 00:40:52,700
But you remember, the input differed only by one, right?

450
00:40:52,700 --> 00:40:55,340
So you had two input strings that differed by one bit.

451
00:40:55,670 --> 00:41:02,239
And now you add a noise of, let's say ten to the one to the first string, the one that had this one bit not set.

452
00:41:02,240 --> 00:41:06,080
Let's say, let's say the first bit. They define the first bit. The first string has zero.

453
00:41:06,080 --> 00:41:13,730
In the first bit the other string has one in the first bit. And now the probability that for the first one you pick the ten.

454
00:41:14,630 --> 00:41:17,660
And so you would output. Then whatever the two ends are plus ten.

455
00:41:18,660 --> 00:41:21,660
And the probability that you take the first.

456
00:41:21,810 --> 00:41:24,870
The second one. And you pick the nine. Yeah.

457
00:41:25,140 --> 00:41:27,959
Which in this case you would then output in both cases the same thing,

458
00:41:27,960 --> 00:41:32,580
namely the two ends out of the first one plus ten because the second string had the first bit set.

459
00:41:32,850 --> 00:41:36,960
Yeah. So the probability that they are almost the same are very close right.

460
00:41:37,110 --> 00:41:40,320
Just a difference here between picking a ten or a nine.

461
00:41:41,720 --> 00:41:42,050
Okay.

462
00:41:42,230 --> 00:41:50,960
And so you can show that by taking this parameter one way epsilon for the Laplace definition you get exactly epsilon differentially private privacy.

463
00:41:51,020 --> 00:41:57,230
It seems like it's perfectly made for that. And for the accuracy if well, you need to do a little bit of work, but not too bad.

464
00:41:57,260 --> 00:42:00,729
Some properties of the Laplacian distribution. Okay.

465
00:42:00,730 --> 00:42:04,000
But this is very nice. It's independent of n okay.

466
00:42:04,000 --> 00:42:08,180
So accuracy is independent of the size of the input. Okay.

467
00:42:08,750 --> 00:42:13,460
That's an icing here. So that's good. But what if data is dynamic.

468
00:42:13,820 --> 00:42:20,150
And this is now finally where my book comes in. Because I like to always look at the dynamic setting where the input changes.

469
00:42:21,100 --> 00:42:26,330
Okay. And now the people cannot agree in the field what they should call the dynamic setting.

470
00:42:26,350 --> 00:42:34,420
They don't call it dynamic, that's for sure. They call it either differential privacy under continual observation or under continual release.

471
00:42:34,570 --> 00:42:38,260
Well, on streaming data. Well, sometimes on data streams.

472
00:42:38,350 --> 00:42:44,200
Okay. They are very creative, but it's all the same. It's where you get the data, one after the other.

473
00:42:45,400 --> 00:42:53,130
Okay. So for the general model, what you have, you have a stream of input data from the universe, you of some length let's say t.

474
00:42:53,980 --> 00:42:59,980
And now the privacy definition needs to be adapted. Um, and I'll talk about this in in a second.

475
00:43:00,130 --> 00:43:04,630
But what you need to do is every time you get some input, you need to give some output again.

476
00:43:05,380 --> 00:43:10,210
Okay. So before in the static setting you had one input and you gave one output.

477
00:43:10,810 --> 00:43:14,050
So you had just one chance of leaking information right.

478
00:43:14,470 --> 00:43:20,230
By now you have t chances of leaking information because you need to give two outputs after each input.

479
00:43:20,230 --> 00:43:23,950
And after the next bit of input arrives, you need to give some output again.

480
00:43:24,430 --> 00:43:27,430
So let's say for binary counting you need to keep on outputting the sum.

481
00:43:28,120 --> 00:43:32,860
So far the sum so far. And so you have t times the chance to leak information.

482
00:43:33,280 --> 00:43:36,520
So intuitively already the accuracy should go up.

483
00:43:36,700 --> 00:43:43,870
You need to add more noise. Okay, so one thing the open problem actually is continual binary counting.

484
00:43:43,870 --> 00:43:49,840
And I'll tell you what to open about it. Um, so where you have just a stream of input data zero one of length t,

485
00:43:50,230 --> 00:43:58,900
and after the t split is given for any t between one and capital T, you need to output the sum of all the bits so far.

486
00:44:01,710 --> 00:44:08,130
And, um, now, in differential privacy and the continual observation, there was a two definitions of privacy.

487
00:44:08,430 --> 00:44:12,720
I'll just show you one. Um, and I'll just mention that there's a second one.

488
00:44:13,110 --> 00:44:18,330
And this is the event level differential privacy. And that's very similar to what we've seen before.

489
00:44:18,870 --> 00:44:21,659
So we say Sigma and Sigma Prime are neighbouring.

490
00:44:21,660 --> 00:44:28,320
If they differ in one input value in which means in one time step, in one time step you can give something out, get something else.

491
00:44:29,460 --> 00:44:34,450
And then in the definition of differential privacy now you need to say over all time steps.

492
00:44:34,500 --> 00:44:40,890
So before we said over all neighbouring databases. And so now we need to say over all the time steps.

493
00:44:41,330 --> 00:44:45,900
Uh, before we also had over all possible outputs, the outputs on our two dimensional vector.

494
00:44:46,260 --> 00:44:49,650
And then for any neighbouring inputs you have this definition again.

495
00:44:51,040 --> 00:44:54,519
And is. It's as it's late in the talk and I don't want to bore you too much.

496
00:44:54,520 --> 00:44:59,020
I skip this definition of user level privacy, but there is a more, uh, you know,

497
00:44:59,020 --> 00:45:04,090
advanced or more stringent the harder version, which is called user level privacy.

498
00:45:04,360 --> 00:45:07,960
So everything that I'm going to say now is about event level privacy.

499
00:45:09,810 --> 00:45:16,790
And so now you care again about the accuracy. So what is the proper definition of accuracy for the continuous setting.

500
00:45:16,800 --> 00:45:20,940
So now you are outputting a two dimensional vector right. Because you are outputting T numbers.

501
00:45:21,660 --> 00:45:27,750
And so the error definition that is most popular is that you take the maximum over all time steps.

502
00:45:28,020 --> 00:45:34,800
So if you think of this as being a vector. So this is f of sigma is a two answers a two dimensional vector with the two answers.

503
00:45:35,220 --> 00:45:39,180
And um, alpha is the one that the algorithm gives you.

504
00:45:40,050 --> 00:45:48,360
Okay. And we will see this in a second. Then we will take the max, the infinite in one, which is the maximum of all the entries in this two vectors.

505
00:45:49,540 --> 00:45:53,139
Okay. So we say the mechanism is also accurate.

506
00:45:53,140 --> 00:46:00,850
If for all input streams of length t, the error of the two answer versus the output of the mechanism.

507
00:46:00,850 --> 00:46:02,980
Here, for every plug in the output of the mechanism,

508
00:46:03,340 --> 00:46:08,920
the probability that most is the difference is at most, although with at least probability two thirds.

509
00:46:11,050 --> 00:46:14,170
But. And now what's known about continual binary counting.

510
00:46:15,130 --> 00:46:21,850
Well, you could do it in a very naive way. The naive way would be to trust every time.

511
00:46:21,860 --> 00:46:26,200
At every time step you just add Laplacian noise. This is like a static algorithm.

512
00:46:26,200 --> 00:46:31,060
Run the static algorithm every time. But now you're outputting tee times.

513
00:46:31,610 --> 00:46:39,400
So and if you do composition right. Then the differential privacy of these tee outputs is the sum of the epsilon values.

514
00:46:40,240 --> 00:46:45,910
Okay. Since you are outputting t of them, you have to make sure that each of them is epsilon over t differentially private.

515
00:46:47,040 --> 00:46:47,370
Okay.

516
00:46:47,820 --> 00:46:55,200
So you have to require each at each output you have epsilon over t differentially private which means you need to use Laplacian noise t over epsilon.

517
00:46:56,690 --> 00:46:59,810
Okay. And then you can just use the composition that I showed you before,

518
00:47:00,110 --> 00:47:07,700
and you get the sum of all these key outputs every time epsilon uh of this t epsilon numbers every time epsilon over t.

519
00:47:07,850 --> 00:47:11,510
So you get that your epsilon differential product. So that's good.

520
00:47:11,630 --> 00:47:19,070
But what is your accuracy. If you have Laplacian noise t over epsilon your accuracy is t over epsilon which is really not good.

521
00:47:19,100 --> 00:47:22,730
It's the number of times steps over epsilon. Okay.

522
00:47:23,240 --> 00:47:29,120
And the question is can you do better. And the answer is yes, but not as good as you would like to.

523
00:47:29,990 --> 00:47:37,730
So what? What can you do? You can do an accuracy of loks squared t over epsilon.

524
00:47:39,170 --> 00:47:44,810
That's good. Much better than t. But there's also a lower bound of log t.

525
00:47:45,290 --> 00:47:51,560
So this was shown by to work Noah Petersen Ross Blum and this by these authors and Chauncey and so on.

526
00:47:51,710 --> 00:47:58,700
They actually showed slightly worse bounds. But you can, you know, basically get these bound said, okay, so there is actually a gap.

527
00:47:59,000 --> 00:48:00,470
Yeah, between the lower bound.

528
00:48:00,590 --> 00:48:07,430
So you cannot show anything better than log T as this lower bound, no matter which mechanism, there cannot be a mechanism better than log t.

529
00:48:08,240 --> 00:48:11,630
Um, but there's also this upper bound of log square T.

530
00:48:12,810 --> 00:48:16,290
Now, this lower bound of locked is actually really a disaster, right?

531
00:48:16,440 --> 00:48:23,190
Because if t goes to infinity, think of you use your phone every day, you know, and probably hundreds of times per day.

532
00:48:23,280 --> 00:48:31,080
Yeah. So your t they if every time they sending some information back whether you know you looked at health related data or not for example.

533
00:48:31,110 --> 00:48:39,490
Yeah. Then your T is very quickly very big. And this says here that then the accuracy has to go down by a factor of log t,

534
00:48:40,270 --> 00:48:44,310
where the companies who are collecting data about you don't like this, right.

535
00:48:44,330 --> 00:48:46,120
They don't want to have such a bad accuracy.

536
00:48:46,870 --> 00:48:54,670
So what they do is supposedly that's what they say is every time period, let's say every month, they give every user a certain privacy budget.

537
00:48:55,390 --> 00:49:01,450
And then when they whenever they use your data in some computation, they subtract something from your privacy budget.

538
00:49:02,200 --> 00:49:09,910
And if your privacy budget uses reaches zero, they don't use your data anymore for whatever training or whatever algorithms they use.

539
00:49:10,270 --> 00:49:18,780
They don't use your data anymore. Until the next month, when they reset your privacy budget to be the initial value again.

540
00:49:19,470 --> 00:49:22,590
Which means basically that they just delaying.

541
00:49:22,650 --> 00:49:29,100
Right. The privacy loss right there. Just no, uh, they're just delaying it a little bit because for a while they're not using your data.

542
00:49:30,000 --> 00:49:35,900
But that's how they deal with it. But on the other side, you know, if they don't want to have an accuracy error of to you,

543
00:49:35,910 --> 00:49:40,050
that's really the only thing they can do because otherwise, you know, it's too bad.

544
00:49:40,830 --> 00:49:46,980
Now it turns out there's a slightly relaxed version of differential privacy called epsilon delta differential privacy, which I don't want to define.

545
00:49:47,400 --> 00:49:50,400
And there you can do it even more complicated.

546
00:49:50,640 --> 00:49:55,560
There you can also actually log t to the 1.5 and was shown by Tung XI and so on.

547
00:49:56,220 --> 00:50:00,780
Um, and so here we did some work together with Baga and Apache.

548
00:50:01,050 --> 00:50:04,740
Uh, then what we did is we actually fixed figured out the constant.

549
00:50:04,740 --> 00:50:10,530
They had a really bad constant. We did an algorithm, a different algorithm where we show the exact constant.

550
00:50:11,010 --> 00:50:15,910
And, um, the other advantage of our approach is that our accuracy loss.

551
00:50:15,930 --> 00:50:21,970
So our noise is very smooth. While for the other algorithm, the noise was very, um, smooth, as you can see.

552
00:50:21,990 --> 00:50:26,990
So for each point in t this put a dot okay. And so you see that it was very smooth.

553
00:50:27,000 --> 00:50:34,379
And the reason being that the amount of noise, the number of Laplacian noises that you would add up depended on the time step.

554
00:50:34,380 --> 00:50:38,070
And it might even be just one Laplacian noise, or it might be locked in Laplacian noises.

555
00:50:38,070 --> 00:50:41,340
And so you got very non smooth noise in the setting.

556
00:50:41,340 --> 00:50:46,230
And in practice it's a disaster. So you want to count how many Covid cases there are right.

557
00:50:46,470 --> 00:50:49,920
And you don't want something where you say I don't know where they went up or not.

558
00:50:49,920 --> 00:50:53,670
Maybe might have been the noise that made it go up. Or maybe the data really goes up.

559
00:50:53,670 --> 00:50:57,450
So you want the noise that goes up in a smooth way and not such a jumpy way.

560
00:50:58,630 --> 00:51:03,490
Now let me just because I promised very, very high level sketch how this algorithm works.

561
00:51:03,820 --> 00:51:05,320
Okay. It's actually simple.

562
00:51:05,680 --> 00:51:11,480
Uh, even though you might not believe it, it's very our algorithm is very simple because, you see, what do you want to output?

563
00:51:11,500 --> 00:51:13,600
You want to output the sum of ones.

564
00:51:14,260 --> 00:51:21,549
Now, I can write this down as a linear algebra problem where you have a matrix A that is lower diagonal with all ones down here,

565
00:51:21,550 --> 00:51:28,340
and also rows up here. And then the output that you have is simply a times sigma where sigma your input vector.

566
00:51:29,230 --> 00:51:35,400
Yeah, that's exactly the output that you want. And even better if you are at timestep t.

567
00:51:35,410 --> 00:51:40,210
So this is the output for all. If you do this product you get all the answers for all time steps.

568
00:51:40,900 --> 00:51:46,889
If I'm just an time step lower T, I can just take the t by t principal submatrix.

569
00:51:46,890 --> 00:51:56,440
So the small version of a yeah is up t, but I just take the first t rows and columns and take just the first t entries of my input into this product.

570
00:51:57,070 --> 00:52:01,120
And the t coordinate of this product is, uh, the answer I want.

571
00:52:01,390 --> 00:52:05,590
So this is the correct answer. And so now how do we do make this differentially private.

572
00:52:06,160 --> 00:52:11,470
What we do is we show that there is a composition of A into two matrices L and R.

573
00:52:11,680 --> 00:52:14,739
And they are both lower triangular as well okay.

574
00:52:14,740 --> 00:52:19,510
That's the crucial thing. They need to both be lower triangular to work in this continual setting.

575
00:52:20,260 --> 00:52:29,500
And so they also have this nice property. If you restrict them to the top t by t u k the l's up t and R's up t and that product is still R is up T.

576
00:52:29,800 --> 00:52:36,520
So everything is beautifully well-behaved. And so now what our mechanism does is a time step t.

577
00:52:36,670 --> 00:52:40,630
It takes r t and multiplies this with the input sigma t.

578
00:52:40,840 --> 00:52:44,170
And then it adds a noise vector to it. Okay.

579
00:52:44,590 --> 00:52:49,930
So it adds noise vector to it here. And then it does post-processing by multiplying with l t.

580
00:52:50,900 --> 00:52:54,890
So our mechanism is this very simple thing here. Okay.

581
00:52:55,130 --> 00:53:00,620
Where you can prove that, you know, this is an unbiased estimator and you can show that this is differentially private.

582
00:53:00,800 --> 00:53:04,670
So what we get is we get a maximum epsilon delta differentially private.

583
00:53:05,000 --> 00:53:09,670
Um, we get this epsilon delta differentially private mechanism that has this accuracy.

584
00:53:09,680 --> 00:53:12,920
So then the whole proof is linear algebra. Okay.

585
00:53:13,190 --> 00:53:16,520
Good. So this is, uh, this is what we did here.

586
00:53:16,700 --> 00:53:21,340
Now, there has been work for in the continuous setting in, uh, these five cases.

587
00:53:21,350 --> 00:53:23,960
So we have data streams, which are sequences of numbers.

588
00:53:24,350 --> 00:53:31,540
We have databases which are sequences of rows that have been analysed, data structures where you have sequences of operations,

589
00:53:31,550 --> 00:53:38,690
for example, heaps, um, graph algorithms where you have again sequences of operations, insertions and deletions of edges.

590
00:53:39,320 --> 00:53:45,110
Uh, clustering algorithms where you have insertions and deletions of points and you want to maintain some clustering of them.

591
00:53:45,590 --> 00:53:52,970
Uh, all this has been studied, but of course, it's still in the beginning, meaning we don't necessarily have the best, uh, answers for them.

592
00:53:53,840 --> 00:54:00,470
So what are future directions? So one is this look at user level differential privacy, which I just mentioned before.

593
00:54:00,890 --> 00:54:04,520
Uh, the other one is local differential privacy that I defined.

594
00:54:05,030 --> 00:54:09,320
And there has been very little work of local differential privacy under continual observation.

595
00:54:10,160 --> 00:54:16,520
And then also look at the running time and actually implement this algorithms and see how fast they are and how they actually perform.

596
00:54:16,850 --> 00:54:18,680
Yeah. So actually do algorithms engineering.

597
00:54:19,460 --> 00:54:26,990
And then the specific problems that I'm still interested in come up with better algorithms for clustering or graph properties and also, uh,

598
00:54:26,990 --> 00:54:32,200
close the gap of this upper bound, which is located to the 1.5 for log square,

599
00:54:32,240 --> 00:54:37,450
T for continuous binary counting, and this log T for the lower bound close the gap.

600
00:54:37,460 --> 00:54:41,540
These are the two specific problems I want to look at. And then I want to just thank my co-authors.