1
00:00:08,760 --> 00:00:13,470
This is the 2pm session, and the livestream is now on, so hello,

2
00:00:13,470 --> 00:00:22,320
if anyone's watching and this is the session today on Digital Trace data, you already know me, so I don't need to introduce myself.

3
00:00:22,320 --> 00:00:29,400
And the purpose of so we're going to have two sessions today in the afternoon, and this is the first part of it.

4
00:00:29,400 --> 00:00:37,980
When I'm going to talk about digital trace data and the purpose of this session is really to think about digital trace data broadly,

5
00:00:37,980 --> 00:00:43,440
right, as as a category of of data. And think about its strengths and weaknesses,

6
00:00:43,440 --> 00:00:48,780
but also to think about different kinds of research designs that one might use when

7
00:00:48,780 --> 00:00:55,440
trying to adopt digital trace data for the purpose of social science research.

8
00:00:55,440 --> 00:01:00,210
And in the second part of my session after the coffee or tea break,

9
00:01:00,210 --> 00:01:08,580
I'll talk perhaps a bit about the tools and techniques we might use for for using digital trace data,

10
00:01:08,580 --> 00:01:13,890
and also then conclude a bit about sort of the wider implications when we think about ethical

11
00:01:13,890 --> 00:01:22,280
and access based limitations to using these kinds of data for social science research.

12
00:01:22,280 --> 00:01:26,090
So I want to start off the dog very much. Bye bye.

13
00:01:26,090 --> 00:01:29,720
Actually, at the point where metal organic started yesterday,

14
00:01:29,720 --> 00:01:38,960
which is that we're living in the digital age and the digital age is one where information storage has increased dramatically.

15
00:01:38,960 --> 00:01:45,140
There's there's a lot of data out there in the world and a lot of it is digital.

16
00:01:45,140 --> 00:01:52,700
And there's billions of gigabytes of data. And accompanying that is the fact that there's also a remarkable and exponential

17
00:01:52,700 --> 00:02:01,540
increase in computing power that is accompanied this expansion in information storage.

18
00:02:01,540 --> 00:02:05,990
And this is now been talked about a lot, and in some senses, the premise for this,

19
00:02:05,990 --> 00:02:13,540
this community gathering here right where we're living in this digital age sometimes called the big data era.

20
00:02:13,540 --> 00:02:23,110
And I like to think that this big data era here has, you know, certain defining features in relation to the way the data exist in the world now.

21
00:02:23,110 --> 00:02:27,620
So of course, there's the explosion in the volume of data, which is what I showed in the previous slide, right?

22
00:02:27,620 --> 00:02:32,440
We just have a lot more data and a lot of it is digitally recorded.

23
00:02:32,440 --> 00:02:40,480
There's also the idea that the data that are produced have higher velocity or they have a certain kind of

24
00:02:40,480 --> 00:02:48,370
speediness about them that was perhaps not possible in the past with other kinds of data that were available.

25
00:02:48,370 --> 00:02:56,870
So Nick earlier today was talking a bit about censuses and how the DRC had the census in nineteen eighty four.

26
00:02:56,870 --> 00:03:02,170
Now in nineteen eighty four was a long time ago, and lots of things have changed since then.

27
00:03:02,170 --> 00:03:06,820
But how do we know about those and how can we measure those kinds of changes that temporality?

28
00:03:06,820 --> 00:03:10,780
The fact that old data sources, they were big, they had volume?

29
00:03:10,780 --> 00:03:16,440
If we think about a census, a census is a very high volume undertaking, right?

30
00:03:16,440 --> 00:03:25,120
There's a largeness to it. But there were big and slow. But now potentially we are entering a paradigm where data sources could be both big and fast.

31
00:03:25,120 --> 00:03:32,890
So there's that important dimension of velocity here that I think we have to be also mindful of another aspect of,

32
00:03:32,890 --> 00:03:43,420
I think which is which is key to understanding this big data era is what I think of as a diversification of the production of data.

33
00:03:43,420 --> 00:03:45,370
So who are the producers of data?

34
00:03:45,370 --> 00:03:54,820
It's now not just administrative data or governments or states that are interested in and have a lot of data about people.

35
00:03:54,820 --> 00:03:59,710
Companies have a lot of data about people, right? And there's also, in some sense,

36
00:03:59,710 --> 00:04:07,690
the diversification of different kinds of agencies that hold data about individuals have harvested data about individuals.

37
00:04:07,690 --> 00:04:12,700
And in some sense, that's what I like to think of as a decentralisation of of data production.

38
00:04:12,700 --> 00:04:18,460
Of course, you hold a lot of data about yourself. Do as Nick were showing with location histories, right?

39
00:04:18,460 --> 00:04:27,580
Those JSON files, which were sometimes gigabytes, if I had disabled my location history because so I'm not contributing to the big data cores.

40
00:04:27,580 --> 00:04:34,510
But but at the same time, if you did have your location history and I had my location history in the past before I disable it,

41
00:04:34,510 --> 00:04:39,400
when it was tracked for about two years, it was, I think, three gigabytes of information.

42
00:04:39,400 --> 00:04:48,220
So it's a lot of information that me as an individual had generated that that and that's in some sense there's a decentralisation of data,

43
00:04:48,220 --> 00:04:54,850
which also means that different kinds of people have data about you, but you also have a lot of data about yourself,

44
00:04:54,850 --> 00:05:01,930
which can be interesting also to think about from the data production dimension.

45
00:05:01,930 --> 00:05:04,340
And there's also the variety of data, right?

46
00:05:04,340 --> 00:05:13,840
So while in the past when social scientist did research, we were often thinking in terms of most quantitative social analysis,

47
00:05:13,840 --> 00:05:22,660
relied on some kind of rectangular data frame, which had usually some kind of numbers in it discrete quantities.

48
00:05:22,660 --> 00:05:26,830
We now also just have a plethora of data sources. We have images available.

49
00:05:26,830 --> 00:05:33,280
We have audio available. We also have text. And Taylor will also talk about text tomorrow.

50
00:05:33,280 --> 00:05:40,720
But we have different kinds of data that are being generated and new technologies that are generating data.

51
00:05:40,720 --> 00:05:49,120
So we have mobile phones again. Nick talked about call detail records, which of course, are very kind of novel kind of data.

52
00:05:49,120 --> 00:05:51,400
Source their satellite data sources.

53
00:05:51,400 --> 00:05:58,120
But there's also the Internet of Things, the fact that there are physical sensors that interact with our everyday world.

54
00:05:58,120 --> 00:06:06,580
So we have the Amazon Alexa in the world now that are capturing a lot of information about people's physical environments.

55
00:06:06,580 --> 00:06:14,590
Andrew Dilnot, who is the warden of Nuffield College, who is also now the chair of the Geospatial Commission in this country.

56
00:06:14,590 --> 00:06:25,120
He likes to use this example, which I think is very pertinent here, is that a lot of cameras are now fitted in on, on, on cars, right?

57
00:06:25,120 --> 00:06:28,330
So all the cars that are driving around on the streets,

58
00:06:28,330 --> 00:06:33,040
most of them have some kind of a lot of them have dash cams with a lot of them have cameras available,

59
00:06:33,040 --> 00:06:36,490
reverse cams and other kinds of cameras available that actually,

60
00:06:36,490 --> 00:06:41,110
just if we relied on all of that data that is accumulated from all of these cars,

61
00:06:41,110 --> 00:06:48,730
we would know where most of the potholes are in the city of Oxford and Banbury Road has a lot of them because I commute there every day and it's bad.

62
00:06:48,730 --> 00:06:55,750
So we would know that if we had all of this data that was centralised somewhere and it is, it exists.

63
00:06:55,750 --> 00:07:01,130
This is not data that need to be generated, it exists, but it just needs to be sort.

64
00:07:01,130 --> 00:07:05,750
In some sense, repurposed for this, and I think that's.

65
00:07:05,750 --> 00:07:16,010
And this is where I like Matt Matt's organics definition that a lot of the data that exist in this big data era are, in some sense, readymade.

66
00:07:16,010 --> 00:07:23,300
They're already there in the world. But at the same time, at the same time, they are not custom made for research, right?

67
00:07:23,300 --> 00:07:27,290
So they are made, they exist, but they are not custom made for research.

68
00:07:27,290 --> 00:07:32,180
And this is the example of this is the analogy that he used yesterday where he said,

69
00:07:32,180 --> 00:07:43,430
we have here the urinal that was repurposed as art by Marcel Duchamp and we have here the elegant, gorgeous David, right?

70
00:07:43,430 --> 00:07:47,570
Very much. I mean, no one would say this was a repurposed piece by any means.

71
00:07:47,570 --> 00:07:56,200
This was not a piece of marble lying around. This is very much marble that was essentially made pristine by the work of an artist.

72
00:07:56,200 --> 00:07:56,430
Right.

73
00:07:56,430 --> 00:08:06,920
So this is what we live in a world where we have a lot of these go around, but at the same time is very clearly that this the urinal is not David,

74
00:08:06,920 --> 00:08:15,040
but potentially we also have to find ways to think of the urinal and repurpose it as a work of art.

75
00:08:15,040 --> 00:08:18,400
So let's start by thinking about so I've been talking a lot about, as I said,

76
00:08:18,400 --> 00:08:26,110
the proliferation of different kinds of data sources and the fact that we live now in a world where there is data abundance.

77
00:08:26,110 --> 00:08:31,000
And today I said, I'm going to talk about digital trace data. So what are these digital trends data?

78
00:08:31,000 --> 00:08:37,300
So the definition that I like to use for digital is data that there the data provide by-products of

79
00:08:37,300 --> 00:08:44,770
the digitalisation of our lives and the adoption of digital technologies and platforms such as,

80
00:08:44,770 --> 00:08:53,260
for example, social media. Right. So if we think about them as these data by-products, we can think of them in different ways.

81
00:08:53,260 --> 00:09:02,350
So in one sense, digital trace data are themselves the expression of digital interactions and purely digital phenomenon.

82
00:09:02,350 --> 00:09:07,510
Right. So if you think about tweeting, tweeting is a digital thing to do.

83
00:09:07,510 --> 00:09:13,420
I don't know if there's a physical equivalent of like writing a 140 character sum.

84
00:09:13,420 --> 00:09:20,680
I don't know and writes notes anymore to anyone with 140 characters. So there isn't a physical equivalent of tweeting.

85
00:09:20,680 --> 00:09:28,780
It's a purely digital phenomenon. However, we can also have in digital by-products, for example,

86
00:09:28,780 --> 00:09:37,210
through the use of sensors like the cams that I was telling you on cameras or accelerometers in our phones that we carry around with us that are.

87
00:09:37,210 --> 00:09:41,020
That's how those location histories that you were using were generated.

88
00:09:41,020 --> 00:09:45,550
They're essentially digital by-products that are created of actual physical activity.

89
00:09:45,550 --> 00:09:52,570
So that's also a type of digital trace. And one unique feature about these kinds of data,

90
00:09:52,570 --> 00:10:03,400
and I think this is often this is often used as in some sense also as a as an aspect that is unique about them is that unlike self-reported measures,

91
00:10:03,400 --> 00:10:08,710
right? They're essentially much more behavioural measures.

92
00:10:08,710 --> 00:10:11,710
They're actually just capturing activity as it's occurring,

93
00:10:11,710 --> 00:10:20,260
which is different from someone telling you that this morning I went from this place to this place, your location history will reveal that indirectly.

94
00:10:20,260 --> 00:10:21,340
This can also be the case.

95
00:10:21,340 --> 00:10:31,210
For example, you know, this aspect of of the behavioural aspect about this data can also make them, in a sense, sometimes messier as well.

96
00:10:31,210 --> 00:10:36,460
And we'll come back to that when we think about also the weaknesses of these kinds of data sources.

97
00:10:36,460 --> 00:10:37,570
So in the next few slides,

98
00:10:37,570 --> 00:10:45,250
what I want to do is I want to just give you examples of kind of published papers out there that I've trying to use digital trends data so far.

99
00:10:45,250 --> 00:10:54,580
Social media sites is perhaps the most widely known source of digital trace data, and Twitter is something that or Twitter has been widely,

100
00:10:54,580 --> 00:11:04,270
I think, used in in in social science papers to try and analyse discourses or discussions around events.

101
00:11:04,270 --> 00:11:14,170
It's also been used to do to measure things like public opinion, but it's also been used to forecast elections, for instance.

102
00:11:14,170 --> 00:11:20,080
And there's a there's a large literature on forecasting elections using using Twitter.

103
00:11:20,080 --> 00:11:25,450
It's also been used, and this is actually a unique aspect of Twitter. As as a social media data source,

104
00:11:25,450 --> 00:11:32,830
it's perhaps probably the most widely used because it's the most widely accessible and that isn't is actually becoming,

105
00:11:32,830 --> 00:11:38,350
I would say, a little bit more complicated. Now there's more authentication processes involved in accessing Twitter data.

106
00:11:38,350 --> 00:11:44,950
But one of the reasons why I think Twitter has been so big in the world of social media and computational social

107
00:11:44,950 --> 00:11:51,940
science research is partly because it is one of the data sources that researchers can actually access through an API.

108
00:11:51,940 --> 00:11:56,320
So this paper here. And one of the authors I know this Pablo Barbara.

109
00:11:56,320 --> 00:12:01,300
It's by Jost and colleagues. Pablo Barbaro will be speaking next week on Monday.

110
00:12:01,300 --> 00:12:04,750
And in the summer institute here he did.

111
00:12:04,750 --> 00:12:12,880
What they do is in this paper is they essentially are looking at Twitter accounts that were created daily in Ukraine before and after some protests.

112
00:12:12,880 --> 00:12:19,390
And they want to essentially look at they look at these peaks around certain key events over the period that they're looking at.

113
00:12:19,390 --> 00:12:29,890
And they see that as as a form of of political mobilisation around around these these time points that are occurring in the real world.

114
00:12:29,890 --> 00:12:31,630
So of course, you know,

115
00:12:31,630 --> 00:12:39,940
we can there can be an active debate about what this really means in terms of is it a good measure of political mobilisation and so on.

116
00:12:39,940 --> 00:12:49,960
But the fact that they're capturing these peaks occurring at very, very fine time scales on here on a weekly basis is quite interesting.

117
00:12:49,960 --> 00:12:55,660
Here's another example from a recently published paper in Psychological Science by David Garcia and colleagues.

118
00:12:55,660 --> 00:13:04,180
And what they look at is this is around the times of around the time of the Paris terrorist attacks.

119
00:13:04,180 --> 00:13:15,010
They look at essentially emotional valence, or we look at discourse in terms of positive and negative sentiments around the times of the attack.

120
00:13:15,010 --> 00:13:19,720
And then they try and see sort of this shoot up of the negative sentiment,

121
00:13:19,720 --> 00:13:28,450
and they say that we can actually see kind of a public collective emotions through these data sources and we can to try to see,

122
00:13:28,450 --> 00:13:35,880
in some sense, collective emotion and social resilience occurring in real time.

123
00:13:35,880 --> 00:13:41,790
And some of my own work, I've been using the data source that Francesco Campazzo talked about yesterday,

124
00:13:41,790 --> 00:13:48,300
which is the Facebook marketing API to now cost global digital agenda apps.

125
00:13:48,300 --> 00:13:52,980
So this is a this is a very important, I would say, social development indicator.

126
00:13:52,980 --> 00:13:58,410
It's important for us to know where women have access to internet and mobile phones,

127
00:13:58,410 --> 00:14:03,540
and it's actually something that we know remarkably little about or we don't know as well

128
00:14:03,540 --> 00:14:10,380
because of the lack of widespread survey data to measure ICD use by gender of the user,

129
00:14:10,380 --> 00:14:18,360
especially in low and middle income countries. So in this context, essentially looking at the number of Facebook users by gender of the user,

130
00:14:18,360 --> 00:14:21,750
as well as other characteristics such as device type or age,

131
00:14:21,750 --> 00:14:30,820
can be a useful proxy for capturing digital inequalities, digital gender inequalities at a very regular basis.

132
00:14:30,820 --> 00:14:32,400
So on our website,

133
00:14:32,400 --> 00:14:40,920
digital gender gaps that are essentially reissue this map every day because we can query the map every we can query the data source very frequently.

134
00:14:40,920 --> 00:14:45,090
We can run a census every day, but we can. We can definitely run a survey every day,

135
00:14:45,090 --> 00:14:56,250
but we can ask the API every day how many men and women are monthly active users or daily active users on Facebook?

136
00:14:56,250 --> 00:15:01,860
So those are examples of social media sites that search queries have also been used a lot,

137
00:15:01,860 --> 00:15:07,680
and I would say these were the kind of earliest examples of of digital trends.

138
00:15:07,680 --> 00:15:12,930
Data is as being heralded as with a lot of excitement and has been seen as as

139
00:15:12,930 --> 00:15:19,650
the new telescope with which to view and understand human social behaviour.

140
00:15:19,650 --> 00:15:27,090
This was a very famous quote in a paper by Duncan Watts, where he said that now we can directly adjust observed behaviours.

141
00:15:27,090 --> 00:15:29,520
So this is the earliest examples of this.

142
00:15:29,520 --> 00:15:38,340
This is a paper by Ginsberg and colleagues published in Nature in 2009, in which they tried to essentially match the estimates.

143
00:15:38,340 --> 00:15:44,760
CDC, the Centre for Disease Control estimates of influenza with Google search queries.

144
00:15:44,760 --> 00:15:55,110
Now, the black other Google search queries and the red are the CDC estimates, and you can see the CDC estimates have a bit of a lag.

145
00:15:55,110 --> 00:15:59,580
But then Google search head says, predict what's happening.

146
00:15:59,580 --> 00:16:07,150
And then the CDC follows and seems to be generally well-calibrated to what the search queries predicted.

147
00:16:07,150 --> 00:16:14,700
So this was back in 2009, when they when they first published this and there was a lot of excitement about the

148
00:16:14,700 --> 00:16:21,960
potential of sources such as these web search queries for being able to track and now cost.

149
00:16:21,960 --> 00:16:30,960
These predict the present essentially of key social development, health and other indicators.

150
00:16:30,960 --> 00:16:33,900
Another aspect of Google search queries, for example,

151
00:16:33,900 --> 00:16:41,130
has also been touted is that they might potentially capture behaviour that people might not want to report,

152
00:16:41,130 --> 00:16:49,080
so they might have the ability to capture or capture phenomenon that are prone to social desirability bias,

153
00:16:49,080 --> 00:16:55,140
where if we were doing things in the context of a survey, people might not want to report these things.

154
00:16:55,140 --> 00:17:01,050
So in some of my work with Nicola and others, we've been trying to see if we could use Google search queries, for example,

155
00:17:01,050 --> 00:17:08,100
to study sex selective abortion in the context of India, which is a behaviour that we know as demographers exists.

156
00:17:08,100 --> 00:17:12,930
But of course, people don't want to talk about it and report that they are doing it themselves.

157
00:17:12,930 --> 00:17:17,880
They might be OK with telling an interview, a survey, or that other people are doing it,

158
00:17:17,880 --> 00:17:22,050
but they don't want to report their own behaviour in relation to these kinds of things.

159
00:17:22,050 --> 00:17:28,260
And a paper and earliest example of using Google search to study abortion is Rice and Brown Brownstein in 2010,

160
00:17:28,260 --> 00:17:34,410
where you see here on the x axis, you see a proportion of pregnancies ending in abortion.

161
00:17:34,410 --> 00:17:39,960
And you look at relative internet search volume for abortion in those countries and what they are doing.

162
00:17:39,960 --> 00:17:44,460
This paper is that places where people tend to search for abortion much more

163
00:17:44,460 --> 00:17:49,440
are also the places where there are very significant restrictions on abortion.

164
00:17:49,440 --> 00:17:54,000
So in some sense, there is actually a demand for this, which is not being captured, not being met,

165
00:17:54,000 --> 00:18:02,730
and people are resorting to the internet and other kinds of resources online to try and seek information about these things.

166
00:18:02,730 --> 00:18:06,990
By the way, I should preface my comments by saying that please feel free to ask any questions at any

167
00:18:06,990 --> 00:18:14,070
time or if you want the clarification on something or even a more substantive question.

168
00:18:14,070 --> 00:18:21,570
Hmm. So another source of data that I've seen being used in the literature is blogs and internet forums.

169
00:18:21,570 --> 00:18:34,200
So this is here a paper that looks at how people are providing social support online via Reddit, and they look at different types of of support.

170
00:18:34,200 --> 00:18:38,760
So they look at whether this is emotional support. Is this information support?

171
00:18:38,760 --> 00:18:44,910
And in the paper, they describe the other categories of the kinds of support that people are providing on forums on Reddit,

172
00:18:44,910 --> 00:18:49,620
where people are talking about mental health and they find that and they're interested also

173
00:18:49,620 --> 00:18:57,360
in how much of the support is kind of done as a anonymously and how much of it is done,

174
00:18:57,360 --> 00:19:01,980
not anonymously. And they find that a lot of emotional support is actually provided anonymously,

175
00:19:01,980 --> 00:19:11,180
whereas kind of more practical informational support is provided on anonymously.

176
00:19:11,180 --> 00:19:17,060
Right, and then Nick talked about this already extensively, so I perhaps don't have to dwell so much,

177
00:19:17,060 --> 00:19:24,560
but this is again a paper by that much organic talked about yesterday, which is call detail records to try.

178
00:19:24,560 --> 00:19:29,810
And this is again one of the earliest examples of of AH,

179
00:19:29,810 --> 00:19:37,040
a very well known paper that tried to do to do computational social science by leveraging a very novel source of data,

180
00:19:37,040 --> 00:19:41,900
at least at that time, which was called data records. And on the left.

181
00:19:41,900 --> 00:19:50,360
And finally, you have the predicted wealth index for districts in Rwanda computer in 2009, using cold call data.

182
00:19:50,360 --> 00:19:55,520
So the KDR data and on the right hand side, you have the same,

183
00:19:55,520 --> 00:20:05,660
which is estimated using a very high quality but expensive survey, which is the demographic and health survey in in Rwanda.

184
00:20:05,660 --> 00:20:15,200
Now you see, you have sort of correspondence in the colours across the two, so essentially a much, a much less data trend.

185
00:20:15,200 --> 00:20:27,020
So this was predicted using matching call the CDER's to a survey in which they could actually verify and validate the information on respondents.

186
00:20:27,020 --> 00:20:30,890
So using a much smaller dataset, as Matt described yesterday,

187
00:20:30,890 --> 00:20:38,360
which is what it relied on relative to the DHS, they were able to generate pretty similar measures.

188
00:20:38,360 --> 00:20:41,690
And that's an example of, in some sense, reducing cost,

189
00:20:41,690 --> 00:20:49,310
improving frequency of measurement and potentially also guaranteeing more effective scalability.

190
00:20:49,310 --> 00:20:57,020
Right. And the last example I want to uses is sensor data, and this is a this is an example of a kind of a sensor data.

191
00:20:57,020 --> 00:21:02,600
And this is this is a paper published by a new wing and colleagues who I believe are

192
00:21:02,600 --> 00:21:07,880
actually working on this as a part of the awareness the Office of National Statistics,

193
00:21:07,880 --> 00:21:12,380
Big Data and Official Statistics Team.

194
00:21:12,380 --> 00:21:16,340
And what they try and look at here is electricity smart metres that exist.

195
00:21:16,340 --> 00:21:20,570
So this is an example again of these Internet of Things phenomenon.

196
00:21:20,570 --> 00:21:24,680
So we have a lot of homes now in the UK, have electricity, smart metres,

197
00:21:24,680 --> 00:21:32,690
and you see that there are these kind of very distinct profiles of household energy consumption that occur,

198
00:21:32,690 --> 00:21:42,350
and these vary by kind of the household composition. I saw families that have that have two adults and three plus children.

199
00:21:42,350 --> 00:21:50,450
They tend to have a lot of activity around seven, around five six that time and then early in the morning as well.

200
00:21:50,450 --> 00:21:53,780
They tend to have a lot of activity, probably around the time of school,

201
00:21:53,780 --> 00:21:58,550
and they make the argument that these electricity smart metres, we have this data.

202
00:21:58,550 --> 00:22:03,560
And since we see such clear patterns of consumption based on household characteristics,

203
00:22:03,560 --> 00:22:10,400
we could maybe start inferring something about the structure of households in the UK by just looking at electricity consumption itself,

204
00:22:10,400 --> 00:22:15,800
rather than waiting for censuses and so on to come around.

205
00:22:15,800 --> 00:22:18,890
And this is another example which I like very much. And it exists.

206
00:22:18,890 --> 00:22:32,360
And and Matt talks about it in his book Bit by Bit, which is using essentially the the metres in electronic metres in cabs in New York.

207
00:22:32,360 --> 00:22:35,840
And what they do is they look at they had information. So this is fibre.

208
00:22:35,840 --> 00:22:42,020
He had information on on all rides that were taken, I believe so.

209
00:22:42,020 --> 00:22:46,520
This is the New York City taxi and limousine company.

210
00:22:46,520 --> 00:22:54,620
The agency charged with regulating the industry now requires all taxis to be equipped with electronic devices that record all trip information,

211
00:22:54,620 --> 00:23:03,530
including says times and locations. The currently two companies that supply these devices report all this information to the DNC on a regular basis.

212
00:23:03,530 --> 00:23:13,880
And I have obtained full information for all trips taken in New York City taxicabs four the five years from 2009 to 2013 and using these data.

213
00:23:13,880 --> 00:23:20,120
What Farber then proceeds to do is is to actually test two competing theories

214
00:23:20,120 --> 00:23:25,710
in labour economics neoclassical versus more behavioural economic approaches

215
00:23:25,710 --> 00:23:34,430
and tries to see Is it that on days that that taxi drivers are getting a lot of are getting higher fares who they work more or they work less?

216
00:23:34,430 --> 00:23:38,420
Because neoclassical economics would tell you they should work more, they're going to earn more.

217
00:23:38,420 --> 00:23:42,410
Whereas behavioural economics would tell you, well, maybe they just will have a kind of maximum.

218
00:23:42,410 --> 00:23:48,500
They want to earn two hundred dollars a day, and if they can do it in three hours, they'll stop. So which what do we find?

219
00:23:48,500 --> 00:23:53,840
The actually found that neoclassical approaches were better fit the data that he had.

220
00:23:53,840 --> 00:24:02,060
But that's exactly taking a very detailed measurement and having a data set of a very detailed measurement and just testing a theory with it in a

221
00:24:02,060 --> 00:24:10,670
way that was perhaps was not possible with the previous kinds of of data sources because it allowed for greater disaggregation and exploration.

222
00:24:10,670 --> 00:24:12,440
Heterogeneity. Right.

223
00:24:12,440 --> 00:24:20,720
So as I said, I've sort of shown you diverse examples here, and I will come back to what I think these different kinds of examples are trying to do.

224
00:24:20,720 --> 00:24:29,870
But there are definitely some kind of promises that we already see when we see these examples of research of digital trace data in action.

225
00:24:29,870 --> 00:24:36,890
So the first thing is this and what? I'm going to talk about this, I want to rely a lot on bit by bit.

226
00:24:36,890 --> 00:24:46,660
So if you haven't read this chapter already, I suggest you read it afterwards that the first I think big promise here is the bigness.

227
00:24:46,660 --> 00:24:54,830
The fact that there is volume and in enough itself is something that is bigger, better.

228
00:24:54,830 --> 00:25:05,000
I mean, in the context here, I think it's it's it's not entirely clear what whether that would be necessary to have so much bigger data.

229
00:25:05,000 --> 00:25:14,960
But when we have questions surrounding, say, a rare event, so we really want to explore certain forms of heterogeneity where we might be interested,

230
00:25:14,960 --> 00:25:24,030
for example, in this aggregating to a certain geographical attempt to get some disaggregation by by geography or time.

231
00:25:24,030 --> 00:25:31,040
There's an argument to be made. I think that that the volume is an asset and a strength.

232
00:25:31,040 --> 00:25:40,490
So also from the perspective of a velocity. I think the fact that there's higher frequency or more regular measurement can also be, I think,

233
00:25:40,490 --> 00:25:48,810
can be seen as a key strength of of several sources of digital trace data and also things that are always on.

234
00:25:48,810 --> 00:25:53,120
So if we go back to the Twitter example that I was talking about,

235
00:25:53,120 --> 00:25:57,200
the fact that this is kind of always on the fact that Twitter is always on and

236
00:25:57,200 --> 00:26:02,300
you can see these peaks occurring around the time of certain kinds of events.

237
00:26:02,300 --> 00:26:10,700
Many times, you know, in the real world, there are natural experiments that occur, but we don't have pre and post.

238
00:26:10,700 --> 00:26:19,400
We don't have that perfect dataset that we can actually go and observe pre and post and exploit that nice design.

239
00:26:19,400 --> 00:26:22,490
Or we could look at what happened as a result of this.

240
00:26:22,490 --> 00:26:31,550
But in this setting, potentially we could use that kind of difference in different kind of design or we could look at a change occurring.

241
00:26:31,550 --> 00:26:33,530
Of course, there might be problems if, for example,

242
00:26:33,530 --> 00:26:39,020
the composition of users changed across these time points and then we might want to think about that.

243
00:26:39,020 --> 00:26:47,120
And that's a broader problem of population drift or just drift of these kinds of data sources, and I'll come back to that later.

244
00:26:47,120 --> 00:26:54,200
But in general, the fact that there are very few other most of the time our surveys aren't perfectly calibrated

245
00:26:54,200 --> 00:26:58,220
in terms of time to capture some of these events that are occurring in the real world.

246
00:26:58,220 --> 00:27:04,280
But maybe this always-on characteristic of some social media data sources could be valuable.

247
00:27:04,280 --> 00:27:08,990
And Twitter, I think here is a good example because unlike other social media APIs,

248
00:27:08,990 --> 00:27:13,970
it allows you to at least go back in the past and it allows you to go back.

249
00:27:13,970 --> 00:27:15,320
In the past, for example,

250
00:27:15,320 --> 00:27:23,660
the Facebook marketing API that I'll talk a little bit later doesn't allow you to go back in the past so you can look forward,

251
00:27:23,660 --> 00:27:28,790
which could be helpful if you identify something that you might want to track over time.

252
00:27:28,790 --> 00:27:36,650
But if you wanted to go and see what happened, for example, when Sudan, as we were discussing yesterday, shut off the internet,

253
00:27:36,650 --> 00:27:43,640
you know, we might not be able to go back and see how that changed over time, but we potentially could with with this date.

254
00:27:43,640 --> 00:27:55,760
And so it's here. Another aspect is that I think for a lot of digital trace data sources or or social media data sources,

255
00:27:55,760 --> 00:28:01,670
sometimes there are topics or geographies that we might capture in these data sources that we might not.

256
00:28:01,670 --> 00:28:05,960
In some others, another aspect I think,

257
00:28:05,960 --> 00:28:14,720
which can be seen as a promise is that there again non-reactive so they might capture behaviour that might be otherwise difficult to measure.

258
00:28:14,720 --> 00:28:22,910
And this was something that I already talked about in the context of the social desirability bias and Google search and abortion.

259
00:28:22,910 --> 00:28:25,040
So if we think about here, as I was saying again,

260
00:28:25,040 --> 00:28:32,630
coming back to the gender gaps in internet use when they're being computed here using the Facebook ad, audience estimates.

261
00:28:32,630 --> 00:28:36,320
So what we see here is that there are some parts of the world.

262
00:28:36,320 --> 00:28:42,450
So this ratio is a female is a female, the male ratio.

263
00:28:42,450 --> 00:28:49,200
Is a female to male ratio so higher values correspond to greater gender equality in internet use?

264
00:28:49,200 --> 00:28:54,660
So essentially, a lower value corresponds to women being less represented online.

265
00:28:54,660 --> 00:29:07,680
And we see that in parts of southern Asia here and also in in sub-Saharan Africa, we have significant gender inequality online.

266
00:29:07,680 --> 00:29:10,410
Now this is interesting for, I think, for a number of reasons.

267
00:29:10,410 --> 00:29:18,930
I mean, the first reason it's interesting is that if being online, if the internet is a valuable way to access information,

268
00:29:18,930 --> 00:29:27,780
to be able to access other kinds of resources that are relevant to people's health, education, say access to contraception,

269
00:29:27,780 --> 00:29:35,640
if these kind if people are if women aren't online, if they're systematically less online, it's important for us to know that,

270
00:29:35,640 --> 00:29:40,710
and we might not know that if we just were relying on survey data to be able to measure that.

271
00:29:40,710 --> 00:29:46,140
So in itself, I think there's intrinsic value in mapping and capturing this development indicator.

272
00:29:46,140 --> 00:29:54,690
But I think another important aspect of trying to understand who is online and who is not online and who is captured in in in this case,

273
00:29:54,690 --> 00:29:59,850
the social media data source is that if we want to use this especially to make kind

274
00:29:59,850 --> 00:30:06,060
of wider claims about and we want to think about issues such as who is represented,

275
00:30:06,060 --> 00:30:07,470
who is not represented,

276
00:30:07,470 --> 00:30:15,930
then the fact that women are significantly underrepresented in some parts of the world on Facebook is important also for us to know,

277
00:30:15,930 --> 00:30:21,210
and we want to use these kinds of data sources for the kinds of research questions we may have.

278
00:30:21,210 --> 00:30:29,070
Now, this might not be a problem again, if we are trying, if our research question pertains in a more kind of difference and different kind of

279
00:30:29,070 --> 00:30:36,660
design in our research question is how does censorship affect women's internet access?

280
00:30:36,660 --> 00:30:45,390
Maybe if we if we felt that this sample was fairly stable over time in terms of its composition, maybe we could study that using this data source.

281
00:30:45,390 --> 00:30:53,160
But if we wanted to say what is, for example, the population level estimate for a particular indicator of interest,

282
00:30:53,160 --> 00:31:01,050
we might want to take a step back and think about, well, who are we capturing and who are we not capturing?

283
00:31:01,050 --> 00:31:08,190
The same Facebook marketing API has been used by a number of actually has been is probably the most.

284
00:31:08,190 --> 00:31:13,590
There's a question that. Yes. So in the previous, the previous model you could put.

285
00:31:13,590 --> 00:31:19,110
So if I understand correctly yesterday front ActionScript presentation, you could potentially estimate this as well.

286
00:31:19,110 --> 00:31:22,950
Even lower? Yes. Zip code. Yeah, yeah. Yeah.

287
00:31:22,950 --> 00:31:27,300
I mean, this makes sense, right? Because in certain countries, you might have zero people.

288
00:31:27,300 --> 00:31:31,830
Yeah, yeah. Yeah, absolutely. Yeah, exactly.

289
00:31:31,830 --> 00:31:38,520
I mean, the reason why I'm showing this map is because first, in order to understand the value of this data,

290
00:31:38,520 --> 00:31:47,100
the first step that we undertook was we collected and we created what was a measure of just a female to male ratio of Facebook users.

291
00:31:47,100 --> 00:31:56,700
And then we tried to test and validated against survey based measures of internet use based from sort of more trustworthy probability sample surveys.

292
00:31:56,700 --> 00:32:00,900
And we found that the correlation was about point eight. Right. So it was pretty good.

293
00:32:00,900 --> 00:32:05,460
And the purpose here was similar to the bloomin stock paper to say that we have a small,

294
00:32:05,460 --> 00:32:12,900
smaller dataset where we have both Facebook and we have we have Facebook as well as a survey based measure.

295
00:32:12,900 --> 00:32:17,040
And let's just sort of trained a simple model to make a prediction then and expand

296
00:32:17,040 --> 00:32:21,510
our geographical coverage to those that we don't actually have this indicator for.

297
00:32:21,510 --> 00:32:24,810
So in some other words, we've tried to now go down, some nationally.

298
00:32:24,810 --> 00:32:29,730
But then a limitation that we have is we can measure Facebook use at the subnational level.

299
00:32:29,730 --> 00:32:36,050
But of course, we don't have a good survey data to be able to capture exactly what it is is.

300
00:32:36,050 --> 00:32:40,950
Yeah. So then we might use it in a different kind of way, of course.

301
00:32:40,950 --> 00:32:47,940
But the purpose here was to try. And so this this is being done in the context of a project where the UN Foundation,

302
00:32:47,940 --> 00:32:54,300
where we got a grant from the UN Foundation to try and see the value of big data sources to measure development indicators.

303
00:32:54,300 --> 00:32:55,320
So in that context,

304
00:32:55,320 --> 00:33:02,970
this was a it was first done trying to see where we had both data sources and to test the sort of validity of the online data source.

305
00:33:02,970 --> 00:33:07,290
Yeah. So I have questions or speculate. If you look at this one.

306
00:33:07,290 --> 00:33:15,690
Yeah. And I also like what you were saying about the fact that correlations like 0.8 and yeah, pretty good, but it's not perfect.

307
00:33:15,690 --> 00:33:21,750
Yeah. And I think like for some applications like digital trace data like elections, online limelight,

308
00:33:21,750 --> 00:33:28,980
there's obviously the difference between having a seven percent absolute and or adding a four percent or three percent absolute error is massive.

309
00:33:28,980 --> 00:33:32,430
Mm hmm. The difference for people who want to use it at that.

310
00:33:32,430 --> 00:33:36,660
So what? What would you say? Like a critique that says, Well, stay tuned. Interesting, very innovative.

311
00:33:36,660 --> 00:33:40,260
But there's no way this is useful at the moment. Yeah, yeah.

312
00:33:40,260 --> 00:33:45,340
Yeah. So I think Roberto's. Point about how much error can we tolerate?

313
00:33:45,340 --> 00:33:49,930
Is a valid one, and I think it's very context and domain specific.

314
00:33:49,930 --> 00:33:57,790
So as again, this is something that Nick was talking about earlier today when you were actually interested in in in thinking about, OK?

315
00:33:57,790 --> 00:34:03,850
Does this mean that we have to use a different quantity of spray altogether? Because the difference between the why?

316
00:34:03,850 --> 00:34:07,510
Because the differences are hinged on like the second decimal point.

317
00:34:07,510 --> 00:34:16,090
Then yes, perhaps we you know, we might want to be cautious as well, and we might then want to try and try and think about how these could,

318
00:34:16,090 --> 00:34:23,080
for example, be validated against or how we might just want to do a survey, right?

319
00:34:23,080 --> 00:34:28,060
We might just want to do a survey and actually so, so run a survey instead.

320
00:34:28,060 --> 00:34:35,320
But I think like in the context of some things where where we like, I would say in this context, you know, we're here,

321
00:34:35,320 --> 00:34:41,080
which aren't too interested in just simply mapping inequalities and actually showing that there are parts of the

322
00:34:41,080 --> 00:34:48,910
world where where this inequality wasn't really wasn't something that people were really aware about before.

323
00:34:48,910 --> 00:34:53,140
And this is an aspect that I think people can still draw attention to.

324
00:34:53,140 --> 00:34:55,630
So we're not saying here that we shouldn't do any ICD surveys.

325
00:34:55,630 --> 00:35:02,050
We're just saying that look, if you did an ICD survey, you would realise that there is there might be some issues here in other parts,

326
00:35:02,050 --> 00:35:06,130
not just the countries for which we have these handful of of surveys available.

327
00:35:06,130 --> 00:35:12,940
So it could be useful as a first step towards drawing attention to a matter and then saying that while we can't be precise,

328
00:35:12,940 --> 00:35:17,860
we can at least still orient the discussion around it. So yeah, your point is well taken.

329
00:35:17,860 --> 00:35:23,170
That error is the extent of error that we can tolerate is domain specific.

330
00:35:23,170 --> 00:35:28,270
But I think there still may be value depending on what your goal is.

331
00:35:28,270 --> 00:35:36,430
So this is another example of, I think, the burgeoning field of digital demography, where Emilio Isagani,

332
00:35:36,430 --> 00:35:42,730
who is the director of the Max Planck Institute of Demography in Rostock in Germany,

333
00:35:42,730 --> 00:35:48,280
where he and his colleagues have also been trying to use the same Facebook marketing API,

334
00:35:48,280 --> 00:35:56,050
which provides essentially information on aggregate counts of Facebook users broken down by different kinds of characteristics.

335
00:35:56,050 --> 00:36:06,160
So they've been using the category of expats, and they've been trying to see if how well these getting this category of of expats,

336
00:36:06,160 --> 00:36:12,430
as inferred by Facebook, can help us track stocks of migrants in different parts of the world.

337
00:36:12,430 --> 00:36:16,540
And also in this paper, they look at the US in particular in different states,

338
00:36:16,540 --> 00:36:20,770
and compare their estimates to asks the American Community Survey's estimates.

339
00:36:20,770 --> 00:36:25,240
So what you see here is just the large fraction of immigrants for a particular country,

340
00:36:25,240 --> 00:36:33,070
and Facebook seems to be fairly strongly correlated with the fraction of immigrants, according to the World Bank.

341
00:36:33,070 --> 00:36:39,010
And they again, similar to the argument that I was making in the previous slide in relation to gender gaps.

342
00:36:39,010 --> 00:36:45,580
We should be able to. Migration data and demography are in fact of the three demographic processes,

343
00:36:45,580 --> 00:36:54,250
but stats and mobility migration is the worst in terms of our ability to measure and even in in high income countries,

344
00:36:54,250 --> 00:37:01,990
migration or mobility isn't as well tracked as vital registration is able to track births and deaths.

345
00:37:01,990 --> 00:37:07,480
So they were making the argument that at least in in-between years when, for example,

346
00:37:07,480 --> 00:37:16,520
other data sources might not be available, this could be a data innovation that might help fill this data gap.

347
00:37:16,520 --> 00:37:23,540
Right, so the kind of timeliness and the coverage of these topics could be valuable here.

348
00:37:23,540 --> 00:37:28,640
So I was talking a bit about also this the what I said is the non reactivity.

349
00:37:28,640 --> 00:37:34,250
So this is a paper by by Stevens Davidovich in the Journal of Public Economics.

350
00:37:34,250 --> 00:37:45,110
And what he's looking at here, this is from Google Trends is looking at searches for the [INAUDIBLE] in the US.

351
00:37:45,110 --> 00:37:49,160
And so he looks at this and he says this is a measure of racial bias.

352
00:37:49,160 --> 00:37:55,730
And then he makes an argument in this paper that essentially, if we just looked at surveys,

353
00:37:55,730 --> 00:38:02,030
we underestimate how much racism cost Obama the election because it puts in.

354
00:38:02,030 --> 00:38:04,160
I mean, I suggest you read the paper.

355
00:38:04,160 --> 00:38:14,000
I don't I don't know that well, but the argument in the paper is effectively is that if we actually put in an and we we look at this racial,

356
00:38:14,000 --> 00:38:23,780
this racial search term, it actually is very highly predictive of of the Obama's vote share lost.

357
00:38:23,780 --> 00:38:30,980
And so he makes the argument that this is exactly the kind of measure that we would not capture in a survey and we would not be.

358
00:38:30,980 --> 00:38:34,220
And he makes the argument, of course, that we are not in a post-racial world.

359
00:38:34,220 --> 00:38:41,690
But but this is exactly the kind of non-reactive measure that perhaps is free of social desirability

360
00:38:41,690 --> 00:38:50,300
bias that might enable us to to capture a phenomenon that we might not capture in survey research.

361
00:38:50,300 --> 00:38:57,110
So those are the promises. So what are the pitfalls now of of digital trace data?

362
00:38:57,110 --> 00:39:08,000
So I think one key aspect about about working with and using digital trace data are that they're quite, yeah, they're quite dirty there.

363
00:39:08,000 --> 00:39:16,970
And this is this is dirty in the sense of whether you're using social media data or whether you're also dealing with with

364
00:39:16,970 --> 00:39:24,380
sensor based data from from different kinds of instruments out there in the world or if you're are using accelerometers.

365
00:39:24,380 --> 00:39:29,900
So a year ago or about two years ago, I was a I was a subject in a study.

366
00:39:29,900 --> 00:39:39,980
I was contributing to science by being a subject, and I was a subject in a study of the John Ratcliffe Hospital in in in Oxford.

367
00:39:39,980 --> 00:39:44,810
And I, I had a number of measurements that were taken off me.

368
00:39:44,810 --> 00:39:51,440
I mean, I think too many measurements. They created too many variables. At some point, I wondered what they would do with so many variables.

369
00:39:51,440 --> 00:39:58,400
But anyway, they collected a lot of information from me and they gave me a fitness tracker.

370
00:39:58,400 --> 00:40:04,430
They gave me a blood pressure monitor that I had that would just buzz every hour just automatically.

371
00:40:04,430 --> 00:40:08,360
Just my cuff would swell up and I would just be measured.

372
00:40:08,360 --> 00:40:12,860
So I was essentially for like for over a year I had.

373
00:40:12,860 --> 00:40:17,390
I mean, the blood pressure monitor was only like like for three or four days in the year,

374
00:40:17,390 --> 00:40:22,640
but I had a fitness tracker for a full year that they were monitoring information from.

375
00:40:22,640 --> 00:40:29,210
And then on my final visit to them, I went to them and I said, How much of the data are you?

376
00:40:29,210 --> 00:40:30,500
Do you think like, you know what?

377
00:40:30,500 --> 00:40:39,380
Like, have you found anything interesting or do you think they're like, well, actually, like half the people then switch on their fitness trackers?

378
00:40:39,380 --> 00:40:44,450
I was like, OK, that's not helpful. I said, No, but I've been using mine. So, you know, I'm sure some people have been using it.

379
00:40:44,450 --> 00:40:54,620
They're like, Yeah, but about basically we've so far, we've deduced that about 60 to 65 percent of our data are just completely unusable.

380
00:40:54,620 --> 00:41:01,010
Now, you know, this is a genuine concern that that that I think a lot of studies that are that are trying

381
00:41:01,010 --> 00:41:08,360
to adopt novel forms of of measurement are going to face in that a lot of the data

382
00:41:08,360 --> 00:41:14,480
that they collect might not be usable because the switching was people input the right

383
00:41:14,480 --> 00:41:20,360
thing on or there was just too much noise in it that it made it hard to to parse properly.

384
00:41:20,360 --> 00:41:27,350
But I think this is an aspect of data collection that that you have to be sure that

385
00:41:27,350 --> 00:41:32,840
you have to be mindful of when when trying to use digital trace data sources.

386
00:41:32,840 --> 00:41:37,880
Another aspect, and I think this is something that we perhaps should collectively,

387
00:41:37,880 --> 00:41:43,940
I think as a as a computational social science community need to think about and and have a broader public

388
00:41:43,940 --> 00:41:52,220
discussion about is the inaccessibility of a lot of interesting and meaningful digital trace data sources.

389
00:41:52,220 --> 00:41:57,200
Now, Nick, earlier in the morning talked a lot about call detail records.

390
00:41:57,200 --> 00:42:01,550
And then I think Chris asked a question about, well, how accessible are these data?

391
00:42:01,550 --> 00:42:06,620
Because there isn't really a I mean, and perhaps rightfully so,

392
00:42:06,620 --> 00:42:16,610
people's information on call is not being shared widely through some kind of, you know, open access, public infrastructure.

393
00:42:16,610 --> 00:42:18,170
So.

394
00:42:18,170 --> 00:42:30,950
A lot of existing data sources that are held by private companies, but also held by governments are not really accessible to to to the researchers.

395
00:42:30,950 --> 00:42:39,080
And part of that, especially when we think about it from the purpose of companies, is that they just have very different incentives and goals to most.

396
00:42:39,080 --> 00:42:48,440
So if researchers write their goals are often business and profit and they're interested in generating products that that appeal

397
00:42:48,440 --> 00:42:56,270
to customers and make sure that their customers don't get angry with them about and to protect themselves against backlash,

398
00:42:56,270 --> 00:43:02,480
especially in the aftermath of events such as last, such as February last year with Cambridge Analytica.

399
00:43:02,480 --> 00:43:09,260
Often, companies will restrict access and make it harder to access data sources.

400
00:43:09,260 --> 00:43:16,670
Now there are some well-known leaks. I think Matt yesterday already talked about the emotional contagion experiment.

401
00:43:16,670 --> 00:43:26,030
There's also the the the example of AOL providing access to browser to search histories of people.

402
00:43:26,030 --> 00:43:34,970
And then they had made a claim that it was perfectly anonymized, but then turned out that there were some people who could be identified.

403
00:43:34,970 --> 00:43:40,700
And this essentially resulted in very senior officials losing their jobs or as a result of that.

404
00:43:40,700 --> 00:43:48,680
So it's a business that essentially not only lost business not only generated a lot of public backlash,

405
00:43:48,680 --> 00:43:55,250
but also effectively made it more inward looking in relation to data accessibility and sharing.

406
00:43:55,250 --> 00:44:03,170
In the aftermath of this event, there's also another example which Matt talks about in his book,

407
00:44:03,170 --> 00:44:07,430
which is the Netflix in the course of the Netflix challenge.

408
00:44:07,430 --> 00:44:19,040
There were a lot of. So it was a it was a it was a challenge in which large data set of people's what films people watched was revealed.

409
00:44:19,040 --> 00:44:22,730
And Netflix essentially wanted to improve its recommended system so that people could

410
00:44:22,730 --> 00:44:28,520
predict just people's viewing just what films they might want to watch better.

411
00:44:28,520 --> 00:44:33,200
So it was a kind of an it was using the common dark framework, this machine learning challenge,

412
00:44:33,200 --> 00:44:40,430
where people could make a prediction and then later on a fight and a lawsuit was filed against Netflix because

413
00:44:40,430 --> 00:44:48,620
it emerged later on that people specific individuals could be identified through their film watching habits.

414
00:44:48,620 --> 00:44:53,030
Now you would think, why is it sensitive? Like, I mean, it's just films that people watch.

415
00:44:53,030 --> 00:45:00,230
Why should people care about? I mean, do you care if someone watches your films or someone knows what films you're watching?

416
00:45:00,230 --> 00:45:03,110
Well, for some people, this might be sensitive information, right?

417
00:45:03,110 --> 00:45:09,710
And it's hard for us to predict what might be sensitive for some people and what might not be so sensitive for others.

418
00:45:09,710 --> 00:45:18,650
So the lawsuit was brought forward by by a lesbian woman who said that actually the kinds of things that people watch reveal about real,

419
00:45:18,650 --> 00:45:24,440
intimate things, about their identities and things that they might not want other people to know about,

420
00:45:24,440 --> 00:45:30,350
especially if they belong to certain families that might not accept them for for for this.

421
00:45:30,350 --> 00:45:35,510
So I think, you know, from one perspective, you think, Oh, this is films, why should we care?

422
00:45:35,510 --> 00:45:43,880
But for some people, that might be highly sensitive information. And I guess your ethics activity perhaps gave you a sense of the people might

423
00:45:43,880 --> 00:45:48,520
have contrasting perspective on what is what is valuable and and what is not.

424
00:45:48,520 --> 00:45:59,370
Yeah, there's a question is of using this in the long term, it might be that, you know, this company will restrict and restrict.

425
00:45:59,370 --> 00:46:08,720
More and more next year, and you'll be. And then they found those yeah, when I saw these stories that opened up.

426
00:46:08,720 --> 00:46:16,220
Yeah. Now Coke mania realised, Yeah, yeah, I mean, I think that's a genuine that is a genuine.

427
00:46:16,220 --> 00:46:26,060
So there this is something that Dean Cynllun, then Dean Fillon's and he he's a comedian.

428
00:46:26,060 --> 00:46:31,430
Yeah, Finland. He's he says that we're now living in a poor state by age where in the early 2000s we saw

429
00:46:31,430 --> 00:46:37,610
this big proliferation of of of web resources that that researchers were given access to.

430
00:46:37,610 --> 00:46:44,960
But at the same time, now we actually see that that that there's they're clamping down and restricting what's available.

431
00:46:44,960 --> 00:46:53,330
And and I think that, yeah, it's it's one of the challenges with this is that as I was saying that, you know, we have a,

432
00:46:53,330 --> 00:47:04,610
I think, a very limited understanding so far of of of what people's notions and conceptions of privacy are now.

433
00:47:04,610 --> 00:47:08,390
You think that, you know, on one side, you know,

434
00:47:08,390 --> 00:47:17,240
we live increasingly our worlds online and we have accounts for different things and we and we we share a lot of information online.

435
00:47:17,240 --> 00:47:21,860
But it turns out that that actually people still have certain kinds of norms and notions

436
00:47:21,860 --> 00:47:25,970
of privacy associated with even that kind of life online where they're sharing.

437
00:47:25,970 --> 00:47:32,870
That's occurring online and trying to find in that kind of and parse out what is what is sensitive and and

438
00:47:32,870 --> 00:47:43,220
what people consider to be things that that that they would like consent before they're used is challenging.

439
00:47:43,220 --> 00:47:48,830
But that being said, I think there's also on the other end like this kind of and this is this is of

440
00:47:48,830 --> 00:47:54,080
the wider discussion about a lot of these data could also be put to good public

441
00:47:54,080 --> 00:47:58,040
use in terms of understanding phenomenon that we might not necessarily know very

442
00:47:58,040 --> 00:48:04,490
much about for maybe mapping as Nick was showing malaria movement mobility.

443
00:48:04,490 --> 00:48:08,090
So trying to find the balance between sort of good public use,

444
00:48:08,090 --> 00:48:14,120
but at the same time recognising issues such as privacy and also recognising

445
00:48:14,120 --> 00:48:19,480
the importance of of of of of what people find to be sensitive information is,

446
00:48:19,480 --> 00:48:23,870
is is something that that I don't think we have a good answer to at the moment,

447
00:48:23,870 --> 00:48:29,270
but we need to have a bigger public discussion about in any way you study.

448
00:48:29,270 --> 00:48:37,160
Some of them, even their own insurance now would not be possible to do like today.

449
00:48:37,160 --> 00:48:41,780
I think all of the studies I intentionally chose them because I think it would still be possible to do them.

450
00:48:41,780 --> 00:48:42,560
Yeah, yeah.

451
00:48:42,560 --> 00:48:50,540
I intentionally chose these examples because they're using data sources that are that are still there might be like, for example, with Twitter,

452
00:48:50,540 --> 00:48:55,040
you might still there's there's a more detailed credentialing process now,

453
00:48:55,040 --> 00:49:03,470
but you have to write about why you want to access these data, but the Twitter API is still accessible.

454
00:49:03,470 --> 00:49:08,570
So I think all of these data sources, it would be possible to do this research, at least to this date, right?

455
00:49:08,570 --> 00:49:14,600
A lot of these are also aggregated, so search volumes are aggregated.

456
00:49:14,600 --> 00:49:18,200
So, so anyway. But but that was intentionally part of my reason.

457
00:49:18,200 --> 00:49:22,430
But I think the inaccessibility problem is still a while.

458
00:49:22,430 --> 00:49:26,060
The call detail records are inaccessible to a public researcher.

459
00:49:26,060 --> 00:49:30,890
But if you have, but you know, if you have a relationship and you,

460
00:49:30,890 --> 00:49:40,640
then you can sign an agreement with with a mobile phone provider and potentially within the context of a project could leverage some of those data.

461
00:49:40,640 --> 00:49:47,210
So I already talked a bit about the sensitivity aspect. There's also the the aspect of incomplete.

462
00:49:47,210 --> 00:49:51,590
And I think with most digital data sources, you're never going to have everything.

463
00:49:51,590 --> 00:49:59,360
So because again, they're not custom-made for research, so we won't know, for instance, the demographic characteristics of our users.

464
00:49:59,360 --> 00:50:10,340
And I've seen obviously papers that we're still in for those using using some existing classifiers or algorithms.

465
00:50:10,340 --> 00:50:18,110
So they'll try and infer people's, for example, gender or their race or their ethnicity by looking at images and so on and so forth.

466
00:50:18,110 --> 00:50:24,620
And I think those are also important ethical questions about whether we should be doing that.

467
00:50:24,620 --> 00:50:34,970
And but that's an aspect of the fact that they're incomplete and we could try and overcome some issues of incompleteness by doing things,

468
00:50:34,970 --> 00:50:42,980
then then might be sensitive. So, you know, there's tensions there that I think are important to bear in mind.

469
00:50:42,980 --> 00:50:50,990
I already talked a bit about this, but there is the issue of data sources being non-representative now in of itself.

470
00:50:50,990 --> 00:50:59,300
As I said, whether something is non-representative depends and whether that's a that's a problem, a depends on your research question.

471
00:50:59,300 --> 00:51:02,900
Right. And what exactly it is that you're hoping to do?

472
00:51:02,900 --> 00:51:12,080
Answer, and there are maybe ways around it, as Roberto will talk about done on Friday and also Matt will talk on the livestream on Thursday.

473
00:51:12,080 --> 00:51:24,380
But at the same time, I don't think that in of itself, we we just because something is non-representative, we shouldn't we shouldn't use it.

474
00:51:24,380 --> 00:51:35,180
So this is an example here of the again coming back to this using Facebook to monitor stocks of migrants, example by media's agony and colleagues.

475
00:51:35,180 --> 00:51:42,860
And one one. One interesting aspect of this here is that they use this, this,

476
00:51:42,860 --> 00:51:52,700
this behaviour or this category that Facebook classifies people as as ex-pats, whether someone is an expat or not.

477
00:51:52,700 --> 00:52:00,320
So, so you can, for example, go and see what Facebook classifies us as in relation to your what kind of ads you should be

478
00:52:00,320 --> 00:52:05,930
seeing based on what what Facebook has inferred about you or you've reported to Facebook.

479
00:52:05,930 --> 00:52:14,540
And if you're interested that later on, I can share a link for that. But what's interesting here is that they're using this category of expert,

480
00:52:14,540 --> 00:52:20,270
but we don't really know what exactly or who exactly is an expat or how that's been inferred.

481
00:52:20,270 --> 00:52:25,640
And that's in some sense behind this. It's a sort of a black box algorithm because this is not a survey.

482
00:52:25,640 --> 00:52:31,370
We don't really have kind of an extensive metadata about it in relation to knowing who

483
00:52:31,370 --> 00:52:38,750
is or how this category is is defined and how potentially it might change over time.

484
00:52:38,750 --> 00:52:43,460
So if we think about.

485
00:52:43,460 --> 00:52:54,470
If we think about how it might change over time, that's where I think thinking about drift or population drift and usage drift is is really important.

486
00:52:54,470 --> 00:52:59,300
So the composition of users on these platforms might change over time,

487
00:52:59,300 --> 00:53:06,050
and that might have implications for what it is that there that we are potentially measuring.

488
00:53:06,050 --> 00:53:12,830
But also the system itself might change over time because these are businesses that have different rationale and motivations.

489
00:53:12,830 --> 00:53:14,570
So if they have, for example,

490
00:53:14,570 --> 00:53:22,250
they might implement an improvement in an algorithm which might not be great from a social research perspective because it messes up the measurement,

491
00:53:22,250 --> 00:53:26,390
but it might make them millions of dollars. So then here, who's complaining?

492
00:53:26,390 --> 00:53:36,200
So there's trade-offs here in terms of what the order's distinctions between what often the goals of businesses are and what the goals of,

493
00:53:36,200 --> 00:53:44,150
I think researchers are. And I think this was very nicely exemplified by the Google flu example.

494
00:53:44,150 --> 00:53:51,800
So I first showed you the Ginsburg and colleagues paper from 2009, where we had Google flu effectively being matched.

495
00:53:51,800 --> 00:53:59,060
Then two weeks later or a month later by the CDC estimates and Google flu was doing really well with the prediction then,

496
00:53:59,060 --> 00:54:08,630
and this became such a sort of an exciting project that Google even had an in-house team running its Google flu trends.

497
00:54:08,630 --> 00:54:14,720
So they were trying to match and predict CDC estimates of influenza before the CDC,

498
00:54:14,720 --> 00:54:19,490
or they were trying to predict levels of the flu before the CDC estimates came about.

499
00:54:19,490 --> 00:54:25,250
But then sometime around 2000 and 12, we started seeing that.

500
00:54:25,250 --> 00:54:32,750
Actually, Google was significantly over predicting flu, almost double the CDC estimates,

501
00:54:32,750 --> 00:54:42,230
and it started estimating flu to be much higher than it actually was over the course of over the rest of the year.

502
00:54:42,230 --> 00:54:45,320
So, so why is it that this even happened, right?

503
00:54:45,320 --> 00:54:51,680
And David Lazaar and colleagues have a very interesting paper where they try and diagnose some of these problems.

504
00:54:51,680 --> 00:54:57,620
It's actually probably one of the more well-known papers in this literature The Parable of the Good The Google of Google Flu,

505
00:54:57,620 --> 00:55:03,500
published in Science in 2014. And one of the things that happened.

506
00:55:03,500 --> 00:55:07,820
So to think back to the idea of of drift.

507
00:55:07,820 --> 00:55:16,280
So if we think about platform drift, in 2011, Google made changes to its search algorithm so that there were that.

508
00:55:16,280 --> 00:55:20,990
People were essentially told that if you're searching for a cough or a cold,

509
00:55:20,990 --> 00:55:24,820
you might be interested in the flu or you might be interested in something else.

510
00:55:24,820 --> 00:55:34,340
So essentially, we had an algorithm that encouraged certain kinds of search behaviour and pushed people to move to different kinds of terms,

511
00:55:34,340 --> 00:55:38,090
irrespective of whether they actually had the the flu or not.

512
00:55:38,090 --> 00:55:42,620
So that's an example of kind of essentially algorithmic or the system changing.

513
00:55:42,620 --> 00:55:53,180
That's creating a drift that might make measurement a little bit trickier.

514
00:55:53,180 --> 00:56:05,510
Right, so what I've talked about so far is I've talked a bit about the the the notion that digital trace data sources that they exist,

515
00:56:05,510 --> 00:56:09,680
but they have promises and they have pitfalls.

516
00:56:09,680 --> 00:56:17,480
And I've talked a bit about some of these shortcomings in the context of some of the papers that have been published in this literature.

517
00:56:17,480 --> 00:56:22,340
But I think if we think about what are the kinds of approaches that people are adopting

518
00:56:22,340 --> 00:56:29,630
when they're using some form of digital trace data to answer social research questions,

519
00:56:29,630 --> 00:56:37,280
I see them in some sense as like three in three categories of research projects.

520
00:56:37,280 --> 00:56:48,620
So the first of this is what I see as measurement papers, papers that use different forms of digital trace data for operationalising constructs.

521
00:56:48,620 --> 00:56:54,020
Often, they're trying to operationalise constructs at the macro level because they believe that this is

522
00:56:54,020 --> 00:57:00,500
a this is able to provide valuable information that might not be captured in other data sources.

523
00:57:00,500 --> 00:57:09,230
So if you think back to the Garcia paper about collective sentiment in the aftermath of a terrorist attack are implicitly relying on the notion here

524
00:57:09,230 --> 00:57:19,310
that they're able to capture this kind of macro level collective sentiment through Twitter in a way that might not be captured in in another way.

525
00:57:19,310 --> 00:57:29,660
There are a number of papers that have tried to see if if can we can we measure mood or sentiment for four populations through Twitter?

526
00:57:29,660 --> 00:57:34,940
And I think implicitly, the rationale there is that we are capturing operationalising some kind of

527
00:57:34,940 --> 00:57:42,800
macro level construct that that we might not be able to in in another setting.

528
00:57:42,800 --> 00:57:51,140
I think another aspect or another way of thinking about the measurement or the use of these data sources for the purposes of measurement in relation

529
00:57:51,140 --> 00:58:00,860
to what's already out there in the literature is that these data sources have been used a lot for for now casting and for filling data gaps.

530
00:58:00,860 --> 00:58:08,570
Now the Google flu example is an example of that, where they are trying to now cost levels of the flu essentially beat official

531
00:58:08,570 --> 00:58:12,860
statistics that are published because official statistics come with a lag.

532
00:58:12,860 --> 00:58:22,490
So we're trying to essentially predict the present, as Hal Varian called it, because there will be an official or more conventional data sources,

533
00:58:22,490 --> 00:58:27,740
an inevitable lag that we won't know about what's happened until it's, well, it's happened.

534
00:58:27,740 --> 00:58:35,870
And in that aspect, I think the now casting phenomenon has been done in relation to health surveillance and in the public health literature.

535
00:58:35,870 --> 00:58:39,350
It's also been done or it's or that's kind of the rationale.

536
00:58:39,350 --> 00:58:46,880
Also behind our work in digital gender gaps is that we trying to say what's happening now in relation to digital gender inequalities.

537
00:58:46,880 --> 00:58:55,970
I also see that as a rationale behind work in digital demography that's trying to examine levels of migration or stocks of migrants.

538
00:58:55,970 --> 00:58:57,860
And I think one thing that, if anything,

539
00:58:57,860 --> 00:59:05,420
I've learnt from this kind of measurement exercise or trying to use these data sources for measurement is that to motivate them for measurement,

540
00:59:05,420 --> 00:59:14,000
we always have to justify what we're gaining by using them kind of similar to what Roberto was asking before we have to think about what is it that,

541
00:59:14,000 --> 00:59:18,530
you know, what do we gain from using this that we might not gain if we were not using this?

542
00:59:18,530 --> 00:59:23,270
So I like to think of that as comparing against an offline benchmark.

543
00:59:23,270 --> 00:59:29,090
So in some of our digital agenda, gaps work. One of the things that we often tried to do was we tried to say,

544
00:59:29,090 --> 00:59:35,570
what if we didn't have the Facebook data and we tried to make some kind of a prediction model just to predict

545
00:59:35,570 --> 00:59:40,520
digital gender inequality by some other information that we have on some other development indicators?

546
00:59:40,520 --> 00:59:46,280
How well would we do with that? And then we compared with what we do with just Facebook.

547
00:59:46,280 --> 00:59:51,740
And then we try and have a hybrid approach where we combine measures from Facebook with those that

548
00:59:51,740 --> 00:59:57,650
are available from other kinds of survey data sources or other kinds of development indicators.

549
00:59:57,650 --> 01:00:04,220
So either finding a way to compare against an offline benchmark or have some kind of a hybrid approach might be valuable.

550
01:00:04,220 --> 01:00:08,690
Also, to motivate why this work is is useful and important.

551
01:00:08,690 --> 01:00:16,010
I also see, for example, that is a shortcoming of the original Google flu paper and that in the paper,

552
01:00:16,010 --> 01:00:21,620
if you read the 2009 paper, they sort of say, Oh, Google flu,

553
01:00:21,620 --> 01:00:25,910
we can predict we can predict flu really well with Google, but actually,

554
01:00:25,910 --> 01:00:31,790
they don't compare themselves at all with a simple time series model where they just use League One and League Two.

555
01:00:31,790 --> 01:00:38,090
They're not doing anything where they're we could actually just look at the past flu to predict the current flu.

556
01:00:38,090 --> 01:00:44,330
And why don't we do that instead? Why do we have to do pyrotechnics with Google instead?

557
01:00:44,330 --> 01:00:51,370
And then and and one way to again justify your your research if you are interested in measurement.

558
01:00:51,370 --> 01:00:56,710
Now, casting would be to say, well, actually, there are some pros and there are some cons of these data sources,

559
01:00:56,710 --> 01:01:03,220
and by doing a comparison explicitly against some kind of an offline benchmark or some other kind of model.

560
01:01:03,220 --> 01:01:11,620
I think we can in some sense rationalise our approach and say that there are some things that are potentially gained from this.

561
01:01:11,620 --> 01:01:18,610
Another line of research that I think or I see emerging with the use of digital trace

562
01:01:18,610 --> 01:01:27,430
data are papers that see digital platforms themselves as as microcosms of of society.

563
01:01:27,430 --> 01:01:34,510
So as life is, is a lot of life is, you know, there are purely digital phenomena, as I was talking about before,

564
01:01:34,510 --> 01:01:46,510
could we potentially think of digital spaces as microcosm of the societies to go and test certain kinds of of theories that we may have?

565
01:01:46,510 --> 01:01:52,930
So I think there was an example in one of the the Lightning talks yesterday, which was using the moral foundation.

566
01:01:52,930 --> 01:01:56,560
I think it was written reviewing some of the moral foundations theory. And that's kind of, to me,

567
01:01:56,560 --> 01:02:01,270
an example of thinking of Twitter as a kind of a digital space where we might test a

568
01:02:01,270 --> 01:02:07,420
theory about how is it that people's moral intuitions are are configured and and perhaps,

569
01:02:07,420 --> 01:02:07,630
you know,

570
01:02:07,630 --> 01:02:19,300
you might have a theory that you might want to take and destined for a digital space yourself and think about how that might that might play out.

571
01:02:19,300 --> 01:02:26,680
And I think the third strand of paper is, as I see it, that that try and leverage different kinds of digital traits.

572
01:02:26,680 --> 01:02:36,670
Data are those that actually are interested in thinking about the implications of digital technologies for social processes.

573
01:02:36,670 --> 01:02:39,460
Now this is something that I myself am very interested in.

574
01:02:39,460 --> 01:02:46,570
I'm very interested in thinking about, for example, how now that we are measuring gender inequalities in internet and mobile phone access,

575
01:02:46,570 --> 01:02:51,940
what do they actually mean for the attainment of other social development outcomes,

576
01:02:51,940 --> 01:03:00,430
such as information and access to contraception, such as access to do information about HIV and antenatal screening?

577
01:03:00,430 --> 01:03:08,140
These might be examples of how access to digital technology empowers or provides new forms of resources and

578
01:03:08,140 --> 01:03:15,430
information that could have implications for how the world works and for other kinds of social inequalities.

579
01:03:15,430 --> 01:03:21,940
And I think this is something that a lot of us, at least I can speak more for demographers and sociologists and less for other disciplines.

580
01:03:21,940 --> 01:03:28,270
And I know you are from many different disciplines, so I don't want to overstep my my my claims.

581
01:03:28,270 --> 01:03:32,110
But I do want to say that I think a lot of social scientists haven't.

582
01:03:32,110 --> 01:03:39,940
A lot of sociologists and demographers haven't thought about digital inequality enough.

583
01:03:39,940 --> 01:03:50,110
They haven't thought about its implications for other forms of social inequality, far processes of potentially social stratification.

584
01:03:50,110 --> 01:03:56,020
And I think that's and that's a dimension where different conversations need to be had.

585
01:03:56,020 --> 01:04:02,470
And potentially the use of digital trace data could also be could be helpful in that regard.

586
01:04:02,470 --> 01:04:06,820
Is that a question or no? Yeah, I do have a question.

587
01:04:06,820 --> 01:04:11,860
It's actually like an observation. Yeah, it's because it's something I've certainly worried about.

588
01:04:11,860 --> 01:04:20,170
So yeah, when you use traditional surveys. Yeah, it's usually have a lot of documentation on how the data was gathered that usually have this

589
01:04:20,170 --> 01:04:25,070
technical documentation that says it's something came up in the data collection process.

590
01:04:25,070 --> 01:04:28,480
Yeah, there's an anomaly in a question and so on.

591
01:04:28,480 --> 01:04:35,100
And so that's good because when you're analysing data or exploring it, you can actually explain why you see stuff that doesn't make sense.

592
01:04:35,100 --> 01:04:43,180
Yeah, but then we don't have that when we're analysing sort of most social media data or digital trace data that we use.

593
01:04:43,180 --> 01:04:50,020
Because, for example, when we're using Facebook data or Twitter data, the absence of something might be telling,

594
01:04:50,020 --> 01:04:54,280
yeah, or there, whether there's a column, which means something that we're not.

595
01:04:54,280 --> 01:05:00,820
Yeah, we don't really know what everything is about, and we can't really know how the data was gathered in some sense.

596
01:05:00,820 --> 01:05:06,340
And so I was just thinking whether we should start to uphold some sort of standards in terms of which

597
01:05:06,340 --> 01:05:13,330
data we use and whether we should have further information or what we expect of private companies.

598
01:05:13,330 --> 01:05:21,370
Right? Yeah. I mean, I think this ties in also with something with I was saying earlier that we have in some sense the issues of accessibility,

599
01:05:21,370 --> 01:05:26,470
but I think to some extent we have, you know, there's you're right, it's not a survey.

600
01:05:26,470 --> 01:05:31,270
It's not it's not. It's not meant to be used in the same way as a survey.

601
01:05:31,270 --> 01:05:37,210
And in fact, in my own experiences, so in in the work that I've been doing with the Facebook marketing API,

602
01:05:37,210 --> 01:05:44,380
one of the resistance or one common complaint that we often encounter from Facebook is that they don't.

603
01:05:44,380 --> 01:05:51,180
So we describe the Facebook marketing API as providing a digital census of the Facebook population they dislike.

604
01:05:51,180 --> 01:05:58,110
That phrase, they say we don't want to be seen as a digital census because we are not it, we are not that we are.

605
01:05:58,110 --> 01:06:03,330
We are where we are providing our targets for advertisers and that's what we care about.

606
01:06:03,330 --> 01:06:08,940
And I think so there is definitely a tension there that we might not know what we're what

607
01:06:08,940 --> 01:06:13,890
we're measuring because of the lack of appropriate metadata or we might not have information.

608
01:06:13,890 --> 01:06:18,390
But at the same time, I think, you know, I think rather than saying we shouldn't be using them,

609
01:06:18,390 --> 01:06:22,770
I think we should be saying we should be using them and then asking for for for

610
01:06:22,770 --> 01:06:30,690
clarity from the providers of these data by motivating their importance for research.

611
01:06:30,690 --> 01:06:36,390
And I think that could be. That's that's that's I think something that needs to be.

612
01:06:36,390 --> 01:06:40,680
That needs to be. For me, the way forward now, there are some frameworks out there.

613
01:06:40,680 --> 01:06:43,890
They already have something to say on this. Yeah.

614
01:06:43,890 --> 01:06:51,940
I mean, there are some kind of attempts to try and do this now, such as the social science sort of science one framework,

615
01:06:51,940 --> 01:06:59,640
this kind of attempt at trying to form a partnership between academia and sort of private companies.

616
01:06:59,640 --> 01:07:04,710
And there are dimensions of or there's also the open algorithms project or battle

617
01:07:04,710 --> 01:07:10,380
in Midea that is trying to create a kind of a secure data infrastructure so that

618
01:07:10,380 --> 01:07:14,880
other kinds of users might also be able to use anonymized call detail records

619
01:07:14,880 --> 01:07:20,070
beyond just kind of small projects that are very specific to specific individuals.

620
01:07:20,070 --> 01:07:27,420
So these kinds of partnerships, you know, there is an attempt now, I think, to try and have this discussion and and move forward in that space.

621
01:07:27,420 --> 01:07:35,010
But I think it's still it's still early. It's early stages and often it's hampered by events such as Cambridge Analytica,

622
01:07:35,010 --> 01:07:39,960
where which completely pivot the dialogue in a very different direction.

623
01:07:39,960 --> 01:07:48,930
And I think that's kind of the the challenge that all the discussion then is about misuse that we forget about the misuse as a result.

624
01:07:48,930 --> 01:07:55,990
So anyway, sorry, David, you had something to say about this discussion of how debate is created by the company or what?

625
01:07:55,990 --> 01:07:59,340
Yeah, this is something that we're gathering from.

626
01:07:59,340 --> 01:08:07,860
And then there was the step of us as researchers, which I think is complicated by technology, but really complicated companies.

627
01:08:07,860 --> 01:08:13,500
So it's looking more broadly. We don't have these standards to report with this sort of data yet.

628
01:08:13,500 --> 01:08:22,730
Yeah. I wanted to come up with something like a consulate or something when you're reviewing a kind of a randomised controlled trial or something,

629
01:08:22,730 --> 01:08:28,890
that's what should be there. Yeah. You say that it was a well done, more reporting stuff that's transparent.

630
01:08:28,890 --> 01:08:38,590
And I've pushed back with that. Just the other day, I got back in the article that I had reviewed that they got Ahmad and I had pushed the.

631
01:08:38,590 --> 01:08:44,770
The authors to try and be a bit more transparent about missteps, steps about the selection process,

632
01:08:44,770 --> 01:08:51,610
which we just don't have the morning sort of things and their responses is, well, nobody else does that.

633
01:08:51,610 --> 01:08:56,940
And that's so yeah, I think that that list that I was setting the bar too high for them because nobody else has to do that.

634
01:08:56,940 --> 01:09:02,740
At some point it has to. There is the the company side of it, but there's also just us.

635
01:09:02,740 --> 01:09:07,470
We don't have a blueprint for what we should report in terms of code book for.

636
01:09:07,470 --> 01:09:17,350
I mean, there's also legitimate tensions here in relation to. So there were some messages on Slack earlier about reproducibility and and and openness,

637
01:09:17,350 --> 01:09:23,160
and while I think as a community, we have to be really we have to be moving in that direction.

638
01:09:23,160 --> 01:09:31,980
I think there are again, there are tensions between if we're using private company data about what we can actually population reveal online.

639
01:09:31,980 --> 01:09:38,490
I mean, for example, this was an explicit decision when we put our digital gender gaps dot org online.

640
01:09:38,490 --> 01:09:43,360
We were, you know, we were told that we shouldn't be releasing the raw accounts from Facebook online.

641
01:09:43,360 --> 01:09:49,290
We should only be producing model estimates and putting those online because there is, you know,

642
01:09:49,290 --> 01:09:58,350
and we've had discussions with with Facebook as a result of that about what it is that, you know, they're OK with us kind of sharing and whatnot.

643
01:09:58,350 --> 01:10:00,330
And I think that then also, you know,

644
01:10:00,330 --> 01:10:05,340
there's a tension then because if we want to be open and reproducible and we want other people to be able to do what we're doing.

645
01:10:05,340 --> 01:10:12,180
But at the same time, we're not able to provide the data, then we can also advance in that agenda in the same way.

646
01:10:12,180 --> 01:10:14,010
This is not related to digital trace data,

647
01:10:14,010 --> 01:10:19,090
but I know some of you have also expressed an interest in agent based modelling and engaging this modelling.

648
01:10:19,090 --> 01:10:28,710
There's been this interesting movement towards adopting the Audy protocol or now forgetting what the audit stands for,

649
01:10:28,710 --> 01:10:34,500
but it's a it's a it's a rubric for actually what when you're writing up and describing an agent based model,

650
01:10:34,500 --> 01:10:43,410
what you should be looking to very clearly define. And I think that's that kind of a rubric which emerged because one of the big critiques of agent

651
01:10:43,410 --> 01:10:48,330
based modelling initially was that people can model whatever they want and they can just,

652
01:10:48,330 --> 01:10:54,630
you know, put whatever they want in a system and then recreate and generate any pattern and say that this theory works.

653
01:10:54,630 --> 01:11:02,640
And this was a big critique of of kind of the fact that there was just this unstructured Wild West style of ABM programming that was going on.

654
01:11:02,640 --> 01:11:10,590
And one response to that was let's generate a set of protocols about what needs to be reported, how it needs to be documented.

655
01:11:10,590 --> 01:11:17,190
And I think, you know, we could potentially moving in that direction with this too.

656
01:11:17,190 --> 01:11:27,030
Yeah. Yeah, yeah, that could be actually a very useful thing, especially in relation to thinking about also from from an ethics perspective.

657
01:11:27,030 --> 01:11:31,400
Yeah. Yeah.

658
01:11:31,400 --> 01:11:37,790
Yeah. So the bigger problem with this specific thing is that obviously that they stumble on a

659
01:11:37,790 --> 01:11:42,440
single certain day and two months later that it could be a completely different set.

660
01:11:42,440 --> 01:11:43,890
Yeah, that's the drift problem.

661
01:11:43,890 --> 01:11:50,570
Yeah, it's also especially with the market that's to do with the algorithms that there are multiple us about the same stuff.

662
01:11:50,570 --> 01:11:59,040
Yeah, the rhythm is simple. Check, would this be if you'd have to run the same production process three times?

663
01:11:59,040 --> 01:12:04,890
Yeah. In a moment. Yeah, yeah.

664
01:12:04,890 --> 01:12:09,630
Oh my goodness. Yeah. Or you might want to think about that again.

665
01:12:09,630 --> 01:12:13,920
It would probably be very specific to what kind of question you have and what and what kind of.

666
01:12:13,920 --> 01:12:20,730
Yeah. But if you have a lot of variability, I mean, you notice that counts will change if you query.

667
01:12:20,730 --> 01:12:24,090
So for example, again, with our Facebook marketing API work,

668
01:12:24,090 --> 01:12:29,280
we find that the daily active users changes a lot for the monthly active users tend to be much more flexible.

669
01:12:29,280 --> 01:12:33,150
But one of the attempts of the back end of the project is also by collecting the data

670
01:12:33,150 --> 01:12:37,770
every day to actually be able to understand what it is that changes and how it changes.

671
01:12:37,770 --> 01:12:44,040
Because I think often most researchers were doing using these kinds of data, sources will use them once,

672
01:12:44,040 --> 01:12:51,780
or they'll use them as a one off collection, and then they won't want to necessarily go and think about it in a kind of a longer term perspective.

673
01:12:51,780 --> 01:12:58,770
But when they do want to use it, you're right that they have to think a little bit more about what with the drift affecting them,

674
01:12:58,770 --> 01:13:00,870
listening to industry arguments,

675
01:13:00,870 --> 01:13:11,520
the marketplace nature of this, because yeah, I bought myself for the marketplace online and you can harvest that and within the authority.

676
01:13:11,520 --> 01:13:15,990
Yeah, there's a lot of obviously how much people are there at the same time.

677
01:13:15,990 --> 01:13:23,550
Yeah. And so we've got free now, but also every second third iteration they still use bigger set group.

678
01:13:23,550 --> 01:13:31,710
Obviously, now it can direct. Yeah, which is much more sort of, yeah, graphic, I see.

679
01:13:31,710 --> 01:13:36,540
Yeah, because most of the stuff. Yeah, farming is much more serious.

680
01:13:36,540 --> 01:13:45,270
Yeah, yeah. I mean, that would just be the you mean the platform is changing. Yeah, but also Nike's also starting to advertise.

681
01:13:45,270 --> 01:13:50,050
Yeah, yeah. Well, that's yeah.

682
01:13:50,050 --> 01:13:55,920
Yeah, because you don't know about the other kinds of, yeah, yeah, knows who is interacting with your stuff.

683
01:13:55,920 --> 01:14:01,980
Yeah. Using their mobiles to. Yeah. So yeah, yeah.

684
01:14:01,980 --> 01:14:06,030
No, I mean, I think there, yeah, there are definitely there's that dimension of the unknown as well,

685
01:14:06,030 --> 01:14:12,900
where the algorithm itself from the companies that could be changing, but the environment in which, you know, you might be putting up an.

686
01:14:12,900 --> 01:14:17,610
But yeah, I mean, I think this is these are this in some sense, the kinds of issues that I'm raising are very general.

687
01:14:17,610 --> 01:14:26,760
But but you know, you might have very specific kinds of issues in relation to a specific project that would then come about to.

688
01:14:26,760 --> 01:14:36,270
So those are what I said, what kind of designs that are essentially designs that rely on digital trace data of themselves,

689
01:14:36,270 --> 01:14:43,080
sometimes in combination with with other kinds of existing observational or survey data sources.

690
01:14:43,080 --> 01:14:55,050
But I think the way forward and this is, I think, an argument that Chris Chris Bell makes very actively in the context of six and he actually has a.

691
01:14:55,050 --> 01:15:00,960
So I would recommend you go look at his because he has a tutorial or some slides in which he makes the

692
01:15:00,960 --> 01:15:09,180
argument for hybrid research designs that I think may be the way forward for answering some questions.

693
01:15:09,180 --> 01:15:15,780
Also with with digital trends data. So the first example, of course, and this is something that's already come up.

694
01:15:15,780 --> 01:15:22,140
I've noticed a few times is how could we combine digital trace data with conventional data sources like surveys?

695
01:15:22,140 --> 01:15:26,850
So there was someone I think Clemens yesterday who in a sense is is doing.

696
01:15:26,850 --> 01:15:34,770
Are you relying on a data set that does that where within the context of a survey there are there are browser histories that are embedded in them.

697
01:15:34,770 --> 01:15:44,610
And then he's also using both. He has information from both the survey and the browser history and is using that to sort of answer a question.

698
01:15:44,610 --> 01:15:52,890
And that, I think, could be a great way forward where if we are sort of designing or piloting a survey,

699
01:15:52,890 --> 01:16:00,840
we ask also to have different kinds of measures, potentially things such as, you know, browsing history as one example.

700
01:16:00,840 --> 01:16:06,030
But also, we might want to share information from our social media pages.

701
01:16:06,030 --> 01:16:14,970
We might try to put our Twitter handles available to someone who is putting us who's doing a survey, depending on what our questions are.

702
01:16:14,970 --> 01:16:20,520
So I think that could be a hybrid research design that that could work particularly

703
01:16:20,520 --> 01:16:25,560
well and might help overcome some of the weaknesses that I talked about.

704
01:16:25,560 --> 01:16:34,560
Apps for data, generation and extraction. Now I know Chris and I think Taylor, you as well worked on the project with the social media apps.

705
01:16:34,560 --> 01:16:40,740
So I think Taylor is a good person to talk about how social media apps could potentially be used,

706
01:16:40,740 --> 01:16:46,860
both as a way to to do some to collect information in a way that is ethically sound.

707
01:16:46,860 --> 01:16:50,670
So not repeating Cambridge Analytica.

708
01:16:50,670 --> 01:17:04,110
And that, I think, is an aspect of of ways in which digital trace data could be used in a kind of hybrid design that I think would also work and well.

709
01:17:04,110 --> 01:17:11,830
I think while in the past, like, you know, the idea that you might design an app was quite forbidding.

710
01:17:11,830 --> 01:17:16,260
I know, I think you're you're I think Aiden is designing an app, right?

711
01:17:16,260 --> 01:17:23,070
I think I don't know what platform you're using, but in the context of our shiny making apps has become much easier now,

712
01:17:23,070 --> 01:17:32,220
and deploying them to web interfaces is become much easier. And there are some really good tutorials online on our shiny if you're interested in that.

713
01:17:32,220 --> 01:17:39,900
But I think that could be a model where the barrier to entry is, I think, become much lower than it was in the past.

714
01:17:39,900 --> 01:17:45,270
And designing apps could be a way forward for some kinds of hybrid research designs.

715
01:17:45,270 --> 01:17:53,070
And on Saturday, we'll also talk a bit about how we might use bots basically automated accounts that

716
01:17:53,070 --> 01:17:57,630
might say tweet certain kinds of messages at regular intervals in the context of,

717
01:17:57,630 --> 01:18:01,710
say, an experimental design. If you were interested in, say, looking at political opinion,

718
01:18:01,710 --> 01:18:08,310
information or or things like political polarisation, and that is an example again of what is,

719
01:18:08,310 --> 01:18:14,880
you know, what is an ethic, what is a sort of a hybrid research design with all of these questions, of course, right?

720
01:18:14,880 --> 01:18:22,140
Or these kinds of designs? You know, we always have to think about some of these designs might be quite new,

721
01:18:22,140 --> 01:18:31,950
and maybe there might not be a precedent in relation to to them in your in your in your ethics board and university,

722
01:18:31,950 --> 01:18:39,180
in the university that you're working at. So you might want to probably take a higher standard and think about, well, what are we doing?

723
01:18:39,180 --> 01:18:45,120
Is that ethical from and you know, and how would someone view this maybe a few years from now?

724
01:18:45,120 --> 01:18:53,430
So you might want to adopt a standard which is perhaps, you know, maybe more critical then and think about,

725
01:18:53,430 --> 01:18:57,330
you know, for example, if you were using a board of, you know what?

726
01:18:57,330 --> 01:19:01,470
And how is this under what circumstances it OK to use a board?

727
01:19:01,470 --> 01:19:06,750
Is it not? Anyway, so this is what I want to end on at this or this session,

728
01:19:06,750 --> 01:19:15,330
we're going to have a break until three forty five and then we'll come back and do part two of this session.

729
01:19:15,330 --> 01:19:23,328
So, yeah, for now, it's a break and I want to terminate the livestream.