1 00:00:08,760 --> 00:00:13,470 This is the 2pm session, and the livestream is now on, so hello, 2 00:00:13,470 --> 00:00:22,320 if anyone's watching and this is the session today on Digital Trace data, you already know me, so I don't need to introduce myself. 3 00:00:22,320 --> 00:00:29,400 And the purpose of so we're going to have two sessions today in the afternoon, and this is the first part of it. 4 00:00:29,400 --> 00:00:37,980 When I'm going to talk about digital trace data and the purpose of this session is really to think about digital trace data broadly, 5 00:00:37,980 --> 00:00:43,440 right, as as a category of of data. And think about its strengths and weaknesses, 6 00:00:43,440 --> 00:00:48,780 but also to think about different kinds of research designs that one might use when 7 00:00:48,780 --> 00:00:55,440 trying to adopt digital trace data for the purpose of social science research. 8 00:00:55,440 --> 00:01:00,210 And in the second part of my session after the coffee or tea break, 9 00:01:00,210 --> 00:01:08,580 I'll talk perhaps a bit about the tools and techniques we might use for for using digital trace data, 10 00:01:08,580 --> 00:01:13,890 and also then conclude a bit about sort of the wider implications when we think about ethical 11 00:01:13,890 --> 00:01:22,280 and access based limitations to using these kinds of data for social science research. 12 00:01:22,280 --> 00:01:26,090 So I want to start off the dog very much. Bye bye. 13 00:01:26,090 --> 00:01:29,720 Actually, at the point where metal organic started yesterday, 14 00:01:29,720 --> 00:01:38,960 which is that we're living in the digital age and the digital age is one where information storage has increased dramatically. 15 00:01:38,960 --> 00:01:45,140 There's there's a lot of data out there in the world and a lot of it is digital. 16 00:01:45,140 --> 00:01:52,700 And there's billions of gigabytes of data. And accompanying that is the fact that there's also a remarkable and exponential 17 00:01:52,700 --> 00:02:01,540 increase in computing power that is accompanied this expansion in information storage. 18 00:02:01,540 --> 00:02:05,990 And this is now been talked about a lot, and in some senses, the premise for this, 19 00:02:05,990 --> 00:02:13,540 this community gathering here right where we're living in this digital age sometimes called the big data era. 20 00:02:13,540 --> 00:02:23,110 And I like to think that this big data era here has, you know, certain defining features in relation to the way the data exist in the world now. 21 00:02:23,110 --> 00:02:27,620 So of course, there's the explosion in the volume of data, which is what I showed in the previous slide, right? 22 00:02:27,620 --> 00:02:32,440 We just have a lot more data and a lot of it is digitally recorded. 23 00:02:32,440 --> 00:02:40,480 There's also the idea that the data that are produced have higher velocity or they have a certain kind of 24 00:02:40,480 --> 00:02:48,370 speediness about them that was perhaps not possible in the past with other kinds of data that were available. 25 00:02:48,370 --> 00:02:56,870 So Nick earlier today was talking a bit about censuses and how the DRC had the census in nineteen eighty four. 26 00:02:56,870 --> 00:03:02,170 Now in nineteen eighty four was a long time ago, and lots of things have changed since then. 27 00:03:02,170 --> 00:03:06,820 But how do we know about those and how can we measure those kinds of changes that temporality? 28 00:03:06,820 --> 00:03:10,780 The fact that old data sources, they were big, they had volume? 29 00:03:10,780 --> 00:03:16,440 If we think about a census, a census is a very high volume undertaking, right? 30 00:03:16,440 --> 00:03:25,120 There's a largeness to it. But there were big and slow. But now potentially we are entering a paradigm where data sources could be both big and fast. 31 00:03:25,120 --> 00:03:32,890 So there's that important dimension of velocity here that I think we have to be also mindful of another aspect of, 32 00:03:32,890 --> 00:03:43,420 I think which is which is key to understanding this big data era is what I think of as a diversification of the production of data. 33 00:03:43,420 --> 00:03:45,370 So who are the producers of data? 34 00:03:45,370 --> 00:03:54,820 It's now not just administrative data or governments or states that are interested in and have a lot of data about people. 35 00:03:54,820 --> 00:03:59,710 Companies have a lot of data about people, right? And there's also, in some sense, 36 00:03:59,710 --> 00:04:07,690 the diversification of different kinds of agencies that hold data about individuals have harvested data about individuals. 37 00:04:07,690 --> 00:04:12,700 And in some sense, that's what I like to think of as a decentralisation of of data production. 38 00:04:12,700 --> 00:04:18,460 Of course, you hold a lot of data about yourself. Do as Nick were showing with location histories, right? 39 00:04:18,460 --> 00:04:27,580 Those JSON files, which were sometimes gigabytes, if I had disabled my location history because so I'm not contributing to the big data cores. 40 00:04:27,580 --> 00:04:34,510 But but at the same time, if you did have your location history and I had my location history in the past before I disable it, 41 00:04:34,510 --> 00:04:39,400 when it was tracked for about two years, it was, I think, three gigabytes of information. 42 00:04:39,400 --> 00:04:48,220 So it's a lot of information that me as an individual had generated that that and that's in some sense there's a decentralisation of data, 43 00:04:48,220 --> 00:04:54,850 which also means that different kinds of people have data about you, but you also have a lot of data about yourself, 44 00:04:54,850 --> 00:05:01,930 which can be interesting also to think about from the data production dimension. 45 00:05:01,930 --> 00:05:04,340 And there's also the variety of data, right? 46 00:05:04,340 --> 00:05:13,840 So while in the past when social scientist did research, we were often thinking in terms of most quantitative social analysis, 47 00:05:13,840 --> 00:05:22,660 relied on some kind of rectangular data frame, which had usually some kind of numbers in it discrete quantities. 48 00:05:22,660 --> 00:05:26,830 We now also just have a plethora of data sources. We have images available. 49 00:05:26,830 --> 00:05:33,280 We have audio available. We also have text. And Taylor will also talk about text tomorrow. 50 00:05:33,280 --> 00:05:40,720 But we have different kinds of data that are being generated and new technologies that are generating data. 51 00:05:40,720 --> 00:05:49,120 So we have mobile phones again. Nick talked about call detail records, which of course, are very kind of novel kind of data. 52 00:05:49,120 --> 00:05:51,400 Source their satellite data sources. 53 00:05:51,400 --> 00:05:58,120 But there's also the Internet of Things, the fact that there are physical sensors that interact with our everyday world. 54 00:05:58,120 --> 00:06:06,580 So we have the Amazon Alexa in the world now that are capturing a lot of information about people's physical environments. 55 00:06:06,580 --> 00:06:14,590 Andrew Dilnot, who is the warden of Nuffield College, who is also now the chair of the Geospatial Commission in this country. 56 00:06:14,590 --> 00:06:25,120 He likes to use this example, which I think is very pertinent here, is that a lot of cameras are now fitted in on, on, on cars, right? 57 00:06:25,120 --> 00:06:28,330 So all the cars that are driving around on the streets, 58 00:06:28,330 --> 00:06:33,040 most of them have some kind of a lot of them have dash cams with a lot of them have cameras available, 59 00:06:33,040 --> 00:06:36,490 reverse cams and other kinds of cameras available that actually, 60 00:06:36,490 --> 00:06:41,110 just if we relied on all of that data that is accumulated from all of these cars, 61 00:06:41,110 --> 00:06:48,730 we would know where most of the potholes are in the city of Oxford and Banbury Road has a lot of them because I commute there every day and it's bad. 62 00:06:48,730 --> 00:06:55,750 So we would know that if we had all of this data that was centralised somewhere and it is, it exists. 63 00:06:55,750 --> 00:07:01,130 This is not data that need to be generated, it exists, but it just needs to be sort. 64 00:07:01,130 --> 00:07:05,750 In some sense, repurposed for this, and I think that's. 65 00:07:05,750 --> 00:07:16,010 And this is where I like Matt Matt's organics definition that a lot of the data that exist in this big data era are, in some sense, readymade. 66 00:07:16,010 --> 00:07:23,300 They're already there in the world. But at the same time, at the same time, they are not custom made for research, right? 67 00:07:23,300 --> 00:07:27,290 So they are made, they exist, but they are not custom made for research. 68 00:07:27,290 --> 00:07:32,180 And this is the example of this is the analogy that he used yesterday where he said, 69 00:07:32,180 --> 00:07:43,430 we have here the urinal that was repurposed as art by Marcel Duchamp and we have here the elegant, gorgeous David, right? 70 00:07:43,430 --> 00:07:47,570 Very much. I mean, no one would say this was a repurposed piece by any means. 71 00:07:47,570 --> 00:07:56,200 This was not a piece of marble lying around. This is very much marble that was essentially made pristine by the work of an artist. 72 00:07:56,200 --> 00:07:56,430 Right. 73 00:07:56,430 --> 00:08:06,920 So this is what we live in a world where we have a lot of these go around, but at the same time is very clearly that this the urinal is not David, 74 00:08:06,920 --> 00:08:15,040 but potentially we also have to find ways to think of the urinal and repurpose it as a work of art. 75 00:08:15,040 --> 00:08:18,400 So let's start by thinking about so I've been talking a lot about, as I said, 76 00:08:18,400 --> 00:08:26,110 the proliferation of different kinds of data sources and the fact that we live now in a world where there is data abundance. 77 00:08:26,110 --> 00:08:31,000 And today I said, I'm going to talk about digital trace data. So what are these digital trends data? 78 00:08:31,000 --> 00:08:37,300 So the definition that I like to use for digital is data that there the data provide by-products of 79 00:08:37,300 --> 00:08:44,770 the digitalisation of our lives and the adoption of digital technologies and platforms such as, 80 00:08:44,770 --> 00:08:53,260 for example, social media. Right. So if we think about them as these data by-products, we can think of them in different ways. 81 00:08:53,260 --> 00:09:02,350 So in one sense, digital trace data are themselves the expression of digital interactions and purely digital phenomenon. 82 00:09:02,350 --> 00:09:07,510 Right. So if you think about tweeting, tweeting is a digital thing to do. 83 00:09:07,510 --> 00:09:13,420 I don't know if there's a physical equivalent of like writing a 140 character sum. 84 00:09:13,420 --> 00:09:20,680 I don't know and writes notes anymore to anyone with 140 characters. So there isn't a physical equivalent of tweeting. 85 00:09:20,680 --> 00:09:28,780 It's a purely digital phenomenon. However, we can also have in digital by-products, for example, 86 00:09:28,780 --> 00:09:37,210 through the use of sensors like the cams that I was telling you on cameras or accelerometers in our phones that we carry around with us that are. 87 00:09:37,210 --> 00:09:41,020 That's how those location histories that you were using were generated. 88 00:09:41,020 --> 00:09:45,550 They're essentially digital by-products that are created of actual physical activity. 89 00:09:45,550 --> 00:09:52,570 So that's also a type of digital trace. And one unique feature about these kinds of data, 90 00:09:52,570 --> 00:10:03,400 and I think this is often this is often used as in some sense also as a as an aspect that is unique about them is that unlike self-reported measures, 91 00:10:03,400 --> 00:10:08,710 right? They're essentially much more behavioural measures. 92 00:10:08,710 --> 00:10:11,710 They're actually just capturing activity as it's occurring, 93 00:10:11,710 --> 00:10:20,260 which is different from someone telling you that this morning I went from this place to this place, your location history will reveal that indirectly. 94 00:10:20,260 --> 00:10:21,340 This can also be the case. 95 00:10:21,340 --> 00:10:31,210 For example, you know, this aspect of of the behavioural aspect about this data can also make them, in a sense, sometimes messier as well. 96 00:10:31,210 --> 00:10:36,460 And we'll come back to that when we think about also the weaknesses of these kinds of data sources. 97 00:10:36,460 --> 00:10:37,570 So in the next few slides, 98 00:10:37,570 --> 00:10:45,250 what I want to do is I want to just give you examples of kind of published papers out there that I've trying to use digital trends data so far. 99 00:10:45,250 --> 00:10:54,580 Social media sites is perhaps the most widely known source of digital trace data, and Twitter is something that or Twitter has been widely, 100 00:10:54,580 --> 00:11:04,270 I think, used in in in social science papers to try and analyse discourses or discussions around events. 101 00:11:04,270 --> 00:11:14,170 It's also been used to do to measure things like public opinion, but it's also been used to forecast elections, for instance. 102 00:11:14,170 --> 00:11:20,080 And there's a there's a large literature on forecasting elections using using Twitter. 103 00:11:20,080 --> 00:11:25,450 It's also been used, and this is actually a unique aspect of Twitter. As as a social media data source, 104 00:11:25,450 --> 00:11:32,830 it's perhaps probably the most widely used because it's the most widely accessible and that isn't is actually becoming, 105 00:11:32,830 --> 00:11:38,350 I would say, a little bit more complicated. Now there's more authentication processes involved in accessing Twitter data. 106 00:11:38,350 --> 00:11:44,950 But one of the reasons why I think Twitter has been so big in the world of social media and computational social 107 00:11:44,950 --> 00:11:51,940 science research is partly because it is one of the data sources that researchers can actually access through an API. 108 00:11:51,940 --> 00:11:56,320 So this paper here. And one of the authors I know this Pablo Barbara. 109 00:11:56,320 --> 00:12:01,300 It's by Jost and colleagues. Pablo Barbaro will be speaking next week on Monday. 110 00:12:01,300 --> 00:12:04,750 And in the summer institute here he did. 111 00:12:04,750 --> 00:12:12,880 What they do is in this paper is they essentially are looking at Twitter accounts that were created daily in Ukraine before and after some protests. 112 00:12:12,880 --> 00:12:19,390 And they want to essentially look at they look at these peaks around certain key events over the period that they're looking at. 113 00:12:19,390 --> 00:12:29,890 And they see that as as a form of of political mobilisation around around these these time points that are occurring in the real world. 114 00:12:29,890 --> 00:12:31,630 So of course, you know, 115 00:12:31,630 --> 00:12:39,940 we can there can be an active debate about what this really means in terms of is it a good measure of political mobilisation and so on. 116 00:12:39,940 --> 00:12:49,960 But the fact that they're capturing these peaks occurring at very, very fine time scales on here on a weekly basis is quite interesting. 117 00:12:49,960 --> 00:12:55,660 Here's another example from a recently published paper in Psychological Science by David Garcia and colleagues. 118 00:12:55,660 --> 00:13:04,180 And what they look at is this is around the times of around the time of the Paris terrorist attacks. 119 00:13:04,180 --> 00:13:15,010 They look at essentially emotional valence, or we look at discourse in terms of positive and negative sentiments around the times of the attack. 120 00:13:15,010 --> 00:13:19,720 And then they try and see sort of this shoot up of the negative sentiment, 121 00:13:19,720 --> 00:13:28,450 and they say that we can actually see kind of a public collective emotions through these data sources and we can to try to see, 122 00:13:28,450 --> 00:13:35,880 in some sense, collective emotion and social resilience occurring in real time. 123 00:13:35,880 --> 00:13:41,790 And some of my own work, I've been using the data source that Francesco Campazzo talked about yesterday, 124 00:13:41,790 --> 00:13:48,300 which is the Facebook marketing API to now cost global digital agenda apps. 125 00:13:48,300 --> 00:13:52,980 So this is a this is a very important, I would say, social development indicator. 126 00:13:52,980 --> 00:13:58,410 It's important for us to know where women have access to internet and mobile phones, 127 00:13:58,410 --> 00:14:03,540 and it's actually something that we know remarkably little about or we don't know as well 128 00:14:03,540 --> 00:14:10,380 because of the lack of widespread survey data to measure ICD use by gender of the user, 129 00:14:10,380 --> 00:14:18,360 especially in low and middle income countries. So in this context, essentially looking at the number of Facebook users by gender of the user, 130 00:14:18,360 --> 00:14:21,750 as well as other characteristics such as device type or age, 131 00:14:21,750 --> 00:14:30,820 can be a useful proxy for capturing digital inequalities, digital gender inequalities at a very regular basis. 132 00:14:30,820 --> 00:14:32,400 So on our website, 133 00:14:32,400 --> 00:14:40,920 digital gender gaps that are essentially reissue this map every day because we can query the map every we can query the data source very frequently. 134 00:14:40,920 --> 00:14:45,090 We can run a census every day, but we can. We can definitely run a survey every day, 135 00:14:45,090 --> 00:14:56,250 but we can ask the API every day how many men and women are monthly active users or daily active users on Facebook? 136 00:14:56,250 --> 00:15:01,860 So those are examples of social media sites that search queries have also been used a lot, 137 00:15:01,860 --> 00:15:07,680 and I would say these were the kind of earliest examples of of digital trends. 138 00:15:07,680 --> 00:15:12,930 Data is as being heralded as with a lot of excitement and has been seen as as 139 00:15:12,930 --> 00:15:19,650 the new telescope with which to view and understand human social behaviour. 140 00:15:19,650 --> 00:15:27,090 This was a very famous quote in a paper by Duncan Watts, where he said that now we can directly adjust observed behaviours. 141 00:15:27,090 --> 00:15:29,520 So this is the earliest examples of this. 142 00:15:29,520 --> 00:15:38,340 This is a paper by Ginsberg and colleagues published in Nature in 2009, in which they tried to essentially match the estimates. 143 00:15:38,340 --> 00:15:44,760 CDC, the Centre for Disease Control estimates of influenza with Google search queries. 144 00:15:44,760 --> 00:15:55,110 Now, the black other Google search queries and the red are the CDC estimates, and you can see the CDC estimates have a bit of a lag. 145 00:15:55,110 --> 00:15:59,580 But then Google search head says, predict what's happening. 146 00:15:59,580 --> 00:16:07,150 And then the CDC follows and seems to be generally well-calibrated to what the search queries predicted. 147 00:16:07,150 --> 00:16:14,700 So this was back in 2009, when they when they first published this and there was a lot of excitement about the 148 00:16:14,700 --> 00:16:21,960 potential of sources such as these web search queries for being able to track and now cost. 149 00:16:21,960 --> 00:16:30,960 These predict the present essentially of key social development, health and other indicators. 150 00:16:30,960 --> 00:16:33,900 Another aspect of Google search queries, for example, 151 00:16:33,900 --> 00:16:41,130 has also been touted is that they might potentially capture behaviour that people might not want to report, 152 00:16:41,130 --> 00:16:49,080 so they might have the ability to capture or capture phenomenon that are prone to social desirability bias, 153 00:16:49,080 --> 00:16:55,140 where if we were doing things in the context of a survey, people might not want to report these things. 154 00:16:55,140 --> 00:17:01,050 So in some of my work with Nicola and others, we've been trying to see if we could use Google search queries, for example, 155 00:17:01,050 --> 00:17:08,100 to study sex selective abortion in the context of India, which is a behaviour that we know as demographers exists. 156 00:17:08,100 --> 00:17:12,930 But of course, people don't want to talk about it and report that they are doing it themselves. 157 00:17:12,930 --> 00:17:17,880 They might be OK with telling an interview, a survey, or that other people are doing it, 158 00:17:17,880 --> 00:17:22,050 but they don't want to report their own behaviour in relation to these kinds of things. 159 00:17:22,050 --> 00:17:28,260 And a paper and earliest example of using Google search to study abortion is Rice and Brown Brownstein in 2010, 160 00:17:28,260 --> 00:17:34,410 where you see here on the x axis, you see a proportion of pregnancies ending in abortion. 161 00:17:34,410 --> 00:17:39,960 And you look at relative internet search volume for abortion in those countries and what they are doing. 162 00:17:39,960 --> 00:17:44,460 This paper is that places where people tend to search for abortion much more 163 00:17:44,460 --> 00:17:49,440 are also the places where there are very significant restrictions on abortion. 164 00:17:49,440 --> 00:17:54,000 So in some sense, there is actually a demand for this, which is not being captured, not being met, 165 00:17:54,000 --> 00:18:02,730 and people are resorting to the internet and other kinds of resources online to try and seek information about these things. 166 00:18:02,730 --> 00:18:06,990 By the way, I should preface my comments by saying that please feel free to ask any questions at any 167 00:18:06,990 --> 00:18:14,070 time or if you want the clarification on something or even a more substantive question. 168 00:18:14,070 --> 00:18:21,570 Hmm. So another source of data that I've seen being used in the literature is blogs and internet forums. 169 00:18:21,570 --> 00:18:34,200 So this is here a paper that looks at how people are providing social support online via Reddit, and they look at different types of of support. 170 00:18:34,200 --> 00:18:38,760 So they look at whether this is emotional support. Is this information support? 171 00:18:38,760 --> 00:18:44,910 And in the paper, they describe the other categories of the kinds of support that people are providing on forums on Reddit, 172 00:18:44,910 --> 00:18:49,620 where people are talking about mental health and they find that and they're interested also 173 00:18:49,620 --> 00:18:57,360 in how much of the support is kind of done as a anonymously and how much of it is done, 174 00:18:57,360 --> 00:19:01,980 not anonymously. And they find that a lot of emotional support is actually provided anonymously, 175 00:19:01,980 --> 00:19:11,180 whereas kind of more practical informational support is provided on anonymously. 176 00:19:11,180 --> 00:19:17,060 Right, and then Nick talked about this already extensively, so I perhaps don't have to dwell so much, 177 00:19:17,060 --> 00:19:24,560 but this is again a paper by that much organic talked about yesterday, which is call detail records to try. 178 00:19:24,560 --> 00:19:29,810 And this is again one of the earliest examples of of AH, 179 00:19:29,810 --> 00:19:37,040 a very well known paper that tried to do to do computational social science by leveraging a very novel source of data, 180 00:19:37,040 --> 00:19:41,900 at least at that time, which was called data records. And on the left. 181 00:19:41,900 --> 00:19:50,360 And finally, you have the predicted wealth index for districts in Rwanda computer in 2009, using cold call data. 182 00:19:50,360 --> 00:19:55,520 So the KDR data and on the right hand side, you have the same, 183 00:19:55,520 --> 00:20:05,660 which is estimated using a very high quality but expensive survey, which is the demographic and health survey in in Rwanda. 184 00:20:05,660 --> 00:20:15,200 Now you see, you have sort of correspondence in the colours across the two, so essentially a much, a much less data trend. 185 00:20:15,200 --> 00:20:27,020 So this was predicted using matching call the CDER's to a survey in which they could actually verify and validate the information on respondents. 186 00:20:27,020 --> 00:20:30,890 So using a much smaller dataset, as Matt described yesterday, 187 00:20:30,890 --> 00:20:38,360 which is what it relied on relative to the DHS, they were able to generate pretty similar measures. 188 00:20:38,360 --> 00:20:41,690 And that's an example of, in some sense, reducing cost, 189 00:20:41,690 --> 00:20:49,310 improving frequency of measurement and potentially also guaranteeing more effective scalability. 190 00:20:49,310 --> 00:20:57,020 Right. And the last example I want to uses is sensor data, and this is a this is an example of a kind of a sensor data. 191 00:20:57,020 --> 00:21:02,600 And this is this is a paper published by a new wing and colleagues who I believe are 192 00:21:02,600 --> 00:21:07,880 actually working on this as a part of the awareness the Office of National Statistics, 193 00:21:07,880 --> 00:21:12,380 Big Data and Official Statistics Team. 194 00:21:12,380 --> 00:21:16,340 And what they try and look at here is electricity smart metres that exist. 195 00:21:16,340 --> 00:21:20,570 So this is an example again of these Internet of Things phenomenon. 196 00:21:20,570 --> 00:21:24,680 So we have a lot of homes now in the UK, have electricity, smart metres, 197 00:21:24,680 --> 00:21:32,690 and you see that there are these kind of very distinct profiles of household energy consumption that occur, 198 00:21:32,690 --> 00:21:42,350 and these vary by kind of the household composition. I saw families that have that have two adults and three plus children. 199 00:21:42,350 --> 00:21:50,450 They tend to have a lot of activity around seven, around five six that time and then early in the morning as well. 200 00:21:50,450 --> 00:21:53,780 They tend to have a lot of activity, probably around the time of school, 201 00:21:53,780 --> 00:21:58,550 and they make the argument that these electricity smart metres, we have this data. 202 00:21:58,550 --> 00:22:03,560 And since we see such clear patterns of consumption based on household characteristics, 203 00:22:03,560 --> 00:22:10,400 we could maybe start inferring something about the structure of households in the UK by just looking at electricity consumption itself, 204 00:22:10,400 --> 00:22:15,800 rather than waiting for censuses and so on to come around. 205 00:22:15,800 --> 00:22:18,890 And this is another example which I like very much. And it exists. 206 00:22:18,890 --> 00:22:32,360 And and Matt talks about it in his book Bit by Bit, which is using essentially the the metres in electronic metres in cabs in New York. 207 00:22:32,360 --> 00:22:35,840 And what they do is they look at they had information. So this is fibre. 208 00:22:35,840 --> 00:22:42,020 He had information on on all rides that were taken, I believe so. 209 00:22:42,020 --> 00:22:46,520 This is the New York City taxi and limousine company. 210 00:22:46,520 --> 00:22:54,620 The agency charged with regulating the industry now requires all taxis to be equipped with electronic devices that record all trip information, 211 00:22:54,620 --> 00:23:03,530 including says times and locations. The currently two companies that supply these devices report all this information to the DNC on a regular basis. 212 00:23:03,530 --> 00:23:13,880 And I have obtained full information for all trips taken in New York City taxicabs four the five years from 2009 to 2013 and using these data. 213 00:23:13,880 --> 00:23:20,120 What Farber then proceeds to do is is to actually test two competing theories 214 00:23:20,120 --> 00:23:25,710 in labour economics neoclassical versus more behavioural economic approaches 215 00:23:25,710 --> 00:23:34,430 and tries to see Is it that on days that that taxi drivers are getting a lot of are getting higher fares who they work more or they work less? 216 00:23:34,430 --> 00:23:38,420 Because neoclassical economics would tell you they should work more, they're going to earn more. 217 00:23:38,420 --> 00:23:42,410 Whereas behavioural economics would tell you, well, maybe they just will have a kind of maximum. 218 00:23:42,410 --> 00:23:48,500 They want to earn two hundred dollars a day, and if they can do it in three hours, they'll stop. So which what do we find? 219 00:23:48,500 --> 00:23:53,840 The actually found that neoclassical approaches were better fit the data that he had. 220 00:23:53,840 --> 00:24:02,060 But that's exactly taking a very detailed measurement and having a data set of a very detailed measurement and just testing a theory with it in a 221 00:24:02,060 --> 00:24:10,670 way that was perhaps was not possible with the previous kinds of of data sources because it allowed for greater disaggregation and exploration. 222 00:24:10,670 --> 00:24:12,440 Heterogeneity. Right. 223 00:24:12,440 --> 00:24:20,720 So as I said, I've sort of shown you diverse examples here, and I will come back to what I think these different kinds of examples are trying to do. 224 00:24:20,720 --> 00:24:29,870 But there are definitely some kind of promises that we already see when we see these examples of research of digital trace data in action. 225 00:24:29,870 --> 00:24:36,890 So the first thing is this and what? I'm going to talk about this, I want to rely a lot on bit by bit. 226 00:24:36,890 --> 00:24:46,660 So if you haven't read this chapter already, I suggest you read it afterwards that the first I think big promise here is the bigness. 227 00:24:46,660 --> 00:24:54,830 The fact that there is volume and in enough itself is something that is bigger, better. 228 00:24:54,830 --> 00:25:05,000 I mean, in the context here, I think it's it's it's not entirely clear what whether that would be necessary to have so much bigger data. 229 00:25:05,000 --> 00:25:14,960 But when we have questions surrounding, say, a rare event, so we really want to explore certain forms of heterogeneity where we might be interested, 230 00:25:14,960 --> 00:25:24,030 for example, in this aggregating to a certain geographical attempt to get some disaggregation by by geography or time. 231 00:25:24,030 --> 00:25:31,040 There's an argument to be made. I think that that the volume is an asset and a strength. 232 00:25:31,040 --> 00:25:40,490 So also from the perspective of a velocity. I think the fact that there's higher frequency or more regular measurement can also be, I think, 233 00:25:40,490 --> 00:25:48,810 can be seen as a key strength of of several sources of digital trace data and also things that are always on. 234 00:25:48,810 --> 00:25:53,120 So if we go back to the Twitter example that I was talking about, 235 00:25:53,120 --> 00:25:57,200 the fact that this is kind of always on the fact that Twitter is always on and 236 00:25:57,200 --> 00:26:02,300 you can see these peaks occurring around the time of certain kinds of events. 237 00:26:02,300 --> 00:26:10,700 Many times, you know, in the real world, there are natural experiments that occur, but we don't have pre and post. 238 00:26:10,700 --> 00:26:19,400 We don't have that perfect dataset that we can actually go and observe pre and post and exploit that nice design. 239 00:26:19,400 --> 00:26:22,490 Or we could look at what happened as a result of this. 240 00:26:22,490 --> 00:26:31,550 But in this setting, potentially we could use that kind of difference in different kind of design or we could look at a change occurring. 241 00:26:31,550 --> 00:26:33,530 Of course, there might be problems if, for example, 242 00:26:33,530 --> 00:26:39,020 the composition of users changed across these time points and then we might want to think about that. 243 00:26:39,020 --> 00:26:47,120 And that's a broader problem of population drift or just drift of these kinds of data sources, and I'll come back to that later. 244 00:26:47,120 --> 00:26:54,200 But in general, the fact that there are very few other most of the time our surveys aren't perfectly calibrated 245 00:26:54,200 --> 00:26:58,220 in terms of time to capture some of these events that are occurring in the real world. 246 00:26:58,220 --> 00:27:04,280 But maybe this always-on characteristic of some social media data sources could be valuable. 247 00:27:04,280 --> 00:27:08,990 And Twitter, I think here is a good example because unlike other social media APIs, 248 00:27:08,990 --> 00:27:13,970 it allows you to at least go back in the past and it allows you to go back. 249 00:27:13,970 --> 00:27:15,320 In the past, for example, 250 00:27:15,320 --> 00:27:23,660 the Facebook marketing API that I'll talk a little bit later doesn't allow you to go back in the past so you can look forward, 251 00:27:23,660 --> 00:27:28,790 which could be helpful if you identify something that you might want to track over time. 252 00:27:28,790 --> 00:27:36,650 But if you wanted to go and see what happened, for example, when Sudan, as we were discussing yesterday, shut off the internet, 253 00:27:36,650 --> 00:27:43,640 you know, we might not be able to go back and see how that changed over time, but we potentially could with with this date. 254 00:27:43,640 --> 00:27:55,760 And so it's here. Another aspect is that I think for a lot of digital trace data sources or or social media data sources, 255 00:27:55,760 --> 00:28:01,670 sometimes there are topics or geographies that we might capture in these data sources that we might not. 256 00:28:01,670 --> 00:28:05,960 In some others, another aspect I think, 257 00:28:05,960 --> 00:28:14,720 which can be seen as a promise is that there again non-reactive so they might capture behaviour that might be otherwise difficult to measure. 258 00:28:14,720 --> 00:28:22,910 And this was something that I already talked about in the context of the social desirability bias and Google search and abortion. 259 00:28:22,910 --> 00:28:25,040 So if we think about here, as I was saying again, 260 00:28:25,040 --> 00:28:32,630 coming back to the gender gaps in internet use when they're being computed here using the Facebook ad, audience estimates. 261 00:28:32,630 --> 00:28:36,320 So what we see here is that there are some parts of the world. 262 00:28:36,320 --> 00:28:42,450 So this ratio is a female is a female, the male ratio. 263 00:28:42,450 --> 00:28:49,200 Is a female to male ratio so higher values correspond to greater gender equality in internet use? 264 00:28:49,200 --> 00:28:54,660 So essentially, a lower value corresponds to women being less represented online. 265 00:28:54,660 --> 00:29:07,680 And we see that in parts of southern Asia here and also in in sub-Saharan Africa, we have significant gender inequality online. 266 00:29:07,680 --> 00:29:10,410 Now this is interesting for, I think, for a number of reasons. 267 00:29:10,410 --> 00:29:18,930 I mean, the first reason it's interesting is that if being online, if the internet is a valuable way to access information, 268 00:29:18,930 --> 00:29:27,780 to be able to access other kinds of resources that are relevant to people's health, education, say access to contraception, 269 00:29:27,780 --> 00:29:35,640 if these kind if people are if women aren't online, if they're systematically less online, it's important for us to know that, 270 00:29:35,640 --> 00:29:40,710 and we might not know that if we just were relying on survey data to be able to measure that. 271 00:29:40,710 --> 00:29:46,140 So in itself, I think there's intrinsic value in mapping and capturing this development indicator. 272 00:29:46,140 --> 00:29:54,690 But I think another important aspect of trying to understand who is online and who is not online and who is captured in in in this case, 273 00:29:54,690 --> 00:29:59,850 the social media data source is that if we want to use this especially to make kind 274 00:29:59,850 --> 00:30:06,060 of wider claims about and we want to think about issues such as who is represented, 275 00:30:06,060 --> 00:30:07,470 who is not represented, 276 00:30:07,470 --> 00:30:15,930 then the fact that women are significantly underrepresented in some parts of the world on Facebook is important also for us to know, 277 00:30:15,930 --> 00:30:21,210 and we want to use these kinds of data sources for the kinds of research questions we may have. 278 00:30:21,210 --> 00:30:29,070 Now, this might not be a problem again, if we are trying, if our research question pertains in a more kind of difference and different kind of 279 00:30:29,070 --> 00:30:36,660 design in our research question is how does censorship affect women's internet access? 280 00:30:36,660 --> 00:30:45,390 Maybe if we if we felt that this sample was fairly stable over time in terms of its composition, maybe we could study that using this data source. 281 00:30:45,390 --> 00:30:53,160 But if we wanted to say what is, for example, the population level estimate for a particular indicator of interest, 282 00:30:53,160 --> 00:31:01,050 we might want to take a step back and think about, well, who are we capturing and who are we not capturing? 283 00:31:01,050 --> 00:31:08,190 The same Facebook marketing API has been used by a number of actually has been is probably the most. 284 00:31:08,190 --> 00:31:13,590 There's a question that. Yes. So in the previous, the previous model you could put. 285 00:31:13,590 --> 00:31:19,110 So if I understand correctly yesterday front ActionScript presentation, you could potentially estimate this as well. 286 00:31:19,110 --> 00:31:22,950 Even lower? Yes. Zip code. Yeah, yeah. Yeah. 287 00:31:22,950 --> 00:31:27,300 I mean, this makes sense, right? Because in certain countries, you might have zero people. 288 00:31:27,300 --> 00:31:31,830 Yeah, yeah. Yeah, absolutely. Yeah, exactly. 289 00:31:31,830 --> 00:31:38,520 I mean, the reason why I'm showing this map is because first, in order to understand the value of this data, 290 00:31:38,520 --> 00:31:47,100 the first step that we undertook was we collected and we created what was a measure of just a female to male ratio of Facebook users. 291 00:31:47,100 --> 00:31:56,700 And then we tried to test and validated against survey based measures of internet use based from sort of more trustworthy probability sample surveys. 292 00:31:56,700 --> 00:32:00,900 And we found that the correlation was about point eight. Right. So it was pretty good. 293 00:32:00,900 --> 00:32:05,460 And the purpose here was similar to the bloomin stock paper to say that we have a small, 294 00:32:05,460 --> 00:32:12,900 smaller dataset where we have both Facebook and we have we have Facebook as well as a survey based measure. 295 00:32:12,900 --> 00:32:17,040 And let's just sort of trained a simple model to make a prediction then and expand 296 00:32:17,040 --> 00:32:21,510 our geographical coverage to those that we don't actually have this indicator for. 297 00:32:21,510 --> 00:32:24,810 So in some other words, we've tried to now go down, some nationally. 298 00:32:24,810 --> 00:32:29,730 But then a limitation that we have is we can measure Facebook use at the subnational level. 299 00:32:29,730 --> 00:32:36,050 But of course, we don't have a good survey data to be able to capture exactly what it is is. 300 00:32:36,050 --> 00:32:40,950 Yeah. So then we might use it in a different kind of way, of course. 301 00:32:40,950 --> 00:32:47,940 But the purpose here was to try. And so this this is being done in the context of a project where the UN Foundation, 302 00:32:47,940 --> 00:32:54,300 where we got a grant from the UN Foundation to try and see the value of big data sources to measure development indicators. 303 00:32:54,300 --> 00:32:55,320 So in that context, 304 00:32:55,320 --> 00:33:02,970 this was a it was first done trying to see where we had both data sources and to test the sort of validity of the online data source. 305 00:33:02,970 --> 00:33:07,290 Yeah. So I have questions or speculate. If you look at this one. 306 00:33:07,290 --> 00:33:15,690 Yeah. And I also like what you were saying about the fact that correlations like 0.8 and yeah, pretty good, but it's not perfect. 307 00:33:15,690 --> 00:33:21,750 Yeah. And I think like for some applications like digital trace data like elections, online limelight, 308 00:33:21,750 --> 00:33:28,980 there's obviously the difference between having a seven percent absolute and or adding a four percent or three percent absolute error is massive. 309 00:33:28,980 --> 00:33:32,430 Mm hmm. The difference for people who want to use it at that. 310 00:33:32,430 --> 00:33:36,660 So what? What would you say? Like a critique that says, Well, stay tuned. Interesting, very innovative. 311 00:33:36,660 --> 00:33:40,260 But there's no way this is useful at the moment. Yeah, yeah. 312 00:33:40,260 --> 00:33:45,340 Yeah. So I think Roberto's. Point about how much error can we tolerate? 313 00:33:45,340 --> 00:33:49,930 Is a valid one, and I think it's very context and domain specific. 314 00:33:49,930 --> 00:33:57,790 So as again, this is something that Nick was talking about earlier today when you were actually interested in in in thinking about, OK? 315 00:33:57,790 --> 00:34:03,850 Does this mean that we have to use a different quantity of spray altogether? Because the difference between the why? 316 00:34:03,850 --> 00:34:07,510 Because the differences are hinged on like the second decimal point. 317 00:34:07,510 --> 00:34:16,090 Then yes, perhaps we you know, we might want to be cautious as well, and we might then want to try and try and think about how these could, 318 00:34:16,090 --> 00:34:23,080 for example, be validated against or how we might just want to do a survey, right? 319 00:34:23,080 --> 00:34:28,060 We might just want to do a survey and actually so, so run a survey instead. 320 00:34:28,060 --> 00:34:35,320 But I think like in the context of some things where where we like, I would say in this context, you know, we're here, 321 00:34:35,320 --> 00:34:41,080 which aren't too interested in just simply mapping inequalities and actually showing that there are parts of the 322 00:34:41,080 --> 00:34:48,910 world where where this inequality wasn't really wasn't something that people were really aware about before. 323 00:34:48,910 --> 00:34:53,140 And this is an aspect that I think people can still draw attention to. 324 00:34:53,140 --> 00:34:55,630 So we're not saying here that we shouldn't do any ICD surveys. 325 00:34:55,630 --> 00:35:02,050 We're just saying that look, if you did an ICD survey, you would realise that there is there might be some issues here in other parts, 326 00:35:02,050 --> 00:35:06,130 not just the countries for which we have these handful of of surveys available. 327 00:35:06,130 --> 00:35:12,940 So it could be useful as a first step towards drawing attention to a matter and then saying that while we can't be precise, 328 00:35:12,940 --> 00:35:17,860 we can at least still orient the discussion around it. So yeah, your point is well taken. 329 00:35:17,860 --> 00:35:23,170 That error is the extent of error that we can tolerate is domain specific. 330 00:35:23,170 --> 00:35:28,270 But I think there still may be value depending on what your goal is. 331 00:35:28,270 --> 00:35:36,430 So this is another example of, I think, the burgeoning field of digital demography, where Emilio Isagani, 332 00:35:36,430 --> 00:35:42,730 who is the director of the Max Planck Institute of Demography in Rostock in Germany, 333 00:35:42,730 --> 00:35:48,280 where he and his colleagues have also been trying to use the same Facebook marketing API, 334 00:35:48,280 --> 00:35:56,050 which provides essentially information on aggregate counts of Facebook users broken down by different kinds of characteristics. 335 00:35:56,050 --> 00:36:06,160 So they've been using the category of expats, and they've been trying to see if how well these getting this category of of expats, 336 00:36:06,160 --> 00:36:12,430 as inferred by Facebook, can help us track stocks of migrants in different parts of the world. 337 00:36:12,430 --> 00:36:16,540 And also in this paper, they look at the US in particular in different states, 338 00:36:16,540 --> 00:36:20,770 and compare their estimates to asks the American Community Survey's estimates. 339 00:36:20,770 --> 00:36:25,240 So what you see here is just the large fraction of immigrants for a particular country, 340 00:36:25,240 --> 00:36:33,070 and Facebook seems to be fairly strongly correlated with the fraction of immigrants, according to the World Bank. 341 00:36:33,070 --> 00:36:39,010 And they again, similar to the argument that I was making in the previous slide in relation to gender gaps. 342 00:36:39,010 --> 00:36:45,580 We should be able to. Migration data and demography are in fact of the three demographic processes, 343 00:36:45,580 --> 00:36:54,250 but stats and mobility migration is the worst in terms of our ability to measure and even in in high income countries, 344 00:36:54,250 --> 00:37:01,990 migration or mobility isn't as well tracked as vital registration is able to track births and deaths. 345 00:37:01,990 --> 00:37:07,480 So they were making the argument that at least in in-between years when, for example, 346 00:37:07,480 --> 00:37:16,520 other data sources might not be available, this could be a data innovation that might help fill this data gap. 347 00:37:16,520 --> 00:37:23,540 Right, so the kind of timeliness and the coverage of these topics could be valuable here. 348 00:37:23,540 --> 00:37:28,640 So I was talking a bit about also this the what I said is the non reactivity. 349 00:37:28,640 --> 00:37:34,250 So this is a paper by by Stevens Davidovich in the Journal of Public Economics. 350 00:37:34,250 --> 00:37:45,110 And what he's looking at here, this is from Google Trends is looking at searches for the [INAUDIBLE] in the US. 351 00:37:45,110 --> 00:37:49,160 And so he looks at this and he says this is a measure of racial bias. 352 00:37:49,160 --> 00:37:55,730 And then he makes an argument in this paper that essentially, if we just looked at surveys, 353 00:37:55,730 --> 00:38:02,030 we underestimate how much racism cost Obama the election because it puts in. 354 00:38:02,030 --> 00:38:04,160 I mean, I suggest you read the paper. 355 00:38:04,160 --> 00:38:14,000 I don't I don't know that well, but the argument in the paper is effectively is that if we actually put in an and we we look at this racial, 356 00:38:14,000 --> 00:38:23,780 this racial search term, it actually is very highly predictive of of the Obama's vote share lost. 357 00:38:23,780 --> 00:38:30,980 And so he makes the argument that this is exactly the kind of measure that we would not capture in a survey and we would not be. 358 00:38:30,980 --> 00:38:34,220 And he makes the argument, of course, that we are not in a post-racial world. 359 00:38:34,220 --> 00:38:41,690 But but this is exactly the kind of non-reactive measure that perhaps is free of social desirability 360 00:38:41,690 --> 00:38:50,300 bias that might enable us to to capture a phenomenon that we might not capture in survey research. 361 00:38:50,300 --> 00:38:57,110 So those are the promises. So what are the pitfalls now of of digital trace data? 362 00:38:57,110 --> 00:39:08,000 So I think one key aspect about about working with and using digital trace data are that they're quite, yeah, they're quite dirty there. 363 00:39:08,000 --> 00:39:16,970 And this is this is dirty in the sense of whether you're using social media data or whether you're also dealing with with 364 00:39:16,970 --> 00:39:24,380 sensor based data from from different kinds of instruments out there in the world or if you're are using accelerometers. 365 00:39:24,380 --> 00:39:29,900 So a year ago or about two years ago, I was a I was a subject in a study. 366 00:39:29,900 --> 00:39:39,980 I was contributing to science by being a subject, and I was a subject in a study of the John Ratcliffe Hospital in in in Oxford. 367 00:39:39,980 --> 00:39:44,810 And I, I had a number of measurements that were taken off me. 368 00:39:44,810 --> 00:39:51,440 I mean, I think too many measurements. They created too many variables. At some point, I wondered what they would do with so many variables. 369 00:39:51,440 --> 00:39:58,400 But anyway, they collected a lot of information from me and they gave me a fitness tracker. 370 00:39:58,400 --> 00:40:04,430 They gave me a blood pressure monitor that I had that would just buzz every hour just automatically. 371 00:40:04,430 --> 00:40:08,360 Just my cuff would swell up and I would just be measured. 372 00:40:08,360 --> 00:40:12,860 So I was essentially for like for over a year I had. 373 00:40:12,860 --> 00:40:17,390 I mean, the blood pressure monitor was only like like for three or four days in the year, 374 00:40:17,390 --> 00:40:22,640 but I had a fitness tracker for a full year that they were monitoring information from. 375 00:40:22,640 --> 00:40:29,210 And then on my final visit to them, I went to them and I said, How much of the data are you? 376 00:40:29,210 --> 00:40:30,500 Do you think like, you know what? 377 00:40:30,500 --> 00:40:39,380 Like, have you found anything interesting or do you think they're like, well, actually, like half the people then switch on their fitness trackers? 378 00:40:39,380 --> 00:40:44,450 I was like, OK, that's not helpful. I said, No, but I've been using mine. So, you know, I'm sure some people have been using it. 379 00:40:44,450 --> 00:40:54,620 They're like, Yeah, but about basically we've so far, we've deduced that about 60 to 65 percent of our data are just completely unusable. 380 00:40:54,620 --> 00:41:01,010 Now, you know, this is a genuine concern that that that I think a lot of studies that are that are trying 381 00:41:01,010 --> 00:41:08,360 to adopt novel forms of of measurement are going to face in that a lot of the data 382 00:41:08,360 --> 00:41:14,480 that they collect might not be usable because the switching was people input the right 383 00:41:14,480 --> 00:41:20,360 thing on or there was just too much noise in it that it made it hard to to parse properly. 384 00:41:20,360 --> 00:41:27,350 But I think this is an aspect of data collection that that you have to be sure that 385 00:41:27,350 --> 00:41:32,840 you have to be mindful of when when trying to use digital trace data sources. 386 00:41:32,840 --> 00:41:37,880 Another aspect, and I think this is something that we perhaps should collectively, 387 00:41:37,880 --> 00:41:43,940 I think as a as a computational social science community need to think about and and have a broader public 388 00:41:43,940 --> 00:41:52,220 discussion about is the inaccessibility of a lot of interesting and meaningful digital trace data sources. 389 00:41:52,220 --> 00:41:57,200 Now, Nick, earlier in the morning talked a lot about call detail records. 390 00:41:57,200 --> 00:42:01,550 And then I think Chris asked a question about, well, how accessible are these data? 391 00:42:01,550 --> 00:42:06,620 Because there isn't really a I mean, and perhaps rightfully so, 392 00:42:06,620 --> 00:42:16,610 people's information on call is not being shared widely through some kind of, you know, open access, public infrastructure. 393 00:42:16,610 --> 00:42:18,170 So. 394 00:42:18,170 --> 00:42:30,950 A lot of existing data sources that are held by private companies, but also held by governments are not really accessible to to to the researchers. 395 00:42:30,950 --> 00:42:39,080 And part of that, especially when we think about it from the purpose of companies, is that they just have very different incentives and goals to most. 396 00:42:39,080 --> 00:42:48,440 So if researchers write their goals are often business and profit and they're interested in generating products that that appeal 397 00:42:48,440 --> 00:42:56,270 to customers and make sure that their customers don't get angry with them about and to protect themselves against backlash, 398 00:42:56,270 --> 00:43:02,480 especially in the aftermath of events such as last, such as February last year with Cambridge Analytica. 399 00:43:02,480 --> 00:43:09,260 Often, companies will restrict access and make it harder to access data sources. 400 00:43:09,260 --> 00:43:16,670 Now there are some well-known leaks. I think Matt yesterday already talked about the emotional contagion experiment. 401 00:43:16,670 --> 00:43:26,030 There's also the the the example of AOL providing access to browser to search histories of people. 402 00:43:26,030 --> 00:43:34,970 And then they had made a claim that it was perfectly anonymized, but then turned out that there were some people who could be identified. 403 00:43:34,970 --> 00:43:40,700 And this essentially resulted in very senior officials losing their jobs or as a result of that. 404 00:43:40,700 --> 00:43:48,680 So it's a business that essentially not only lost business not only generated a lot of public backlash, 405 00:43:48,680 --> 00:43:55,250 but also effectively made it more inward looking in relation to data accessibility and sharing. 406 00:43:55,250 --> 00:44:03,170 In the aftermath of this event, there's also another example which Matt talks about in his book, 407 00:44:03,170 --> 00:44:07,430 which is the Netflix in the course of the Netflix challenge. 408 00:44:07,430 --> 00:44:19,040 There were a lot of. So it was a it was a it was a challenge in which large data set of people's what films people watched was revealed. 409 00:44:19,040 --> 00:44:22,730 And Netflix essentially wanted to improve its recommended system so that people could 410 00:44:22,730 --> 00:44:28,520 predict just people's viewing just what films they might want to watch better. 411 00:44:28,520 --> 00:44:33,200 So it was a kind of an it was using the common dark framework, this machine learning challenge, 412 00:44:33,200 --> 00:44:40,430 where people could make a prediction and then later on a fight and a lawsuit was filed against Netflix because 413 00:44:40,430 --> 00:44:48,620 it emerged later on that people specific individuals could be identified through their film watching habits. 414 00:44:48,620 --> 00:44:53,030 Now you would think, why is it sensitive? Like, I mean, it's just films that people watch. 415 00:44:53,030 --> 00:45:00,230 Why should people care about? I mean, do you care if someone watches your films or someone knows what films you're watching? 416 00:45:00,230 --> 00:45:03,110 Well, for some people, this might be sensitive information, right? 417 00:45:03,110 --> 00:45:09,710 And it's hard for us to predict what might be sensitive for some people and what might not be so sensitive for others. 418 00:45:09,710 --> 00:45:18,650 So the lawsuit was brought forward by by a lesbian woman who said that actually the kinds of things that people watch reveal about real, 419 00:45:18,650 --> 00:45:24,440 intimate things, about their identities and things that they might not want other people to know about, 420 00:45:24,440 --> 00:45:30,350 especially if they belong to certain families that might not accept them for for for this. 421 00:45:30,350 --> 00:45:35,510 So I think, you know, from one perspective, you think, Oh, this is films, why should we care? 422 00:45:35,510 --> 00:45:43,880 But for some people, that might be highly sensitive information. And I guess your ethics activity perhaps gave you a sense of the people might 423 00:45:43,880 --> 00:45:48,520 have contrasting perspective on what is what is valuable and and what is not. 424 00:45:48,520 --> 00:45:59,370 Yeah, there's a question is of using this in the long term, it might be that, you know, this company will restrict and restrict. 425 00:45:59,370 --> 00:46:08,720 More and more next year, and you'll be. And then they found those yeah, when I saw these stories that opened up. 426 00:46:08,720 --> 00:46:16,220 Yeah. Now Coke mania realised, Yeah, yeah, I mean, I think that's a genuine that is a genuine. 427 00:46:16,220 --> 00:46:26,060 So there this is something that Dean Cynllun, then Dean Fillon's and he he's a comedian. 428 00:46:26,060 --> 00:46:31,430 Yeah, Finland. He's he says that we're now living in a poor state by age where in the early 2000s we saw 429 00:46:31,430 --> 00:46:37,610 this big proliferation of of of web resources that that researchers were given access to. 430 00:46:37,610 --> 00:46:44,960 But at the same time, now we actually see that that that there's they're clamping down and restricting what's available. 431 00:46:44,960 --> 00:46:53,330 And and I think that, yeah, it's it's one of the challenges with this is that as I was saying that, you know, we have a, 432 00:46:53,330 --> 00:47:04,610 I think, a very limited understanding so far of of of what people's notions and conceptions of privacy are now. 433 00:47:04,610 --> 00:47:08,390 You think that, you know, on one side, you know, 434 00:47:08,390 --> 00:47:17,240 we live increasingly our worlds online and we have accounts for different things and we and we we share a lot of information online. 435 00:47:17,240 --> 00:47:21,860 But it turns out that that actually people still have certain kinds of norms and notions 436 00:47:21,860 --> 00:47:25,970 of privacy associated with even that kind of life online where they're sharing. 437 00:47:25,970 --> 00:47:32,870 That's occurring online and trying to find in that kind of and parse out what is what is sensitive and and 438 00:47:32,870 --> 00:47:43,220 what people consider to be things that that that they would like consent before they're used is challenging. 439 00:47:43,220 --> 00:47:48,830 But that being said, I think there's also on the other end like this kind of and this is this is of 440 00:47:48,830 --> 00:47:54,080 the wider discussion about a lot of these data could also be put to good public 441 00:47:54,080 --> 00:47:58,040 use in terms of understanding phenomenon that we might not necessarily know very 442 00:47:58,040 --> 00:48:04,490 much about for maybe mapping as Nick was showing malaria movement mobility. 443 00:48:04,490 --> 00:48:08,090 So trying to find the balance between sort of good public use, 444 00:48:08,090 --> 00:48:14,120 but at the same time recognising issues such as privacy and also recognising 445 00:48:14,120 --> 00:48:19,480 the importance of of of of of what people find to be sensitive information is, 446 00:48:19,480 --> 00:48:23,870 is is something that that I don't think we have a good answer to at the moment, 447 00:48:23,870 --> 00:48:29,270 but we need to have a bigger public discussion about in any way you study. 448 00:48:29,270 --> 00:48:37,160 Some of them, even their own insurance now would not be possible to do like today. 449 00:48:37,160 --> 00:48:41,780 I think all of the studies I intentionally chose them because I think it would still be possible to do them. 450 00:48:41,780 --> 00:48:42,560 Yeah, yeah. 451 00:48:42,560 --> 00:48:50,540 I intentionally chose these examples because they're using data sources that are that are still there might be like, for example, with Twitter, 452 00:48:50,540 --> 00:48:55,040 you might still there's there's a more detailed credentialing process now, 453 00:48:55,040 --> 00:49:03,470 but you have to write about why you want to access these data, but the Twitter API is still accessible. 454 00:49:03,470 --> 00:49:08,570 So I think all of these data sources, it would be possible to do this research, at least to this date, right? 455 00:49:08,570 --> 00:49:14,600 A lot of these are also aggregated, so search volumes are aggregated. 456 00:49:14,600 --> 00:49:18,200 So, so anyway. But but that was intentionally part of my reason. 457 00:49:18,200 --> 00:49:22,430 But I think the inaccessibility problem is still a while. 458 00:49:22,430 --> 00:49:26,060 The call detail records are inaccessible to a public researcher. 459 00:49:26,060 --> 00:49:30,890 But if you have, but you know, if you have a relationship and you, 460 00:49:30,890 --> 00:49:40,640 then you can sign an agreement with with a mobile phone provider and potentially within the context of a project could leverage some of those data. 461 00:49:40,640 --> 00:49:47,210 So I already talked a bit about the sensitivity aspect. There's also the the aspect of incomplete. 462 00:49:47,210 --> 00:49:51,590 And I think with most digital data sources, you're never going to have everything. 463 00:49:51,590 --> 00:49:59,360 So because again, they're not custom-made for research, so we won't know, for instance, the demographic characteristics of our users. 464 00:49:59,360 --> 00:50:10,340 And I've seen obviously papers that we're still in for those using using some existing classifiers or algorithms. 465 00:50:10,340 --> 00:50:18,110 So they'll try and infer people's, for example, gender or their race or their ethnicity by looking at images and so on and so forth. 466 00:50:18,110 --> 00:50:24,620 And I think those are also important ethical questions about whether we should be doing that. 467 00:50:24,620 --> 00:50:34,970 And but that's an aspect of the fact that they're incomplete and we could try and overcome some issues of incompleteness by doing things, 468 00:50:34,970 --> 00:50:42,980 then then might be sensitive. So, you know, there's tensions there that I think are important to bear in mind. 469 00:50:42,980 --> 00:50:50,990 I already talked a bit about this, but there is the issue of data sources being non-representative now in of itself. 470 00:50:50,990 --> 00:50:59,300 As I said, whether something is non-representative depends and whether that's a that's a problem, a depends on your research question. 471 00:50:59,300 --> 00:51:02,900 Right. And what exactly it is that you're hoping to do? 472 00:51:02,900 --> 00:51:12,080 Answer, and there are maybe ways around it, as Roberto will talk about done on Friday and also Matt will talk on the livestream on Thursday. 473 00:51:12,080 --> 00:51:24,380 But at the same time, I don't think that in of itself, we we just because something is non-representative, we shouldn't we shouldn't use it. 474 00:51:24,380 --> 00:51:35,180 So this is an example here of the again coming back to this using Facebook to monitor stocks of migrants, example by media's agony and colleagues. 475 00:51:35,180 --> 00:51:42,860 And one one. One interesting aspect of this here is that they use this, this, 476 00:51:42,860 --> 00:51:52,700 this behaviour or this category that Facebook classifies people as as ex-pats, whether someone is an expat or not. 477 00:51:52,700 --> 00:52:00,320 So, so you can, for example, go and see what Facebook classifies us as in relation to your what kind of ads you should be 478 00:52:00,320 --> 00:52:05,930 seeing based on what what Facebook has inferred about you or you've reported to Facebook. 479 00:52:05,930 --> 00:52:14,540 And if you're interested that later on, I can share a link for that. But what's interesting here is that they're using this category of expert, 480 00:52:14,540 --> 00:52:20,270 but we don't really know what exactly or who exactly is an expat or how that's been inferred. 481 00:52:20,270 --> 00:52:25,640 And that's in some sense behind this. It's a sort of a black box algorithm because this is not a survey. 482 00:52:25,640 --> 00:52:31,370 We don't really have kind of an extensive metadata about it in relation to knowing who 483 00:52:31,370 --> 00:52:38,750 is or how this category is is defined and how potentially it might change over time. 484 00:52:38,750 --> 00:52:43,460 So if we think about. 485 00:52:43,460 --> 00:52:54,470 If we think about how it might change over time, that's where I think thinking about drift or population drift and usage drift is is really important. 486 00:52:54,470 --> 00:52:59,300 So the composition of users on these platforms might change over time, 487 00:52:59,300 --> 00:53:06,050 and that might have implications for what it is that there that we are potentially measuring. 488 00:53:06,050 --> 00:53:12,830 But also the system itself might change over time because these are businesses that have different rationale and motivations. 489 00:53:12,830 --> 00:53:14,570 So if they have, for example, 490 00:53:14,570 --> 00:53:22,250 they might implement an improvement in an algorithm which might not be great from a social research perspective because it messes up the measurement, 491 00:53:22,250 --> 00:53:26,390 but it might make them millions of dollars. So then here, who's complaining? 492 00:53:26,390 --> 00:53:36,200 So there's trade-offs here in terms of what the order's distinctions between what often the goals of businesses are and what the goals of, 493 00:53:36,200 --> 00:53:44,150 I think researchers are. And I think this was very nicely exemplified by the Google flu example. 494 00:53:44,150 --> 00:53:51,800 So I first showed you the Ginsburg and colleagues paper from 2009, where we had Google flu effectively being matched. 495 00:53:51,800 --> 00:53:59,060 Then two weeks later or a month later by the CDC estimates and Google flu was doing really well with the prediction then, 496 00:53:59,060 --> 00:54:08,630 and this became such a sort of an exciting project that Google even had an in-house team running its Google flu trends. 497 00:54:08,630 --> 00:54:14,720 So they were trying to match and predict CDC estimates of influenza before the CDC, 498 00:54:14,720 --> 00:54:19,490 or they were trying to predict levels of the flu before the CDC estimates came about. 499 00:54:19,490 --> 00:54:25,250 But then sometime around 2000 and 12, we started seeing that. 500 00:54:25,250 --> 00:54:32,750 Actually, Google was significantly over predicting flu, almost double the CDC estimates, 501 00:54:32,750 --> 00:54:42,230 and it started estimating flu to be much higher than it actually was over the course of over the rest of the year. 502 00:54:42,230 --> 00:54:45,320 So, so why is it that this even happened, right? 503 00:54:45,320 --> 00:54:51,680 And David Lazaar and colleagues have a very interesting paper where they try and diagnose some of these problems. 504 00:54:51,680 --> 00:54:57,620 It's actually probably one of the more well-known papers in this literature The Parable of the Good The Google of Google Flu, 505 00:54:57,620 --> 00:55:03,500 published in Science in 2014. And one of the things that happened. 506 00:55:03,500 --> 00:55:07,820 So to think back to the idea of of drift. 507 00:55:07,820 --> 00:55:16,280 So if we think about platform drift, in 2011, Google made changes to its search algorithm so that there were that. 508 00:55:16,280 --> 00:55:20,990 People were essentially told that if you're searching for a cough or a cold, 509 00:55:20,990 --> 00:55:24,820 you might be interested in the flu or you might be interested in something else. 510 00:55:24,820 --> 00:55:34,340 So essentially, we had an algorithm that encouraged certain kinds of search behaviour and pushed people to move to different kinds of terms, 511 00:55:34,340 --> 00:55:38,090 irrespective of whether they actually had the the flu or not. 512 00:55:38,090 --> 00:55:42,620 So that's an example of kind of essentially algorithmic or the system changing. 513 00:55:42,620 --> 00:55:53,180 That's creating a drift that might make measurement a little bit trickier. 514 00:55:53,180 --> 00:56:05,510 Right, so what I've talked about so far is I've talked a bit about the the the notion that digital trace data sources that they exist, 515 00:56:05,510 --> 00:56:09,680 but they have promises and they have pitfalls. 516 00:56:09,680 --> 00:56:17,480 And I've talked a bit about some of these shortcomings in the context of some of the papers that have been published in this literature. 517 00:56:17,480 --> 00:56:22,340 But I think if we think about what are the kinds of approaches that people are adopting 518 00:56:22,340 --> 00:56:29,630 when they're using some form of digital trace data to answer social research questions, 519 00:56:29,630 --> 00:56:37,280 I see them in some sense as like three in three categories of research projects. 520 00:56:37,280 --> 00:56:48,620 So the first of this is what I see as measurement papers, papers that use different forms of digital trace data for operationalising constructs. 521 00:56:48,620 --> 00:56:54,020 Often, they're trying to operationalise constructs at the macro level because they believe that this is 522 00:56:54,020 --> 00:57:00,500 a this is able to provide valuable information that might not be captured in other data sources. 523 00:57:00,500 --> 00:57:09,230 So if you think back to the Garcia paper about collective sentiment in the aftermath of a terrorist attack are implicitly relying on the notion here 524 00:57:09,230 --> 00:57:19,310 that they're able to capture this kind of macro level collective sentiment through Twitter in a way that might not be captured in in another way. 525 00:57:19,310 --> 00:57:29,660 There are a number of papers that have tried to see if if can we can we measure mood or sentiment for four populations through Twitter? 526 00:57:29,660 --> 00:57:34,940 And I think implicitly, the rationale there is that we are capturing operationalising some kind of 527 00:57:34,940 --> 00:57:42,800 macro level construct that that we might not be able to in in another setting. 528 00:57:42,800 --> 00:57:51,140 I think another aspect or another way of thinking about the measurement or the use of these data sources for the purposes of measurement in relation 529 00:57:51,140 --> 00:58:00,860 to what's already out there in the literature is that these data sources have been used a lot for for now casting and for filling data gaps. 530 00:58:00,860 --> 00:58:08,570 Now the Google flu example is an example of that, where they are trying to now cost levels of the flu essentially beat official 531 00:58:08,570 --> 00:58:12,860 statistics that are published because official statistics come with a lag. 532 00:58:12,860 --> 00:58:22,490 So we're trying to essentially predict the present, as Hal Varian called it, because there will be an official or more conventional data sources, 533 00:58:22,490 --> 00:58:27,740 an inevitable lag that we won't know about what's happened until it's, well, it's happened. 534 00:58:27,740 --> 00:58:35,870 And in that aspect, I think the now casting phenomenon has been done in relation to health surveillance and in the public health literature. 535 00:58:35,870 --> 00:58:39,350 It's also been done or it's or that's kind of the rationale. 536 00:58:39,350 --> 00:58:46,880 Also behind our work in digital gender gaps is that we trying to say what's happening now in relation to digital gender inequalities. 537 00:58:46,880 --> 00:58:55,970 I also see that as a rationale behind work in digital demography that's trying to examine levels of migration or stocks of migrants. 538 00:58:55,970 --> 00:58:57,860 And I think one thing that, if anything, 539 00:58:57,860 --> 00:59:05,420 I've learnt from this kind of measurement exercise or trying to use these data sources for measurement is that to motivate them for measurement, 540 00:59:05,420 --> 00:59:14,000 we always have to justify what we're gaining by using them kind of similar to what Roberto was asking before we have to think about what is it that, 541 00:59:14,000 --> 00:59:18,530 you know, what do we gain from using this that we might not gain if we were not using this? 542 00:59:18,530 --> 00:59:23,270 So I like to think of that as comparing against an offline benchmark. 543 00:59:23,270 --> 00:59:29,090 So in some of our digital agenda, gaps work. One of the things that we often tried to do was we tried to say, 544 00:59:29,090 --> 00:59:35,570 what if we didn't have the Facebook data and we tried to make some kind of a prediction model just to predict 545 00:59:35,570 --> 00:59:40,520 digital gender inequality by some other information that we have on some other development indicators? 546 00:59:40,520 --> 00:59:46,280 How well would we do with that? And then we compared with what we do with just Facebook. 547 00:59:46,280 --> 00:59:51,740 And then we try and have a hybrid approach where we combine measures from Facebook with those that 548 00:59:51,740 --> 00:59:57,650 are available from other kinds of survey data sources or other kinds of development indicators. 549 00:59:57,650 --> 01:00:04,220 So either finding a way to compare against an offline benchmark or have some kind of a hybrid approach might be valuable. 550 01:00:04,220 --> 01:00:08,690 Also, to motivate why this work is is useful and important. 551 01:00:08,690 --> 01:00:16,010 I also see, for example, that is a shortcoming of the original Google flu paper and that in the paper, 552 01:00:16,010 --> 01:00:21,620 if you read the 2009 paper, they sort of say, Oh, Google flu, 553 01:00:21,620 --> 01:00:25,910 we can predict we can predict flu really well with Google, but actually, 554 01:00:25,910 --> 01:00:31,790 they don't compare themselves at all with a simple time series model where they just use League One and League Two. 555 01:00:31,790 --> 01:00:38,090 They're not doing anything where they're we could actually just look at the past flu to predict the current flu. 556 01:00:38,090 --> 01:00:44,330 And why don't we do that instead? Why do we have to do pyrotechnics with Google instead? 557 01:00:44,330 --> 01:00:51,370 And then and and one way to again justify your your research if you are interested in measurement. 558 01:00:51,370 --> 01:00:56,710 Now, casting would be to say, well, actually, there are some pros and there are some cons of these data sources, 559 01:00:56,710 --> 01:01:03,220 and by doing a comparison explicitly against some kind of an offline benchmark or some other kind of model. 560 01:01:03,220 --> 01:01:11,620 I think we can in some sense rationalise our approach and say that there are some things that are potentially gained from this. 561 01:01:11,620 --> 01:01:18,610 Another line of research that I think or I see emerging with the use of digital trace 562 01:01:18,610 --> 01:01:27,430 data are papers that see digital platforms themselves as as microcosms of of society. 563 01:01:27,430 --> 01:01:34,510 So as life is, is a lot of life is, you know, there are purely digital phenomena, as I was talking about before, 564 01:01:34,510 --> 01:01:46,510 could we potentially think of digital spaces as microcosm of the societies to go and test certain kinds of of theories that we may have? 565 01:01:46,510 --> 01:01:52,930 So I think there was an example in one of the the Lightning talks yesterday, which was using the moral foundation. 566 01:01:52,930 --> 01:01:56,560 I think it was written reviewing some of the moral foundations theory. And that's kind of, to me, 567 01:01:56,560 --> 01:02:01,270 an example of thinking of Twitter as a kind of a digital space where we might test a 568 01:02:01,270 --> 01:02:07,420 theory about how is it that people's moral intuitions are are configured and and perhaps, 569 01:02:07,420 --> 01:02:07,630 you know, 570 01:02:07,630 --> 01:02:19,300 you might have a theory that you might want to take and destined for a digital space yourself and think about how that might that might play out. 571 01:02:19,300 --> 01:02:26,680 And I think the third strand of paper is, as I see it, that that try and leverage different kinds of digital traits. 572 01:02:26,680 --> 01:02:36,670 Data are those that actually are interested in thinking about the implications of digital technologies for social processes. 573 01:02:36,670 --> 01:02:39,460 Now this is something that I myself am very interested in. 574 01:02:39,460 --> 01:02:46,570 I'm very interested in thinking about, for example, how now that we are measuring gender inequalities in internet and mobile phone access, 575 01:02:46,570 --> 01:02:51,940 what do they actually mean for the attainment of other social development outcomes, 576 01:02:51,940 --> 01:03:00,430 such as information and access to contraception, such as access to do information about HIV and antenatal screening? 577 01:03:00,430 --> 01:03:08,140 These might be examples of how access to digital technology empowers or provides new forms of resources and 578 01:03:08,140 --> 01:03:15,430 information that could have implications for how the world works and for other kinds of social inequalities. 579 01:03:15,430 --> 01:03:21,940 And I think this is something that a lot of us, at least I can speak more for demographers and sociologists and less for other disciplines. 580 01:03:21,940 --> 01:03:28,270 And I know you are from many different disciplines, so I don't want to overstep my my my claims. 581 01:03:28,270 --> 01:03:32,110 But I do want to say that I think a lot of social scientists haven't. 582 01:03:32,110 --> 01:03:39,940 A lot of sociologists and demographers haven't thought about digital inequality enough. 583 01:03:39,940 --> 01:03:50,110 They haven't thought about its implications for other forms of social inequality, far processes of potentially social stratification. 584 01:03:50,110 --> 01:03:56,020 And I think that's and that's a dimension where different conversations need to be had. 585 01:03:56,020 --> 01:04:02,470 And potentially the use of digital trace data could also be could be helpful in that regard. 586 01:04:02,470 --> 01:04:06,820 Is that a question or no? Yeah, I do have a question. 587 01:04:06,820 --> 01:04:11,860 It's actually like an observation. Yeah, it's because it's something I've certainly worried about. 588 01:04:11,860 --> 01:04:20,170 So yeah, when you use traditional surveys. Yeah, it's usually have a lot of documentation on how the data was gathered that usually have this 589 01:04:20,170 --> 01:04:25,070 technical documentation that says it's something came up in the data collection process. 590 01:04:25,070 --> 01:04:28,480 Yeah, there's an anomaly in a question and so on. 591 01:04:28,480 --> 01:04:35,100 And so that's good because when you're analysing data or exploring it, you can actually explain why you see stuff that doesn't make sense. 592 01:04:35,100 --> 01:04:43,180 Yeah, but then we don't have that when we're analysing sort of most social media data or digital trace data that we use. 593 01:04:43,180 --> 01:04:50,020 Because, for example, when we're using Facebook data or Twitter data, the absence of something might be telling, 594 01:04:50,020 --> 01:04:54,280 yeah, or there, whether there's a column, which means something that we're not. 595 01:04:54,280 --> 01:05:00,820 Yeah, we don't really know what everything is about, and we can't really know how the data was gathered in some sense. 596 01:05:00,820 --> 01:05:06,340 And so I was just thinking whether we should start to uphold some sort of standards in terms of which 597 01:05:06,340 --> 01:05:13,330 data we use and whether we should have further information or what we expect of private companies. 598 01:05:13,330 --> 01:05:21,370 Right? Yeah. I mean, I think this ties in also with something with I was saying earlier that we have in some sense the issues of accessibility, 599 01:05:21,370 --> 01:05:26,470 but I think to some extent we have, you know, there's you're right, it's not a survey. 600 01:05:26,470 --> 01:05:31,270 It's not it's not. It's not meant to be used in the same way as a survey. 601 01:05:31,270 --> 01:05:37,210 And in fact, in my own experiences, so in in the work that I've been doing with the Facebook marketing API, 602 01:05:37,210 --> 01:05:44,380 one of the resistance or one common complaint that we often encounter from Facebook is that they don't. 603 01:05:44,380 --> 01:05:51,180 So we describe the Facebook marketing API as providing a digital census of the Facebook population they dislike. 604 01:05:51,180 --> 01:05:58,110 That phrase, they say we don't want to be seen as a digital census because we are not it, we are not that we are. 605 01:05:58,110 --> 01:06:03,330 We are where we are providing our targets for advertisers and that's what we care about. 606 01:06:03,330 --> 01:06:08,940 And I think so there is definitely a tension there that we might not know what we're what 607 01:06:08,940 --> 01:06:13,890 we're measuring because of the lack of appropriate metadata or we might not have information. 608 01:06:13,890 --> 01:06:18,390 But at the same time, I think, you know, I think rather than saying we shouldn't be using them, 609 01:06:18,390 --> 01:06:22,770 I think we should be saying we should be using them and then asking for for for 610 01:06:22,770 --> 01:06:30,690 clarity from the providers of these data by motivating their importance for research. 611 01:06:30,690 --> 01:06:36,390 And I think that could be. That's that's that's I think something that needs to be. 612 01:06:36,390 --> 01:06:40,680 That needs to be. For me, the way forward now, there are some frameworks out there. 613 01:06:40,680 --> 01:06:43,890 They already have something to say on this. Yeah. 614 01:06:43,890 --> 01:06:51,940 I mean, there are some kind of attempts to try and do this now, such as the social science sort of science one framework, 615 01:06:51,940 --> 01:06:59,640 this kind of attempt at trying to form a partnership between academia and sort of private companies. 616 01:06:59,640 --> 01:07:04,710 And there are dimensions of or there's also the open algorithms project or battle 617 01:07:04,710 --> 01:07:10,380 in Midea that is trying to create a kind of a secure data infrastructure so that 618 01:07:10,380 --> 01:07:14,880 other kinds of users might also be able to use anonymized call detail records 619 01:07:14,880 --> 01:07:20,070 beyond just kind of small projects that are very specific to specific individuals. 620 01:07:20,070 --> 01:07:27,420 So these kinds of partnerships, you know, there is an attempt now, I think, to try and have this discussion and and move forward in that space. 621 01:07:27,420 --> 01:07:35,010 But I think it's still it's still early. It's early stages and often it's hampered by events such as Cambridge Analytica, 622 01:07:35,010 --> 01:07:39,960 where which completely pivot the dialogue in a very different direction. 623 01:07:39,960 --> 01:07:48,930 And I think that's kind of the the challenge that all the discussion then is about misuse that we forget about the misuse as a result. 624 01:07:48,930 --> 01:07:55,990 So anyway, sorry, David, you had something to say about this discussion of how debate is created by the company or what? 625 01:07:55,990 --> 01:07:59,340 Yeah, this is something that we're gathering from. 626 01:07:59,340 --> 01:08:07,860 And then there was the step of us as researchers, which I think is complicated by technology, but really complicated companies. 627 01:08:07,860 --> 01:08:13,500 So it's looking more broadly. We don't have these standards to report with this sort of data yet. 628 01:08:13,500 --> 01:08:22,730 Yeah. I wanted to come up with something like a consulate or something when you're reviewing a kind of a randomised controlled trial or something, 629 01:08:22,730 --> 01:08:28,890 that's what should be there. Yeah. You say that it was a well done, more reporting stuff that's transparent. 630 01:08:28,890 --> 01:08:38,590 And I've pushed back with that. Just the other day, I got back in the article that I had reviewed that they got Ahmad and I had pushed the. 631 01:08:38,590 --> 01:08:44,770 The authors to try and be a bit more transparent about missteps, steps about the selection process, 632 01:08:44,770 --> 01:08:51,610 which we just don't have the morning sort of things and their responses is, well, nobody else does that. 633 01:08:51,610 --> 01:08:56,940 And that's so yeah, I think that that list that I was setting the bar too high for them because nobody else has to do that. 634 01:08:56,940 --> 01:09:02,740 At some point it has to. There is the the company side of it, but there's also just us. 635 01:09:02,740 --> 01:09:07,470 We don't have a blueprint for what we should report in terms of code book for. 636 01:09:07,470 --> 01:09:17,350 I mean, there's also legitimate tensions here in relation to. So there were some messages on Slack earlier about reproducibility and and and openness, 637 01:09:17,350 --> 01:09:23,160 and while I think as a community, we have to be really we have to be moving in that direction. 638 01:09:23,160 --> 01:09:31,980 I think there are again, there are tensions between if we're using private company data about what we can actually population reveal online. 639 01:09:31,980 --> 01:09:38,490 I mean, for example, this was an explicit decision when we put our digital gender gaps dot org online. 640 01:09:38,490 --> 01:09:43,360 We were, you know, we were told that we shouldn't be releasing the raw accounts from Facebook online. 641 01:09:43,360 --> 01:09:49,290 We should only be producing model estimates and putting those online because there is, you know, 642 01:09:49,290 --> 01:09:58,350 and we've had discussions with with Facebook as a result of that about what it is that, you know, they're OK with us kind of sharing and whatnot. 643 01:09:58,350 --> 01:10:00,330 And I think that then also, you know, 644 01:10:00,330 --> 01:10:05,340 there's a tension then because if we want to be open and reproducible and we want other people to be able to do what we're doing. 645 01:10:05,340 --> 01:10:12,180 But at the same time, we're not able to provide the data, then we can also advance in that agenda in the same way. 646 01:10:12,180 --> 01:10:14,010 This is not related to digital trace data, 647 01:10:14,010 --> 01:10:19,090 but I know some of you have also expressed an interest in agent based modelling and engaging this modelling. 648 01:10:19,090 --> 01:10:28,710 There's been this interesting movement towards adopting the Audy protocol or now forgetting what the audit stands for, 649 01:10:28,710 --> 01:10:34,500 but it's a it's a it's a rubric for actually what when you're writing up and describing an agent based model, 650 01:10:34,500 --> 01:10:43,410 what you should be looking to very clearly define. And I think that's that kind of a rubric which emerged because one of the big critiques of agent 651 01:10:43,410 --> 01:10:48,330 based modelling initially was that people can model whatever they want and they can just, 652 01:10:48,330 --> 01:10:54,630 you know, put whatever they want in a system and then recreate and generate any pattern and say that this theory works. 653 01:10:54,630 --> 01:11:02,640 And this was a big critique of of kind of the fact that there was just this unstructured Wild West style of ABM programming that was going on. 654 01:11:02,640 --> 01:11:10,590 And one response to that was let's generate a set of protocols about what needs to be reported, how it needs to be documented. 655 01:11:10,590 --> 01:11:17,190 And I think, you know, we could potentially moving in that direction with this too. 656 01:11:17,190 --> 01:11:27,030 Yeah. Yeah, yeah, that could be actually a very useful thing, especially in relation to thinking about also from from an ethics perspective. 657 01:11:27,030 --> 01:11:31,400 Yeah. Yeah. 658 01:11:31,400 --> 01:11:37,790 Yeah. So the bigger problem with this specific thing is that obviously that they stumble on a 659 01:11:37,790 --> 01:11:42,440 single certain day and two months later that it could be a completely different set. 660 01:11:42,440 --> 01:11:43,890 Yeah, that's the drift problem. 661 01:11:43,890 --> 01:11:50,570 Yeah, it's also especially with the market that's to do with the algorithms that there are multiple us about the same stuff. 662 01:11:50,570 --> 01:11:59,040 Yeah, the rhythm is simple. Check, would this be if you'd have to run the same production process three times? 663 01:11:59,040 --> 01:12:04,890 Yeah. In a moment. Yeah, yeah. 664 01:12:04,890 --> 01:12:09,630 Oh my goodness. Yeah. Or you might want to think about that again. 665 01:12:09,630 --> 01:12:13,920 It would probably be very specific to what kind of question you have and what and what kind of. 666 01:12:13,920 --> 01:12:20,730 Yeah. But if you have a lot of variability, I mean, you notice that counts will change if you query. 667 01:12:20,730 --> 01:12:24,090 So for example, again, with our Facebook marketing API work, 668 01:12:24,090 --> 01:12:29,280 we find that the daily active users changes a lot for the monthly active users tend to be much more flexible. 669 01:12:29,280 --> 01:12:33,150 But one of the attempts of the back end of the project is also by collecting the data 670 01:12:33,150 --> 01:12:37,770 every day to actually be able to understand what it is that changes and how it changes. 671 01:12:37,770 --> 01:12:44,040 Because I think often most researchers were doing using these kinds of data, sources will use them once, 672 01:12:44,040 --> 01:12:51,780 or they'll use them as a one off collection, and then they won't want to necessarily go and think about it in a kind of a longer term perspective. 673 01:12:51,780 --> 01:12:58,770 But when they do want to use it, you're right that they have to think a little bit more about what with the drift affecting them, 674 01:12:58,770 --> 01:13:00,870 listening to industry arguments, 675 01:13:00,870 --> 01:13:11,520 the marketplace nature of this, because yeah, I bought myself for the marketplace online and you can harvest that and within the authority. 676 01:13:11,520 --> 01:13:15,990 Yeah, there's a lot of obviously how much people are there at the same time. 677 01:13:15,990 --> 01:13:23,550 Yeah. And so we've got free now, but also every second third iteration they still use bigger set group. 678 01:13:23,550 --> 01:13:31,710 Obviously, now it can direct. Yeah, which is much more sort of, yeah, graphic, I see. 679 01:13:31,710 --> 01:13:36,540 Yeah, because most of the stuff. Yeah, farming is much more serious. 680 01:13:36,540 --> 01:13:45,270 Yeah, yeah. I mean, that would just be the you mean the platform is changing. Yeah, but also Nike's also starting to advertise. 681 01:13:45,270 --> 01:13:50,050 Yeah, yeah. Well, that's yeah. 682 01:13:50,050 --> 01:13:55,920 Yeah, because you don't know about the other kinds of, yeah, yeah, knows who is interacting with your stuff. 683 01:13:55,920 --> 01:14:01,980 Yeah. Using their mobiles to. Yeah. So yeah, yeah. 684 01:14:01,980 --> 01:14:06,030 No, I mean, I think there, yeah, there are definitely there's that dimension of the unknown as well, 685 01:14:06,030 --> 01:14:12,900 where the algorithm itself from the companies that could be changing, but the environment in which, you know, you might be putting up an. 686 01:14:12,900 --> 01:14:17,610 But yeah, I mean, I think this is these are this in some sense, the kinds of issues that I'm raising are very general. 687 01:14:17,610 --> 01:14:26,760 But but you know, you might have very specific kinds of issues in relation to a specific project that would then come about to. 688 01:14:26,760 --> 01:14:36,270 So those are what I said, what kind of designs that are essentially designs that rely on digital trace data of themselves, 689 01:14:36,270 --> 01:14:43,080 sometimes in combination with with other kinds of existing observational or survey data sources. 690 01:14:43,080 --> 01:14:55,050 But I think the way forward and this is, I think, an argument that Chris Chris Bell makes very actively in the context of six and he actually has a. 691 01:14:55,050 --> 01:15:00,960 So I would recommend you go look at his because he has a tutorial or some slides in which he makes the 692 01:15:00,960 --> 01:15:09,180 argument for hybrid research designs that I think may be the way forward for answering some questions. 693 01:15:09,180 --> 01:15:15,780 Also with with digital trends data. So the first example, of course, and this is something that's already come up. 694 01:15:15,780 --> 01:15:22,140 I've noticed a few times is how could we combine digital trace data with conventional data sources like surveys? 695 01:15:22,140 --> 01:15:26,850 So there was someone I think Clemens yesterday who in a sense is is doing. 696 01:15:26,850 --> 01:15:34,770 Are you relying on a data set that does that where within the context of a survey there are there are browser histories that are embedded in them. 697 01:15:34,770 --> 01:15:44,610 And then he's also using both. He has information from both the survey and the browser history and is using that to sort of answer a question. 698 01:15:44,610 --> 01:15:52,890 And that, I think, could be a great way forward where if we are sort of designing or piloting a survey, 699 01:15:52,890 --> 01:16:00,840 we ask also to have different kinds of measures, potentially things such as, you know, browsing history as one example. 700 01:16:00,840 --> 01:16:06,030 But also, we might want to share information from our social media pages. 701 01:16:06,030 --> 01:16:14,970 We might try to put our Twitter handles available to someone who is putting us who's doing a survey, depending on what our questions are. 702 01:16:14,970 --> 01:16:20,520 So I think that could be a hybrid research design that that could work particularly 703 01:16:20,520 --> 01:16:25,560 well and might help overcome some of the weaknesses that I talked about. 704 01:16:25,560 --> 01:16:34,560 Apps for data, generation and extraction. Now I know Chris and I think Taylor, you as well worked on the project with the social media apps. 705 01:16:34,560 --> 01:16:40,740 So I think Taylor is a good person to talk about how social media apps could potentially be used, 706 01:16:40,740 --> 01:16:46,860 both as a way to to do some to collect information in a way that is ethically sound. 707 01:16:46,860 --> 01:16:50,670 So not repeating Cambridge Analytica. 708 01:16:50,670 --> 01:17:04,110 And that, I think, is an aspect of of ways in which digital trace data could be used in a kind of hybrid design that I think would also work and well. 709 01:17:04,110 --> 01:17:11,830 I think while in the past, like, you know, the idea that you might design an app was quite forbidding. 710 01:17:11,830 --> 01:17:16,260 I know, I think you're you're I think Aiden is designing an app, right? 711 01:17:16,260 --> 01:17:23,070 I think I don't know what platform you're using, but in the context of our shiny making apps has become much easier now, 712 01:17:23,070 --> 01:17:32,220 and deploying them to web interfaces is become much easier. And there are some really good tutorials online on our shiny if you're interested in that. 713 01:17:32,220 --> 01:17:39,900 But I think that could be a model where the barrier to entry is, I think, become much lower than it was in the past. 714 01:17:39,900 --> 01:17:45,270 And designing apps could be a way forward for some kinds of hybrid research designs. 715 01:17:45,270 --> 01:17:53,070 And on Saturday, we'll also talk a bit about how we might use bots basically automated accounts that 716 01:17:53,070 --> 01:17:57,630 might say tweet certain kinds of messages at regular intervals in the context of, 717 01:17:57,630 --> 01:18:01,710 say, an experimental design. If you were interested in, say, looking at political opinion, 718 01:18:01,710 --> 01:18:08,310 information or or things like political polarisation, and that is an example again of what is, 719 01:18:08,310 --> 01:18:14,880 you know, what is an ethic, what is a sort of a hybrid research design with all of these questions, of course, right? 720 01:18:14,880 --> 01:18:22,140 Or these kinds of designs? You know, we always have to think about some of these designs might be quite new, 721 01:18:22,140 --> 01:18:31,950 and maybe there might not be a precedent in relation to to them in your in your in your ethics board and university, 722 01:18:31,950 --> 01:18:39,180 in the university that you're working at. So you might want to probably take a higher standard and think about, well, what are we doing? 723 01:18:39,180 --> 01:18:45,120 Is that ethical from and you know, and how would someone view this maybe a few years from now? 724 01:18:45,120 --> 01:18:53,430 So you might want to adopt a standard which is perhaps, you know, maybe more critical then and think about, 725 01:18:53,430 --> 01:18:57,330 you know, for example, if you were using a board of, you know what? 726 01:18:57,330 --> 01:19:01,470 And how is this under what circumstances it OK to use a board? 727 01:19:01,470 --> 01:19:06,750 Is it not? Anyway, so this is what I want to end on at this or this session, 728 01:19:06,750 --> 01:19:15,330 we're going to have a break until three forty five and then we'll come back and do part two of this session. 729 01:19:15,330 --> 01:19:23,328 So, yeah, for now, it's a break and I want to terminate the livestream.