1 00:00:09,100 --> 00:00:17,210 All right, I'm going to. All right, I'm going to start again for part two of digital trace data and. 2 00:00:17,210 --> 00:00:24,460 I hope you're all caffeinated and. Sugared up so fast before I start talking, 3 00:00:24,460 --> 00:00:36,670 I want to make a plug for the I believe Taylor has put out a link on the slack on proposing topics for dinner discussions and not doing well. 4 00:00:36,670 --> 00:00:47,260 We are not doing well because we have only one that's been proposed, although. Yeah, although although heads up first tonight, 5 00:00:47,260 --> 00:00:58,030 Roberto is going to be leading a dinner discussion on the ethics and the challenges involved with web scraped data scraping web data. 6 00:00:58,030 --> 00:01:01,720 And of course, did talk about his personal experiences with that. 7 00:01:01,720 --> 00:01:10,510 But I know many, many of you mentioned, I recall from the lightning talks had been talking about using web scripted or had been or are using it. 8 00:01:10,510 --> 00:01:15,310 And I think it would be a great opportunity for us to just have a discussion about 9 00:01:15,310 --> 00:01:18,920 some of the challenges you might have faced or some of the bigger spending issues. 10 00:01:18,920 --> 00:01:30,160 Yeah, because of Friday, Charlie will be giving the workshop on the issue with digital payment agreement or something like this. 11 00:01:30,160 --> 00:01:37,900 Yeah, yeah. I don't know if Charlie is in the building, but I might write to my message and ask him to join this. 12 00:01:37,900 --> 00:01:44,440 Charlie, if you're watching the livestream, come to dinner. Uh, hmm. 13 00:01:44,440 --> 00:01:48,520 Anyway, so OK, so that's tonight, but I would really be great. 14 00:01:48,520 --> 00:01:54,130 I mean, all of you are brimming with really exciting and interesting ideas or have just questions you want to discuss. 15 00:01:54,130 --> 00:01:59,320 And maybe your question is, is computational social science even a thing and worth it? 16 00:01:59,320 --> 00:02:03,670 And that's also a very good and the German question that we can have a discussion about. 17 00:02:03,670 --> 00:02:12,670 And I think that the dinners are a great place to be able to do that for us, and we also will have some sort tonight. 18 00:02:12,670 --> 00:02:19,420 You know, Roberto's already taken the slot, but we can also have two or three discussions going on at the same time if there is enough interest. 19 00:02:19,420 --> 00:02:25,420 So. So, you know, just please fill in that sheet that dealers send out. 20 00:02:25,420 --> 00:02:31,450 Yeah, there we go. The second is another. Yeah, there is that feedback from yesterday. 21 00:02:31,450 --> 00:02:32,470 We got four responses. 22 00:02:32,470 --> 00:02:39,100 They're not compulsory, but they're really helpful for us, especially if we want to do this again and to inform other six locations. 23 00:02:39,100 --> 00:02:44,050 We've all learnt a lot from that one. And it's just open ended question. 24 00:02:44,050 --> 00:02:50,020 So there'll be another one tonight about today, and you can also go back and answer about yesterday, things like that. 25 00:02:50,020 --> 00:02:59,620 Just keep eye. Yeah, OK. So dinner discussions. And also, if you're interested in any particular topics, there's the proposed the topics on the right. 26 00:02:59,620 --> 00:03:06,020 And so please feel free to start channels to discuss these kinds of. 27 00:03:06,020 --> 00:03:15,020 Discuss things with just profit. And it's all state. Yeah, you just write a word or a little description of what you're interested in, 28 00:03:15,020 --> 00:03:22,520 anybody who is interested in that can like it and then you have a sense you can start your own channel talking about. 29 00:03:22,520 --> 00:03:27,560 Well, it's great to have. Yeah, exactly. OK, great. 30 00:03:27,560 --> 00:03:32,180 So I want to start that now part two of digital trends data and the point or the 31 00:03:32,180 --> 00:03:35,900 purpose of part two of digital trends data is so I've given you a kind of conceptual 32 00:03:35,900 --> 00:03:41,450 overview talked a bit about how my how my social researchers use digital trace data 33 00:03:41,450 --> 00:03:46,550 for answering social questions in terms of different kinds of research designs, 34 00:03:46,550 --> 00:03:53,300 the strengths and weaknesses. And that was meant to sort of think very generally about this kind of data source. 35 00:03:53,300 --> 00:04:01,910 And now I want to talk a little bit about kind of tools and techniques that we might use for working with with with web data. 36 00:04:01,910 --> 00:04:05,420 Not necessarily here, just digital trace data, but web data. 37 00:04:05,420 --> 00:04:11,060 We might adopt some techniques, such as web scraping or screen scraping. 38 00:04:11,060 --> 00:04:17,930 I'll also talk a bit about APIs. Give the example of the Facebook marketing API since I talked a lot about research involving 39 00:04:17,930 --> 00:04:26,300 the marketing API and then we have a group exercise that I hope that will start off today, 40 00:04:26,300 --> 00:04:33,080 which will be first. We'll start off more individually. And then in that, I'll just ask you to work with an API individually, 41 00:04:33,080 --> 00:04:37,940 your API of your choice or maybe the marketing API or something, the Twitter API. 42 00:04:37,940 --> 00:04:46,370 And then tomorrow in the morning, we're going to dedicate the whole morning session to the group exercise just like we did today with FAQs. 43 00:04:46,370 --> 00:04:50,990 OK, so the different techniques and tools. And this again, 44 00:04:50,990 --> 00:04:56,600 is not meant to be an exhaustive lecture of every possible way that you might want to deal with web data 45 00:04:56,600 --> 00:05:02,120 or all the APIs in the world because there are more than twenty thousand web APIs so I can go through. 46 00:05:02,120 --> 00:05:05,240 I mean, it's like saying that there are so many services in the world. 47 00:05:05,240 --> 00:05:12,390 I can teach you how to use every survey in the world, but I can tell you general principles of survey research, so think of it in the same way. 48 00:05:12,390 --> 00:05:19,210 Right? So Web scraping is simply the process of automatically extracting data from Web pages. 49 00:05:19,210 --> 00:05:24,880 Now this is perhaps familiar to many of you already because you've been doing some kind of web scraping 50 00:05:24,880 --> 00:05:33,100 already and the slides that I'm going to be using now just rely on Chris's tutorial on screen scraping, 51 00:05:33,100 --> 00:05:38,770 which is on the six main page, the Princeton page or the Princeton website. 52 00:05:38,770 --> 00:05:45,550 So if you want to follow and actually try some of the code yourself, you could just go to the tutorial and follow along. 53 00:05:45,550 --> 00:05:48,310 If you if you're interested. 54 00:05:48,310 --> 00:05:57,040 Anyway, so as I said, web scraping is just or screen scraping is just identifying a web page that you can legally, as you see in the thing, 55 00:05:57,040 --> 00:06:04,630 legally scrape and then downloading the source code, which of course, this is what a web page looks like to a computer. 56 00:06:04,630 --> 00:06:06,910 It's the HTML at the back end, 57 00:06:06,910 --> 00:06:17,240 or it could be an extra metal file and then essentially bypassing that source code to create a kind of data set that you can use. 58 00:06:17,240 --> 00:06:24,470 And I think before we kind of go further, it's important to take a sort of pause and say, 59 00:06:24,470 --> 00:06:31,670 is web is web scraping legal and there are there are there are. 60 00:06:31,670 --> 00:06:40,790 I've come to learn that there are a number of opinions about this because there's website terms of service. 61 00:06:40,790 --> 00:06:46,640 Are you allowed to do this? And often if you read website terms of service for a lot of websites, 62 00:06:46,640 --> 00:06:53,000 especially large websites like Facebook or New York Times, they will actually say, No, you are not allowed to do this. 63 00:06:53,000 --> 00:06:58,970 You're not allowed to automatically extract data from you are allowed to use it for individual use like, you know, 64 00:06:58,970 --> 00:07:07,340 you're allowed to read these things, but you can't just kind of systematically have a spider of college is going around using these pages. 65 00:07:07,340 --> 00:07:14,120 But I've also heard views about how terms of service are not legally binding agreements. 66 00:07:14,120 --> 00:07:20,090 And as a result, there's not really a contract that needs to be obeyed. 67 00:07:20,090 --> 00:07:25,400 I personally, as a researcher, like to err on the side of caution. 68 00:07:25,400 --> 00:07:28,760 And I think and that's if people ask me, 69 00:07:28,760 --> 00:07:36,380 I always say I are on the side of caution because you don't want to do something that might make things difficult for you, 70 00:07:36,380 --> 00:07:47,480 especially if you know, if at some point it turns out that that that there is some contention about the usability of that particular data source. 71 00:07:47,480 --> 00:07:53,540 So there's no, I would say, general pretty script here that I want to say other than say, 72 00:07:53,540 --> 00:07:59,180 make sure you read and understand terms of services and also yourself. 73 00:07:59,180 --> 00:08:06,710 Think about in the context of what you are doing. Is this something that that you know you do? 74 00:08:06,710 --> 00:08:11,180 How does it fit in with your with your goals and your plans? 75 00:08:11,180 --> 00:08:20,420 All websites these days also tend to have a robot start policy that specifies rules about automated data collection. 76 00:08:20,420 --> 00:08:25,160 So, for example, this is Wikipedia as well robots policy. 77 00:08:25,160 --> 00:08:30,560 I know Pablo here is an expert on Wikipedia, so you probably have seen this this policy? 78 00:08:30,560 --> 00:08:36,050 Have you? Have you read it? No, there. Well, you don't have to read the policy. 79 00:08:36,050 --> 00:08:44,840 But because Wikipedia is OK with it, most, most crawlers, you know, there are no kind of an open, open source, open encyclopaedia. 80 00:08:44,840 --> 00:08:54,170 But of course, there are some web crawlers that they don't like because it's observed spanning large amounts of and ignoring rate limits, 81 00:08:54,170 --> 00:08:59,900 and it's disrespectful and so on. So as a result, you know these some of these are disallowed. 82 00:08:59,900 --> 00:09:06,800 So even on Wikipedia, there are some bots or there are some crawlers that are disallowed. 83 00:09:06,800 --> 00:09:11,090 So anyway, so most websites these days have these kinds of robots. 84 00:09:11,090 --> 00:09:18,440 So you should if you are planning to do use some kind of a crawler, you might want to crawler by crawler. 85 00:09:18,440 --> 00:09:24,740 I mean, just essentially some kind of a computer programme that is crawling or going 86 00:09:24,740 --> 00:09:29,000 through many different kinds of web pages and extracting information from them, 87 00:09:29,000 --> 00:09:36,320 usually quite regularly. And usually crawlers tend to work with the they might have different kinds of rules. 88 00:09:36,320 --> 00:09:41,960 But you know, one column I go to one page and then go to all pages and links to that page and then so on and so forth, 89 00:09:41,960 --> 00:09:50,460 so it can quickly become a problem that is is is quite large. 90 00:09:50,460 --> 00:10:02,250 Well, thanks for Los Angeles. I know I'm referring to the callers as the programme, the computer programme. 91 00:10:02,250 --> 00:10:08,160 Right, so, yeah, a crawler in this case here would be just, yeah, the computer that is going around collecting, 92 00:10:08,160 --> 00:10:14,610 but obviously it's I assume it's usually it's designed by an individual or team or something. 93 00:10:14,610 --> 00:10:17,880 So yeah. So these, for example, these crawlers here, 94 00:10:17,880 --> 00:10:26,040 most of the things that they don't like on Wikipedia tend to be advertising or our media partners, Google and so on and so forth. 95 00:10:26,040 --> 00:10:34,440 Anyway, so this is a straight forward Wikipedia page. There's just the World Health Organisation's ranking of health systems. 96 00:10:34,440 --> 00:10:37,950 And on this page, this is what the back end looks like. 97 00:10:37,950 --> 00:10:41,640 There's the HTML of this Wikipedia page. 98 00:10:41,640 --> 00:10:50,670 And really, if Wikipedia is actually if there's like, sometimes good information on Wikipedia, you might want some tables from Wikipedia. 99 00:10:50,670 --> 00:10:54,930 And as far as scraping Wikipedia goes, it's a fairly manageable. 100 00:10:54,930 --> 00:10:59,370 It's quite it's quite an easy, easier resource to describe. 101 00:10:59,370 --> 00:11:08,190 It has generally the same structure and and and and the AH best package from ENR have. 102 00:11:08,190 --> 00:11:13,300 Have any of you used our best? Obviously. It's actually. 103 00:11:13,300 --> 00:11:16,810 Being 31 now, it's not being used. 104 00:11:16,810 --> 00:11:20,580 Yeah, it's switched to examine what to do. 105 00:11:20,580 --> 00:11:24,010 There's just pretty much anything that you do our best. 106 00:11:24,010 --> 00:11:31,330 OK, so horror is suggesting that better than our rest is eczema. 107 00:11:31,330 --> 00:11:34,840 I mean, it's not. It does, but it does the same kind of things. Yeah. 108 00:11:34,840 --> 00:11:40,720 I mean, I've seen like the Office Suites, so it's not, yeah, much too much maintenance. 109 00:11:40,720 --> 00:11:47,050 No, I mean, you can still use our best was that was was one of the most widely and is one of the most widely used. 110 00:11:47,050 --> 00:11:52,480 Still, I think, for doing simple static web pages and it you just would. 111 00:11:52,480 --> 00:11:56,260 You would read the HTML page in the hour window. 112 00:11:56,260 --> 00:12:04,420 You know, I would tell you it's an next document. This is what the output looks like for. 113 00:12:04,420 --> 00:12:13,510 If you if you just read the HTML is what it looks like. So that object and then so you would then go to. 114 00:12:13,510 --> 00:12:17,260 So this is back to that web page. So this is open in Google Chrome. 115 00:12:17,260 --> 00:12:24,970 You would go to your developer tools here and then in order to be able to scrape this web page. 116 00:12:24,970 --> 00:12:33,430 So I assume that I want this kind of table on this web page. There's the table on the table, on the rankings. 117 00:12:33,430 --> 00:12:41,500 So let's say that this is the this is the information that I want, right, I want this table is what I'm trying to extract information on. 118 00:12:41,500 --> 00:12:45,970 So I would just go and I would go and identify on through the developer tools. 119 00:12:45,970 --> 00:12:54,520 I would go and inspect and find the ex-boss for this. So I would just go and copy the Xbox where this table starts. 120 00:12:54,520 --> 00:13:02,980 And then it's quite straightforward. I would just go use this HDMI underscore node command and then I would specify the Xbox. 121 00:13:02,980 --> 00:13:10,690 And as a result, I would be able to and then using this HDMI underscore table essentially be able to extract that table. 122 00:13:10,690 --> 00:13:15,340 So this is a pretty easy web page to scrape with essentially three commands. 123 00:13:15,340 --> 00:13:21,790 Using this harvest package, we can import and pretty nicely analyse this this table. 124 00:13:21,790 --> 00:13:30,010 Of course, most web pages are not Wikipedia, and they are not as straightforward to extract tables from. 125 00:13:30,010 --> 00:13:38,710 In my experience, often sites that have a lot of tables where you might want information, even if they don't have, 126 00:13:38,710 --> 00:13:44,620 they might not be, as there might be quite interactive and very difficult to extract information from. 127 00:13:44,620 --> 00:13:50,560 But often they'll have on some part of their webpage that might have a JSON file. 128 00:13:50,560 --> 00:13:55,840 And these JSON files, which is often, you know, which is just as I said, 129 00:13:55,840 --> 00:14:04,930 which I've come to as well later is just a data storage format with lots and lots of curly brackets. 130 00:14:04,930 --> 00:14:15,220 And it's often the way in which a lot of information is stored, especially on the web, and can be used to do to you. 131 00:14:15,220 --> 00:14:23,020 Using the JSON format, we might be able to collect a lot of this information that might be available on these websites somewhere. 132 00:14:23,020 --> 00:14:27,100 But at the same time, the tables directly might not be able to be scraped. 133 00:14:27,100 --> 00:14:31,720 So, for example, for a project, what I was interested in using the Global Gender Gap Report. 134 00:14:31,720 --> 00:14:36,610 The website is very interactive, was not easy to scrape, but actually did hidden on their website. 135 00:14:36,610 --> 00:14:40,480 I just found all the tables and JSON, so it was much easier to use them that way. 136 00:14:40,480 --> 00:14:49,210 So it might be useful to see if on some kinds of data sources that you are interested in, whether there is that that JSON file somewhere on the. 137 00:14:49,210 --> 00:14:56,620 Anyway, so for more complex features, as I said, this kind of static, the simple code doesn't work. 138 00:14:56,620 --> 00:15:04,480 And then especially those would have like a lot of these stylesheet, a cascade instantiated CSS elements. 139 00:15:04,480 --> 00:15:14,230 So we might need other approaches so we might use something called the the selector gadget, which is this plugin you can put into Chrome, 140 00:15:14,230 --> 00:15:21,790 which then allows you to sort of point and click and find the parts of the page that you want to describe. 141 00:15:21,790 --> 00:15:26,440 So, for example, here this is the part defined in the example. 142 00:15:26,440 --> 00:15:29,110 If I were interested in just getting this far, 143 00:15:29,110 --> 00:15:37,180 I would just kind of point and click on the different CSS elements and then the selector gadget would kind of identify this for me. 144 00:15:37,180 --> 00:15:44,200 And similarly to what we were doing before using again, the HTML loads instead of Xbox. 145 00:15:44,200 --> 00:15:50,800 Now we'd specify the CSC, and then that would then be used to extract this information. 146 00:15:50,800 --> 00:15:55,090 We still would probably need to clean this, data said, because, you know, we would get some information, 147 00:15:55,090 --> 00:15:59,260 but the strings might be messy or they might not be in the perfect format. 148 00:15:59,260 --> 00:16:03,970 So there would still be processes in which we would need to do some data wrangling. 149 00:16:03,970 --> 00:16:11,990 But at the same time, that could be an alternative approach to also do some kind of web scraping if we wanted to. 150 00:16:11,990 --> 00:16:24,860 Anyway, in practise, this is Web scraping and can be messy, it can be difficult, and it can yield data that aren't always as straightforward to use. 151 00:16:24,860 --> 00:16:30,110 But again, this is problem specific. There might be, in some cases that might be the only thing you can do. 152 00:16:30,110 --> 00:16:34,730 You have to just clean, messy data, and that's what you have to do. 153 00:16:34,730 --> 00:16:38,270 But sometimes you have an alternative, right? 154 00:16:38,270 --> 00:16:45,080 And maybe Roberto might talk about this also in the structured in the sort of in the dinner discussion that we have. 155 00:16:45,080 --> 00:16:49,640 One approach, if you have a complex web page and you're interested in extracting information from it, 156 00:16:49,640 --> 00:16:52,430 you might want to outsource this to crowd workers. 157 00:16:52,430 --> 00:17:00,230 And so you might want to use Amazon Mechanical Turk as a way to send people to different links and say, 158 00:17:00,230 --> 00:17:06,890 Extract this information for from this web page for me, and I might put it on. 159 00:17:06,890 --> 00:17:12,050 So you might want to outsource it to on a cloud worker platform as a way to 160 00:17:12,050 --> 00:17:18,370 collect information from a web page rather than trying to automate the process. 161 00:17:18,370 --> 00:17:27,020 Another approach, of course, could be that you could use a web based API if an API were available, 162 00:17:27,020 --> 00:17:33,950 because some websites do make their content available in application programming interfaces just by show of hands. 163 00:17:33,950 --> 00:17:44,970 How many people here have ever used or interacted with an API? OK, so the vast majority of you have some experience using APIs already. 164 00:17:44,970 --> 00:17:48,950 Can I get I get a sense of what kinds of APIs you've used. 165 00:17:48,950 --> 00:17:59,510 At the back, maybe closer to either directly or through an hour package or something akin to a Python packages. 166 00:17:59,510 --> 00:18:04,920 The translation, which Facebook API follows on the graph. 167 00:18:04,920 --> 00:18:10,230 Yeah. I use one for males. 168 00:18:10,230 --> 00:18:19,090 It's of the genealogy that you get like the genealogy of some names, you find out people's age, gender or race based on their. 169 00:18:19,090 --> 00:18:24,190 Yeah. And that one was a very large sample. And you get the URL. 170 00:18:24,190 --> 00:18:27,850 Yeah, whatever content inside. 171 00:18:27,850 --> 00:18:33,730 Yeah, yeah. Oh yeah. So we have the scale of what I knew. 172 00:18:33,730 --> 00:18:39,370 I'm interested also, by the way, just because I think this is also an opportunity for you to know what other people have done. 173 00:18:39,370 --> 00:18:46,450 Also helpful from the perspective of your projects. Yeah, it's one that we still. 174 00:18:46,450 --> 00:18:55,900 Uh huh. She. 175 00:18:55,900 --> 00:19:03,660 Excellent. OK, so we have we have social media APIs, we have air quality, we have genealogies now. 176 00:19:03,660 --> 00:19:08,550 Actually, the API, it's so the bicycle life bicycle. 177 00:19:08,550 --> 00:19:13,030 Uh-Huh. That's another. Then this thing. 178 00:19:13,030 --> 00:19:18,010 Yeah, yeah, yeah. They have a live Yuvi every second. 179 00:19:18,010 --> 00:19:22,920 When you change or move a bite. Oh wow, actually recorded in, you can see. 180 00:19:22,920 --> 00:19:33,450 Yeah. Oh, that's very interesting. That's cool. Yeah, I actually built this sort of like an automatic box to grab every day or record data. 181 00:19:33,450 --> 00:19:44,050 Yeah. And then they have this message for all public data in Spain for all different open, open data. 182 00:19:44,050 --> 00:19:50,440 Yeah, OK. So we have sort of transport and also a public API. 183 00:19:50,440 --> 00:19:59,330 Any other experiences? Yeah. Google perspective on. 184 00:19:59,330 --> 00:20:05,040 Are. So what can you tell us a bit more about what it does? 185 00:20:05,040 --> 00:20:10,370 It's. Uh huh. 186 00:20:10,370 --> 00:20:16,710 Species that they've developed, and I think most. Moderation. 187 00:20:16,710 --> 00:20:24,720 Lots of users to. Some politicians. 188 00:20:24,720 --> 00:20:29,780 And they have been developing algorithms to classify speech. 189 00:20:29,780 --> 00:20:41,570 Ways these countries kind of one and also. Also, some. 190 00:20:41,570 --> 00:20:47,300 Is to. Yeah, yeah, so so it's it's a data analysis API. 191 00:20:47,300 --> 00:20:54,200 Yeah, so that's also an interesting aspect here because so far we've had a lot of APIs that are just you're retrieving information from them, 192 00:20:54,200 --> 00:20:58,370 but in your case is also doing some some additional analysis. Right. 193 00:20:58,370 --> 00:21:06,020 So it's it's analysing info so so you can have APIs that are both, of course, storing and you can collect information, right? 194 00:21:06,020 --> 00:21:08,270 You can pull if you can pull information from an API, 195 00:21:08,270 --> 00:21:17,630 but you can also basically make a request to an API to analyse something something for you, which I think is also an interesting dimension here. 196 00:21:17,630 --> 00:21:21,650 By the way, I would also suggest for those of you who have talked about the API that you're using, 197 00:21:21,650 --> 00:21:27,680 if you don't mind, could you put links on the slack so that other people also have a sense of what these APIs are? 198 00:21:27,680 --> 00:21:37,610 Yeah, what also is called the open corpus API open corpus allows you to query any name of the firm. 199 00:21:37,610 --> 00:21:43,450 Yeah. Mm-Hmm. 200 00:21:43,450 --> 00:21:53,150 What the government was imagining, so probably a lot of problems that Apple would have liked and brothers and sons would have mentioned, 201 00:21:53,150 --> 00:21:59,390 it goes away and now allows you to get a fuzzy mention on that for the next time. 202 00:21:59,390 --> 00:22:06,000 But the data and you need to know any mentioned and which starts from nowhere from. 203 00:22:06,000 --> 00:22:10,730 Now. OK. So that's and that's global arts. Oh, wow. 204 00:22:10,730 --> 00:22:20,340 Also, you do YouTube API. Sure, opponents, the retailers even preserved Venus with subtitles or captions. 205 00:22:20,340 --> 00:22:24,710 Yeah. All I can say for. Sure, sure. 206 00:22:24,710 --> 00:22:28,340 And have you use that in your own work? Yes. 207 00:22:28,340 --> 00:22:38,390 Here you are using such intense mediations, but again, we're also going to reach this embroiled in this discussion. 208 00:22:38,390 --> 00:22:44,420 And most of them are writing. Yes. But this is not original there, I think so. 209 00:22:44,420 --> 00:22:50,630 Yeah, sure. Results showed that most of the journalists wishes images are related in some way. 210 00:22:50,630 --> 00:22:55,290 That's right. Right. So so this is an example of this. 211 00:22:55,290 --> 00:22:59,840 The YouTube API is as actually still quite rich, right, in terms of providing you a lot of information. 212 00:22:59,840 --> 00:23:04,610 And and it has as far as I haven't used it myself. But what is the credentialing process like? 213 00:23:04,610 --> 00:23:08,340 Is it? Is it fairly straightforward? Can you do it with anyone with a YouTube account? 214 00:23:08,340 --> 00:23:18,050 Can or do you need? Yeah, yeah. Yeah. And anyone else have any other interesting examples from their own work or, oh, 215 00:23:18,050 --> 00:23:24,470 Google Translate API, Google Translate could very easily fetch that, right? 216 00:23:24,470 --> 00:23:36,980 Yeah. Yeah. Me, yeah, we have some live stream ensure that. 217 00:23:36,980 --> 00:23:49,760 Yes. Yes. OK, yes, thank you. Anyway, it's hard to it's hard to be spontaneous and have a discussion while at the same time be anyway. 218 00:23:49,760 --> 00:23:53,480 So. So like the reason I asked all of you in addition to this, 219 00:23:53,480 --> 00:24:05,030 being a community sharing exercise is to also illustrate that web APIs exist for purposes of data extraction and data analysis, right? 220 00:24:05,030 --> 00:24:09,080 So he has a translate, and this was the Perspective Perspectives API. 221 00:24:09,080 --> 00:24:14,750 The other thing that's cool about them is that they exist for all sorts of domains, right? 222 00:24:14,750 --> 00:24:19,670 So we have social media APIs, we have APIs that are providing air quality data that we have genealogies. 223 00:24:19,670 --> 00:24:21,770 So we have really a range of. 224 00:24:21,770 --> 00:24:29,840 And they're also providing all sorts of of when you do kind of get requests to them and you ask them to do to get some information, 225 00:24:29,840 --> 00:24:36,500 you can get information and all sorts of forms. Often you get some kind of adjacent area, which will be, as I said, 226 00:24:36,500 --> 00:24:42,560 this format where things will be will be text and lots of curly brackets with lots of these. 227 00:24:42,560 --> 00:24:49,250 These quotation marks are. But at the same time, sometimes as you can also get different kinds of formats. 228 00:24:49,250 --> 00:24:59,820 But what is an API for those who haven't used APIs is an API is simply a software intermediary that allows two applications to talk to each other. 229 00:24:59,820 --> 00:25:09,200 That is very generally an API is just allowing for two applications two to talk to each other and they get an example for. 230 00:25:09,200 --> 00:25:14,810 This would be if you think if you think about, for example, so on a web API. 231 00:25:14,810 --> 00:25:24,620 More specifically, it allows a client or a computer to ask another computer, usually a server, for some kind of resource over the internet. 232 00:25:24,620 --> 00:25:32,990 So if you think about, for example, these aggregator websites that like Momondo or Expedia or Kayak Ride, 233 00:25:32,990 --> 00:25:38,660 they themselves don't actually have the availability of flights on the particular 234 00:25:38,660 --> 00:25:43,430 date and all this information on their own systems on their own servers. 235 00:25:43,430 --> 00:25:49,040 But at the same time, they still access that information and show it to you when you make a query to them. 236 00:25:49,040 --> 00:25:53,090 So how are those two websites interacting through an API? 237 00:25:53,090 --> 00:26:00,170 So there's a structured bit of exchange going on between, for example, Momondo and British Airways when you make that query. 238 00:26:00,170 --> 00:26:02,990 And that's usually happening through some kind of an API, 239 00:26:02,990 --> 00:26:12,740 and that kind of exchange is then then what you see there and what are your final product is that's been enabled through that API? 240 00:26:12,740 --> 00:26:17,750 What's cool about a lot of modern APIs is that they are they adhere to standards 241 00:26:17,750 --> 00:26:22,370 that make data exchange between them programmatically quite extensible, 242 00:26:22,370 --> 00:26:25,490 structured and generally safe. 243 00:26:25,490 --> 00:26:36,650 And another thing that I think are kind of interesting and cool about APIs is that they generally require some kind of credentialing process, right? 244 00:26:36,650 --> 00:26:43,550 So they have some kind of key that you need or some kind of authentication process that you have to go through, 245 00:26:43,550 --> 00:26:51,740 which in some sense gives them more legitimacy than you would have where you would just sort of crawling or scraping a web page. 246 00:26:51,740 --> 00:26:55,310 So sometimes people like to draw a distinction between, you know, 247 00:26:55,310 --> 00:27:03,590 an API based data strategy versus a versus one where you've just kind of scraped something off the web. 248 00:27:03,590 --> 00:27:10,370 You have a point, where do I? Well, it's good to exercise the OK. 249 00:27:10,370 --> 00:27:15,380 So that was my brain about how this the API and really starting around the 2000s 250 00:27:15,380 --> 00:27:20,310 when there was this whole development of standards and an explosion in web APIs, 251 00:27:20,310 --> 00:27:24,890 really, there was really an expansion in the in the number of APIs that are available. 252 00:27:24,890 --> 00:27:31,160 And this is a good resource programmable web dot com slash APIs where you get a 253 00:27:31,160 --> 00:27:37,160 sense of all existing web APIs that are available for all sorts of purposes. 254 00:27:37,160 --> 00:27:46,670 And this has, according to Chris, this has about 20 thousand and counting, and it tells you about all categories of of of APIs. 255 00:27:46,670 --> 00:27:52,100 And you see a lot of them are from the mid 2000s, which is when this kind of boom in APIs occurred. 256 00:27:52,100 --> 00:27:59,690 And now I would think we're sort of reaching a plateau where there isn't really kind of a hex expansion really, that's happening at the same rate. 257 00:27:59,690 --> 00:28:04,280 And maybe we don't know, but maybe there might even be, as I was saying earlier, 258 00:28:04,280 --> 00:28:10,550 post API age where where there might be more challenges to data access. 259 00:28:10,550 --> 00:28:17,330 So most Web APIs tend to use hypertext transfer protocol or HDB methods, you know, which you are familiar with, 260 00:28:17,330 --> 00:28:23,090 because when you are making any query, when you type in a URL, that's going to be GDP, that's how it starts with. 261 00:28:23,090 --> 00:28:32,600 And it should be is the network protocol that delivers virtually sort of all files over the internet or over the world wide web. 262 00:28:32,600 --> 00:28:42,830 So HTML files, image queries, everything. And the nice thing about HDB methods is that there are these kind of four or three that get 263 00:28:42,830 --> 00:28:52,450 forced delete and a few other verbs that are used to do to interact and make requests with them. 264 00:28:52,450 --> 00:29:00,260 So I think two of the most commonly used ones are get togethers used when you are requesting a page and post 265 00:29:00,260 --> 00:29:05,030 is usually when you're like submitting something to a server or when you are submitting a form to a server, 266 00:29:05,030 --> 00:29:14,180 that's a kind of a post post request. And the HDR package and hour is a great way in my experience to just work. 267 00:29:14,180 --> 00:29:21,560 It's very useful to work with all kinds of APIs, and you can make a request to a URL, get a response. 268 00:29:21,560 --> 00:29:31,400 And the word for that is get g d. And the response usually contains the kind of all responses will have a status, a header and a body. 269 00:29:31,400 --> 00:29:36,020 So here, for example, I made a simple request. So there's HDR. 270 00:29:36,020 --> 00:29:40,220 I just said that I'll show you in the next slide what this is. 271 00:29:40,220 --> 00:29:45,710 But I said I made a request to this this 8TB dat API. 272 00:29:45,710 --> 00:29:49,910 And and then and then I just say I queried and I said, What is it? 273 00:29:49,910 --> 00:29:53,570 Is it is called and I said, trial and this is two hundred. 274 00:29:53,570 --> 00:29:59,210 And for those of you who have used HDTV, two hundred means that this was a successful request. 275 00:29:59,210 --> 00:30:06,860 So two hundred is a status code for success and 400 is like bad right for 100 is no. 276 00:30:06,860 --> 00:30:13,400 Somehow something went wrong. You didn't get access or 400 error, usually unauthorised or some problem, some credential. 277 00:30:13,400 --> 00:30:19,160 Something's wrong. So and then and then I just said, Well, what is the content of this? 278 00:30:19,160 --> 00:30:22,910 And I get these kind of numbers because this is what this is, right? 279 00:30:22,910 --> 00:30:28,110 I made a query to the cat. And what's cool about this is that so this is a joke, right? 280 00:30:28,110 --> 00:30:36,170 They have all the error codes with cat photos. It's the cat API which has the I don't know anyone else seen this before now, OK? 281 00:30:36,170 --> 00:30:37,880 They also have a dog API yesterday. 282 00:30:37,880 --> 00:30:43,310 So yesterday, when I was having dinner, my partner said, Why did you have a dog picture instead that dogs are cuter than cats? 283 00:30:43,310 --> 00:30:45,920 And I was like, No, I know that Cat API anyway. 284 00:30:45,920 --> 00:30:54,500 So the the nice thing about this is really what you see is that my point here to show you the cat API is is first. 285 00:30:54,500 --> 00:31:00,290 That was a very simple command, but you're making a request to different kinds of data in different formats. 286 00:31:00,290 --> 00:31:05,030 So this is not a JSON object, but you got a cat. 287 00:31:05,030 --> 00:31:15,790 And that cat is an image anyway. So. There's the Facebook marketing API, which is an example again of a social media API that I use some example, 288 00:31:15,790 --> 00:31:20,280 some papers and I describe some papers that had used this. 289 00:31:20,280 --> 00:31:28,890 And that's an example, again of a social media API that we might want to use and could interact with using the R package in our. 290 00:31:28,890 --> 00:31:33,090 And in order to do it, the credentialing involved is you need to have a Facebook account. 291 00:31:33,090 --> 00:31:38,730 You need to have the marketing app with token and ad account number or act. 292 00:31:38,730 --> 00:31:43,530 So rather than opening my my Facebook account here and showing you how to get these things, 293 00:31:43,530 --> 00:31:50,020 which would involve me also showing you my own credentials, I suggest if you want to get these credentials. 294 00:31:50,020 --> 00:31:54,780 Sophia Gill Flavell from the Max Planck Institute for Demographic Research has very 295 00:31:54,780 --> 00:31:59,160 nicely and helpfully prepared a tutorial on how you can get these credentials. 296 00:31:59,160 --> 00:32:03,840 So I don't have to reveal my Facebook timeline and history is here, 297 00:32:03,840 --> 00:32:11,490 so I suggest you go here and link if you're interested in getting these credentials to to set this up. 298 00:32:11,490 --> 00:32:15,630 But if your goal essentially if you open your Facebook account and this is perhaps 299 00:32:15,630 --> 00:32:19,350 a legitimate reason to open your Facebook account at this point in the lecture, 300 00:32:19,350 --> 00:32:25,080 if you want to, you know, if you go up to the top, where on the right on the right hand side, 301 00:32:25,080 --> 00:32:35,820 if you click the little arrow or a dropdown menu where it says marketing on or marketing platform, I don't know what it says, 302 00:32:35,820 --> 00:32:40,920 but if any of you have a Facebook account that you want to open and you want to see on the right hand side, 303 00:32:40,920 --> 00:32:44,880 there's a part where you can see advertising on Facebook is what it says. 304 00:32:44,880 --> 00:32:49,170 And if you go to advertising, if you click on that link advertising on Facebook. 305 00:32:49,170 --> 00:33:00,180 This is the Blacks. This is generally what you get in the end. You, you make a few more next next steps and then this is what you get at the at. 306 00:33:00,180 --> 00:33:06,330 From that on the front end is what you see and what this tells you, as you can create a new audience here, right? 307 00:33:06,330 --> 00:33:12,100 And you can say so yesterday night I created an audience and I so at the moment, 308 00:33:12,100 --> 00:33:15,750 there's this cricket World Cup going on, and I'm very excited about it. 309 00:33:15,750 --> 00:33:18,390 But I didn't watch it yesterday. I was very sad. 310 00:33:18,390 --> 00:33:24,960 And then I was thinking, Oh, I wonder if other people in Oxford are interested about the Cricket World Cup? 311 00:33:24,960 --> 00:33:28,920 And then I did a search request. I did a query request yesterday to see, Oh, 312 00:33:28,920 --> 00:33:35,790 what is the size of the audience that might be interested in in cricket and the Cricket World Cup on Facebook? 313 00:33:35,790 --> 00:33:41,880 How many Facebook users are interested in cricket and the Cricket World Cup in Oxford and all? 314 00:33:41,880 --> 00:33:45,810 So I just said men, women, all between the ages 18+. 315 00:33:45,810 --> 00:33:52,260 And it turns out, according to Facebook, thirty eight thousand people are interested in cricket and Facebook, which I was kind of surprised by. 316 00:33:52,260 --> 00:34:00,390 I thought it would be a lower number. So, yeah, it was. This is an example of basically making a targeted audience on. 317 00:34:00,390 --> 00:34:08,820 On on Facebook now this is telling you how many users meet certain characteristics by geography, 318 00:34:08,820 --> 00:34:13,620 by age, gender and then actually a range of different other kinds of detailed targeting. 319 00:34:13,620 --> 00:34:18,600 This was just a very sort of casual request, just looking at interests. 320 00:34:18,600 --> 00:34:22,110 But you could also do this by things like parental status. 321 00:34:22,110 --> 00:34:30,060 So in the tutorial that I've prepared to accompany this lecture, which I think is also on the website for six Oxford, 322 00:34:30,060 --> 00:34:35,370 you'll see an R&D file where I link to some pages that have used this data source for research. 323 00:34:35,370 --> 00:34:45,690 And you can see that this that the view is different behaviours or interest categories to create measures and variables with this. 324 00:34:45,690 --> 00:34:51,510 So this is on the front end. This is what it looks like, right? This is what the the the numbers look like. 325 00:34:51,510 --> 00:34:56,670 But actually, programmatically, we can make a query to this API to retrieve what are these? 326 00:34:56,670 --> 00:35:04,650 These add audience estimates. These add audience estimates are just the measures of how many Facebook users match certain characteristics. 327 00:35:04,650 --> 00:35:09,090 And you can do pretty detailed estimates or you can do very aggregate estimates. 328 00:35:09,090 --> 00:35:14,370 So but but it's possible to obtain these programmatically using the the marketing API. 329 00:35:14,370 --> 00:35:17,580 And what are these are essentially targeting specifications. 330 00:35:17,580 --> 00:35:24,270 You can specify what categories you want to target and get sell sizes or columns for and this website. 331 00:35:24,270 --> 00:35:28,350 It's actually quite a well detailed in terms of the information. 332 00:35:28,350 --> 00:35:35,760 It's very well documented and it tells you how different things are are operationalised. 333 00:35:35,760 --> 00:35:39,240 The one drawback, though, in my experience with working with this API, 334 00:35:39,240 --> 00:35:47,280 is that it changes all the time and you go through many versions very quickly so that if you have previous base, 335 00:35:47,280 --> 00:35:49,920 you URL that you used as a syntax, 336 00:35:49,920 --> 00:35:57,750 things change and then the query structure changes and then you have to essentially be mindful of that or targeting specific actions get moved around. 337 00:35:57,750 --> 00:36:04,920 So what used to be classified as an interest before becomes a behaviour? And so then you're so old code might not work. 338 00:36:04,920 --> 00:36:09,930 If you are done a query, say, you know, with an older version, it might not work with the newer version. 339 00:36:09,930 --> 00:36:15,240 Yeah. How do you deal with this in your gender project, which essentially does this every day? 340 00:36:15,240 --> 00:36:25,110 Yeah. So the one way we deal with this is there's a person who works on the project and and and when he notices, something is wrong, he he. 341 00:36:25,110 --> 00:36:27,300 But that's a really good point, and it's a really good question. 342 00:36:27,300 --> 00:36:36,180 And that's one of the reasons we actually hired someone for this was because it became that if you want to do this on a long, 343 00:36:36,180 --> 00:36:41,820 if you want actually want to do this kind of connexion or you want to have a platform on your dashboard but is actually doing this, 344 00:36:41,820 --> 00:36:44,160 have regularly, you need to be, you know, 345 00:36:44,160 --> 00:36:51,180 we need you need to have someone who is dedicated to actually monitoring these kinds of changes because otherwise it would not be. 346 00:36:51,180 --> 00:36:54,300 And even then, even when you have someone working on this, you know, 347 00:36:54,300 --> 00:37:01,470 sometimes you don't realise that the changes occurred and then you notice two days later that the data was collected and why were they not collected? 348 00:37:01,470 --> 00:37:05,460 So yeah, it's it's a bigger and. And actually, this has been an issue. 349 00:37:05,460 --> 00:37:10,170 I don't know if others have had this issue that sometimes, you know, things change and you don't know what changes. 350 00:37:10,170 --> 00:37:15,420 And so, yeah, go ahead. Do you have to find a deposit before spending any money on Facebook? 351 00:37:15,420 --> 00:37:22,140 Yeah, good question. Yeah. So just to actually launch the ads, you don't need to do launch to actually run an ad, 352 00:37:22,140 --> 00:37:28,650 you need money, but in order to query just accounts you don't need, it doesn't cost money. 353 00:37:28,650 --> 00:37:30,480 So you. So if you were to run an ad, 354 00:37:30,480 --> 00:37:37,110 if I said if I actually went and ran an ad for these thirty eight thousand people in Oxford who are interested in cricket, 355 00:37:37,110 --> 00:37:45,910 then yes, I would be paying. But I just asked for the account, and that's free. 356 00:37:45,910 --> 00:37:53,800 Well, all APIs have great limits and and this is a concept where if you query too much, too fast, 357 00:37:53,800 --> 00:38:02,230 then you are basically you're being disrespectful and you know your your access might be compromised as a result of that. 358 00:38:02,230 --> 00:38:09,940 But as long as yeah, so what rate limiting means, essentially is that if you wanted to extract a lot of queries, it would take a lot of time. 359 00:38:09,940 --> 00:38:15,250 But limiting also means that, you know, if you do stick within the the rules of the game, it's fine. 360 00:38:15,250 --> 00:38:24,100 There are also different levels of tiers that we found, so we were able to improve our status for digital gender gaps to go from a. 361 00:38:24,100 --> 00:38:29,290 I don't know what to steer officially, it's called, but we went to a higher trade to our highest atmosphere, 362 00:38:29,290 --> 00:38:33,580 which allowed us to have a shorter gap between queries. 363 00:38:33,580 --> 00:38:37,540 And the reason we were able to do that was because we went through again an additional authentication 364 00:38:37,540 --> 00:38:43,960 process where we explained why we were collecting these data by making a little video about it. 365 00:38:43,960 --> 00:38:52,270 So that was an additional step to reduce the time between queries to five seconds or I think of you instead of eight or nine seconds. 366 00:38:52,270 --> 00:38:58,470 So that's so that's kind of a strategy that some might allow you to have. 367 00:38:58,470 --> 00:39:03,480 So these these links are very helpful in telling you about the API keeps changing. 368 00:39:03,480 --> 00:39:16,690 And this is a problem that we've encountered. So at this point for version three point three, which is the version of the API so far at the moment, 369 00:39:16,690 --> 00:39:20,500 sorry, this is kind of the general structure of the URL. 370 00:39:20,500 --> 00:39:27,880 It's graphed out facebook.com slash the version of the API Act and this would be your ad account number. 371 00:39:27,880 --> 00:39:34,750 The delivery estimate and the delivery estimate is basically telling you that it's one delivery. 372 00:39:34,750 --> 00:39:37,750 Do you want what kind of customer do you want anyway? 373 00:39:37,750 --> 00:39:43,310 And then you have to first authenticate yourself with a token and that that's the token that I was telling you. 374 00:39:43,310 --> 00:39:51,700 If you go to the tutorial, that service that Sofia is prepared, you'll be able to see it how to get those tokens for yourself. 375 00:39:51,700 --> 00:40:00,280 But again, we already discussed this. It's important to remember rate limiting when working with the APIs I worked towards, I think. 376 00:40:00,280 --> 00:40:04,690 Is the tutorial online? Yeah, yeah. 377 00:40:04,690 --> 00:40:15,790 So I just go to that. But before I just I go to the tutorial next on the on the HTML file, but I just wanted to say that as a general principle, 378 00:40:15,790 --> 00:40:23,440 this kind of targeted advertising counts for how many users on a platform meet certain characteristics. 379 00:40:23,440 --> 00:40:29,110 This kind of online advertising data as a general principle is not just limited to Facebook. 380 00:40:29,110 --> 00:40:37,780 It also exists for Google. It also exists for Twitter, for instance. So, so this is an example of Google AdWords here. 381 00:40:37,780 --> 00:40:42,700 This is AdWords trying to start an advertising campaign. 382 00:40:42,700 --> 00:40:49,150 So these are going to add data available for advertisers. This is what I'm trying to start a new advertising campaign. 383 00:40:49,150 --> 00:40:56,620 And I'm saying create a campaign and I've chosen gender, age, parental status, household income. 384 00:40:56,620 --> 00:41:05,410 This is only available for like the US household income as a targeting criteria or like us and a few other countries anyway. 385 00:41:05,410 --> 00:41:15,280 So what you notice here is orderly. So I chose everything. And then you notice here that it says your targeting reaches 10 billion. 386 00:41:15,280 --> 00:41:19,750 Now I'm a demographer. I know that there are 10 billion people in the world. 387 00:41:19,750 --> 00:41:24,970 So why is Google claiming there to be 10 billion people? So what is the problem here? 388 00:41:24,970 --> 00:41:29,620 So it turns out that unlike Facebook, which gives you estimates of users, 389 00:41:29,620 --> 00:41:39,340 what AdWords is giving you is estimates of impressions and impressions is vaguely defined in the impressions as defined in their in their strategy, 390 00:41:39,340 --> 00:41:45,140 in their documents as just how many times an ad would be seen by someone. 391 00:41:45,140 --> 00:41:50,920 Right. So how many eyes would it encounter? And actually, in the context of our gender gaps work? 392 00:41:50,920 --> 00:41:54,550 This is interesting because I haven't included a figure here about this. 393 00:41:54,550 --> 00:41:59,710 But when when we look at the same ratio of the Facebook gender gap, 394 00:41:59,710 --> 00:42:07,060 we find that the gender gap or gender inequality as estimated by a ratio of Facebook users, 395 00:42:07,060 --> 00:42:11,830 if we had a measure of digital inequality as measured by the ratio of female to male Facebook users, 396 00:42:11,830 --> 00:42:19,000 is actually much more optimistic about gender equality than the Google AdWords based measure, 397 00:42:19,000 --> 00:42:22,690 which suggests that there is a greater gap between men and women. 398 00:42:22,690 --> 00:42:28,840 And part of the story there could just be that, you know, if men have more time on their hands for leisure activities, 399 00:42:28,840 --> 00:42:34,480 they might be spending it online or they might be surfing much more online relative to women who might be using. 400 00:42:34,480 --> 00:42:39,160 They might have internet access, they might be using it, but they might not be using it as much. 401 00:42:39,160 --> 00:42:42,900 And as a result, the impressions they create are not as many. So. 402 00:42:42,900 --> 00:42:49,960 So even though the two measures of the Facebook gender gap in the ad were gender, gap has a very high correlation generally for the same country, 403 00:42:49,960 --> 00:42:56,380 the Facebook measure tends to be much more much higher than it is for the Google AdWords gender gap. 404 00:42:56,380 --> 00:42:59,800 That's what we've been finding in our in our work here and in general. 405 00:42:59,800 --> 00:43:06,460 We find that for the kind of the women we're trying to estimate internet use gender gaps and invalidating against survey data, 406 00:43:06,460 --> 00:43:11,860 we find that Facebook does better on its own than AdWords. 407 00:43:11,860 --> 00:43:20,230 But the best models are those that combine Facebook and Google try and do when we're trying to make a prediction about internet use gender gaps. 408 00:43:20,230 --> 00:43:27,170 Anyway, so what I want to do now is just run through quickly the the tutorial on how to use the. 409 00:43:27,170 --> 00:43:44,100 The API from within our. With this, whether this is a digital jam, the gas website, as you can see the reports for the different days. 410 00:43:44,100 --> 00:43:50,520 And then see different days, how it's evolved. You can also see how we tried to estimate the mobile phone in the Gulf and so on. 411 00:43:50,520 --> 00:43:57,800 But you know, where is it? You know, I heard. OK. 412 00:43:57,800 --> 00:44:08,180 How can I go back down with this? We'll take this year. 413 00:44:08,180 --> 00:44:18,810 OK, so. So what I've done in this tutorial is just give you an example of how to get estimates from the API, 414 00:44:18,810 --> 00:44:27,670 from the marketing API using your our console and you don't really actually need that many packages forward. 415 00:44:27,670 --> 00:44:40,260 You just need HDR and JSON light because the query is returned in JSON format and then and you will put in your token in your act. 416 00:44:40,260 --> 00:44:45,680 And as I said, this sort of the general structure of the URL that you're making a call to is quite straightforward. 417 00:44:45,680 --> 00:44:55,980 Graph Facebook.com, the version, your credentials and then the key information is going to be this targeting spec, right? 418 00:44:55,980 --> 00:45:01,320 So the simplest targeting spec here, so the targeting spec is specified as a JSON array, 419 00:45:01,320 --> 00:45:08,550 so you have to specify it in these kind of curly brackets, but I'm dealing with it as a string here and here. 420 00:45:08,550 --> 00:45:10,740 Although if you had a more complex array, 421 00:45:10,740 --> 00:45:20,520 you might not want to deal with it as a string and you might want to work with it as a separate JSON object and then import that into our. 422 00:45:20,520 --> 00:45:24,860 The Keystone. Is no good question. 423 00:45:24,860 --> 00:45:32,180 So, Jack, almost question is, can you get historical queries? Can you go back in time and see how many users there were in the past? 424 00:45:32,180 --> 00:45:40,400 And actually, that's a big limitation, say, unlike Twitter, where at least you could theoretically go back at least a few days in the past and the 425 00:45:40,400 --> 00:45:47,210 free access that you get to the Twitter API and even longer if you had paid access. 426 00:45:47,210 --> 00:45:53,350 Although I don't know, the standard rest API let you go backward seven days or how many. 427 00:45:53,350 --> 00:45:59,230 Yeah. Seven days or something, but you can you can get premium access and go back. 428 00:45:59,230 --> 00:46:07,360 And there are some places that have access to ladder. The fire hose, I guess who have access to all this is a big limitation of this data source. 429 00:46:07,360 --> 00:46:09,790 That's actually one of the reasons why in digital gender gaps, 430 00:46:09,790 --> 00:46:17,890 we're collecting it prospectively because we're interested in looking at change over time. 431 00:46:17,890 --> 00:46:27,440 You're not going to pay for access. Which makes you think even for years, it was only a matter of years. 432 00:46:27,440 --> 00:46:34,620 And there's also a certain university. But yeah. 433 00:46:34,620 --> 00:46:46,110 Now, so the sometimes called the Twitter. Does anyone I mean any of your active Twitter user, I mean active Twitter users for research. 434 00:46:46,110 --> 00:46:49,890 I mean, because the standard rest API let you go back a few days, 435 00:46:49,890 --> 00:46:57,600 but with some universities you have access to what's called the Twitter firehose, which is a very large sample of, I think, tweets ever. 436 00:46:57,600 --> 00:47:03,660 All tweets, ever. Yeah. And there might be a few other universities that have access to that. 437 00:47:03,660 --> 00:47:10,380 And that means that. And again, this is coming back to the point I was making in the first part where one of the reasons why we've 438 00:47:10,380 --> 00:47:17,280 seen so much digital trace data research use Twitter is because Twitter has been actually very open, 439 00:47:17,280 --> 00:47:23,280 unlike other platforms that have been much less open. So I think so. 440 00:47:23,280 --> 00:47:27,270 Twitter has posted this this Facebook. 441 00:47:27,270 --> 00:47:31,260 Not this this information. No, no. 442 00:47:31,260 --> 00:47:37,080 But that's actually one of the reasons why we're going forward in time is to try and get around some of those. 443 00:47:37,080 --> 00:47:43,000 Of course, Facebook, I'm sure, has it, but it's a different issue. 444 00:47:43,000 --> 00:47:51,460 But yeah, so this is a very simple query, what I've just said, all people in GB and so in the UK and I've just said. 445 00:47:51,460 --> 00:47:58,800 And if you don't specify any other targeting specific nations, then you will essentially the others will be left stunned by their default. 446 00:47:58,800 --> 00:48:04,900 So in the documentation, it'll tell you what the defaults are for all of the other specifications. 447 00:48:04,900 --> 00:48:16,120 And then I've just created a list here where I've said, these are just some attributes that I want to specify in my query to the API. 448 00:48:16,120 --> 00:48:20,570 But I'm saying that I want to get a reach estimate. That's kind of my optimisation goal. 449 00:48:20,570 --> 00:48:22,540 I want to just see what's the reach of my ad, 450 00:48:22,540 --> 00:48:31,250 which is the equivalent of like how many people would see this ad or what would be the audience size of this ad? 451 00:48:31,250 --> 00:48:35,000 There's a I've also said, for instance, that this is method get, 452 00:48:35,000 --> 00:48:40,700 which is that's the get verb from the HTP protocol that I'm just extracting or I'm requesting this estimate. 453 00:48:40,700 --> 00:48:44,030 I don't want to make any changes to anything, so it's just getting it. 454 00:48:44,030 --> 00:48:48,080 And I'm saying this is a targeting specification and that's my targeting specific vision. 455 00:48:48,080 --> 00:48:53,660 I just want to know how many people are, how many are Facebook users in the UK? 456 00:48:53,660 --> 00:49:05,600 And that's what. And then I make that request using the R package and then I just and then I say, Tell me and I just pass this content as text. 457 00:49:05,600 --> 00:49:10,520 And I because it's as I said, the output is Jason. And then this is what I get from it. 458 00:49:10,520 --> 00:49:21,980 I get a list and then one one element in this list is this is this data data frame and you can see that it gives you these kind of four columns. 459 00:49:21,980 --> 00:49:30,170 It gives you estimate, double estimate mall, an estimate ready estimate dole refers to daily active users. 460 00:49:30,170 --> 00:49:37,190 The estimate now is the monthly active users, and estimate ready just means that this was a successful request. 461 00:49:37,190 --> 00:49:42,260 So, for example, if you were doing some error handling which you might want to do if you are making a range of requests, 462 00:49:42,260 --> 00:49:46,700 you might want to check if this estimate already is equal to draw or not. 463 00:49:46,700 --> 00:49:52,460 If you wanted to just confirm that you were actually getting a plausible request. 464 00:49:52,460 --> 00:49:59,300 This is another issue we encounter a lot as we get a lot of implausible requests, we get a lot of. 465 00:49:59,300 --> 00:50:07,400 So, for example, we get a lot of thousands and we know that, for example, the population of the US doesn't have a thousand Facebook users, 466 00:50:07,400 --> 00:50:14,390 and thousand is generally the smallest cell count that they're willing to reveal information for for the monthly active users. 467 00:50:14,390 --> 00:50:20,270 But at the same time, a thousand can also just be sometimes thrown up as doesn't happen that much. 468 00:50:20,270 --> 00:50:28,430 But sometimes it happens that we get unexplained thousands, even though in the same country the previous day, it was not a thousand. 469 00:50:28,430 --> 00:50:35,390 So it seems to be some kind of an error. And I've talked with others about it and they feel that this happens, but it doesn't happen that much. 470 00:50:35,390 --> 00:50:41,720 But it happens enough that you should have. If you're doing a lot of queries, you should probably think about that. 471 00:50:41,720 --> 00:50:50,190 Another thing you might want to do is, of course, is again, to go back to HDB status codes, you might want to check if it should be code equal to 100. 472 00:50:50,190 --> 00:50:54,830 So was this a legitimate? And was this a successful request or not? 473 00:50:54,830 --> 00:50:58,340 When you if you were just trying to collect a lot of queries and wanted to 474 00:50:58,340 --> 00:51:04,370 make sure that you had successful requests because so this was just as I said, 475 00:51:04,370 --> 00:51:10,040 it gives you the Dow, the model you might want to make a more complex, quick, sorry complex query. 476 00:51:10,040 --> 00:51:16,580 So here I put a I said age, I would age minimum max. 477 00:51:16,580 --> 00:51:26,870 So I said here I said, I want to query which is a GBE in Spain, so U.K. and Spain. 478 00:51:26,870 --> 00:51:33,290 And I only women in GB and Spain between the ages of 20 and fifty five. 479 00:51:33,290 --> 00:51:37,880 So that was my query this time. And I also said those who location type is home. 480 00:51:37,880 --> 00:51:43,370 So Francesco was mentioning this yesterday. They allow you to specify whether you are. 481 00:51:43,370 --> 00:51:49,730 You say you live in this place and you're actually living or alternately, you might be commuting from somewhere else. 482 00:51:49,730 --> 00:51:53,150 So there are basically different location types that you can also specify. 483 00:51:53,150 --> 00:52:01,040 So these are people who are essentially say that they're living in Oxford and they're also being seen as is a group based in Oxford. 484 00:52:01,040 --> 00:52:06,620 So this is kind of a this home measure, then the device platforms mobile and desktop. 485 00:52:06,620 --> 00:52:15,230 So I put in some more specifications. But the query structures exactly the same, just the the JSON array becomes a little bit longer. 486 00:52:15,230 --> 00:52:25,440 And then I see that I get forty two million users and maybe I don't know if this is updated or did Dominion. 487 00:52:25,440 --> 00:52:30,260 I guess it's only women, so it's fine. I was like, This seems kind of small. 488 00:52:30,260 --> 00:52:36,140 Forty two million for women, twenty to fifty five GB and Spain. 489 00:52:36,140 --> 00:52:43,010 Another thing to note here is that by default, targeting specific actions as an end query. 490 00:52:43,010 --> 00:52:47,420 So it will not tell you GB or Spain or Italy and Spain. 491 00:52:47,420 --> 00:52:53,360 So if you wanted to get to know GB and yes, separately, you would have to make two separate requests. 492 00:52:53,360 --> 00:52:58,010 And as a result, that's what I was saying, that if you wanted to make a series of requests about different countries, 493 00:52:58,010 --> 00:53:04,490 you would just have to do this in a loop and make sure that in your loop, you would put in to avoid getting rid limited. 494 00:53:04,490 --> 00:53:11,360 Maybe this time, you know, a certain number of maybe eight seconds if you are in the first year of this. 495 00:53:11,360 --> 00:53:21,270 So this is a. So another aspect of this API, which is quite useful when you're working with it, is that you have. 496 00:53:21,270 --> 00:53:26,160 So this is the actual get so. So this is actually when you want to extract the estimates. 497 00:53:26,160 --> 00:53:35,220 If you actually wanted to see what are the targeting criteria that are available, you could use this the search API of the marketing API. 498 00:53:35,220 --> 00:53:44,940 And this what I've done here is I've looked at I've just said so using slightly different URL, but I've just done search here. 499 00:53:44,940 --> 00:53:46,920 I've said that to me, 500 00:53:46,920 --> 00:53:55,890 the art that the ad targeting categories of behaviours and this is my token and then tell me what's available in these behaviours. 501 00:53:55,890 --> 00:54:03,060 And you see. So this is just the list of the first 10 behaviours of the 300 that come in this table. 502 00:54:03,060 --> 00:54:10,020 So you can see that you can target by frequent travellers of technology early adopters. 503 00:54:10,020 --> 00:54:17,670 Interesting Facebook Access OS, Mac OS, Windows XP anyway. 504 00:54:17,670 --> 00:54:22,830 So device types these kinds of bits of information anyway. 505 00:54:22,830 --> 00:54:29,280 So this is just sort of a way that you can start if you wanted to extract some data from it. 506 00:54:29,280 --> 00:54:32,820 But again, there are many different APIs for many different kinds of things. 507 00:54:32,820 --> 00:54:40,260 This is just an example of one that I've used and that's been used, for example, in. 508 00:54:40,260 --> 00:54:44,400 This is the that is being used for this digital gender gaps work, right? 509 00:54:44,400 --> 00:54:51,410 This is the data source that we're using. And so it will come back. 510 00:54:51,410 --> 00:54:57,770 So in there for the next maybe half an hour before we go for dinner, 511 00:54:57,770 --> 00:55:05,720 I thought maybe we could just work on if this would be our exercise for the afternoon would be familiarise yourself and obtain credentials 512 00:55:05,720 --> 00:55:16,040 to work with with maybe one of the either the Facebook marketing API that I just talked about or potentially the the Twitter API, 513 00:55:16,040 --> 00:55:23,090 which for which you can interact directly through an R package. And Chris has a very helpful tutorial on how to use the R tweet package, 514 00:55:23,090 --> 00:55:31,180 which is linked which is linked here on the slides, but is also actually I can test if it's linked on the slides. 515 00:55:31,180 --> 00:55:40,280 I smoke, it is linked on the slides, but or you can also just use it in some some other way. 516 00:55:40,280 --> 00:55:44,510 I should note that Pablo Barbaro, who's going to give a talk here on Monday, 517 00:55:44,510 --> 00:55:52,460 has also said that [INAUDIBLE] talk a bit about his own our packages because he's developed similar packages to use Twitter in an hour, 518 00:55:52,460 --> 00:55:56,990 and [INAUDIBLE] give a demo of some of his package it packages on Monday as well. 519 00:55:56,990 --> 00:56:02,030 This is a different package than his package. This one is, I think, by my journey. 520 00:56:02,030 --> 00:56:05,780 It's also a good package or another API of your choosing. 521 00:56:05,780 --> 00:56:11,480 So just for the next half an hour, if you wanted to work on just familiarising yourself, 522 00:56:11,480 --> 00:56:18,770 get credentials and then maybe do some, especially for those of you who have never worked with an API before. 523 00:56:18,770 --> 00:56:28,460 Just just get familiar with working with one for the next half an hour because I plan for tomorrow is I don't know, should we break for that? 524 00:56:28,460 --> 00:56:33,300 Should we make the groups now or the. 525 00:56:33,300 --> 00:56:39,020 We did in the morning. Yeah, but then but then I revealed the strategy for other groups are made. 526 00:56:39,020 --> 00:56:46,110 Oh yeah. Well, OK. Anyway, so this is going to be the group exercise that we're going to work on in the morning. 527 00:56:46,110 --> 00:56:51,530 Divide yourselves into groups of groups by counting into four, which is why because they didn't know about this. 528 00:56:51,530 --> 00:56:58,350 So it's kind of bizarre genus. Then the yeah, now they might strategize on those anyway. 529 00:56:58,350 --> 00:57:06,300 So I want you to divide us having two groups or groups of four and then work together to identify a research 530 00:57:06,300 --> 00:57:12,900 question that you believe could be answered using one of these three sources of digital trade status or not. 531 00:57:12,900 --> 00:57:19,500 I mean, there are many different sources, but I just want you to think about anything that could be done with one of these three. 532 00:57:19,500 --> 00:57:23,820 And and think about it in the context of what question you would ask. 533 00:57:23,820 --> 00:57:29,010 But also think about research design, given that, you know, some of them have some stuff and some limitations, 534 00:57:29,010 --> 00:57:36,270 and then try and collect some preliminary data to test the feasibility of being able to answer your question. 535 00:57:36,270 --> 00:57:42,210 And think about how it could potentially be improved by some kind of hybrid design. 536 00:57:42,210 --> 00:57:46,770 So this is what we're going to work on tomorrow. 537 00:57:46,770 --> 00:57:55,600 Let's just let's just sort out the groups at this stage so that there's no strategizing about this later. 538 00:57:55,600 --> 00:58:10,220 Yeah, and then we can also discuss the topic more at dinner. So, Jack, what do you want to start off by saying one of to things or no one? 539 00:58:10,220 --> 00:58:16,520 This morning was the other day as part of this. How many are we willing to sacrifice? 540 00:58:16,520 --> 00:58:20,660 We want to have those people. I mean, what was our total number? 541 00:58:20,660 --> 00:58:33,730 I mean, I don't know what is there going to be one two three four five six seven eight nine 10 11. 542 00:58:33,730 --> 00:58:39,780 I'm very confused about who I'm counting gold, I will go with five. 543 00:58:39,780 --> 00:58:46,600 OK, that's fine. OK, fine. 544 00:58:46,600 --> 00:58:58,620 Very. For. OK, so then those who can be here after the auditors are not counted by. 545 00:58:58,620 --> 00:59:03,790 I'm sure for one. 546 00:59:03,790 --> 00:59:17,050 Two, three, four five. And then and then we have some who haven't counted, you counted one two. 547 00:59:17,050 --> 00:59:21,010 You can't make it OK. Do so we have we have a shortage. 548 00:59:21,010 --> 00:59:27,610 Then we have one group that there's only two. And the rest that have. 549 00:59:27,610 --> 00:59:33,060 Yeah. Are they not? 550 00:59:33,060 --> 00:59:38,200 I would be talking about. OK, 551 00:59:38,200 --> 00:59:43,630 but or we can just have then we just have some groups that are fighting and just 552 00:59:43,630 --> 00:59:49,360 some that are poor because we don't we can't even spread because there seems to be. 553 00:59:49,360 --> 01:00:02,470 Yeah. OK, so group number two, arbitrary group number two, could one of you could this could the second person who was group number, 554 01:00:02,470 --> 01:00:14,520 do you say one that's you will and will be in the group with the the two who are unclaimed? 555 01:00:14,520 --> 01:00:16,820 I can see people that are on the same boats. 556 01:00:16,820 --> 01:00:24,560 Yeah, but it's part of the other side, people on both sides, so they are fine, they're fine, then OK, then we're OK. 557 01:00:24,560 --> 01:00:28,250 So those are some groups that are far right. That's that's fine, then. 558 01:00:28,250 --> 01:00:33,800 OK, my problem is more is less complicated than I thought. OK, good. 559 01:00:33,800 --> 01:00:40,970 Perfect. So that's what we're going to work on tomorrow. This is going to be the question you might want to interact with your group. 560 01:00:40,970 --> 01:00:46,670 And so there we have Group Group one, Group two, Group three, Group four and five. 561 01:00:46,670 --> 01:00:53,540 There's going to be the agenda for tomorrow, but for the agenda until we break one until about 5:15 or maybe 5:30. 562 01:00:53,540 --> 01:01:01,510 Is this? Yeah. 563 01:01:01,510 --> 01:01:10,140 It's on the website and on this line, you know, this markdown thing is on the line. 564 01:01:10,140 --> 01:01:15,630 This one, it's on the website, it's called tutorial. 565 01:01:15,630 --> 01:01:30,690 Yeah. You remind me, I need to ask before you. 566 01:01:30,690 --> 01:01:36,870 The one ActionScript. It's not working, I think. 567 01:01:36,870 --> 01:01:40,650 OK, I'll show you where it is, though. It just click on it. 568 01:01:40,650 --> 01:01:43,800 No, no, this is for the art weed so well that you were. 569 01:01:43,800 --> 01:01:53,740 Just go to the just go to the summer institute. What are the Princeton's page? 570 01:01:53,740 --> 01:02:06,090 And this and this is the link for. So it's this one. 571 01:02:06,090 --> 01:02:20,920 Oh, I just. So this is the Twitter. So this is using our tweets otherwise. 572 01:02:20,920 --> 01:02:27,890 Otherwise, if you if you prefer using Tableau Barbara's packages, that can show you his. 573 01:02:27,890 --> 01:02:34,780 He is. Did you see it or this is what's new? 574 01:02:34,780 --> 01:02:41,090 It's called it's called application programming interfaces in our. 575 01:02:41,090 --> 01:02:48,213 But they didn't do it, it's just that I can put it on this lack, maybe I'll do that on the sly.