1 00:00:08,020 --> 00:00:15,210 Welcome back from lunch. We wanted to make sure that we think some of the people who have been supporting us at six Oxford, 2 00:00:15,210 --> 00:00:19,410 which is Russell Sage Foundation, Alfred P. Sloan Foundation, 3 00:00:19,410 --> 00:00:25,470 Oxford Van Hooten Fund, Social Science Division of Teaching Development Awards and Nuffield College, 4 00:00:25,470 --> 00:00:35,160 and of course, the sociology department here that we're in. That's just like a short list of a number of groups that help bring this together. 5 00:00:35,160 --> 00:00:41,400 So thank you. Today, we're going to talk about computational text analysis. 6 00:00:41,400 --> 00:00:42,810 My name is Taylor Brown. 7 00:00:42,810 --> 00:00:51,120 I'm a Ph.D. candidate at Duke and a visiting scholar at NYU, and I know that there's a lot of people in this room in particular who have interest. 8 00:00:51,120 --> 00:00:57,510 We learnt from the flash talks and computational text analysis, but also a lot of experience. 9 00:00:57,510 --> 00:01:02,340 And this is not meant to be just me talking at you because of that level of interest and expertise. 10 00:01:02,340 --> 00:01:06,060 I hope that you ask questions. You help answer questions. 11 00:01:06,060 --> 00:01:11,070 If I don't answer them properly, you correct me. If I do it wrong, we can debug my code together. 12 00:01:11,070 --> 00:01:16,170 Whatever. Let's just make sure that this is interactive, so feel free to interrupt. 13 00:01:16,170 --> 00:01:26,550 And if you don't, I'll probably pause and ask you to. I don't actually consider myself to primarily be a computational text analyst, 14 00:01:26,550 --> 00:01:36,960 but I have used the methods previously and can someone let certain things and. 15 00:01:36,960 --> 00:01:45,750 And I think one thing that's really neat about these methods is it's pretty easy to go from beginner to the cutting edge in a short amount of time. 16 00:01:45,750 --> 00:01:51,300 When we look at the timeline of how computational text analysis has has developed, 17 00:01:51,300 --> 00:01:54,900 a lot of the progress has been made just in the past couple of years. 18 00:01:54,900 --> 00:02:00,630 And I didn't know I didn't know how to code when I started my Ph.D., but I certainly didn't know a lot of these methods. 19 00:02:00,630 --> 00:02:06,870 And now I help on a project that that package we'll see later on on on tech's networks. 20 00:02:06,870 --> 00:02:12,570 That's kind of something that's never been done before, but isn't, but also isn't that advanced and complex. 21 00:02:12,570 --> 00:02:17,580 So there's a lot of opportunity to contribute, especially in the social sciences. 22 00:02:17,580 --> 00:02:21,990 And the last thing I'll mention before we jump in is that as I was preparing this, 23 00:02:21,990 --> 00:02:30,510 I decided to start like a bibliography of foundational and really relevant computational text analysis citations, 24 00:02:30,510 --> 00:02:33,490 and I started a slack channel called Text Citations. 25 00:02:33,490 --> 00:02:38,970 You can join, and if you have anything to contribute, you just put it there and I'll add it to the bibliography. 26 00:02:38,970 --> 00:02:45,660 And at the end of six, we're just we'll print it out and you'll have what could be the start of a good syllabus or reading list or whatever. 27 00:02:45,660 --> 00:02:51,120 So and all the references that I have in this presentation will be there as well. 28 00:02:51,120 --> 00:03:01,050 But starting out. What is computational text analysis on the face of it, it's pretty straight forward and in a lot of ways it is. 29 00:03:01,050 --> 00:03:07,890 But let's break it down. First of all, computational having to do with computers or using computers. 30 00:03:07,890 --> 00:03:16,590 The the point being, amongst other things, that we have just a massive amount of data now and not enough time to to analyse all of this text. 31 00:03:16,590 --> 00:03:23,730 And we use computers to help us do stuff much faster that we probably could do if we would live forever or something like that. 32 00:03:23,730 --> 00:03:33,270 But but it can also be. Computers also help us to do very more complex things we might have difficulty doing at all. 33 00:03:33,270 --> 00:03:42,120 Text this again seems kind of straightforward, but this is the definition I came up with because of reading other things. 34 00:03:42,120 --> 00:03:53,250 Any object that can be read. And most of the time we think of this in terms of written or spoken language and in social cultural scholars have shown. 35 00:03:53,250 --> 00:03:58,740 And I think our intuition is that this sort of language it contains within it a structure that 36 00:03:58,740 --> 00:04:05,730 reflects things like our as as groups of society or morality or hierarchical structures of status. 37 00:04:05,730 --> 00:04:10,500 What's important to us, all of these sorts of things. So that's one reason we focus on those types of texts. 38 00:04:10,500 --> 00:04:16,110 We also just have a lot of those types of texts because it's our primary mode of communication. 39 00:04:16,110 --> 00:04:24,420 But in my work, for example, I look at artworks as texts, as objects, cultural objects that have meaning, 40 00:04:24,420 --> 00:04:30,480 and that meaning is kind of derived from things like colour, content, texture. 41 00:04:30,480 --> 00:04:37,200 And I look at these vectors that are just numeric, but I'm thinking them of them in a very similar way to the way we think about texts. 42 00:04:37,200 --> 00:04:42,600 And so I would just encourage you as we go through this, maybe or not, a more linguistic text analyst. 43 00:04:42,600 --> 00:04:46,500 But maybe your data could be analysed with some of these methods. 44 00:04:46,500 --> 00:04:49,860 I was thinking this morning, I don't I don't know of any case, 45 00:04:49,860 --> 00:04:57,780 but using a topic model on something that's not language text, and I don't know what would come out of that, 46 00:04:57,780 --> 00:05:06,090 but topic modelling was kind of discovered simultaneously in like population ecology or something non textual and with David Blaine, 47 00:05:06,090 --> 00:05:10,290 who's the one who's been cited a bazillion times with their latent, variously allocation. 48 00:05:10,290 --> 00:05:19,740 So originally it was thought of as possibly not applying the texts, linguistic texts and then analysis. 49 00:05:19,740 --> 00:05:28,020 Hopefully, if if you don't know what analysis is, we have like a whole other lecture, but there's probably a lot of definitions for this as well. 50 00:05:28,020 --> 00:05:32,340 I came with a systematic examination of the structure mechanisms of something. 51 00:05:32,340 --> 00:05:33,900 So if we put all of this together, 52 00:05:33,900 --> 00:05:43,920 computational text analysis would be systematic computer-assisted examination of the structure, mechanisms of readable content. 53 00:05:43,920 --> 00:05:48,420 And that's great. But it also makes it sound a little bit boring. 54 00:05:48,420 --> 00:05:56,910 And in particular, social scientists, we want to think about how what analysis means to us. 55 00:05:56,910 --> 00:06:02,400 So Hopkins and King in their article on textual on text analysis, 56 00:06:02,400 --> 00:06:08,100 say this Policy-Makers or computer scientists may be interested in finding the needle in the haystack, 57 00:06:08,100 --> 00:06:12,480 such as a potential terrorist threat or the right webpage to display from a search. 58 00:06:12,480 --> 00:06:17,340 But social scientists are more commonly interested in characterising the haystack. 59 00:06:17,340 --> 00:06:25,290 So we don't necessarily, for instance, with topic modelling, focus on getting the correct classification of some sort of document, 60 00:06:25,290 --> 00:06:32,870 but rather are interested in the distribution of documents over a corpus. 61 00:06:32,870 --> 00:06:38,780 So that just keeping that in mind, as social scientists, again, computational social science, 62 00:06:38,780 --> 00:06:45,920 how we're using these methods because computer scientists or policymakers might use them in a very different way. 63 00:06:45,920 --> 00:06:50,330 Turning now to just the history of will do this really quickly. 64 00:06:50,330 --> 00:06:57,350 The same article the history of computational text analysis that same Hopkins and King article note 65 00:06:57,350 --> 00:07:02,270 that the Catholic Church tracked the proportion of non-religious printed texts in the sixteen hundreds, 66 00:07:02,270 --> 00:07:09,470 and they kind of mentioned this is one of the first examples we have of of like word count or some sort of content analysis. 67 00:07:09,470 --> 00:07:16,610 My intuition is there's probably a precursor to this in other parts of the world, whether the Middle East or Asia Africa. 68 00:07:16,610 --> 00:07:21,500 But this is like a western delta and King and Hopkins were going in that direction. 69 00:07:21,500 --> 00:07:25,730 So. So that's kind of it started a long time ago content analysis. 70 00:07:25,730 --> 00:07:31,400 And but we usually talk about the modern era, starting with Laswell doing keyword counts. 71 00:07:31,400 --> 00:07:35,690 There's a lot of quotes for that from him, for the intuition of content, 72 00:07:35,690 --> 00:07:42,590 analysis of text, for studying sociometric measures or measures of social dynamics. 73 00:07:42,590 --> 00:07:48,920 And those same methods of keyword count started to be used by social scientists in the nineteen forties. 74 00:07:48,920 --> 00:08:01,820 And then, of course, Turing in the nineteen fifties in the context of World War Two applied AI to study, text, try and decipher foreign transmissions. 75 00:08:01,820 --> 00:08:06,140 And then as we moved along, we have the first textbooks on content analysis. 76 00:08:06,140 --> 00:08:15,350 We start applying mainframe computers to them, start doing an open event coding and then we get these dictionary based methods like week for 77 00:08:15,350 --> 00:08:22,820 studying texts and that that brings us up until the 1990s when we have the first topic models. 78 00:08:22,820 --> 00:08:29,270 And then slowly, throughout the nineties, these topic models get infrastructure and get started used and other methods like network 79 00:08:29,270 --> 00:08:33,830 methods of text that we'll discuss later start to get use in there a little bit. 80 00:08:33,830 --> 00:08:42,010 Previous in like the 70s, we have the first embedding models not necessarily word embedding, but embedding models. 81 00:08:42,010 --> 00:08:50,320 And then we in 2010, King and Hopkins and others start really bringing topic modelling. 82 00:08:50,320 --> 00:08:58,510 This is kind of the main method that social scientists tend to use as their entree into computational text analysis and topic modelling. 83 00:08:58,510 --> 00:09:09,070 And that was just in 2010, which is a disturbingly long time ago, but also not that long ago going to, it seems in my mind to be not that long ago. 84 00:09:09,070 --> 00:09:18,280 And in 2014, Majura or Molly Roberts and her colleagues developed a structural topic modelling, which will do a tutorial on later. 85 00:09:18,280 --> 00:09:23,650 But one thing that's interesting about that is that it really was a content text analysis 86 00:09:23,650 --> 00:09:30,790 design for social scientists and sociologists interested in demographic characteristics. 87 00:09:30,790 --> 00:09:39,910 And then we have today, which I believe is June 19, and it's all of us doing whatever we're doing with text analysis. 88 00:09:39,910 --> 00:09:48,280 And like I said, it's pretty easy to jump to the cutting edge. Building off of everything that's come before us. 89 00:09:48,280 --> 00:09:57,190 So. Getting the data really talked about this quite a bit yesterday, so I will not do too much on it, 90 00:09:57,190 --> 00:10:04,670 but where do we get this sort of data that we use in in these with these methods, these computational methods that will. 91 00:10:04,670 --> 00:10:07,580 We'll study later. Largely lately, 92 00:10:07,580 --> 00:10:15,380 one of the things that has really pushed the development of text analysis forward is the fact that we have tons of content from the internet, 93 00:10:15,380 --> 00:10:19,010 including social media. So here there's some examples. 94 00:10:19,010 --> 00:10:22,530 As really mentioned, Twitter has been very good at helping us get its data. 95 00:10:22,530 --> 00:10:27,830 And so a lot of studies come out of Twitter data power. 96 00:10:27,830 --> 00:10:33,410 Pablo Barbara, who we'll hear from later, is this birds of the same feather tweet together, 97 00:10:33,410 --> 00:10:38,630 just looking at the network structure of Twitter and trying to predict ideology 98 00:10:38,630 --> 00:10:45,830 of political leaders and then kind of comparing that to what they actually say? 99 00:10:45,830 --> 00:10:54,560 Monger This great title tweet meant effects on the tweeted experiment reducing racist harassment was looking at. 100 00:10:54,560 --> 00:11:02,690 Similarly on Twitter, if you could use bots to kind of censor people who were harassing others. 101 00:11:02,690 --> 00:11:08,630 We have studies on the effect of wordage in terms of how popular a message becomes. 102 00:11:08,630 --> 00:11:13,220 Reddit is another source of data that's really interesting. 103 00:11:13,220 --> 00:11:20,150 This study looked at there was a ban in 2015 for hate speech on certain subreddits, and they just wanted to see, 104 00:11:20,150 --> 00:11:26,300 did it actually diminish hate speech or did these people and the hate speech just kind of bleed into other subreddits? 105 00:11:26,300 --> 00:11:33,080 Similarly, analysis and Facebook Kickstarter. What makes for a successful Kickstarter campaign based on the language? 106 00:11:33,080 --> 00:11:39,530 We talked about this one at lunch, some of US self-disclosure and perceived trustworthiness on Airbnb. 107 00:11:39,530 --> 00:11:43,790 So as a host on Airbnb, if I disclose more about myself, 108 00:11:43,790 --> 00:11:51,440 is it possible that I that the people looking for housing will trust me more based on text analysis? 109 00:11:51,440 --> 00:11:59,110 And then this one by a king at all? They they use a lot of different social media platforms to look at Chinese censorship. 110 00:11:59,110 --> 00:12:03,640 Outside of social media, there is open ended surveys, historical archives, 111 00:12:03,640 --> 00:12:11,680 this study by Beerman and Stovell used historical interviews with former Nazis. 112 00:12:11,680 --> 00:12:17,950 This one was looking at text from the Qing Dynasty from seventeen twenty two to nineteen eleven. 113 00:12:17,950 --> 00:12:23,680 I just came across one by Mark Anthony Hofmann from Columbia, but he's moving somewhere else. 114 00:12:23,680 --> 00:12:29,410 But he looked at the Bible and how different in the US kind of revival era, 115 00:12:29,410 --> 00:12:38,110 how different pastors in their sermons cited different Bible quotes or passages and Enron emails, 116 00:12:38,110 --> 00:12:45,670 political documents, including the State of the Union in the US, which is a speech given by the president once a year. 117 00:12:45,670 --> 00:12:50,950 And, of course, newspapers and beyond these more specifics, we've got sorry, 118 00:12:50,950 --> 00:12:59,350 we've got these massive corpora like Google and which is millions of books from fifteen hundred until twenty eight. 119 00:12:59,350 --> 00:13:05,620 Maybe you've been more recent now English corpora is an interesting one that I don't think a ton of people know about for some reason, 120 00:13:05,620 --> 00:13:11,020 maybe because you have to pay for it and we don't like doing that. They have some great corpora that are historical. 121 00:13:11,020 --> 00:13:17,440 They're subdivided it by type and then some that are very like updated daily and they're quite large. 122 00:13:17,440 --> 00:13:24,060 The manifesto project has like political stances by something, what I put it. 123 00:13:24,060 --> 00:13:31,830 By over a thousand parties from nineteen forty five until today in 50 countries on five continents, 124 00:13:31,830 --> 00:13:40,110 spinner Internet Archive is another really interesting one. If you're ever wanting to save a website and the text on that website as it is today, 125 00:13:40,110 --> 00:13:46,380 but you don't have time to scrape it or don't know how or whatever you can just go to Internet Archive, 126 00:13:46,380 --> 00:13:51,060 place the URL and it will store that it's literally an archive of the internet. 127 00:13:51,060 --> 00:14:00,060 So for instance, when there's maybe political transitions as we had, you know, in the US a few years ago, 128 00:14:00,060 --> 00:14:06,120 people were worried about certain political government departments like the EPA. 129 00:14:06,120 --> 00:14:11,280 And so there going to be changes when there's a change in presidency to these these websites sometimes. 130 00:14:11,280 --> 00:14:15,000 So people went and just archived all of those web pages. 131 00:14:15,000 --> 00:14:24,540 So you had a static version of what it was like before and after the transition, but this has just tons of of resources for texts. 132 00:14:24,540 --> 00:14:30,210 Any questions, comments, other resources that are so. 133 00:14:30,210 --> 00:14:38,610 Yeah. There's no one else. I mean, it's a clear line between public cloud, OK? 134 00:14:38,610 --> 00:14:46,380 Of the date preservation project, OK, the data preservation project OK. 135 00:14:46,380 --> 00:14:51,670 Cool. Yeah. So anyways, there's tons out there. 136 00:14:51,670 --> 00:15:00,370 And how do we get at it? We talked about this a little bit yesterday, but open source or API is obviously ideal, especially if the open source, 137 00:15:00,370 --> 00:15:04,810 as we talk about, isn't from a hacker who just dropped it there for you, but is actually open source. 138 00:15:04,810 --> 00:15:12,280 Someone has said you can have this or an API where they give you a structured way of getting it from them. 139 00:15:12,280 --> 00:15:15,880 A private agreement. My dissertation data is a private agreement with a company. 140 00:15:15,880 --> 00:15:21,790 Those aren't always easy to come by, but they can be quite nice purchased. 141 00:15:21,790 --> 00:15:30,130 Like I said, the English corpora is one that you can purchase, but you can purchase mass amounts of Twitter data with the firehose as well. 142 00:15:30,130 --> 00:15:34,930 And then, of course, scraped, which we've talked about in terms of use on that one. 143 00:15:34,930 --> 00:15:41,620 Or as we said, maybe you don't care, but I would say check the terms of use. 144 00:15:41,620 --> 00:15:48,370 And then once you have your data, you need to do some preparation before you can analyse it. 145 00:15:48,370 --> 00:15:57,700 So I wanted to take a poll of those of you who do text analysis, how many of you use Python to kind of clean or prepare and analyse your data? 146 00:15:57,700 --> 00:16:03,220 OK. Maybe five. And how many of you are oh, OK. 147 00:16:03,220 --> 00:16:10,660 A little bit more, but not a ton more. If you use our do you use the tidy, like tidy text packages for cleaning your data? 148 00:16:10,660 --> 00:16:12,730 Yeah, the most part. OK. 149 00:16:12,730 --> 00:16:24,610 So in the Princeton Core six on the computational text analysis slides, Maya and Aden's advisor Chris Bell is teaching maths and he will. 150 00:16:24,610 --> 00:16:29,290 He has tons of examples of tidy text, and so we talked about it at lunch. 151 00:16:29,290 --> 00:16:35,320 I think that is. I love Tidey. I love the title universe, especially for text analysis. 152 00:16:35,320 --> 00:16:41,590 Anything but I had we work on seems trustworthy, but I thought I would introduce us to a different approach. 153 00:16:41,590 --> 00:16:46,690 And so when we do our tutorial, we're going to use the quantitative package. Does anyone use that? 154 00:16:46,690 --> 00:16:50,260 Which is it? It's different. Very, very similar, but different. 155 00:16:50,260 --> 00:16:56,920 And I think it's just nice to know what tools you have at your disposal beyond the tidy package. 156 00:16:56,920 --> 00:17:02,470 But OK, before we get to that, let's say we have we now have a corpus of data, 157 00:17:02,470 --> 00:17:08,620 a corpus of text that we want to analyse and we need to pre process them. 158 00:17:08,620 --> 00:17:15,430 So maybe it looks like this. This is actually a piece of scraped data. I have to admit I didn't look at the terms of use, but it's just a small bit. 159 00:17:15,430 --> 00:17:25,750 And as you can see, there's not just text. This is some sort of art contemporary art criticism document. 160 00:17:25,750 --> 00:17:34,330 There's not just text, there's, you know, there's Unicode non ASCII characters, there's age HTML, and we don't really want to analyse those, probably. 161 00:17:34,330 --> 00:17:44,470 And in the context of AR, you're largely what you're going to use to get those out are regular expressions. 162 00:17:44,470 --> 00:17:53,740 Using grep commands, which stands for global globally, search a regular expression in print and we similarly over break, 163 00:17:53,740 --> 00:17:57,370 we're talking about how we all kind of hate regular expressions. 164 00:17:57,370 --> 00:18:04,840 They're they're cumbersome and there's a ton of packages and ah, that you can use that have them built in, so you don't really have to think about it. 165 00:18:04,840 --> 00:18:08,530 But I do encourage you if you want to do text analysis to just get, 166 00:18:08,530 --> 00:18:13,930 especially if you're doing it in order to get familiar with the regular expressions. 167 00:18:13,930 --> 00:18:15,100 And I'll share this later. 168 00:18:15,100 --> 00:18:23,890 There's there's some sort of app online where there's crossword puzzles that you can fill out using regular expressions to get better at it. 169 00:18:23,890 --> 00:18:29,800 But they can't be. They can be a pain. But this is kind of what it would look like in the code. 170 00:18:29,800 --> 00:18:34,330 You might you would. You would define a variable as you know, whatever your text is. 171 00:18:34,330 --> 00:18:40,630 And then this is one of the ones that's the methods that's in base are so you don't have to load a package, but just g sub. 172 00:18:40,630 --> 00:18:43,750 And what we're trying to do is we're saying, find this pattern. 173 00:18:43,750 --> 00:18:50,860 The tab, the tab object and voice touching and replace it with a blank in this variable text. 174 00:18:50,860 --> 00:18:57,310 And it comes out and you haven't removed the page similar to the the Unicode, but you did remove that. 175 00:18:57,310 --> 00:19:05,410 So this is kind of the intuition of all the regular expressions for cleaning, obviously, like Unicode, for example, 176 00:19:05,410 --> 00:19:11,620 as a whole system of of codes that that code special characters and text and you're 177 00:19:11,620 --> 00:19:15,250 not going to want to do each one of those is a regular expression yourself. 178 00:19:15,250 --> 00:19:23,470 And that's where some of these built in packages that have already coded that out for you can be useful, but sometimes they just don't work. 179 00:19:23,470 --> 00:19:30,400 And so it's really good to get to know your regular expressions. This is another one for taking out the HTML. 180 00:19:30,400 --> 00:19:40,480 So you're just submitting anything in these brackets, which is how each HTML code is often written, is written and replacing it with nothing. 181 00:19:40,480 --> 00:19:48,680 And then you have this is defining it as a function, so you could just feed your text to that function and it would remove all of the HTML. 182 00:19:48,680 --> 00:19:53,870 But here's a cheat sheet that you can go to that has a whole bunch of regular expressions 183 00:19:53,870 --> 00:20:00,470 and you can start to learn and I will try and find the crossword puzzle in case you. 184 00:20:00,470 --> 00:20:09,500 Are interested in that, but after cleaning all of that out, this is ideally what your text looks like before you before you start to analyse it. 185 00:20:09,500 --> 00:20:16,010 There's a few more steps in processing that are optional and more substantive decisions, I think. 186 00:20:16,010 --> 00:20:26,620 And we're going to discuss these. But what you love is this just this clean text where you don't have other code, you have extra white spaces. 187 00:20:26,620 --> 00:20:32,330 Yeah. Any questions or comments on that so far? 188 00:20:32,330 --> 00:20:39,770 OK. So as you go about pre processing a little bit further, we have things like stop word removal, 189 00:20:39,770 --> 00:20:42,830 which I'm pretty sure everyone in here is familiar with what stop words are, 190 00:20:42,830 --> 00:20:50,390 but they're basically just very common words that aren't really substantive in terms of if you're looking at the topic or content of your text. 191 00:20:50,390 --> 00:20:55,700 So things like at the and et cetera, but it might be corpus specific to you, 192 00:20:55,700 --> 00:21:03,880 maybe when you're looking at, I don't know, some sort of religious text you really don't care about. 193 00:21:03,880 --> 00:21:10,030 A certain religious object. You just want that removed because you don't want to analyse that over and over and over again. 194 00:21:10,030 --> 00:21:15,220 Or specific names that you want removed. You can add those to your dictionary of stop words and remove them. 195 00:21:15,220 --> 00:21:21,660 And when we do our tutorial a little bit later, we'll see how to programmatically remove stop words. 196 00:21:21,660 --> 00:21:27,960 Then there's the options of stemming or limited rising your your tax stemming removes 197 00:21:27,960 --> 00:21:32,190 the endings of conjugated verbs and plural nouns returning only the stem of the word. 198 00:21:32,190 --> 00:21:38,820 So in this case, running would become run and the verb saw like I saw something would remain sore. 199 00:21:38,820 --> 00:21:44,430 But if you limit, you actually get to the base type of the base of the word. 200 00:21:44,430 --> 00:21:52,740 And so the noun saw like sawing a piece of wood remain would remain sore, but the verb saw would become C because that's it's based on. 201 00:21:52,740 --> 00:21:57,420 So then seeing saw and see what all become see in your text. 202 00:21:57,420 --> 00:22:03,210 And if you think about it, I mean, these can be very substantive and there have has been research that has shown that 203 00:22:03,210 --> 00:22:07,200 you will get different effects in certain contexts if you advertise or if you don't, 204 00:22:07,200 --> 00:22:10,710 or if you remove certain stop words and not others. 205 00:22:10,710 --> 00:22:17,400 And so you want to think about that in this, I think comes back to what we were talking about before that in this kind of wild west. 206 00:22:17,400 --> 00:22:23,280 In some regards of computational social science, we don't necessarily have standards of reporting. 207 00:22:23,280 --> 00:22:28,110 But whenever I review papers that do things with computational text analysis, 208 00:22:28,110 --> 00:22:35,940 I do ask the author to provide, if not their code, so we can just see what words did you remove. 209 00:22:35,940 --> 00:22:46,830 Then the code and robustness checks for what happens if you if you just change some of those parameters around, ideally nothing would change. 210 00:22:46,830 --> 00:22:48,690 But sometimes it does. 211 00:22:48,690 --> 00:23:01,440 Other options in pre-processing are to create an in your text so just based tokenization, meaning that token is just like an individual unit of text. 212 00:23:01,440 --> 00:23:04,950 You can treat them as you anagrams where it's just each individual word. 213 00:23:04,950 --> 00:23:10,980 So something like New York City would be separate New York and City. And if we. 214 00:23:10,980 --> 00:23:15,150 If we do something like a topic model or an embedding model like we did later, 215 00:23:15,150 --> 00:23:21,660 those are probably likely to show up in the same cluster or topic because they're technically the same entity. 216 00:23:21,660 --> 00:23:29,670 You could do it as by and just get New York or New York City. And there are certain and graham kind of processes that will help you predict 217 00:23:29,670 --> 00:23:34,050 which engrams are actual and Graham's just based on how commonly they show up. 218 00:23:34,050 --> 00:23:40,740 And so York City probably would get kicked out because that doesn't happen as much, but New York would stay in. 219 00:23:40,740 --> 00:23:48,890 And then you have New York City, which is the tiger, the tiger. And you can get longer and longer if you like. 220 00:23:48,890 --> 00:23:50,870 You can identify parts of speech, 221 00:23:50,870 --> 00:23:59,010 so this is kind of a common output of a part of speech tagging if you want to identify which of your words fall into which part of speech. 222 00:23:59,010 --> 00:24:05,470 So we have like a singular noun and a plural noun and a verb. 223 00:24:05,470 --> 00:24:10,240 And maybe you want to do that, so then you remove all adjectives or you remove all nouns. 224 00:24:10,240 --> 00:24:16,270 There was a professor in our department, somebody mentioned moral found multiple people mention moral foundations theory. 225 00:24:16,270 --> 00:24:22,270 Previously, he was interested in doing topic modelling where you removed. 226 00:24:22,270 --> 00:24:28,540 The mounds I can't remember now, but basically looking at trying to get the moral foundations by clustering only adjectives. 227 00:24:28,540 --> 00:24:35,320 So once you have them attached to their nouns, then you just look at then you get rid of the nouns and you only look at the adjectives 228 00:24:35,320 --> 00:24:42,310 to see how the clusters of morality around words are shaped in different corpora. 229 00:24:42,310 --> 00:24:45,790 And then identifying named entities, 230 00:24:45,790 --> 00:24:55,930 which is a subtasks of information execution that seeks to locate and classify named entity mentions in unstructured text into predefined categories. 231 00:24:55,930 --> 00:25:04,000 And I will show I have a little tutorial that we're going to go through using the Google named Entity API. 232 00:25:04,000 --> 00:25:09,100 There's others, but it's kind of neat because it will return, you send it the text, 233 00:25:09,100 --> 00:25:13,510 it returns the text and it picks out things that identifies those named entities. 234 00:25:13,510 --> 00:25:20,440 If there's a Wikipedia page associated with it, then they'll give you the link to the Wikipedia page, 235 00:25:20,440 --> 00:25:24,910 and it gives you some like whether it's a noun and a person, 236 00:25:24,910 --> 00:25:33,490 a place, an institution so you can start to to get at what what other entities are spoken about in your texts. 237 00:25:33,490 --> 00:25:34,930 And so, yeah, that's the point. 238 00:25:34,930 --> 00:25:43,800 Now that we're going to move on to the tutorial for pre processing some of the things we just discussed, we're going to show them. 239 00:25:43,800 --> 00:25:56,140 And that I think I have a slide for it. Oh, but before right before we do that, when we come back, we'll talk about analysing the texts. 240 00:25:56,140 --> 00:25:59,980 I just wanted to make sure that you all have access. Did you all get access to the materials? 241 00:25:59,980 --> 00:26:05,770 I sent the link on Slack. Raise your hand. If you don't have the materials yet, you don't. 242 00:26:05,770 --> 00:26:11,920 On Slack, there's a link. If that's all Flash six, Oxford. 243 00:26:11,920 --> 00:26:17,770 And all of the materials, the tutorials for data and the data should be there. 244 00:26:17,770 --> 00:26:24,030 So I'll come to you right afterwards. 245 00:26:24,030 --> 00:26:32,520 But yeah, we'll clean the Texan when we come back after the break, we're going to analyse the text and I thought I'd just give a precursor to that. 246 00:26:32,520 --> 00:26:40,800 In terms of in the computational analysis world, we've all heard these term supervised, 247 00:26:40,800 --> 00:26:48,840 unsupervised and semi supervised the different methods that you can use in like machine learning and these sorts of things. 248 00:26:48,840 --> 00:26:56,550 And I just wanted to make sure we're all on the same page as to where we fit within it and where these methods that we'll learn fit within it. 249 00:26:56,550 --> 00:27:05,610 So unsupervised text you, you give your algorithm or whatever you're working with computationally a set of labelled data, 250 00:27:05,610 --> 00:27:10,860 meaning that maybe you have multiple cases and you're saying, this is a woman, this is a man, this is a woman, this is a man. 251 00:27:10,860 --> 00:27:20,610 And the algorithm will train on that set and then help you predict a set where you don't have any of that data with supervised data, 252 00:27:20,610 --> 00:27:25,390 you don't have any labels to begin with. You don't have. What topic does this word belong to? 253 00:27:25,390 --> 00:27:30,870 You don't have. Is this a male or a female candidate? Are they at high risk or not? 254 00:27:30,870 --> 00:27:39,550 But it just kind of. Uses without really supervision trains and starts to predict what those categories might be. 255 00:27:39,550 --> 00:27:42,610 And with semi supervised, it's a combination of these two things. 256 00:27:42,610 --> 00:27:47,710 Usually it's something like you have some labelled data, but you have a ton of unlabelled data. 257 00:27:47,710 --> 00:27:55,060 So one way you would do that is you would take your labelled data and train an algorithm, predict the on some section of the unlabelled data. 258 00:27:55,060 --> 00:28:01,420 Now they have labels and you bring them back in and relay train the analysis, then you train more. 259 00:28:01,420 --> 00:28:04,260 And it's this kind of cyclical process. 260 00:28:04,260 --> 00:28:12,300 The things that we're going to learn, topic modelling, word embedding and network analysis are largely unsupervised, 261 00:28:12,300 --> 00:28:19,020 classified as unsupervised because a lot of times we're just feeding it data and it's giving it something out. 262 00:28:19,020 --> 00:28:25,260 But in the in the context of social scientific research, it's I put it a little bit more in the semi supervised, 263 00:28:25,260 --> 00:28:29,160 not in the sense that we're adding labels and then retraining algorithms, 264 00:28:29,160 --> 00:28:36,180 but that we are very intentionally informing the algorithm by our own substantive knowledge, which we'll talk about further. 265 00:28:36,180 --> 00:28:45,810 That these sort of methods for us are not things as Chris talked about or not in their introduction to computational social science. 266 00:28:45,810 --> 00:28:52,620 There was maybe hope. I think it wasn't as broad spread as we sometimes like to think, but there was hope at the beginning that with tons of data, 267 00:28:52,620 --> 00:28:58,110 we would need theory anymore, and we've learnt that that's really not the case. And that's true with these methods, too. 268 00:28:58,110 --> 00:29:00,570 Like topic modelling doesn't mean you don't have to read your text. 269 00:29:00,570 --> 00:29:05,310 In fact, it's very important that you're familiar with the text in the corpora that you're working with. 270 00:29:05,310 --> 00:29:13,050 So it's kind of like unsupervised ished, just because the definition of semi supervised is a little bit different. 271 00:29:13,050 --> 00:29:19,200 But yeah, when we come back, these are the three methods we're going to work with. But before that, we're going to clean our data. 272 00:29:19,200 --> 00:29:23,250 Yeah, that's the difference in my own experiences. 273 00:29:23,250 --> 00:29:30,300 Yeah. Where can you get them, really? Well. Thank you. 274 00:29:30,300 --> 00:29:36,960 I just want to ask for a pre-processing steps. There were some things I ran into at some point because they helped me. 275 00:29:36,960 --> 00:29:43,980 First of all, if you have. Well, first of all, if you have text in different languages, you can easily use a translate API. 276 00:29:43,980 --> 00:29:50,430 Oh yeah. But then also it can help you to actually get out spelling mistakes. 277 00:29:50,430 --> 00:29:53,460 So if you translate to language and I'm back against the English, 278 00:29:53,460 --> 00:30:05,160 how clever actually fixes and general you have also proposing steps that I do edits critical spelling errors that take a long time. 279 00:30:05,160 --> 00:30:07,860 Text globe. I think four four are OK. 280 00:30:07,860 --> 00:30:17,500 And then last thing is that you can use fuzzy messaging if your data has a lot of names of companies, that kind of stuff. 281 00:30:17,500 --> 00:30:22,190 They're all mentioned in a different way or with or without and ink and etc. 282 00:30:22,190 --> 00:30:26,860 Yeah, it's actually get them together as a two steps at will be helpful, too. 283 00:30:26,860 --> 00:30:36,870 Yeah, those are both really great points. I think the fuzzy matching is often used when you have, yeah, things, things like Disney or Disney. 284 00:30:36,870 --> 00:30:43,170 It's like the same thing. The named entity stuff can sometimes help with that as well because they'll have the same URL attached to them or something, 285 00:30:43,170 --> 00:30:51,540 but those are both really great points. Anyone else experiences we could go on forever, I'm sure, with horror stories of data. 286 00:30:51,540 --> 00:30:56,570 But what about anything you've learnt or? OK. 287 00:30:56,570 --> 00:30:59,870 Maybe they'll come out during the tutorial, which is what we're going to switch to now. 288 00:30:59,870 --> 00:31:06,380 So the the script that we're going to start with, it's the HTML file for the quantita. 289 00:31:06,380 --> 00:31:12,560 I think it's just called pre processing. And now we're going to switch to the HDMI. 290 00:31:12,560 --> 00:31:17,290 I'm turning myself off. Oh, yeah, anyone on the livestream. Yeah. 291 00:31:17,290 --> 00:31:27,280 The link to these materials are in the HTML files on the BETWEE link that you should see on your screen as soon as we turn off the slides. 292 00:31:27,280 --> 00:31:32,710 We're back just a reminder anyone on the live stream there should be like a bitterly wink that you can follow. 293 00:31:32,710 --> 00:31:43,570 If not, it's bit dot lie slash six Oxford and the file that we're looking at is the HTML on pre-processing text. 294 00:31:43,570 --> 00:31:50,170 And as I said, we're going to work mostly with the Quantita package, but also some things from the tidy purse. 295 00:31:50,170 --> 00:32:00,250 And I'll just walk through it and anyone can ask questions. Obviously, anyone with experience in our python, but particularly there's no R or Python. 296 00:32:00,250 --> 00:32:05,410 There's a million ways to do any one thing that you want to do. So I'm just showing one example. 297 00:32:05,410 --> 00:32:12,790 But if you know of a more efficient way or anything like that, feel free to interrupt. 298 00:32:12,790 --> 00:32:20,230 So first, we're just going to load the packages and then load the texts, which, as I mentioned, 299 00:32:20,230 --> 00:32:31,210 is just a set of of abstracts from sociology between, I think twenty eight and 2000 and like half way through 2012. 300 00:32:31,210 --> 00:32:39,280 And you see here with the printout, just kind of a preview of what the data looks like in the data frame kind of looks like what the source, 301 00:32:39,280 --> 00:32:43,600 which journal the article came from, who the first author is. 302 00:32:43,600 --> 00:32:50,170 I didn't extract all of the authors, only the first author and the year, 303 00:32:50,170 --> 00:32:55,690 and then that's just kind of a thing so you can track back to which file it came from. 304 00:32:55,690 --> 00:33:03,100 Other ways to read in your data, obviously, your data can come in a ton of different formats and this read text I like read text a lot. 305 00:33:03,100 --> 00:33:07,300 There's also a reader read our function. That's pretty great. 306 00:33:07,300 --> 00:33:11,440 But but for quantita, this read text is pretty great. 307 00:33:11,440 --> 00:33:15,610 There's one for just tweets that reads in Twitter data. 308 00:33:15,610 --> 00:33:26,110 Again, that's just represent representative of how prominent Twitter data compared to other things are that we have unique reading package for that, 309 00:33:26,110 --> 00:33:32,920 Jason. Text files multiple text files say you have a I often have this. 310 00:33:32,920 --> 00:33:41,410 You have a file on your computer with a ton of text files and you don't want to load each one individually if you just why it doesn't. 311 00:33:41,410 --> 00:33:44,200 OK, I'll stay over here. 312 00:33:44,200 --> 00:33:57,890 If you just write your path and then start text or whatever the outcome of the format of your text is, it will read in all of those. 313 00:33:57,890 --> 00:34:04,820 And then within quantita for the analysis, you also have what they call dark variables for the file names, 314 00:34:04,820 --> 00:34:12,650 and you can pre specify those and we'll talk a little bit more about them in this case on that might. 315 00:34:12,650 --> 00:34:19,400 Five. It was reading in the State of the Union address, which again is a US speech by the president. 316 00:34:19,400 --> 00:34:24,290 And so it was saying which president and which year XML file CSP. 317 00:34:24,290 --> 00:34:28,340 One thing that's really neat about read text is that when you read in text, you specify, 318 00:34:28,340 --> 00:34:35,840 you declare this parameter text field as to what the column or field within your original file, 319 00:34:35,840 --> 00:34:44,780 where the text is, how it's identified and it reads it in and it names it text so that if you say 320 00:34:44,780 --> 00:34:51,850 you had a document where you had a bunch of different CSV files in the text. 321 00:34:51,850 --> 00:34:56,380 Call them in these different case files was named slightly differently when you use this, 322 00:34:56,380 --> 00:35:04,720 it will rename all of them text so you don't have to do that yourself. 323 00:35:04,720 --> 00:35:11,290 This is me, I don't know how with the Quantita package, how to duplicate the text there might be. 324 00:35:11,290 --> 00:35:15,830 So say you have. Maybe when you were scraping you, 325 00:35:15,830 --> 00:35:24,230 you save you accidentally saved or the same web page has two different URLs and you captured both of them, so it's the exact same text. 326 00:35:24,230 --> 00:35:30,560 But some of the other metadata is different, and so just doing like a dupe on the data frame at large wouldn't work. 327 00:35:30,560 --> 00:35:37,760 It's only duping the text. I use deployer for that. 328 00:35:37,760 --> 00:35:46,500 So this is basically just taking our sociology data frame and then grouping it by the text side. 329 00:35:46,500 --> 00:35:51,470 Anyways, it's duping that if you if we want to explain it further, I'm happy to. 330 00:35:51,470 --> 00:35:55,010 But this is one way that you could remove multiple texts. 331 00:35:55,010 --> 00:36:03,710 If the other metadata about that text was different might be risky, though, because maybe that metadata is really where the important difference lies. 332 00:36:03,710 --> 00:36:05,330 But let's say let's run that. 333 00:36:05,330 --> 00:36:15,750 I don't think for this particular corpus there were any duplicates, but I definitely have duplicates and in contexts where I. 334 00:36:15,750 --> 00:36:18,830 Scraped data. 335 00:36:18,830 --> 00:36:28,850 And then in quantita, similar in some respects to other packages, you create what's called the corpus, which is designed to be, as it says, 336 00:36:28,850 --> 00:36:33,410 a library of original documents that have been converted to plain UTF eight encoded 337 00:36:33,410 --> 00:36:39,290 text and stored along with the metadata at the corpus level and document level. 338 00:36:39,290 --> 00:36:45,390 They have this special document level metadata called doc variables, which are variables associated with the document. 339 00:36:45,390 --> 00:36:53,150 So in the case of our sociology abstracts doc, an instance of the doc variable would be What journal does it come from and what year was it? 340 00:36:53,150 --> 00:37:01,850 Was it produced? The corpus in quantita, as it says here is is not really designed to be where you conduct your analysis. 341 00:37:01,850 --> 00:37:07,940 It's like so original data kind of when you were first learning Stata or SPSS or something, 342 00:37:07,940 --> 00:37:11,540 it's like, don't change the original data, don't touch that right? 343 00:37:11,540 --> 00:37:20,360 It's similar to the the corpus. It's where if you want to do multiple different analyses or robustness checks, that's where the original data lives. 344 00:37:20,360 --> 00:37:25,990 So we're going to create a corpus out of our right now, our. 345 00:37:25,990 --> 00:37:31,990 DataFrame, it's just like it's just a data frame, a traditional data frame that with the text, 346 00:37:31,990 --> 00:37:37,900 the and whatnot, and now we're going to create this special object called a corpus and take a look at that. 347 00:37:37,900 --> 00:37:51,030 It looks pretty much the same. But the formatting, the metadata of a corpus is slightly different and allows for certain content quantita analysis. 348 00:37:51,030 --> 00:37:55,650 If we want to add document level variables that weren't originally in the important of the data, 349 00:37:55,650 --> 00:38:01,350 so again, a dark variable would be like the journalist or the year. So we wanted to create one on the fly. 350 00:38:01,350 --> 00:38:06,300 So in this case, just because it was what I thought of, first, 351 00:38:06,300 --> 00:38:13,830 we're going to identify two of the prominent American sociology journals American Journal of Sociology and American Sociological Review, 352 00:38:13,830 --> 00:38:19,070 which a lot of people would love to have a solo authored or an authored paper and one of these. 353 00:38:19,070 --> 00:38:23,820 So I thought those are like kind of the paradigms in US sociology of like a prestigious journal. 354 00:38:23,820 --> 00:38:33,660 So when you use that as a heuristic for like Tip Top Journal publication, and I'm just going to make a binary coding of that. 355 00:38:33,660 --> 00:38:39,240 So here we only get we're only getting the first five examples, but you see, there's this new variable a.j.'s, 356 00:38:39,240 --> 00:38:45,150 as are none of these first five as you can see there in the sociology, religion or symbolic interaction. 357 00:38:45,150 --> 00:38:51,780 All of those are not in that journal, so they have a value of false. But we've just created a new document variable. 358 00:38:51,780 --> 00:39:03,060 Is this document from a serious R? There's many other ways, as I mentioned, Rag X would be your friend for creating new doc variables. 359 00:39:03,060 --> 00:39:09,920 But this is just one example. Any questions? All makes sense. 360 00:39:09,920 --> 00:39:14,750 So if we wanted to add a corpus level variable, which is not about individual documents, 361 00:39:14,750 --> 00:39:20,510 but the corpus as a whole, we would do that with Matt Murdock. 362 00:39:20,510 --> 00:39:25,100 And I just added the date yesterday, which was June 18th, 363 00:39:25,100 --> 00:39:30,680 and that's kind of just a meta information about the corpus you could add, like who collected it and who created the corpus. 364 00:39:30,680 --> 00:39:41,230 If you plan to make this open source in the future or anything, that that could be a really useful place for you to store metadata about the corpus. 365 00:39:41,230 --> 00:39:50,530 And to take a look at an example, we just want to say, look at the text of the first or this, I'm looking at the second abstract here. 366 00:39:50,530 --> 00:39:56,080 This is an abstract from I believe it was the sociology sociology of religion, right? 367 00:39:56,080 --> 00:40:04,900 This is what the text looks like. Luckily, it's pretty clean in terms of, like we said, there's no non ASCII characters or Unicode or something. 368 00:40:04,900 --> 00:40:12,660 So we can. We can be happy about that, because that's a luxury. 369 00:40:12,660 --> 00:40:21,360 If we wanted to summarise a bit about the corpus, I'm just summarising here the number of abstracts by year. 370 00:40:21,360 --> 00:40:26,220 I don't know why it didn't come up. We go. 371 00:40:26,220 --> 00:40:37,150 So, yeah, like I said, it cuts off partially through 2012, but it looks like there's approximately fifteen hundred articles for each year. 372 00:40:37,150 --> 00:40:44,480 We wanted to see which text is the longest. It looks like there's five hundred and one tokens and one abstracts. 373 00:40:44,480 --> 00:40:52,040 It's kind of like they want one word over what is often our limit of 500 word abstract. 374 00:40:52,040 --> 00:40:59,600 If you happen to have more than one corpus and you wanted to combine them, pretty simple, just plus Adam together. 375 00:40:59,600 --> 00:41:05,270 If you wanted to subset your corpus here, we say we only wanted to look at two thousand and ten. 376 00:41:05,270 --> 00:41:12,070 This is how we would would do it and move this over. So it's not cramping our style. 377 00:41:12,070 --> 00:41:16,300 And then we can begin to explore the actual text itself. 378 00:41:16,300 --> 00:41:23,840 So if we wanted to see, for instance, I study art, I want to see which abstracts mention the word art and in what context. 379 00:41:23,840 --> 00:41:29,230 So here are some it tells me which abstract it's from and then the context of it. 380 00:41:29,230 --> 00:41:33,850 So I argue there is art exertion. Who knows what due to? 381 00:41:33,850 --> 00:41:35,770 I don't know what that one is. 382 00:41:35,770 --> 00:41:46,150 Average consumers of art and culture can maybe then start to maybe you look specifically for things that are relevant to my own research. 383 00:41:46,150 --> 00:41:52,540 The practise of art as prayer, or maybe this kind of exploration starts to give you ideas of what sort of topics 384 00:41:52,540 --> 00:41:56,630 there might be if you're going to do a topic modelling or something like that. 385 00:41:56,630 --> 00:42:01,910 So you can start to explore your data, and there are evidently a lot of hurt. 386 00:42:01,910 --> 00:42:11,520 OK. You can expect inspect the document variables that you defined or the metacarpals variables that you defined, 387 00:42:11,520 --> 00:42:15,460 those we've already kind of talked about. 388 00:42:15,460 --> 00:42:26,140 Then when you want to perform your analysis, as is often the case, you're going to create some sort of term document or term tote like token matrix. 389 00:42:26,140 --> 00:42:33,510 There's often the idea of term frequency document and frequency matrix that you use and topic modelling in quantity. 390 00:42:33,510 --> 00:42:40,750 At first, it's literally, I believe, just a yeah document feature matrix where the features are the columns and the features of the words. 391 00:42:40,750 --> 00:42:45,940 And then you have the documents. And there's a lot of different ways to tokenise. 392 00:42:45,940 --> 00:42:53,060 As I mentioned before, a token in the context of text analysis is like a unit of analysis, and that might be a word. 393 00:42:53,060 --> 00:42:58,250 It might be a whole sentence could be by grams or grams. It could be only punctuation. 394 00:42:58,250 --> 00:43:05,590 It's whatever you want. So I'm just going to show within the context of quantita all the different ways you could very easily tokenising. 395 00:43:05,590 --> 00:43:11,980 And we're starting with these three sentences that I came up with that if anybody can identify which sources they're from, 396 00:43:11,980 --> 00:43:18,040 buy you a drink or coffee later. 397 00:43:18,040 --> 00:43:25,810 So in this case, these are the the texts, and we're just tokenising the basic tokens tax is just going to put it into words. 398 00:43:25,810 --> 00:43:29,200 It's not looking at diagrams or telegrams or sentences, it's just words. 399 00:43:29,200 --> 00:43:38,440 So it's split if you're happy and a dream, does that count into if you're happy and etc., right? 400 00:43:38,440 --> 00:43:45,550 But there's all of these parameters that you can set in the tokens for function like remove numbers, remove punctuation, 401 00:43:45,550 --> 00:43:55,060 remove symbols, remove separators, remove Twitter things like hashtags or at science, remove hyphens or remove URLs. 402 00:43:55,060 --> 00:43:58,000 And I'm not going to go through these. But if you run these different functions, 403 00:43:58,000 --> 00:44:06,070 you'll see how doing the same thing results in a different result or results in different tokens from those sentences. 404 00:44:06,070 --> 00:44:09,700 And as we talked about before with engrams, the token can also do Ingrams. 405 00:44:09,700 --> 00:44:15,130 You just specify Ingram. So here with engrams one through two, I'm saying get MoneyGram's and diagrams. 406 00:44:15,130 --> 00:44:19,570 If you did two to three, you would only get by grams and try grams. You did one two three. 407 00:44:19,570 --> 00:44:26,950 You'd get, et cetera, you anagrams by and try grams. So here that same sentence. 408 00:44:26,950 --> 00:44:40,420 Let's look at the shortest one. If you're happy and a dream, does that count also has if your your happy, happy in in a sutra, right? 409 00:44:40,420 --> 00:44:44,800 You can also this definition character, so instead of words, 410 00:44:44,800 --> 00:44:50,410 I want characters I don't know why you do this, but which is why I said, why would you ever do this? 411 00:44:50,410 --> 00:45:01,700 But it would split it into letters. If you're thinking maybe you would do it to count, to count the number of characters, there's an easier way. 412 00:45:01,700 --> 00:45:10,810 But. There it's possible, or you can tokenise by sentence, but each of these is only one sentence long. 413 00:45:10,810 --> 00:45:17,640 Oh no, this is this one, OK? So. This is not known, this is bowling, there are rules. 414 00:45:17,640 --> 00:45:24,150 Similarly, if you can tell me where that's from, you get a drink or chocolate or something, but it'll just put it into sentences. 415 00:45:24,150 --> 00:45:33,510 Now sentences are your unit of analysis. They're your tokens. Cool. 416 00:45:33,510 --> 00:45:41,980 Super cool. OK, now constructing the document frequency matrix is similarly quite easy in quantity package. 417 00:45:41,980 --> 00:45:50,760 It's just this DFM. In the tokens, it didn't do anything unless you told it to write, it didn't lowercase, 418 00:45:50,760 --> 00:45:56,340 it didn't remove punctuation, it didn't do any of that unless you specified that that's what you wanted with DFM. 419 00:45:56,340 --> 00:46:01,590 It will do a lot of that. You can override it, but it will do a lot of that on on its own. 420 00:46:01,590 --> 00:46:09,710 And so you won't have. It's not showing me. 421 00:46:09,710 --> 00:46:16,400 You won't have to specify, and you won't have things like if it appears at the top in the beginning of a sentence, 422 00:46:16,400 --> 00:46:20,940 you won't have it capitalised and so it's counted as a different token. 423 00:46:20,940 --> 00:46:30,320 It'll be the same token if it's the same word. You can also remove stop words in STEM, which is what we talked about in the in the tutorials. 424 00:46:30,320 --> 00:46:39,980 It's really easy just defining you want to improve English, stop words and use that stem to true and you want to remove punctuation. 425 00:46:39,980 --> 00:46:48,300 Looking at what types of Stoppard's are in this English dictionary, it's I mean, my et cetera, but you can add you can add to it yourself if you want, 426 00:46:48,300 --> 00:46:55,220 like I said, if you wanted to remove, maybe maybe you're analysing text messages or something and they are all yours. 427 00:46:55,220 --> 00:46:59,090 And so your name is there a million times and you're like, I don't want me to be there. 428 00:46:59,090 --> 00:47:04,310 You could add your name as a as a stop word. 429 00:47:04,310 --> 00:47:09,800 Sorry, guys. That'd be very interesting to analyse your text messages. I've never done that. 430 00:47:09,800 --> 00:47:19,140 So this is an example of that. Adding the word will to the dictionary stop words to remove. 431 00:47:19,140 --> 00:47:27,660 Then this top features function, oh, I have to define that. 432 00:47:27,660 --> 00:47:32,100 We'll just show you the top features, the top words that are occurring in your corpus. 433 00:47:32,100 --> 00:47:42,510 So in this case, unsurprisingly, maybe social is is top data study research article using health women results amongst religious also. 434 00:47:42,510 --> 00:47:54,470 It's kind of like a sentence. So, yeah, these are some of the top tokens or words used in sociological abstracts. 435 00:47:54,470 --> 00:48:00,170 You can also make a word cloud, which you kind of a love hate word clouds, and this takes a while to run. 436 00:48:00,170 --> 00:48:08,230 It's in your HTML, you can see it. Some like ungodly palette that was selected to go along with it, but. 437 00:48:08,230 --> 00:48:12,100 I shouldn't have clicked on that. You guys know what the word cloud looks like. 438 00:48:12,100 --> 00:48:21,930 All right. There you go. Who? It's very tropical. 439 00:48:21,930 --> 00:48:30,260 You can also group document, sorry, you had to decide yourself that the full size to to the word size. 440 00:48:30,260 --> 00:48:33,880 Yeah, yeah, yeah, yeah, the font size of the words. 441 00:48:33,880 --> 00:48:44,660 Yeah, it's how common that word is in the corpus. Yeah. You can start to analyse by groups of tags, so in this case, 442 00:48:44,660 --> 00:48:57,650 you want to see if maybe you want to create a corpus or a term document for each frequency matrix by year to analyse them separately. 443 00:48:57,650 --> 00:49:04,520 And so it would do that. So here now we're able to see like say that we normalise the number of tokens across years. 444 00:49:04,520 --> 00:49:06,530 You could see how they like, go up and down. 445 00:49:06,530 --> 00:49:17,390 This is just rock counts, but it looks like health was really more popular in 2010 and went down for whatever reason in 2011. 446 00:49:17,390 --> 00:49:24,080 And they have nearly the same number of documents, if I remember from the histogram at the beginning. 447 00:49:24,080 --> 00:49:29,510 So that's one thing you can do. We're not going to. 448 00:49:29,510 --> 00:49:36,750 Well, maybe we. Yeah. And what did you have the parameters? 449 00:49:36,750 --> 00:49:43,910 Is it more advisable to travel documents when they chase down? 450 00:49:43,910 --> 00:49:50,600 In this case, the those were some of the year's strikes and the Collins. 451 00:49:50,600 --> 00:49:56,510 Well, it's still like the same matrix of just the documents and the features or the tokens. 452 00:49:56,510 --> 00:50:01,130 And then the analysis just does that by year and gives you this table. 453 00:50:01,130 --> 00:50:12,260 So it's sort of the list is is every single piece of information sold in the list and because just by typing. 454 00:50:12,260 --> 00:50:19,280 Yes. And I don't know what the commanders on, I think produces that headed for you or yeah, it will. 455 00:50:19,280 --> 00:50:27,350 It produces the header for you there. It's The Matrix is just a matrix, which is a bunch of lists, you know, in the columns. 456 00:50:27,350 --> 00:50:36,560 But then separate is the corpus where the metadata resides, and it's just drawing on that to create this table for you. 457 00:50:36,560 --> 00:50:45,260 Yeah, because of. But I'm interested in, for example, president a government to make sure it's fair and. 458 00:50:45,260 --> 00:50:49,160 Would send an answer to. OK. 459 00:50:49,160 --> 00:50:56,480 Yeah. And so that wasn't really. So I'm just wondering that's gotten some space. 460 00:50:56,480 --> 00:51:02,100 Yeah. I would probably create a matrix. 461 00:51:02,100 --> 00:51:06,390 By sort of one category of improvement in the this year. 462 00:51:06,390 --> 00:51:13,530 Yeah, I'm trying to think off hand because you want like all of the different combinations of your two features. 463 00:51:13,530 --> 00:51:15,990 I don't think there's a straightforward way to do it with quantity, 464 00:51:15,990 --> 00:51:23,310 but using to there might be you could look in the documentation, but it's definitely doable. 465 00:51:23,310 --> 00:51:29,900 Yeah. Sometimes we have, as it says, 466 00:51:29,900 --> 00:51:39,220 some prior intuition of words that are particularly important inside of our text and we might want to know how those relate. 467 00:51:39,220 --> 00:51:45,980 This shouldn't be women, probably incorrect. I was looking at men and women, but then I switched it to culture and structure, which are too common. 468 00:51:45,980 --> 00:51:53,180 We have previously thought of them as either oppositions or ends of the spectrum within social sociology. 469 00:51:53,180 --> 00:51:59,120 There's obviously a lot of theory there that would argue against it, but just in terms of how we use it in abstract. 470 00:51:59,120 --> 00:52:07,830 Maybe we're interested in how those would be used separately. And so we could just get a count of which ones winning. 471 00:52:07,830 --> 00:52:14,940 Do we talk more about culture? Do we talk more about structure? And according to the first five, we talk more about culture. 472 00:52:14,940 --> 00:52:22,410 But culture is structure to some people. So here you can use external dictionaries. 473 00:52:22,410 --> 00:52:27,990 So maybe you want to do the same sort of thing, but you want to look at like Leiweke Dictionary terms. 474 00:52:27,990 --> 00:52:32,100 You have to pay for those which I didn't want to do last night. 475 00:52:32,100 --> 00:52:38,700 So I just gave you the code for how you would import those terms and then you could do the same analysis. 476 00:52:38,700 --> 00:52:46,230 Looking at the frequency over time, you can look at the similarity of texts. 477 00:52:46,230 --> 00:52:50,350 By running this code, I don't know if this will show up, actually. 478 00:52:50,350 --> 00:53:01,510 Based on my cosine similarity of the the term document matrix might take a little bit. 479 00:53:01,510 --> 00:53:10,810 So this is just giving you an example of the the cosine similarity of the first document with with other documents. 480 00:53:10,810 --> 00:53:19,910 You could then use those and as weights and a network analysis or something, I don't I'm not sure you can do the same with specific words. 481 00:53:19,910 --> 00:53:31,890 So if you wanted to look at how race the cosine similarity of race with other with other tokens in your corpus, 482 00:53:31,890 --> 00:53:38,260 it looks like it has a very high similarity with the word racial, which maybe isn't surprising. 483 00:53:38,260 --> 00:53:43,180 If you go down to the gender list, you're kind of intuitive as well. 484 00:53:43,180 --> 00:53:52,400 It's you. So with gender, women is the most similar term associated with, but then men equal differ male. 485 00:53:52,400 --> 00:53:59,120 Remember, we stemmed these so equal to be equality legislation. 486 00:53:59,120 --> 00:54:04,790 So yeah, it's a cosine similarity from the document term frequency matrix. 487 00:54:04,790 --> 00:54:11,270 Yeah. And then. 488 00:54:11,270 --> 00:54:15,560 Oh, we're not going to do this because we have a whole tutorial on structural topic modelling, 489 00:54:15,560 --> 00:54:20,810 but within the quartzite quantita package, you can you can do topic modelling and it's pretty straightforward. 490 00:54:20,810 --> 00:54:28,280 So here you're just identifying how many topics you want and you're saying I'm going to use latent like allocation like. 491 00:54:28,280 --> 00:54:34,520 And then you get the terms and the topic distributions. So if you want, you can start doing topic modelling that way. 492 00:54:34,520 --> 00:54:41,970 Topic models that way. Yeah. OK, it's 3:15. 493 00:54:41,970 --> 00:54:48,870 Should we take a break or no, because I have two other scripts on preprocessing that I've provided, 494 00:54:48,870 --> 00:54:53,340 but you can also just play with those at your leisure. One would be a bit complicated for us to go through, 495 00:54:53,340 --> 00:55:02,660 but it's working with that Google API for detecting named entities, and that is, I think it's just called. 496 00:55:02,660 --> 00:55:10,130 And LP, it's like the MLP tutorial. And there's another one on sentiment analysis, so in the core six at Princeton, 497 00:55:10,130 --> 00:55:17,240 they do more work on sentiment analysis is pretty straightforward, though, and getting the sentiment just give it a dictionary. 498 00:55:17,240 --> 00:55:21,440 And it looks at the words and tells you what the sentiment of those words are by and large like. 499 00:55:21,440 --> 00:55:25,700 That's the most simple form of sentiment analysis. So it's not even a tutorial. 500 00:55:25,700 --> 00:55:30,380 I just provided you a script that you could use if you wanted to look at in case 501 00:55:30,380 --> 00:55:37,250 any of you want to do that in your group project and you want a base to run on. But I'm happy to go through the Google named Entity Thing. 502 00:55:37,250 --> 00:55:44,030 Often there's credential, it's an API, right? So there could be credentialing problems or we can just take a break. 503 00:55:44,030 --> 00:55:49,550 Take a break. OK, I'm going to do that. 504 00:55:49,550 --> 00:56:00,450 Unless it no questions, right? So it's supposed to run the polls on the stock price, and it's like, oh, yeah. 505 00:56:00,450 --> 00:56:07,620 So a lot of what the text analysis stuff. Not necessarily processing but models we run, a lot of them take a long time. 506 00:56:07,620 --> 00:56:12,960 So we're not going to do a lot of it. So but they each HTML and then I'll give you the marked down file later. 507 00:56:12,960 --> 00:56:17,550 If you want it, you have it. You can copy paste later at your leisure. 508 00:56:17,550 --> 00:56:25,990 OK. Yeah. Yeah. Like you said, at that point, it was like they knew the news outlet. 509 00:56:25,990 --> 00:56:31,650 So we treat the text from. Next, Major. 510 00:56:31,650 --> 00:56:37,020 The public's fascination with newspapers and things. 511 00:56:37,020 --> 00:56:43,230 There are there are a lot. Yeah, I'm sure plenty of people in this room might know of more. 512 00:56:43,230 --> 00:56:53,490 I know, for instance, that English corporate dawg that I told you about one of their larger I think it's called Coca S.O.S. a larger corpora. 513 00:56:53,490 --> 00:57:00,780 Has they subdivide the corpus into like magazines, blogs, newspapers and within newspapers? 514 00:57:00,780 --> 00:57:07,980 There's multiple different newspapers, but like New York Times, is kind of epically good of being able to get from their API text. 515 00:57:07,980 --> 00:57:14,581 I have a corpus if you want to just toy with it of New York Times, I think their op ed.