1
00:00:08,020 --> 00:00:15,210
Welcome back from lunch. We wanted to make sure that we think some of the people who have been supporting us at six Oxford,

2
00:00:15,210 --> 00:00:19,410
which is Russell Sage Foundation, Alfred P. Sloan Foundation,

3
00:00:19,410 --> 00:00:25,470
Oxford Van Hooten Fund, Social Science Division of Teaching Development Awards and Nuffield College,

4
00:00:25,470 --> 00:00:35,160
and of course, the sociology department here that we're in. That's just like a short list of a number of groups that help bring this together.

5
00:00:35,160 --> 00:00:41,400
So thank you. Today, we're going to talk about computational text analysis.

6
00:00:41,400 --> 00:00:42,810
My name is Taylor Brown.

7
00:00:42,810 --> 00:00:51,120
I'm a Ph.D. candidate at Duke and a visiting scholar at NYU, and I know that there's a lot of people in this room in particular who have interest.

8
00:00:51,120 --> 00:00:57,510
We learnt from the flash talks and computational text analysis, but also a lot of experience.

9
00:00:57,510 --> 00:01:02,340
And this is not meant to be just me talking at you because of that level of interest and expertise.

10
00:01:02,340 --> 00:01:06,060
I hope that you ask questions. You help answer questions.

11
00:01:06,060 --> 00:01:11,070
If I don't answer them properly, you correct me. If I do it wrong, we can debug my code together.

12
00:01:11,070 --> 00:01:16,170
Whatever. Let's just make sure that this is interactive, so feel free to interrupt.

13
00:01:16,170 --> 00:01:26,550
And if you don't, I'll probably pause and ask you to. I don't actually consider myself to primarily be a computational text analyst,

14
00:01:26,550 --> 00:01:36,960
but I have used the methods previously and can someone let certain things and.

15
00:01:36,960 --> 00:01:45,750
And I think one thing that's really neat about these methods is it's pretty easy to go from beginner to the cutting edge in a short amount of time.

16
00:01:45,750 --> 00:01:51,300
When we look at the timeline of how computational text analysis has has developed,

17
00:01:51,300 --> 00:01:54,900
a lot of the progress has been made just in the past couple of years.

18
00:01:54,900 --> 00:02:00,630
And I didn't know I didn't know how to code when I started my Ph.D., but I certainly didn't know a lot of these methods.

19
00:02:00,630 --> 00:02:06,870
And now I help on a project that that package we'll see later on on on tech's networks.

20
00:02:06,870 --> 00:02:12,570
That's kind of something that's never been done before, but isn't, but also isn't that advanced and complex.

21
00:02:12,570 --> 00:02:17,580
So there's a lot of opportunity to contribute, especially in the social sciences.

22
00:02:17,580 --> 00:02:21,990
And the last thing I'll mention before we jump in is that as I was preparing this,

23
00:02:21,990 --> 00:02:30,510
I decided to start like a bibliography of foundational and really relevant computational text analysis citations,

24
00:02:30,510 --> 00:02:33,490
and I started a slack channel called Text Citations.

25
00:02:33,490 --> 00:02:38,970
You can join, and if you have anything to contribute, you just put it there and I'll add it to the bibliography.

26
00:02:38,970 --> 00:02:45,660
And at the end of six, we're just we'll print it out and you'll have what could be the start of a good syllabus or reading list or whatever.

27
00:02:45,660 --> 00:02:51,120
So and all the references that I have in this presentation will be there as well.

28
00:02:51,120 --> 00:03:01,050
But starting out. What is computational text analysis on the face of it, it's pretty straight forward and in a lot of ways it is.

29
00:03:01,050 --> 00:03:07,890
But let's break it down. First of all, computational having to do with computers or using computers.

30
00:03:07,890 --> 00:03:16,590
The the point being, amongst other things, that we have just a massive amount of data now and not enough time to to analyse all of this text.

31
00:03:16,590 --> 00:03:23,730
And we use computers to help us do stuff much faster that we probably could do if we would live forever or something like that.

32
00:03:23,730 --> 00:03:33,270
But but it can also be. Computers also help us to do very more complex things we might have difficulty doing at all.

33
00:03:33,270 --> 00:03:42,120
Text this again seems kind of straightforward, but this is the definition I came up with because of reading other things.

34
00:03:42,120 --> 00:03:53,250
Any object that can be read. And most of the time we think of this in terms of written or spoken language and in social cultural scholars have shown.

35
00:03:53,250 --> 00:03:58,740
And I think our intuition is that this sort of language it contains within it a structure that

36
00:03:58,740 --> 00:04:05,730
reflects things like our as as groups of society or morality or hierarchical structures of status.

37
00:04:05,730 --> 00:04:10,500
What's important to us, all of these sorts of things. So that's one reason we focus on those types of texts.

38
00:04:10,500 --> 00:04:16,110
We also just have a lot of those types of texts because it's our primary mode of communication.

39
00:04:16,110 --> 00:04:24,420
But in my work, for example, I look at artworks as texts, as objects, cultural objects that have meaning,

40
00:04:24,420 --> 00:04:30,480
and that meaning is kind of derived from things like colour, content, texture.

41
00:04:30,480 --> 00:04:37,200
And I look at these vectors that are just numeric, but I'm thinking them of them in a very similar way to the way we think about texts.

42
00:04:37,200 --> 00:04:42,600
And so I would just encourage you as we go through this, maybe or not, a more linguistic text analyst.

43
00:04:42,600 --> 00:04:46,500
But maybe your data could be analysed with some of these methods.

44
00:04:46,500 --> 00:04:49,860
I was thinking this morning, I don't I don't know of any case,

45
00:04:49,860 --> 00:04:57,780
but using a topic model on something that's not language text, and I don't know what would come out of that,

46
00:04:57,780 --> 00:05:06,090
but topic modelling was kind of discovered simultaneously in like population ecology or something non textual and with David Blaine,

47
00:05:06,090 --> 00:05:10,290
who's the one who's been cited a bazillion times with their latent, variously allocation.

48
00:05:10,290 --> 00:05:19,740
So originally it was thought of as possibly not applying the texts, linguistic texts and then analysis.

49
00:05:19,740 --> 00:05:28,020
Hopefully, if if you don't know what analysis is, we have like a whole other lecture, but there's probably a lot of definitions for this as well.

50
00:05:28,020 --> 00:05:32,340
I came with a systematic examination of the structure mechanisms of something.

51
00:05:32,340 --> 00:05:33,900
So if we put all of this together,

52
00:05:33,900 --> 00:05:43,920
computational text analysis would be systematic computer-assisted examination of the structure, mechanisms of readable content.

53
00:05:43,920 --> 00:05:48,420
And that's great. But it also makes it sound a little bit boring.

54
00:05:48,420 --> 00:05:56,910
And in particular, social scientists, we want to think about how what analysis means to us.

55
00:05:56,910 --> 00:06:02,400
So Hopkins and King in their article on textual on text analysis,

56
00:06:02,400 --> 00:06:08,100
say this Policy-Makers or computer scientists may be interested in finding the needle in the haystack,

57
00:06:08,100 --> 00:06:12,480
such as a potential terrorist threat or the right webpage to display from a search.

58
00:06:12,480 --> 00:06:17,340
But social scientists are more commonly interested in characterising the haystack.

59
00:06:17,340 --> 00:06:25,290
So we don't necessarily, for instance, with topic modelling, focus on getting the correct classification of some sort of document,

60
00:06:25,290 --> 00:06:32,870
but rather are interested in the distribution of documents over a corpus.

61
00:06:32,870 --> 00:06:38,780
So that just keeping that in mind, as social scientists, again, computational social science,

62
00:06:38,780 --> 00:06:45,920
how we're using these methods because computer scientists or policymakers might use them in a very different way.

63
00:06:45,920 --> 00:06:50,330
Turning now to just the history of will do this really quickly.

64
00:06:50,330 --> 00:06:57,350
The same article the history of computational text analysis that same Hopkins and King article note

65
00:06:57,350 --> 00:07:02,270
that the Catholic Church tracked the proportion of non-religious printed texts in the sixteen hundreds,

66
00:07:02,270 --> 00:07:09,470
and they kind of mentioned this is one of the first examples we have of of like word count or some sort of content analysis.

67
00:07:09,470 --> 00:07:16,610
My intuition is there's probably a precursor to this in other parts of the world, whether the Middle East or Asia Africa.

68
00:07:16,610 --> 00:07:21,500
But this is like a western delta and King and Hopkins were going in that direction.

69
00:07:21,500 --> 00:07:25,730
So. So that's kind of it started a long time ago content analysis.

70
00:07:25,730 --> 00:07:31,400
And but we usually talk about the modern era, starting with Laswell doing keyword counts.

71
00:07:31,400 --> 00:07:35,690
There's a lot of quotes for that from him, for the intuition of content,

72
00:07:35,690 --> 00:07:42,590
analysis of text, for studying sociometric measures or measures of social dynamics.

73
00:07:42,590 --> 00:07:48,920
And those same methods of keyword count started to be used by social scientists in the nineteen forties.

74
00:07:48,920 --> 00:08:01,820
And then, of course, Turing in the nineteen fifties in the context of World War Two applied AI to study, text, try and decipher foreign transmissions.

75
00:08:01,820 --> 00:08:06,140
And then as we moved along, we have the first textbooks on content analysis.

76
00:08:06,140 --> 00:08:15,350
We start applying mainframe computers to them, start doing an open event coding and then we get these dictionary based methods like week for

77
00:08:15,350 --> 00:08:22,820
studying texts and that that brings us up until the 1990s when we have the first topic models.

78
00:08:22,820 --> 00:08:29,270
And then slowly, throughout the nineties, these topic models get infrastructure and get started used and other methods like network

79
00:08:29,270 --> 00:08:33,830
methods of text that we'll discuss later start to get use in there a little bit.

80
00:08:33,830 --> 00:08:42,010
Previous in like the 70s, we have the first embedding models not necessarily word embedding, but embedding models.

81
00:08:42,010 --> 00:08:50,320
And then we in 2010, King and Hopkins and others start really bringing topic modelling.

82
00:08:50,320 --> 00:08:58,510
This is kind of the main method that social scientists tend to use as their entree into computational text analysis and topic modelling.

83
00:08:58,510 --> 00:09:09,070
And that was just in 2010, which is a disturbingly long time ago, but also not that long ago going to, it seems in my mind to be not that long ago.

84
00:09:09,070 --> 00:09:18,280
And in 2014, Majura or Molly Roberts and her colleagues developed a structural topic modelling, which will do a tutorial on later.

85
00:09:18,280 --> 00:09:23,650
But one thing that's interesting about that is that it really was a content text analysis

86
00:09:23,650 --> 00:09:30,790
design for social scientists and sociologists interested in demographic characteristics.

87
00:09:30,790 --> 00:09:39,910
And then we have today, which I believe is June 19, and it's all of us doing whatever we're doing with text analysis.

88
00:09:39,910 --> 00:09:48,280
And like I said, it's pretty easy to jump to the cutting edge. Building off of everything that's come before us.

89
00:09:48,280 --> 00:09:57,190
So. Getting the data really talked about this quite a bit yesterday, so I will not do too much on it,

90
00:09:57,190 --> 00:10:04,670
but where do we get this sort of data that we use in in these with these methods, these computational methods that will.

91
00:10:04,670 --> 00:10:07,580
We'll study later. Largely lately,

92
00:10:07,580 --> 00:10:15,380
one of the things that has really pushed the development of text analysis forward is the fact that we have tons of content from the internet,

93
00:10:15,380 --> 00:10:19,010
including social media. So here there's some examples.

94
00:10:19,010 --> 00:10:22,530
As really mentioned, Twitter has been very good at helping us get its data.

95
00:10:22,530 --> 00:10:27,830
And so a lot of studies come out of Twitter data power.

96
00:10:27,830 --> 00:10:33,410
Pablo Barbara, who we'll hear from later, is this birds of the same feather tweet together,

97
00:10:33,410 --> 00:10:38,630
just looking at the network structure of Twitter and trying to predict ideology

98
00:10:38,630 --> 00:10:45,830
of political leaders and then kind of comparing that to what they actually say?

99
00:10:45,830 --> 00:10:54,560
Monger This great title tweet meant effects on the tweeted experiment reducing racist harassment was looking at.

100
00:10:54,560 --> 00:11:02,690
Similarly on Twitter, if you could use bots to kind of censor people who were harassing others.

101
00:11:02,690 --> 00:11:08,630
We have studies on the effect of wordage in terms of how popular a message becomes.

102
00:11:08,630 --> 00:11:13,220
Reddit is another source of data that's really interesting.

103
00:11:13,220 --> 00:11:20,150
This study looked at there was a ban in 2015 for hate speech on certain subreddits, and they just wanted to see,

104
00:11:20,150 --> 00:11:26,300
did it actually diminish hate speech or did these people and the hate speech just kind of bleed into other subreddits?

105
00:11:26,300 --> 00:11:33,080
Similarly, analysis and Facebook Kickstarter. What makes for a successful Kickstarter campaign based on the language?

106
00:11:33,080 --> 00:11:39,530
We talked about this one at lunch, some of US self-disclosure and perceived trustworthiness on Airbnb.

107
00:11:39,530 --> 00:11:43,790
So as a host on Airbnb, if I disclose more about myself,

108
00:11:43,790 --> 00:11:51,440
is it possible that I that the people looking for housing will trust me more based on text analysis?

109
00:11:51,440 --> 00:11:59,110
And then this one by a king at all? They they use a lot of different social media platforms to look at Chinese censorship.

110
00:11:59,110 --> 00:12:03,640
Outside of social media, there is open ended surveys, historical archives,

111
00:12:03,640 --> 00:12:11,680
this study by Beerman and Stovell used historical interviews with former Nazis.

112
00:12:11,680 --> 00:12:17,950
This one was looking at text from the Qing Dynasty from seventeen twenty two to nineteen eleven.

113
00:12:17,950 --> 00:12:23,680
I just came across one by Mark Anthony Hofmann from Columbia, but he's moving somewhere else.

114
00:12:23,680 --> 00:12:29,410
But he looked at the Bible and how different in the US kind of revival era,

115
00:12:29,410 --> 00:12:38,110
how different pastors in their sermons cited different Bible quotes or passages and Enron emails,

116
00:12:38,110 --> 00:12:45,670
political documents, including the State of the Union in the US, which is a speech given by the president once a year.

117
00:12:45,670 --> 00:12:50,950
And, of course, newspapers and beyond these more specifics, we've got sorry,

118
00:12:50,950 --> 00:12:59,350
we've got these massive corpora like Google and which is millions of books from fifteen hundred until twenty eight.

119
00:12:59,350 --> 00:13:05,620
Maybe you've been more recent now English corpora is an interesting one that I don't think a ton of people know about for some reason,

120
00:13:05,620 --> 00:13:11,020
maybe because you have to pay for it and we don't like doing that. They have some great corpora that are historical.

121
00:13:11,020 --> 00:13:17,440
They're subdivided it by type and then some that are very like updated daily and they're quite large.

122
00:13:17,440 --> 00:13:24,060
The manifesto project has like political stances by something, what I put it.

123
00:13:24,060 --> 00:13:31,830
By over a thousand parties from nineteen forty five until today in 50 countries on five continents,

124
00:13:31,830 --> 00:13:40,110
spinner Internet Archive is another really interesting one. If you're ever wanting to save a website and the text on that website as it is today,

125
00:13:40,110 --> 00:13:46,380
but you don't have time to scrape it or don't know how or whatever you can just go to Internet Archive,

126
00:13:46,380 --> 00:13:51,060
place the URL and it will store that it's literally an archive of the internet.

127
00:13:51,060 --> 00:14:00,060
So for instance, when there's maybe political transitions as we had, you know, in the US a few years ago,

128
00:14:00,060 --> 00:14:06,120
people were worried about certain political government departments like the EPA.

129
00:14:06,120 --> 00:14:11,280
And so there going to be changes when there's a change in presidency to these these websites sometimes.

130
00:14:11,280 --> 00:14:15,000
So people went and just archived all of those web pages.

131
00:14:15,000 --> 00:14:24,540
So you had a static version of what it was like before and after the transition, but this has just tons of of resources for texts.

132
00:14:24,540 --> 00:14:30,210
Any questions, comments, other resources that are so.

133
00:14:30,210 --> 00:14:38,610
Yeah. There's no one else. I mean, it's a clear line between public cloud, OK?

134
00:14:38,610 --> 00:14:46,380
Of the date preservation project, OK, the data preservation project OK.

135
00:14:46,380 --> 00:14:51,670
Cool. Yeah. So anyways, there's tons out there.

136
00:14:51,670 --> 00:15:00,370
And how do we get at it? We talked about this a little bit yesterday, but open source or API is obviously ideal, especially if the open source,

137
00:15:00,370 --> 00:15:04,810
as we talk about, isn't from a hacker who just dropped it there for you, but is actually open source.

138
00:15:04,810 --> 00:15:12,280
Someone has said you can have this or an API where they give you a structured way of getting it from them.

139
00:15:12,280 --> 00:15:15,880
A private agreement. My dissertation data is a private agreement with a company.

140
00:15:15,880 --> 00:15:21,790
Those aren't always easy to come by, but they can be quite nice purchased.

141
00:15:21,790 --> 00:15:30,130
Like I said, the English corpora is one that you can purchase, but you can purchase mass amounts of Twitter data with the firehose as well.

142
00:15:30,130 --> 00:15:34,930
And then, of course, scraped, which we've talked about in terms of use on that one.

143
00:15:34,930 --> 00:15:41,620
Or as we said, maybe you don't care, but I would say check the terms of use.

144
00:15:41,620 --> 00:15:48,370
And then once you have your data, you need to do some preparation before you can analyse it.

145
00:15:48,370 --> 00:15:57,700
So I wanted to take a poll of those of you who do text analysis, how many of you use Python to kind of clean or prepare and analyse your data?

146
00:15:57,700 --> 00:16:03,220
OK. Maybe five. And how many of you are oh, OK.

147
00:16:03,220 --> 00:16:10,660
A little bit more, but not a ton more. If you use our do you use the tidy, like tidy text packages for cleaning your data?

148
00:16:10,660 --> 00:16:12,730
Yeah, the most part. OK.

149
00:16:12,730 --> 00:16:24,610
So in the Princeton Core six on the computational text analysis slides, Maya and Aden's advisor Chris Bell is teaching maths and he will.

150
00:16:24,610 --> 00:16:29,290
He has tons of examples of tidy text, and so we talked about it at lunch.

151
00:16:29,290 --> 00:16:35,320
I think that is. I love Tidey. I love the title universe, especially for text analysis.

152
00:16:35,320 --> 00:16:41,590
Anything but I had we work on seems trustworthy, but I thought I would introduce us to a different approach.

153
00:16:41,590 --> 00:16:46,690
And so when we do our tutorial, we're going to use the quantitative package. Does anyone use that?

154
00:16:46,690 --> 00:16:50,260
Which is it? It's different. Very, very similar, but different.

155
00:16:50,260 --> 00:16:56,920
And I think it's just nice to know what tools you have at your disposal beyond the tidy package.

156
00:16:56,920 --> 00:17:02,470
But OK, before we get to that, let's say we have we now have a corpus of data,

157
00:17:02,470 --> 00:17:08,620
a corpus of text that we want to analyse and we need to pre process them.

158
00:17:08,620 --> 00:17:15,430
So maybe it looks like this. This is actually a piece of scraped data. I have to admit I didn't look at the terms of use, but it's just a small bit.

159
00:17:15,430 --> 00:17:25,750
And as you can see, there's not just text. This is some sort of art contemporary art criticism document.

160
00:17:25,750 --> 00:17:34,330
There's not just text, there's, you know, there's Unicode non ASCII characters, there's age HTML, and we don't really want to analyse those, probably.

161
00:17:34,330 --> 00:17:44,470
And in the context of AR, you're largely what you're going to use to get those out are regular expressions.

162
00:17:44,470 --> 00:17:53,740
Using grep commands, which stands for global globally, search a regular expression in print and we similarly over break,

163
00:17:53,740 --> 00:17:57,370
we're talking about how we all kind of hate regular expressions.

164
00:17:57,370 --> 00:18:04,840
They're they're cumbersome and there's a ton of packages and ah, that you can use that have them built in, so you don't really have to think about it.

165
00:18:04,840 --> 00:18:08,530
But I do encourage you if you want to do text analysis to just get,

166
00:18:08,530 --> 00:18:13,930
especially if you're doing it in order to get familiar with the regular expressions.

167
00:18:13,930 --> 00:18:15,100
And I'll share this later.

168
00:18:15,100 --> 00:18:23,890
There's there's some sort of app online where there's crossword puzzles that you can fill out using regular expressions to get better at it.

169
00:18:23,890 --> 00:18:29,800
But they can't be. They can be a pain. But this is kind of what it would look like in the code.

170
00:18:29,800 --> 00:18:34,330
You might you would. You would define a variable as you know, whatever your text is.

171
00:18:34,330 --> 00:18:40,630
And then this is one of the ones that's the methods that's in base are so you don't have to load a package, but just g sub.

172
00:18:40,630 --> 00:18:43,750
And what we're trying to do is we're saying, find this pattern.

173
00:18:43,750 --> 00:18:50,860
The tab, the tab object and voice touching and replace it with a blank in this variable text.

174
00:18:50,860 --> 00:18:57,310
And it comes out and you haven't removed the page similar to the the Unicode, but you did remove that.

175
00:18:57,310 --> 00:19:05,410
So this is kind of the intuition of all the regular expressions for cleaning, obviously, like Unicode, for example,

176
00:19:05,410 --> 00:19:11,620
as a whole system of of codes that that code special characters and text and you're

177
00:19:11,620 --> 00:19:15,250
not going to want to do each one of those is a regular expression yourself.

178
00:19:15,250 --> 00:19:23,470
And that's where some of these built in packages that have already coded that out for you can be useful, but sometimes they just don't work.

179
00:19:23,470 --> 00:19:30,400
And so it's really good to get to know your regular expressions. This is another one for taking out the HTML.

180
00:19:30,400 --> 00:19:40,480
So you're just submitting anything in these brackets, which is how each HTML code is often written, is written and replacing it with nothing.

181
00:19:40,480 --> 00:19:48,680
And then you have this is defining it as a function, so you could just feed your text to that function and it would remove all of the HTML.

182
00:19:48,680 --> 00:19:53,870
But here's a cheat sheet that you can go to that has a whole bunch of regular expressions

183
00:19:53,870 --> 00:20:00,470
and you can start to learn and I will try and find the crossword puzzle in case you.

184
00:20:00,470 --> 00:20:09,500
Are interested in that, but after cleaning all of that out, this is ideally what your text looks like before you before you start to analyse it.

185
00:20:09,500 --> 00:20:16,010
There's a few more steps in processing that are optional and more substantive decisions, I think.

186
00:20:16,010 --> 00:20:26,620
And we're going to discuss these. But what you love is this just this clean text where you don't have other code, you have extra white spaces.

187
00:20:26,620 --> 00:20:32,330
Yeah. Any questions or comments on that so far?

188
00:20:32,330 --> 00:20:39,770
OK. So as you go about pre processing a little bit further, we have things like stop word removal,

189
00:20:39,770 --> 00:20:42,830
which I'm pretty sure everyone in here is familiar with what stop words are,

190
00:20:42,830 --> 00:20:50,390
but they're basically just very common words that aren't really substantive in terms of if you're looking at the topic or content of your text.

191
00:20:50,390 --> 00:20:55,700
So things like at the and et cetera, but it might be corpus specific to you,

192
00:20:55,700 --> 00:21:03,880
maybe when you're looking at, I don't know, some sort of religious text you really don't care about.

193
00:21:03,880 --> 00:21:10,030
A certain religious object. You just want that removed because you don't want to analyse that over and over and over again.

194
00:21:10,030 --> 00:21:15,220
Or specific names that you want removed. You can add those to your dictionary of stop words and remove them.

195
00:21:15,220 --> 00:21:21,660
And when we do our tutorial a little bit later, we'll see how to programmatically remove stop words.

196
00:21:21,660 --> 00:21:27,960
Then there's the options of stemming or limited rising your your tax stemming removes

197
00:21:27,960 --> 00:21:32,190
the endings of conjugated verbs and plural nouns returning only the stem of the word.

198
00:21:32,190 --> 00:21:38,820
So in this case, running would become run and the verb saw like I saw something would remain sore.

199
00:21:38,820 --> 00:21:44,430
But if you limit, you actually get to the base type of the base of the word.

200
00:21:44,430 --> 00:21:52,740
And so the noun saw like sawing a piece of wood remain would remain sore, but the verb saw would become C because that's it's based on.

201
00:21:52,740 --> 00:21:57,420
So then seeing saw and see what all become see in your text.

202
00:21:57,420 --> 00:22:03,210
And if you think about it, I mean, these can be very substantive and there have has been research that has shown that

203
00:22:03,210 --> 00:22:07,200
you will get different effects in certain contexts if you advertise or if you don't,

204
00:22:07,200 --> 00:22:10,710
or if you remove certain stop words and not others.

205
00:22:10,710 --> 00:22:17,400
And so you want to think about that in this, I think comes back to what we were talking about before that in this kind of wild west.

206
00:22:17,400 --> 00:22:23,280
In some regards of computational social science, we don't necessarily have standards of reporting.

207
00:22:23,280 --> 00:22:28,110
But whenever I review papers that do things with computational text analysis,

208
00:22:28,110 --> 00:22:35,940
I do ask the author to provide, if not their code, so we can just see what words did you remove.

209
00:22:35,940 --> 00:22:46,830
Then the code and robustness checks for what happens if you if you just change some of those parameters around, ideally nothing would change.

210
00:22:46,830 --> 00:22:48,690
But sometimes it does.

211
00:22:48,690 --> 00:23:01,440
Other options in pre-processing are to create an in your text so just based tokenization, meaning that token is just like an individual unit of text.

212
00:23:01,440 --> 00:23:04,950
You can treat them as you anagrams where it's just each individual word.

213
00:23:04,950 --> 00:23:10,980
So something like New York City would be separate New York and City. And if we.

214
00:23:10,980 --> 00:23:15,150
If we do something like a topic model or an embedding model like we did later,

215
00:23:15,150 --> 00:23:21,660
those are probably likely to show up in the same cluster or topic because they're technically the same entity.

216
00:23:21,660 --> 00:23:29,670
You could do it as by and just get New York or New York City. And there are certain and graham kind of processes that will help you predict

217
00:23:29,670 --> 00:23:34,050
which engrams are actual and Graham's just based on how commonly they show up.

218
00:23:34,050 --> 00:23:40,740
And so York City probably would get kicked out because that doesn't happen as much, but New York would stay in.

219
00:23:40,740 --> 00:23:48,890
And then you have New York City, which is the tiger, the tiger. And you can get longer and longer if you like.

220
00:23:48,890 --> 00:23:50,870
You can identify parts of speech,

221
00:23:50,870 --> 00:23:59,010
so this is kind of a common output of a part of speech tagging if you want to identify which of your words fall into which part of speech.

222
00:23:59,010 --> 00:24:05,470
So we have like a singular noun and a plural noun and a verb.

223
00:24:05,470 --> 00:24:10,240
And maybe you want to do that, so then you remove all adjectives or you remove all nouns.

224
00:24:10,240 --> 00:24:16,270
There was a professor in our department, somebody mentioned moral found multiple people mention moral foundations theory.

225
00:24:16,270 --> 00:24:22,270
Previously, he was interested in doing topic modelling where you removed.

226
00:24:22,270 --> 00:24:28,540
The mounds I can't remember now, but basically looking at trying to get the moral foundations by clustering only adjectives.

227
00:24:28,540 --> 00:24:35,320
So once you have them attached to their nouns, then you just look at then you get rid of the nouns and you only look at the adjectives

228
00:24:35,320 --> 00:24:42,310
to see how the clusters of morality around words are shaped in different corpora.

229
00:24:42,310 --> 00:24:45,790
And then identifying named entities,

230
00:24:45,790 --> 00:24:55,930
which is a subtasks of information execution that seeks to locate and classify named entity mentions in unstructured text into predefined categories.

231
00:24:55,930 --> 00:25:04,000
And I will show I have a little tutorial that we're going to go through using the Google named Entity API.

232
00:25:04,000 --> 00:25:09,100
There's others, but it's kind of neat because it will return, you send it the text,

233
00:25:09,100 --> 00:25:13,510
it returns the text and it picks out things that identifies those named entities.

234
00:25:13,510 --> 00:25:20,440
If there's a Wikipedia page associated with it, then they'll give you the link to the Wikipedia page,

235
00:25:20,440 --> 00:25:24,910
and it gives you some like whether it's a noun and a person,

236
00:25:24,910 --> 00:25:33,490
a place, an institution so you can start to to get at what what other entities are spoken about in your texts.

237
00:25:33,490 --> 00:25:34,930
And so, yeah, that's the point.

238
00:25:34,930 --> 00:25:43,800
Now that we're going to move on to the tutorial for pre processing some of the things we just discussed, we're going to show them.

239
00:25:43,800 --> 00:25:56,140
And that I think I have a slide for it. Oh, but before right before we do that, when we come back, we'll talk about analysing the texts.

240
00:25:56,140 --> 00:25:59,980
I just wanted to make sure that you all have access. Did you all get access to the materials?

241
00:25:59,980 --> 00:26:05,770
I sent the link on Slack. Raise your hand. If you don't have the materials yet, you don't.

242
00:26:05,770 --> 00:26:11,920
On Slack, there's a link. If that's all Flash six, Oxford.

243
00:26:11,920 --> 00:26:17,770
And all of the materials, the tutorials for data and the data should be there.

244
00:26:17,770 --> 00:26:24,030
So I'll come to you right afterwards.

245
00:26:24,030 --> 00:26:32,520
But yeah, we'll clean the Texan when we come back after the break, we're going to analyse the text and I thought I'd just give a precursor to that.

246
00:26:32,520 --> 00:26:40,800
In terms of in the computational analysis world, we've all heard these term supervised,

247
00:26:40,800 --> 00:26:48,840
unsupervised and semi supervised the different methods that you can use in like machine learning and these sorts of things.

248
00:26:48,840 --> 00:26:56,550
And I just wanted to make sure we're all on the same page as to where we fit within it and where these methods that we'll learn fit within it.

249
00:26:56,550 --> 00:27:05,610
So unsupervised text you, you give your algorithm or whatever you're working with computationally a set of labelled data,

250
00:27:05,610 --> 00:27:10,860
meaning that maybe you have multiple cases and you're saying, this is a woman, this is a man, this is a woman, this is a man.

251
00:27:10,860 --> 00:27:20,610
And the algorithm will train on that set and then help you predict a set where you don't have any of that data with supervised data,

252
00:27:20,610 --> 00:27:25,390
you don't have any labels to begin with. You don't have. What topic does this word belong to?

253
00:27:25,390 --> 00:27:30,870
You don't have. Is this a male or a female candidate? Are they at high risk or not?

254
00:27:30,870 --> 00:27:39,550
But it just kind of. Uses without really supervision trains and starts to predict what those categories might be.

255
00:27:39,550 --> 00:27:42,610
And with semi supervised, it's a combination of these two things.

256
00:27:42,610 --> 00:27:47,710
Usually it's something like you have some labelled data, but you have a ton of unlabelled data.

257
00:27:47,710 --> 00:27:55,060
So one way you would do that is you would take your labelled data and train an algorithm, predict the on some section of the unlabelled data.

258
00:27:55,060 --> 00:28:01,420
Now they have labels and you bring them back in and relay train the analysis, then you train more.

259
00:28:01,420 --> 00:28:04,260
And it's this kind of cyclical process.

260
00:28:04,260 --> 00:28:12,300
The things that we're going to learn, topic modelling, word embedding and network analysis are largely unsupervised,

261
00:28:12,300 --> 00:28:19,020
classified as unsupervised because a lot of times we're just feeding it data and it's giving it something out.

262
00:28:19,020 --> 00:28:25,260
But in the in the context of social scientific research, it's I put it a little bit more in the semi supervised,

263
00:28:25,260 --> 00:28:29,160
not in the sense that we're adding labels and then retraining algorithms,

264
00:28:29,160 --> 00:28:36,180
but that we are very intentionally informing the algorithm by our own substantive knowledge, which we'll talk about further.

265
00:28:36,180 --> 00:28:45,810
That these sort of methods for us are not things as Chris talked about or not in their introduction to computational social science.

266
00:28:45,810 --> 00:28:52,620
There was maybe hope. I think it wasn't as broad spread as we sometimes like to think, but there was hope at the beginning that with tons of data,

267
00:28:52,620 --> 00:28:58,110
we would need theory anymore, and we've learnt that that's really not the case. And that's true with these methods, too.

268
00:28:58,110 --> 00:29:00,570
Like topic modelling doesn't mean you don't have to read your text.

269
00:29:00,570 --> 00:29:05,310
In fact, it's very important that you're familiar with the text in the corpora that you're working with.

270
00:29:05,310 --> 00:29:13,050
So it's kind of like unsupervised ished, just because the definition of semi supervised is a little bit different.

271
00:29:13,050 --> 00:29:19,200
But yeah, when we come back, these are the three methods we're going to work with. But before that, we're going to clean our data.

272
00:29:19,200 --> 00:29:23,250
Yeah, that's the difference in my own experiences.

273
00:29:23,250 --> 00:29:30,300
Yeah. Where can you get them, really? Well. Thank you.

274
00:29:30,300 --> 00:29:36,960
I just want to ask for a pre-processing steps. There were some things I ran into at some point because they helped me.

275
00:29:36,960 --> 00:29:43,980
First of all, if you have. Well, first of all, if you have text in different languages, you can easily use a translate API.

276
00:29:43,980 --> 00:29:50,430
Oh yeah. But then also it can help you to actually get out spelling mistakes.

277
00:29:50,430 --> 00:29:53,460
So if you translate to language and I'm back against the English,

278
00:29:53,460 --> 00:30:05,160
how clever actually fixes and general you have also proposing steps that I do edits critical spelling errors that take a long time.

279
00:30:05,160 --> 00:30:07,860
Text globe. I think four four are OK.

280
00:30:07,860 --> 00:30:17,500
And then last thing is that you can use fuzzy messaging if your data has a lot of names of companies, that kind of stuff.

281
00:30:17,500 --> 00:30:22,190
They're all mentioned in a different way or with or without and ink and etc.

282
00:30:22,190 --> 00:30:26,860
Yeah, it's actually get them together as a two steps at will be helpful, too.

283
00:30:26,860 --> 00:30:36,870
Yeah, those are both really great points. I think the fuzzy matching is often used when you have, yeah, things, things like Disney or Disney.

284
00:30:36,870 --> 00:30:43,170
It's like the same thing. The named entity stuff can sometimes help with that as well because they'll have the same URL attached to them or something,

285
00:30:43,170 --> 00:30:51,540
but those are both really great points. Anyone else experiences we could go on forever, I'm sure, with horror stories of data.

286
00:30:51,540 --> 00:30:56,570
But what about anything you've learnt or? OK.

287
00:30:56,570 --> 00:30:59,870
Maybe they'll come out during the tutorial, which is what we're going to switch to now.

288
00:30:59,870 --> 00:31:06,380
So the the script that we're going to start with, it's the HTML file for the quantita.

289
00:31:06,380 --> 00:31:12,560
I think it's just called pre processing. And now we're going to switch to the HDMI.

290
00:31:12,560 --> 00:31:17,290
I'm turning myself off. Oh, yeah, anyone on the livestream. Yeah.

291
00:31:17,290 --> 00:31:27,280
The link to these materials are in the HTML files on the BETWEE link that you should see on your screen as soon as we turn off the slides.

292
00:31:27,280 --> 00:31:32,710
We're back just a reminder anyone on the live stream there should be like a bitterly wink that you can follow.

293
00:31:32,710 --> 00:31:43,570
If not, it's bit dot lie slash six Oxford and the file that we're looking at is the HTML on pre-processing text.

294
00:31:43,570 --> 00:31:50,170
And as I said, we're going to work mostly with the Quantita package, but also some things from the tidy purse.

295
00:31:50,170 --> 00:32:00,250
And I'll just walk through it and anyone can ask questions. Obviously, anyone with experience in our python, but particularly there's no R or Python.

296
00:32:00,250 --> 00:32:05,410
There's a million ways to do any one thing that you want to do. So I'm just showing one example.

297
00:32:05,410 --> 00:32:12,790
But if you know of a more efficient way or anything like that, feel free to interrupt.

298
00:32:12,790 --> 00:32:20,230
So first, we're just going to load the packages and then load the texts, which, as I mentioned,

299
00:32:20,230 --> 00:32:31,210
is just a set of of abstracts from sociology between, I think twenty eight and 2000 and like half way through 2012.

300
00:32:31,210 --> 00:32:39,280
And you see here with the printout, just kind of a preview of what the data looks like in the data frame kind of looks like what the source,

301
00:32:39,280 --> 00:32:43,600
which journal the article came from, who the first author is.

302
00:32:43,600 --> 00:32:50,170
I didn't extract all of the authors, only the first author and the year,

303
00:32:50,170 --> 00:32:55,690
and then that's just kind of a thing so you can track back to which file it came from.

304
00:32:55,690 --> 00:33:03,100
Other ways to read in your data, obviously, your data can come in a ton of different formats and this read text I like read text a lot.

305
00:33:03,100 --> 00:33:07,300
There's also a reader read our function. That's pretty great.

306
00:33:07,300 --> 00:33:11,440
But but for quantita, this read text is pretty great.

307
00:33:11,440 --> 00:33:15,610
There's one for just tweets that reads in Twitter data.

308
00:33:15,610 --> 00:33:26,110
Again, that's just represent representative of how prominent Twitter data compared to other things are that we have unique reading package for that,

309
00:33:26,110 --> 00:33:32,920
Jason. Text files multiple text files say you have a I often have this.

310
00:33:32,920 --> 00:33:41,410
You have a file on your computer with a ton of text files and you don't want to load each one individually if you just why it doesn't.

311
00:33:41,410 --> 00:33:44,200
OK, I'll stay over here.

312
00:33:44,200 --> 00:33:57,890
If you just write your path and then start text or whatever the outcome of the format of your text is, it will read in all of those.

313
00:33:57,890 --> 00:34:04,820
And then within quantita for the analysis, you also have what they call dark variables for the file names,

314
00:34:04,820 --> 00:34:12,650
and you can pre specify those and we'll talk a little bit more about them in this case on that might.

315
00:34:12,650 --> 00:34:19,400
Five. It was reading in the State of the Union address, which again is a US speech by the president.

316
00:34:19,400 --> 00:34:24,290
And so it was saying which president and which year XML file CSP.

317
00:34:24,290 --> 00:34:28,340
One thing that's really neat about read text is that when you read in text, you specify,

318
00:34:28,340 --> 00:34:35,840
you declare this parameter text field as to what the column or field within your original file,

319
00:34:35,840 --> 00:34:44,780
where the text is, how it's identified and it reads it in and it names it text so that if you say

320
00:34:44,780 --> 00:34:51,850
you had a document where you had a bunch of different CSV files in the text.

321
00:34:51,850 --> 00:34:56,380
Call them in these different case files was named slightly differently when you use this,

322
00:34:56,380 --> 00:35:04,720
it will rename all of them text so you don't have to do that yourself.

323
00:35:04,720 --> 00:35:11,290
This is me, I don't know how with the Quantita package, how to duplicate the text there might be.

324
00:35:11,290 --> 00:35:15,830
So say you have. Maybe when you were scraping you,

325
00:35:15,830 --> 00:35:24,230
you save you accidentally saved or the same web page has two different URLs and you captured both of them, so it's the exact same text.

326
00:35:24,230 --> 00:35:30,560
But some of the other metadata is different, and so just doing like a dupe on the data frame at large wouldn't work.

327
00:35:30,560 --> 00:35:37,760
It's only duping the text. I use deployer for that.

328
00:35:37,760 --> 00:35:46,500
So this is basically just taking our sociology data frame and then grouping it by the text side.

329
00:35:46,500 --> 00:35:51,470
Anyways, it's duping that if you if we want to explain it further, I'm happy to.

330
00:35:51,470 --> 00:35:55,010
But this is one way that you could remove multiple texts.

331
00:35:55,010 --> 00:36:03,710
If the other metadata about that text was different might be risky, though, because maybe that metadata is really where the important difference lies.

332
00:36:03,710 --> 00:36:05,330
But let's say let's run that.

333
00:36:05,330 --> 00:36:15,750
I don't think for this particular corpus there were any duplicates, but I definitely have duplicates and in contexts where I.

334
00:36:15,750 --> 00:36:18,830
Scraped data.

335
00:36:18,830 --> 00:36:28,850
And then in quantita, similar in some respects to other packages, you create what's called the corpus, which is designed to be, as it says,

336
00:36:28,850 --> 00:36:33,410
a library of original documents that have been converted to plain UTF eight encoded

337
00:36:33,410 --> 00:36:39,290
text and stored along with the metadata at the corpus level and document level.

338
00:36:39,290 --> 00:36:45,390
They have this special document level metadata called doc variables, which are variables associated with the document.

339
00:36:45,390 --> 00:36:53,150
So in the case of our sociology abstracts doc, an instance of the doc variable would be What journal does it come from and what year was it?

340
00:36:53,150 --> 00:37:01,850
Was it produced? The corpus in quantita, as it says here is is not really designed to be where you conduct your analysis.

341
00:37:01,850 --> 00:37:07,940
It's like so original data kind of when you were first learning Stata or SPSS or something,

342
00:37:07,940 --> 00:37:11,540
it's like, don't change the original data, don't touch that right?

343
00:37:11,540 --> 00:37:20,360
It's similar to the the corpus. It's where if you want to do multiple different analyses or robustness checks, that's where the original data lives.

344
00:37:20,360 --> 00:37:25,990
So we're going to create a corpus out of our right now, our.

345
00:37:25,990 --> 00:37:31,990
DataFrame, it's just like it's just a data frame, a traditional data frame that with the text,

346
00:37:31,990 --> 00:37:37,900
the and whatnot, and now we're going to create this special object called a corpus and take a look at that.

347
00:37:37,900 --> 00:37:51,030
It looks pretty much the same. But the formatting, the metadata of a corpus is slightly different and allows for certain content quantita analysis.

348
00:37:51,030 --> 00:37:55,650
If we want to add document level variables that weren't originally in the important of the data,

349
00:37:55,650 --> 00:38:01,350
so again, a dark variable would be like the journalist or the year. So we wanted to create one on the fly.

350
00:38:01,350 --> 00:38:06,300
So in this case, just because it was what I thought of, first,

351
00:38:06,300 --> 00:38:13,830
we're going to identify two of the prominent American sociology journals American Journal of Sociology and American Sociological Review,

352
00:38:13,830 --> 00:38:19,070
which a lot of people would love to have a solo authored or an authored paper and one of these.

353
00:38:19,070 --> 00:38:23,820
So I thought those are like kind of the paradigms in US sociology of like a prestigious journal.

354
00:38:23,820 --> 00:38:33,660
So when you use that as a heuristic for like Tip Top Journal publication, and I'm just going to make a binary coding of that.

355
00:38:33,660 --> 00:38:39,240
So here we only get we're only getting the first five examples, but you see, there's this new variable a.j.'s,

356
00:38:39,240 --> 00:38:45,150
as are none of these first five as you can see there in the sociology, religion or symbolic interaction.

357
00:38:45,150 --> 00:38:51,780
All of those are not in that journal, so they have a value of false. But we've just created a new document variable.

358
00:38:51,780 --> 00:39:03,060
Is this document from a serious R? There's many other ways, as I mentioned, Rag X would be your friend for creating new doc variables.

359
00:39:03,060 --> 00:39:09,920
But this is just one example. Any questions? All makes sense.

360
00:39:09,920 --> 00:39:14,750
So if we wanted to add a corpus level variable, which is not about individual documents,

361
00:39:14,750 --> 00:39:20,510
but the corpus as a whole, we would do that with Matt Murdock.

362
00:39:20,510 --> 00:39:25,100
And I just added the date yesterday, which was June 18th,

363
00:39:25,100 --> 00:39:30,680
and that's kind of just a meta information about the corpus you could add, like who collected it and who created the corpus.

364
00:39:30,680 --> 00:39:41,230
If you plan to make this open source in the future or anything, that that could be a really useful place for you to store metadata about the corpus.

365
00:39:41,230 --> 00:39:50,530
And to take a look at an example, we just want to say, look at the text of the first or this, I'm looking at the second abstract here.

366
00:39:50,530 --> 00:39:56,080
This is an abstract from I believe it was the sociology sociology of religion, right?

367
00:39:56,080 --> 00:40:04,900
This is what the text looks like. Luckily, it's pretty clean in terms of, like we said, there's no non ASCII characters or Unicode or something.

368
00:40:04,900 --> 00:40:12,660
So we can. We can be happy about that, because that's a luxury.

369
00:40:12,660 --> 00:40:21,360
If we wanted to summarise a bit about the corpus, I'm just summarising here the number of abstracts by year.

370
00:40:21,360 --> 00:40:26,220
I don't know why it didn't come up. We go.

371
00:40:26,220 --> 00:40:37,150
So, yeah, like I said, it cuts off partially through 2012, but it looks like there's approximately fifteen hundred articles for each year.

372
00:40:37,150 --> 00:40:44,480
We wanted to see which text is the longest. It looks like there's five hundred and one tokens and one abstracts.

373
00:40:44,480 --> 00:40:52,040
It's kind of like they want one word over what is often our limit of 500 word abstract.

374
00:40:52,040 --> 00:40:59,600
If you happen to have more than one corpus and you wanted to combine them, pretty simple, just plus Adam together.

375
00:40:59,600 --> 00:41:05,270
If you wanted to subset your corpus here, we say we only wanted to look at two thousand and ten.

376
00:41:05,270 --> 00:41:12,070
This is how we would would do it and move this over. So it's not cramping our style.

377
00:41:12,070 --> 00:41:16,300
And then we can begin to explore the actual text itself.

378
00:41:16,300 --> 00:41:23,840
So if we wanted to see, for instance, I study art, I want to see which abstracts mention the word art and in what context.

379
00:41:23,840 --> 00:41:29,230
So here are some it tells me which abstract it's from and then the context of it.

380
00:41:29,230 --> 00:41:33,850
So I argue there is art exertion. Who knows what due to?

381
00:41:33,850 --> 00:41:35,770
I don't know what that one is.

382
00:41:35,770 --> 00:41:46,150
Average consumers of art and culture can maybe then start to maybe you look specifically for things that are relevant to my own research.

383
00:41:46,150 --> 00:41:52,540
The practise of art as prayer, or maybe this kind of exploration starts to give you ideas of what sort of topics

384
00:41:52,540 --> 00:41:56,630
there might be if you're going to do a topic modelling or something like that.

385
00:41:56,630 --> 00:42:01,910
So you can start to explore your data, and there are evidently a lot of hurt.

386
00:42:01,910 --> 00:42:11,520
OK. You can expect inspect the document variables that you defined or the metacarpals variables that you defined,

387
00:42:11,520 --> 00:42:15,460
those we've already kind of talked about.

388
00:42:15,460 --> 00:42:26,140
Then when you want to perform your analysis, as is often the case, you're going to create some sort of term document or term tote like token matrix.

389
00:42:26,140 --> 00:42:33,510
There's often the idea of term frequency document and frequency matrix that you use and topic modelling in quantity.

390
00:42:33,510 --> 00:42:40,750
At first, it's literally, I believe, just a yeah document feature matrix where the features are the columns and the features of the words.

391
00:42:40,750 --> 00:42:45,940
And then you have the documents. And there's a lot of different ways to tokenise.

392
00:42:45,940 --> 00:42:53,060
As I mentioned before, a token in the context of text analysis is like a unit of analysis, and that might be a word.

393
00:42:53,060 --> 00:42:58,250
It might be a whole sentence could be by grams or grams. It could be only punctuation.

394
00:42:58,250 --> 00:43:05,590
It's whatever you want. So I'm just going to show within the context of quantita all the different ways you could very easily tokenising.

395
00:43:05,590 --> 00:43:11,980
And we're starting with these three sentences that I came up with that if anybody can identify which sources they're from,

396
00:43:11,980 --> 00:43:18,040
buy you a drink or coffee later.

397
00:43:18,040 --> 00:43:25,810
So in this case, these are the the texts, and we're just tokenising the basic tokens tax is just going to put it into words.

398
00:43:25,810 --> 00:43:29,200
It's not looking at diagrams or telegrams or sentences, it's just words.

399
00:43:29,200 --> 00:43:38,440
So it's split if you're happy and a dream, does that count into if you're happy and etc., right?

400
00:43:38,440 --> 00:43:45,550
But there's all of these parameters that you can set in the tokens for function like remove numbers, remove punctuation,

401
00:43:45,550 --> 00:43:55,060
remove symbols, remove separators, remove Twitter things like hashtags or at science, remove hyphens or remove URLs.

402
00:43:55,060 --> 00:43:58,000
And I'm not going to go through these. But if you run these different functions,

403
00:43:58,000 --> 00:44:06,070
you'll see how doing the same thing results in a different result or results in different tokens from those sentences.

404
00:44:06,070 --> 00:44:09,700
And as we talked about before with engrams, the token can also do Ingrams.

405
00:44:09,700 --> 00:44:15,130
You just specify Ingram. So here with engrams one through two, I'm saying get MoneyGram's and diagrams.

406
00:44:15,130 --> 00:44:19,570
If you did two to three, you would only get by grams and try grams. You did one two three.

407
00:44:19,570 --> 00:44:26,950
You'd get, et cetera, you anagrams by and try grams. So here that same sentence.

408
00:44:26,950 --> 00:44:40,420
Let's look at the shortest one. If you're happy and a dream, does that count also has if your your happy, happy in in a sutra, right?

409
00:44:40,420 --> 00:44:44,800
You can also this definition character, so instead of words,

410
00:44:44,800 --> 00:44:50,410
I want characters I don't know why you do this, but which is why I said, why would you ever do this?

411
00:44:50,410 --> 00:45:01,700
But it would split it into letters. If you're thinking maybe you would do it to count, to count the number of characters, there's an easier way.

412
00:45:01,700 --> 00:45:10,810
But. There it's possible, or you can tokenise by sentence, but each of these is only one sentence long.

413
00:45:10,810 --> 00:45:17,640
Oh no, this is this one, OK? So. This is not known, this is bowling, there are rules.

414
00:45:17,640 --> 00:45:24,150
Similarly, if you can tell me where that's from, you get a drink or chocolate or something, but it'll just put it into sentences.

415
00:45:24,150 --> 00:45:33,510
Now sentences are your unit of analysis. They're your tokens. Cool.

416
00:45:33,510 --> 00:45:41,980
Super cool. OK, now constructing the document frequency matrix is similarly quite easy in quantity package.

417
00:45:41,980 --> 00:45:50,760
It's just this DFM. In the tokens, it didn't do anything unless you told it to write, it didn't lowercase,

418
00:45:50,760 --> 00:45:56,340
it didn't remove punctuation, it didn't do any of that unless you specified that that's what you wanted with DFM.

419
00:45:56,340 --> 00:46:01,590
It will do a lot of that. You can override it, but it will do a lot of that on on its own.

420
00:46:01,590 --> 00:46:09,710
And so you won't have. It's not showing me.

421
00:46:09,710 --> 00:46:16,400
You won't have to specify, and you won't have things like if it appears at the top in the beginning of a sentence,

422
00:46:16,400 --> 00:46:20,940
you won't have it capitalised and so it's counted as a different token.

423
00:46:20,940 --> 00:46:30,320
It'll be the same token if it's the same word. You can also remove stop words in STEM, which is what we talked about in the in the tutorials.

424
00:46:30,320 --> 00:46:39,980
It's really easy just defining you want to improve English, stop words and use that stem to true and you want to remove punctuation.

425
00:46:39,980 --> 00:46:48,300
Looking at what types of Stoppard's are in this English dictionary, it's I mean, my et cetera, but you can add you can add to it yourself if you want,

426
00:46:48,300 --> 00:46:55,220
like I said, if you wanted to remove, maybe maybe you're analysing text messages or something and they are all yours.

427
00:46:55,220 --> 00:46:59,090
And so your name is there a million times and you're like, I don't want me to be there.

428
00:46:59,090 --> 00:47:04,310
You could add your name as a as a stop word.

429
00:47:04,310 --> 00:47:09,800
Sorry, guys. That'd be very interesting to analyse your text messages. I've never done that.

430
00:47:09,800 --> 00:47:19,140
So this is an example of that. Adding the word will to the dictionary stop words to remove.

431
00:47:19,140 --> 00:47:27,660
Then this top features function, oh, I have to define that.

432
00:47:27,660 --> 00:47:32,100
We'll just show you the top features, the top words that are occurring in your corpus.

433
00:47:32,100 --> 00:47:42,510
So in this case, unsurprisingly, maybe social is is top data study research article using health women results amongst religious also.

434
00:47:42,510 --> 00:47:54,470
It's kind of like a sentence. So, yeah, these are some of the top tokens or words used in sociological abstracts.

435
00:47:54,470 --> 00:48:00,170
You can also make a word cloud, which you kind of a love hate word clouds, and this takes a while to run.

436
00:48:00,170 --> 00:48:08,230
It's in your HTML, you can see it. Some like ungodly palette that was selected to go along with it, but.

437
00:48:08,230 --> 00:48:12,100
I shouldn't have clicked on that. You guys know what the word cloud looks like.

438
00:48:12,100 --> 00:48:21,930
All right. There you go. Who? It's very tropical.

439
00:48:21,930 --> 00:48:30,260
You can also group document, sorry, you had to decide yourself that the full size to to the word size.

440
00:48:30,260 --> 00:48:33,880
Yeah, yeah, yeah, yeah, the font size of the words.

441
00:48:33,880 --> 00:48:44,660
Yeah, it's how common that word is in the corpus. Yeah. You can start to analyse by groups of tags, so in this case,

442
00:48:44,660 --> 00:48:57,650
you want to see if maybe you want to create a corpus or a term document for each frequency matrix by year to analyse them separately.

443
00:48:57,650 --> 00:49:04,520
And so it would do that. So here now we're able to see like say that we normalise the number of tokens across years.

444
00:49:04,520 --> 00:49:06,530
You could see how they like, go up and down.

445
00:49:06,530 --> 00:49:17,390
This is just rock counts, but it looks like health was really more popular in 2010 and went down for whatever reason in 2011.

446
00:49:17,390 --> 00:49:24,080
And they have nearly the same number of documents, if I remember from the histogram at the beginning.

447
00:49:24,080 --> 00:49:29,510
So that's one thing you can do. We're not going to.

448
00:49:29,510 --> 00:49:36,750
Well, maybe we. Yeah. And what did you have the parameters?

449
00:49:36,750 --> 00:49:43,910
Is it more advisable to travel documents when they chase down?

450
00:49:43,910 --> 00:49:50,600
In this case, the those were some of the year's strikes and the Collins.

451
00:49:50,600 --> 00:49:56,510
Well, it's still like the same matrix of just the documents and the features or the tokens.

452
00:49:56,510 --> 00:50:01,130
And then the analysis just does that by year and gives you this table.

453
00:50:01,130 --> 00:50:12,260
So it's sort of the list is is every single piece of information sold in the list and because just by typing.

454
00:50:12,260 --> 00:50:19,280
Yes. And I don't know what the commanders on, I think produces that headed for you or yeah, it will.

455
00:50:19,280 --> 00:50:27,350
It produces the header for you there. It's The Matrix is just a matrix, which is a bunch of lists, you know, in the columns.

456
00:50:27,350 --> 00:50:36,560
But then separate is the corpus where the metadata resides, and it's just drawing on that to create this table for you.

457
00:50:36,560 --> 00:50:45,260
Yeah, because of. But I'm interested in, for example, president a government to make sure it's fair and.

458
00:50:45,260 --> 00:50:49,160
Would send an answer to. OK.

459
00:50:49,160 --> 00:50:56,480
Yeah. And so that wasn't really. So I'm just wondering that's gotten some space.

460
00:50:56,480 --> 00:51:02,100
Yeah. I would probably create a matrix.

461
00:51:02,100 --> 00:51:06,390
By sort of one category of improvement in the this year.

462
00:51:06,390 --> 00:51:13,530
Yeah, I'm trying to think off hand because you want like all of the different combinations of your two features.

463
00:51:13,530 --> 00:51:15,990
I don't think there's a straightforward way to do it with quantity,

464
00:51:15,990 --> 00:51:23,310
but using to there might be you could look in the documentation, but it's definitely doable.

465
00:51:23,310 --> 00:51:29,900
Yeah. Sometimes we have, as it says,

466
00:51:29,900 --> 00:51:39,220
some prior intuition of words that are particularly important inside of our text and we might want to know how those relate.

467
00:51:39,220 --> 00:51:45,980
This shouldn't be women, probably incorrect. I was looking at men and women, but then I switched it to culture and structure, which are too common.

468
00:51:45,980 --> 00:51:53,180
We have previously thought of them as either oppositions or ends of the spectrum within social sociology.

469
00:51:53,180 --> 00:51:59,120
There's obviously a lot of theory there that would argue against it, but just in terms of how we use it in abstract.

470
00:51:59,120 --> 00:52:07,830
Maybe we're interested in how those would be used separately. And so we could just get a count of which ones winning.

471
00:52:07,830 --> 00:52:14,940
Do we talk more about culture? Do we talk more about structure? And according to the first five, we talk more about culture.

472
00:52:14,940 --> 00:52:22,410
But culture is structure to some people. So here you can use external dictionaries.

473
00:52:22,410 --> 00:52:27,990
So maybe you want to do the same sort of thing, but you want to look at like Leiweke Dictionary terms.

474
00:52:27,990 --> 00:52:32,100
You have to pay for those which I didn't want to do last night.

475
00:52:32,100 --> 00:52:38,700
So I just gave you the code for how you would import those terms and then you could do the same analysis.

476
00:52:38,700 --> 00:52:46,230
Looking at the frequency over time, you can look at the similarity of texts.

477
00:52:46,230 --> 00:52:50,350
By running this code, I don't know if this will show up, actually.

478
00:52:50,350 --> 00:53:01,510
Based on my cosine similarity of the the term document matrix might take a little bit.

479
00:53:01,510 --> 00:53:10,810
So this is just giving you an example of the the cosine similarity of the first document with with other documents.

480
00:53:10,810 --> 00:53:19,910
You could then use those and as weights and a network analysis or something, I don't I'm not sure you can do the same with specific words.

481
00:53:19,910 --> 00:53:31,890
So if you wanted to look at how race the cosine similarity of race with other with other tokens in your corpus,

482
00:53:31,890 --> 00:53:38,260
it looks like it has a very high similarity with the word racial, which maybe isn't surprising.

483
00:53:38,260 --> 00:53:43,180
If you go down to the gender list, you're kind of intuitive as well.

484
00:53:43,180 --> 00:53:52,400
It's you. So with gender, women is the most similar term associated with, but then men equal differ male.

485
00:53:52,400 --> 00:53:59,120
Remember, we stemmed these so equal to be equality legislation.

486
00:53:59,120 --> 00:54:04,790
So yeah, it's a cosine similarity from the document term frequency matrix.

487
00:54:04,790 --> 00:54:11,270
Yeah. And then.

488
00:54:11,270 --> 00:54:15,560
Oh, we're not going to do this because we have a whole tutorial on structural topic modelling,

489
00:54:15,560 --> 00:54:20,810
but within the quartzite quantita package, you can you can do topic modelling and it's pretty straightforward.

490
00:54:20,810 --> 00:54:28,280
So here you're just identifying how many topics you want and you're saying I'm going to use latent like allocation like.

491
00:54:28,280 --> 00:54:34,520
And then you get the terms and the topic distributions. So if you want, you can start doing topic modelling that way.

492
00:54:34,520 --> 00:54:41,970
Topic models that way. Yeah. OK, it's 3:15.

493
00:54:41,970 --> 00:54:48,870
Should we take a break or no, because I have two other scripts on preprocessing that I've provided,

494
00:54:48,870 --> 00:54:53,340
but you can also just play with those at your leisure. One would be a bit complicated for us to go through,

495
00:54:53,340 --> 00:55:02,660
but it's working with that Google API for detecting named entities, and that is, I think it's just called.

496
00:55:02,660 --> 00:55:10,130
And LP, it's like the MLP tutorial. And there's another one on sentiment analysis, so in the core six at Princeton,

497
00:55:10,130 --> 00:55:17,240
they do more work on sentiment analysis is pretty straightforward, though, and getting the sentiment just give it a dictionary.

498
00:55:17,240 --> 00:55:21,440
And it looks at the words and tells you what the sentiment of those words are by and large like.

499
00:55:21,440 --> 00:55:25,700
That's the most simple form of sentiment analysis. So it's not even a tutorial.

500
00:55:25,700 --> 00:55:30,380
I just provided you a script that you could use if you wanted to look at in case

501
00:55:30,380 --> 00:55:37,250
any of you want to do that in your group project and you want a base to run on. But I'm happy to go through the Google named Entity Thing.

502
00:55:37,250 --> 00:55:44,030
Often there's credential, it's an API, right? So there could be credentialing problems or we can just take a break.

503
00:55:44,030 --> 00:55:49,550
Take a break. OK, I'm going to do that.

504
00:55:49,550 --> 00:56:00,450
Unless it no questions, right? So it's supposed to run the polls on the stock price, and it's like, oh, yeah.

505
00:56:00,450 --> 00:56:07,620
So a lot of what the text analysis stuff. Not necessarily processing but models we run, a lot of them take a long time.

506
00:56:07,620 --> 00:56:12,960
So we're not going to do a lot of it. So but they each HTML and then I'll give you the marked down file later.

507
00:56:12,960 --> 00:56:17,550
If you want it, you have it. You can copy paste later at your leisure.

508
00:56:17,550 --> 00:56:25,990
OK. Yeah. Yeah. Like you said, at that point, it was like they knew the news outlet.

509
00:56:25,990 --> 00:56:31,650
So we treat the text from. Next, Major.

510
00:56:31,650 --> 00:56:37,020
The public's fascination with newspapers and things.

511
00:56:37,020 --> 00:56:43,230
There are there are a lot. Yeah, I'm sure plenty of people in this room might know of more.

512
00:56:43,230 --> 00:56:53,490
I know, for instance, that English corporate dawg that I told you about one of their larger I think it's called Coca S.O.S. a larger corpora.

513
00:56:53,490 --> 00:57:00,780
Has they subdivide the corpus into like magazines, blogs, newspapers and within newspapers?

514
00:57:00,780 --> 00:57:07,980
There's multiple different newspapers, but like New York Times, is kind of epically good of being able to get from their API text.

515
00:57:07,980 --> 00:57:14,581
I have a corpus if you want to just toy with it of New York Times, I think their op ed.