1
00:00:08,630 --> 00:00:21,460
So welcome back from snack. We are going to be talking about the methods, three methods of large-scale text analysis, there are others,

2
00:00:21,460 --> 00:00:28,720
there's dictionary based methods, there's methods I've never even heard of, but these are three that are pretty prominent.

3
00:00:28,720 --> 00:00:32,390
The first is topic modelling, which you probably all at least heard of, if not done.

4
00:00:32,390 --> 00:00:42,310
And I'm sure a lot of you have done even a basic topic. Model word embedding models are a little less common thus far in the social sciences,

5
00:00:42,310 --> 00:00:48,040
although I know that there's a lot of people working with them now and they're really pretty cool

6
00:00:48,040 --> 00:00:54,460
and different than topic modelling and then network based approach approaches to text analysis,

7
00:00:54,460 --> 00:01:02,710
which are pretty long standing but are also have a ton of room to develop and for people to contribute.

8
00:01:02,710 --> 00:01:06,220
So in our last session, we talked about where do we get this data?

9
00:01:06,220 --> 00:01:11,650
Then we talked about how do we clean and pre process it for analysis, and now we're going to analyse it.

10
00:01:11,650 --> 00:01:16,210
And before we do that, I'll just give us kind of a state of reference.

11
00:01:16,210 --> 00:01:21,940
If I gave you a massive stack of documents that you'd never seen before and I asked you to

12
00:01:21,940 --> 00:01:26,230
tell me to read them and then to find some way to read them and tell me what they're about,

13
00:01:26,230 --> 00:01:33,620
what the general themes are. What would be your process? Maybe some of you have done this qualitative coding, right?

14
00:01:33,620 --> 00:01:47,520
How would you go about it? Yeah, it's you know, it's likely to solve some problems to try to pull out, and it seems like it's.

15
00:01:47,520 --> 00:01:52,460
It's a sad situation, and there's been.

16
00:01:52,460 --> 00:01:56,620
Then just lots of things. Yeah.

17
00:01:56,620 --> 00:01:56,900
Yeah,

18
00:01:56,900 --> 00:02:03,740
I think that's probably what a lot of people would have said that what he said was basically you take maybe a sub sample of the corpus and read it,

19
00:02:03,740 --> 00:02:09,770
pick out some themes and then in a rather iterative process, you'd maybe expand the number of things or demand minute collapse,

20
00:02:09,770 --> 00:02:14,570
the number of themes as you start to read and apply it to the entire corpus.

21
00:02:14,570 --> 00:02:18,980
Anything else? Or is that kind of what people thought they would do?

22
00:02:18,980 --> 00:02:26,480
Yeah. So that's kind of that's a pretty traditional way that we. It's still a very valuable way of doing text analysis.

23
00:02:26,480 --> 00:02:35,360
But I just wanted to give that as a reference point for some of these other methods that we do in computational text analysis.

24
00:02:35,360 --> 00:02:45,020
So first being topic models, how many have just to ask how many of you have used topic modelling in your research, even just playing around with it?

25
00:02:45,020 --> 00:02:50,060
I think there were maybe even weren't cool. Well, then this will be more really fun.

26
00:02:50,060 --> 00:02:53,720
How many of you have heard of the topic model? A lot more.

27
00:02:53,720 --> 00:02:59,670
Yeah. What is the topic model?

28
00:02:59,670 --> 00:03:10,350
Oops! You get back on this oh, so OK, so a topic model, what kind of idea of it was first proposed?

29
00:03:10,350 --> 00:03:18,030
Well, actually it was developed in population genetics in two thousand and then independently developed by Blé.

30
00:03:18,030 --> 00:03:23,730
This is what I was mentioning before in the context of text analysis in 2003.

31
00:03:23,730 --> 00:03:28,650
And so again, it doesn't just need to be applied to text, but it is interesting.

32
00:03:28,650 --> 00:03:36,090
And the kind of distinguishing feature of topic models is that there are an automated procedure for coding the content of texts,

33
00:03:36,090 --> 00:03:42,450
including very large corpora, into a set of meaningful categories or topics.

34
00:03:42,450 --> 00:03:47,940
This is done with algorithm minimal human intervention and therefore is kind of a more

35
00:03:47,940 --> 00:03:57,950
inductive way of drawing out topics than than the hand coding that we discussed before.

36
00:03:57,950 --> 00:04:07,130
How does it work? The really simple answer is that it relies on the assumption that language and meaning is relational.

37
00:04:07,130 --> 00:04:12,800
And so therefore, even though these are a topic, modelling is what we call a bag of words method,

38
00:04:12,800 --> 00:04:19,100
meaning that it doesn't really care about the rules of language, syntax and narrative order and parts of speech and things.

39
00:04:19,100 --> 00:04:28,760
It does still believe that there are these like content clusters that share and create meta meanings or topics.

40
00:04:28,760 --> 00:04:37,430
The algorithm is a probabilistic model and the most widely used one that a lot of us have heard is is LDA or lightweight and seriously allocation,

41
00:04:37,430 --> 00:04:42,620
which we'll talk quite a bit about. And that's kind of an introduction.

42
00:04:42,620 --> 00:04:54,020
But the assumption in this is that a corpus within a corpus there are there is a distribution of different topics and then within each topic.

43
00:04:54,020 --> 00:05:03,260
So maybe in one corpus on political and political text, there's one about health care and one about elections and whatnot.

44
00:05:03,260 --> 00:05:07,460
And then if you just went into the health care topic, you'd have all the entire corpus.

45
00:05:07,460 --> 00:05:12,530
All the words that are in that corpus would have a probability of occurring within that topic.

46
00:05:12,530 --> 00:05:17,060
So you have a distribution of topics across the corpus and within an individual document

47
00:05:17,060 --> 00:05:22,520
you and then within each topic you have a probability or a distribution to the words.

48
00:05:22,520 --> 00:05:30,290
Does that make sense? And instead of, as we talked about with the qualitative coding where you would,

49
00:05:30,290 --> 00:05:34,700
maybe you start by reading the text, get a sense of what the topics might be, add to them,

50
00:05:34,700 --> 00:05:40,760
diminish them in topic modelling you pre specify how many topics you want the algorithm to uncover,

51
00:05:40,760 --> 00:05:48,380
and it's kind of a long standing area of research is how do you predict or how do you choose the correct number of topics?

52
00:05:48,380 --> 00:05:55,670
And we could go on forever about that. There are some methods we'll see when we look at structural topic modelling,

53
00:05:55,670 --> 00:05:59,840
what they've developed to help you try and distinguish what's a good number of topics.

54
00:05:59,840 --> 00:06:06,830
A lot of times the answer for social scientists has just been No, your topic doesn't make sense to these topics.

55
00:06:06,830 --> 00:06:17,270
Makes sense. You might hope that, OK, so if I have five topics and I see and one of them that like marriage and family,

56
00:06:17,270 --> 00:06:21,440
marriage and family are mixed together in one topic and I want those separated.

57
00:06:21,440 --> 00:06:25,460
So I'm going to make it six topics and see if it breaks apart. Those of you have done this.

58
00:06:25,460 --> 00:06:28,250
No, that's not the way it works, and it can totally change.

59
00:06:28,250 --> 00:06:32,840
And you'll get a topic you've never heard of before in the marriage and family topic remains the same.

60
00:06:32,840 --> 00:06:38,490
So there is a bit of of. Art to some of it, you might say,

61
00:06:38,490 --> 00:06:47,460
and I think that's all the more reason that we we need to have good reporting standards about what we did and how we made our decisions.

62
00:06:47,460 --> 00:06:53,460
But the idea of the algorithms of topic modelling is that you want to kind of reverse engineer whatever

63
00:06:53,460 --> 00:07:00,240
it was that the speaker or the author of the text was trying to convey in terms of the ideas.

64
00:07:00,240 --> 00:07:05,580
And so let's see.

65
00:07:05,580 --> 00:07:13,980
Yeah. Now what the more practical question about real world in general in the studio space.

66
00:07:13,980 --> 00:07:17,880
Will you define what what it should be something like?

67
00:07:17,880 --> 00:07:23,760
How do you build things you needed to? Is this something you define all the model?

68
00:07:23,760 --> 00:07:27,000
Yeah. No, the model generates it and we'll talk about how, but this.

69
00:07:27,000 --> 00:07:31,440
I'm sorry, I didn't explain this image at all, but basically, if you have this text,

70
00:07:31,440 --> 00:07:34,770
you can see the different topics are coloured and the different words.

71
00:07:34,770 --> 00:07:42,990
So you see the Foundation Million Support Foundation public facilities is all in green, whereas New York,

72
00:07:42,990 --> 00:07:49,920
New York Philharmonic and performing and opera and music is all in red, that's maybe like the arts topic.

73
00:07:49,920 --> 00:07:55,020
And so you could see how in this document, there's a distribution of these different topics.

74
00:07:55,020 --> 00:08:03,090
These look pretty equal in distribution, but there might be one topic where there's very little representation of it in that document.

75
00:08:03,090 --> 00:08:08,400
And then for each topic, each of these words would have a probability of occurring in that topic.

76
00:08:08,400 --> 00:08:14,400
So there's a distribution of words within each topic in a different distribution of topics across the words.

77
00:08:14,400 --> 00:08:24,390
And basically, what you do with the topic modelling, which is what we'll discuss now, is you're trying to get it to uncover that those distributions.

78
00:08:24,390 --> 00:08:28,200
So the objective is given a corpus of text in a specified number of topics,

79
00:08:28,200 --> 00:08:37,970
find the parameters that likely generated it recreate in a very basic form the intent of the speaker in terms of what they wanted to talk about.

80
00:08:37,970 --> 00:08:41,900
The primary endpoint is the text and the number of topics you want it to uncover.

81
00:08:41,900 --> 00:08:47,340
And then basically the process is OK. The algorithms have been told how many topics there are.

82
00:08:47,340 --> 00:08:52,970
It has a corpus of words because you provided a text and it will take a topic.

83
00:08:52,970 --> 00:08:55,610
It doesn't have any information about that topic yet,

84
00:08:55,610 --> 00:09:02,510
but it knows that there's some probability of that topic arriving and in a particular document over the corpus.

85
00:09:02,510 --> 00:09:06,080
So it selects a topic and then for that topic,

86
00:09:06,080 --> 00:09:12,560
it selects the word and it puts that word in like the bag of words that is eventually supposed to be growing the whole document.

87
00:09:12,560 --> 00:09:24,500
Essentially, the algorithm is trying to recreate the the corpus that that you gave it by iteratively doing this, drawing a theme or drawing a topic,

88
00:09:24,500 --> 00:09:30,320
drawing a word within that topic and putting it in the bag until it recreates something like the corpus that you have.

89
00:09:30,320 --> 00:09:38,300
And then it says, OK, you gave me this number of topics. This is this is kind of the best I could do to recreate the corpus you gave me,

90
00:09:38,300 --> 00:09:44,190
given that number of topics by changing the probability of a word occurring in any topic.

91
00:09:44,190 --> 00:09:49,190
So the output that you get if you have a word distribution for every topic.

92
00:09:49,190 --> 00:09:57,490
So if we go back. If I then was like, I want to just look at this arts, this red one.

93
00:09:57,490 --> 00:10:01,690
You could look at it wouldn't give you the label, but you would call it the arts topic.

94
00:10:01,690 --> 00:10:07,810
We'll see this later. You could just look at that topic and you could say, OK, give me the word probability within that topic.

95
00:10:07,810 --> 00:10:17,020
And maybe art is super high, so it has a high probability of occurring whenever art is being discussed, whereas the word.

96
00:10:17,020 --> 00:10:27,130
Procedural might have a low probability in an arts topic, but it would still be there, every word in the corpus has a probability within every topic.

97
00:10:27,130 --> 00:10:34,280
It's just that some will have a like a very, very low probability within that topic, and some will have a higher probability.

98
00:10:34,280 --> 00:10:41,910
Does that make sense? It's hard to wrap your mind. It's been hard for me over the years to continually wrap my mind around.

99
00:10:41,910 --> 00:10:48,370
But ultimately, what you get, as I said, is you get a distribution of words for each topic in a distribution of topics over the corpus.

100
00:10:48,370 --> 00:10:54,480
So you know, not only how likely is it for a certain word to pop up if a certain topic is being mentioned,

101
00:10:54,480 --> 00:11:04,580
but you also know how likely is it for that topic to be mentioned in this corpus or within an individual document?

102
00:11:04,580 --> 00:11:08,660
What is a topic model not or what are they not?

103
00:11:08,660 --> 00:11:11,330
I think this is a really important point.

104
00:11:11,330 --> 00:11:18,770
It's a long quote from that herald as well that I mentioned earlier kind of a forerunner for content analysis in the modern age.

105
00:11:18,770 --> 00:11:25,120
And I know it's long and I hate when people read long quotes, but we're going to do it anyways. Content analysis should be given.

106
00:11:25,120 --> 00:11:29,030
We should begin with traditional modes of research and the person who wishes to

107
00:11:29,030 --> 00:11:33,020
use content analysis for a study of the propaganda of some political party,

108
00:11:33,020 --> 00:11:39,560
for example, should steep themselves in that propaganda before he or she begins to count.

109
00:11:39,560 --> 00:11:44,180
They should read it to detect characteristic mechanisms and devices.

110
00:11:44,180 --> 00:11:50,750
They should study the vocabulary and format. They should know the party organisation and personnel from this knowledge.

111
00:11:50,750 --> 00:11:54,980
They can organise the hypotheses and predict hypotheses and predictions.

112
00:11:54,980 --> 00:12:01,250
At that point, in a conventional study, they can start writing at this point in a content analysis,

113
00:12:01,250 --> 00:12:06,440
they are instead ready to set up their categories, pre-test them and start counting.

114
00:12:06,440 --> 00:12:13,550
So that was in the context of a much more qualitative type of content analysis, but it absolutely, in my opinion, applies to topic modelling.

115
00:12:13,550 --> 00:12:20,210
In my first course in text data with Neil Cameron at UMC.

116
00:12:20,210 --> 00:12:26,180
My final project looked at a corpus of religious texts of a religion that I was particularly familiar with,

117
00:12:26,180 --> 00:12:30,950
and I was looking over time at the topic model and there were anomalies within it

118
00:12:30,950 --> 00:12:35,030
that I know had I not been familiar with that religion from a family background.

119
00:12:35,030 --> 00:12:43,250
I would not have understood why there were these, these anomalies in the topics and why that topic totally shifted in this year,

120
00:12:43,250 --> 00:12:50,360
because the leadership of that church totally shifted in that you're not because suddenly the the congregations changed,

121
00:12:50,360 --> 00:12:56,900
but because leadership within that religion is very much driven by the personality and the interests

122
00:12:56,900 --> 00:13:02,240
of who or the kind of culture of the religion is driven by whoever is the leader at that time.

123
00:13:02,240 --> 00:13:06,680
And I knew who the leaders were, and one was really strict.

124
00:13:06,680 --> 00:13:08,630
And so it changed.

125
00:13:08,630 --> 00:13:19,220
But yeah, that's the sort of thing that you really want to understand your texts and in a more recent context and apply it to topic modelling.

126
00:13:19,220 --> 00:13:25,490
We have this seen in this light, it's useful to think about topic models not as providing an automatic text analysis programme,

127
00:13:25,490 --> 00:13:29,300
but rather as providing a lens that allows researchers to work on a problem to

128
00:13:29,300 --> 00:13:34,190
view a relevant textual corpus in a different light and at a different scale.

129
00:13:34,190 --> 00:13:40,760
We can just look at way more text, and we can do it in a slightly different way than we would with qualitative coding.

130
00:13:40,760 --> 00:13:50,670
And that's true of all the methods that we'll use. But it's just that reiteration that it's not like you don't have to read the text anymore.

131
00:13:50,670 --> 00:13:56,320
In the tutorial, we're going to look at structural topic models, which I'm not going to read this because I read enough of them.

132
00:13:56,320 --> 00:14:02,460
They it's basically a topic model, but you get to add in variables like demographic variables.

133
00:14:02,460 --> 00:14:07,320
Maybe it's like whoever said this were a Republican or a Democrat.

134
00:14:07,320 --> 00:14:11,430
Were they a man or woman? Was it in the 18th century or the 19th century?

135
00:14:11,430 --> 00:14:21,530
You can have continuous variables. What date was it said in? How long is the text that it came from that sort of thing and use those to see

136
00:14:21,530 --> 00:14:27,050
how does a particular topic change depending on who set it or in what context?

137
00:14:27,050 --> 00:14:36,410
So again, this is a type of topic model designed for social researchers, which is pretty cool, I think, and designed in like 2014.

138
00:14:36,410 --> 00:14:46,170
You can read more about it there. And we'll do the our package this time, our package, there was more on that slide.

139
00:14:46,170 --> 00:14:52,680
I guess not. So. I was going to take a break and do the structural topic model,

140
00:14:52,680 --> 00:14:58,880
but I kind of feel like I could just talk about word embedding immediately and then do both of those tutorials back to back.

141
00:14:58,880 --> 00:15:08,420
Does that sound OK? Yeah. OK, so we're going to contrast topic modelling with word embedding models and sort of word embedding models.

142
00:15:08,420 --> 00:15:17,240
A lot fewer people, but you will hear I genuinely believe you will hear about them increasingly in the context of social scientific research,

143
00:15:17,240 --> 00:15:23,660
they're not being part of the reason I didn't. We're not just live the the.

144
00:15:23,660 --> 00:15:28,100
The Princeton site is that we want there's so much to text analysis, right,

145
00:15:28,100 --> 00:15:32,870
and they do an excellent job covering sentiment analysis and that sort of stuff, but they don't cover word embedding.

146
00:15:32,870 --> 00:15:38,650
And so now the live streamers can get even more than they would have otherwise.

147
00:15:38,650 --> 00:15:44,720
And you as well, because you can totally go watch the Princeton live stream later on if you want.

148
00:15:44,720 --> 00:15:50,180
But OK, so we just topic discussed topic modelling where?

149
00:15:50,180 --> 00:15:57,140
There's distributions of topics over your corpus, and there's distribution of words with integrity within every topic.

150
00:15:57,140 --> 00:16:03,080
Word embedding models are completely different. Well, first off, what is a word embedding model?

151
00:16:03,080 --> 00:16:07,400
A method of text analysis that results in a matrix of word vectors,

152
00:16:07,400 --> 00:16:14,180
in which words used in similar contexts are closer together in vector space than words used in different contexts.

153
00:16:14,180 --> 00:16:18,320
That's not right with the output of a top edge model was which was this like distribution

154
00:16:18,320 --> 00:16:21,830
of topic over the corporate and distributed distribution of words within the topic.

155
00:16:21,830 --> 00:16:28,520
So you're getting at something very different here that has a lot more to do with the relationship of words within your corpus,

156
00:16:28,520 --> 00:16:32,810
not necessarily the topics in your corpus. Although in I,

157
00:16:32,810 --> 00:16:43,540
I just out of curiosity did something in the tutorial that we did to see if we could also get something of topics out of word and embedding.

158
00:16:43,540 --> 00:16:47,440
I will skip ahead real quick and come back. This is kind of what your output could look like,

159
00:16:47,440 --> 00:16:52,930
it would be not in just two dimensional space and you can't see this, but they're all different words.

160
00:16:52,930 --> 00:16:58,390
And the ones that are similar are close together in this vector space and the ones that are far apart are far apart.

161
00:16:58,390 --> 00:17:05,710
When we go back? Can someone remind me anyone when when I'm streaming for my computer, we can go to?

162
00:17:05,710 --> 00:17:14,770
Or you can go yourself to this projector dot TensorFlow dot org and you can see this in in this multidimensional three dimensional space,

163
00:17:14,770 --> 00:17:21,970
which is really more what the vectors are creating because it's not a two dimensional distance, but I'm probably getting ahead of myself.

164
00:17:21,970 --> 00:17:23,820
So.

165
00:17:23,820 --> 00:17:30,450
This is what we're trying to create that ultimately we want words that are similar to be close together and words that are different to be far apart.

166
00:17:30,450 --> 00:17:37,790
And early versions of of this used latent semantic analysis, I'm not going to go into that.

167
00:17:37,790 --> 00:17:44,340
Mean, that's going to be too much if none of us have even heard of these, really basically.

168
00:17:44,340 --> 00:17:50,940
How does it do this using a shell shallow neural network to produce these sort of.

169
00:17:50,940 --> 00:17:55,380
And we're going to talk about what that means in this context to produce these relationships.

170
00:17:55,380 --> 00:17:59,340
And these have been around for a long time. But as with a lot of things,

171
00:17:59,340 --> 00:18:05,760
the progress in computing power and in neural nets has allowed word embeddings to just suddenly

172
00:18:05,760 --> 00:18:12,750
go from doing a semi good job to doing a very good job and to help with translation of text or,

173
00:18:12,750 --> 00:18:19,390
like, you know, Siri on your phone and stuff like that.

174
00:18:19,390 --> 00:18:25,780
But we haven't used them much in social scientific research in some respects because they're a little intimidating,

175
00:18:25,780 --> 00:18:30,760
but they're really not that intimidating once you step into them and also just because none of us have the training for them.

176
00:18:30,760 --> 00:18:36,040
So this is hopefully a good introduction. So how does it work, right?

177
00:18:36,040 --> 00:18:39,790
How did these word embedding models actually work? There's two different approaches.

178
00:18:39,790 --> 00:18:44,470
There's a continuous bag of words and a skip graham approach. Hmm.

179
00:18:44,470 --> 00:18:48,310
Stop me if I'm getting, I'm getting ahead. Or you miss something that doesn't make sense.

180
00:18:48,310 --> 00:18:51,310
OK, but let's say we have a given text.

181
00:18:51,310 --> 00:18:59,620
The continuous bag of words would kind of slowly read in that text, and for each word, it would try and predict that word.

182
00:18:59,620 --> 00:19:04,510
Given the context that the word appears in in context here is defined or operationalise

183
00:19:04,510 --> 00:19:09,010
very explicitly by you as the researcher in the way that with topic modelling,

184
00:19:09,010 --> 00:19:12,940
you had to say how many topics there are here. You have to say what the context is,

185
00:19:12,940 --> 00:19:17,050
and there's been research to suggest that usually approximately eight words on

186
00:19:17,050 --> 00:19:21,760
either side of whatever word you're trying to predict is a good notion of context.

187
00:19:21,760 --> 00:19:29,320
That's a really long sentence. 17 words, but that's that's kind of what people have found to be a good one.

188
00:19:29,320 --> 00:19:33,400
So you define context as eight words on either side of your target word.

189
00:19:33,400 --> 00:19:38,600
And given all of those contexts, words, those 16 words.

190
00:19:38,600 --> 00:19:45,560
Tell me which word is missing the other way is Skip Graham, predict the context, given a word, you're given a word tell.

191
00:19:45,560 --> 00:19:53,900
Tell me what other words might occur around it. This is probably not intuitive quite yet as to how it works, but we're going to describe that.

192
00:19:53,900 --> 00:19:58,110
So how many of you have heard of word two vec?

193
00:19:58,110 --> 00:20:05,130
So more so words back is a word embedding there were I wondered all of a sudden when I saw this, I was like, That might be what people know.

194
00:20:05,130 --> 00:20:14,760
Word embedding models as word T-VEC is an implementation of a word embedding model, so you're more familiar than you realised.

195
00:20:14,760 --> 00:20:18,570
How does it do it? We're going to take the skip Graham example, because that one's a little more common.

196
00:20:18,570 --> 00:20:24,840
And what we'll be doing later. So you take a word or concept in your training corpus.

197
00:20:24,840 --> 00:20:28,530
Call it a target and the number of words that lie close to it.

198
00:20:28,530 --> 00:20:33,750
The context which again, we said you defined as, say, eight words around it or in this case, I think I did.

199
00:20:33,750 --> 00:20:38,710
Five. So here's an example from catch twenty two.

200
00:20:38,710 --> 00:20:43,630
It's a quote we'll just focus on this last sentence, Captain Black, who had aspired to the position himself,

201
00:20:43,630 --> 00:20:49,450
maintained that Major Major really was Henry Fonda and was too blank to admit it.

202
00:20:49,450 --> 00:20:52,600
If we defined Major Major as our, well, that's what we're trying to predict.

203
00:20:52,600 --> 00:20:57,010
We're trying to say that's one word we're trying to predict major majors name where

204
00:20:57,010 --> 00:21:01,870
will he turn up in the text while he might turn up whenever we see these other words,

205
00:21:01,870 --> 00:21:04,480
or at least some of these other words in the text?

206
00:21:04,480 --> 00:21:12,280
So Henry Fonda, maybe repeatedly throughout catch twenty two major majors mentioned in context with Henry Fonda.

207
00:21:12,280 --> 00:21:19,420
Well, as you read the text slowly and you keep coming across major major and seeing him in that context,

208
00:21:19,420 --> 00:21:24,280
you'll slowly bring those two closer together. That's kind of the intuition of it.

209
00:21:24,280 --> 00:21:28,870
But more specifically, there are these five steps, which is a really basic.

210
00:21:28,870 --> 00:21:32,860
You can read forever, obviously, on the computer and mathematical science behind this,

211
00:21:32,860 --> 00:21:43,040
but you take your target word and then each one of these words is represented by a vector or a bunch of numbers.

212
00:21:43,040 --> 00:21:52,650
And at the very beginning, these can just be random. But your aim is to make sure that the vectors that occur together by the end of your analysis are

213
00:21:52,650 --> 00:21:57,570
close together in vector space and those that never occur together are far apart in vector space.

214
00:21:57,570 --> 00:22:01,620
And you kind of do this by every time you come across the word.

215
00:22:01,620 --> 00:22:03,150
I'm just going to skip these. You can read this.

216
00:22:03,150 --> 00:22:09,530
But every time you come across the word, you take the context and you bring all those words closer together in vector space.

217
00:22:09,530 --> 00:22:14,700
You kind of take a random sample from the rest of the vector space and you push those vectors further away.

218
00:22:14,700 --> 00:22:22,650
And as you continue to read the text, you're slowly bringing the contexts together and pushing away whenever you don't see it in that context.

219
00:22:22,650 --> 00:22:26,820
Does that make sense? It's pretty neat, right?

220
00:22:26,820 --> 00:22:35,630
Yeah, just. I think just computationally like a lot to do that,

221
00:22:35,630 --> 00:22:44,960
but there's probably methods that you could I mean these these two that I mentioned continuous backwards and skip grandma just to there's there.

222
00:22:44,960 --> 00:22:51,020
I mean, there were more that preceded it that dealt with the entire corpus and the kind of neural net

223
00:22:51,020 --> 00:22:59,080
training allowed you to make contacts much smaller and and evidently just improved it quite a bit.

224
00:22:59,080 --> 00:23:05,650
But why would you look backwards at all? Just little bit on the.

225
00:23:05,650 --> 00:23:11,200
You mean, just bring its context words closer to us. Yeah.

226
00:23:11,200 --> 00:23:16,750
I don't know. I mean, it does seem I haven't thought about that, and I'm sure the computer scientists have.

227
00:23:16,750 --> 00:23:23,920
But uh, and maybe they've just tested within face validity which one works better.

228
00:23:23,920 --> 00:23:35,270
But it does seem like if you push it away, you would get a little bit more distinct clustering coefficient at the end of your thinking and networks.

229
00:23:35,270 --> 00:23:37,220
So does that make sense?

230
00:23:37,220 --> 00:23:43,370
This is really just supposed to be an introduction so that maybe when you approach a word of word to back in the future you get to,

231
00:23:43,370 --> 00:23:48,880
you generally know what it's doing. Yeah, one more time.

232
00:23:48,880 --> 00:23:55,030
Yeah, definitely. OK, so let's maybe try and do what we're going to work on the fly.

233
00:23:55,030 --> 00:24:03,550
We'll try and use this as an example. And as I slowly read in this quote, say I got to this sentence and I get two major major what I would do,

234
00:24:03,550 --> 00:24:12,310
what the algorithm will do is it'll take major major and it will take the vector for major major and it will bring that vector closer.

235
00:24:12,310 --> 00:24:25,480
I mentioned how with the kind of. The dot product of the vectors, this is kind of how it decides it creates closer and further away,

236
00:24:25,480 --> 00:24:31,240
but it will take that vector and take the vector for all of these other words around it and bring them closer

237
00:24:31,240 --> 00:24:37,060
together and then take a random sample from all of these other words and push them further away in space.

238
00:24:37,060 --> 00:24:40,310
And then it'll move on to really the next word and we'll do the same thing.

239
00:24:40,310 --> 00:24:45,160
So then twice Major Major was brought close to really when it was the target word and

240
00:24:45,160 --> 00:24:50,140
major major was brought close to really when really was the target or the target word.

241
00:24:50,140 --> 00:24:56,050
And slowly, as you do that, you're going to just get these distinct communities of things that happened in context with one another.

242
00:24:56,050 --> 00:25:05,560
Because if you never have major major appear in the context of discussing, then by the end,

243
00:25:05,560 --> 00:25:11,700
there would never have been a point in which those two were brought close together and they'll end up far apart in vacuum space.

244
00:25:11,700 --> 00:25:21,190
Yeah, that's what. This these are the vectors just in space.

245
00:25:21,190 --> 00:25:29,260
Yeah, and you defined the number of dimensions that you want, and this link will give you it in three dimensions, which was a little bit better.

246
00:25:29,260 --> 00:25:38,650
Yeah, yeah. And the other questions.

247
00:25:38,650 --> 00:25:46,580
I think it's really important when we're using a lot of these methods to question.

248
00:25:46,580 --> 00:25:54,920
Where it might break, right? Topic modelling can break in a variety of different ways.

249
00:25:54,920 --> 00:26:00,740
Word embedding one, it does particularly well with large corporate because it needs to see things repeatedly.

250
00:26:00,740 --> 00:26:07,660
You only see a word once it only gets to do the vector creation in the community.

251
00:26:07,660 --> 00:26:13,690
Context creation once. And maybe that was an aberrant use of the word right or a biased use of the word.

252
00:26:13,690 --> 00:26:16,960
And suddenly you think that those two are related and they're not.

253
00:26:16,960 --> 00:26:26,990
It also does a little bit better with a topically consistent or just linguistically consistent usage of the words to say you had two communities.

254
00:26:26,990 --> 00:26:32,100
Who use? Moral in very, very different ways.

255
00:26:32,100 --> 00:26:37,770
You would want to know that and maybe do these analyses separately because it would be harder, but as we'll see,

256
00:26:37,770 --> 00:26:44,010
there is a cool thing about the results of these that allow you to do a little bit of vector algorithms

257
00:26:44,010 --> 00:26:51,600
or vector algebra to be able to take out one vector and see how the vector space kind of changes.

258
00:26:51,600 --> 00:26:55,590
We'll see that and when we'll see that when we do the the tutorial.

259
00:26:55,590 --> 00:27:08,700
But here's a classic example of why the why this is a really interesting method is that ultimately you can do this sort of vector algebra where, say,

260
00:27:08,700 --> 00:27:18,420
you took the vector for king, just the, you know, we know the vectors and then you subtracted the vector for man and added the vector for woman.

261
00:27:18,420 --> 00:27:23,160
It would give you if the algorithm was really robust, the vector for queen.

262
00:27:23,160 --> 00:27:26,550
And you can kind of do these and we'll see them, we'll play with them in the code.

263
00:27:26,550 --> 00:27:31,710
And so we're playing again with the sociology abstracts so you can do things like, OK,

264
00:27:31,710 --> 00:27:38,670
if I'm looking at race and I take away inequality, maybe it gives you election, right?

265
00:27:38,670 --> 00:27:45,450
You're not, definitely. But maybe you can work on hypotheses and there's some cool research that I know

266
00:27:45,450 --> 00:27:52,530
of social scientists doing to try and look at cultural vectors across time.

267
00:27:52,530 --> 00:27:59,730
In this way, looking and training, embedding models at different points in time and seeing how the space changes and maybe

268
00:27:59,730 --> 00:28:06,090
looking at a profession and seeing how it moves closer to high status later on or lower.

269
00:28:06,090 --> 00:28:12,420
You know, if some of our theories of gender in the labour force that, like women entering a profession,

270
00:28:12,420 --> 00:28:17,100
makes it less prestigious, you could possibly uncover that with word embedding models.

271
00:28:17,100 --> 00:28:23,990
When we're talking about lawyers as women entered that, did lawyers become a less prestigious thing?

272
00:28:23,990 --> 00:28:40,290
As women into that profession, something that yeah. How does it differ from a topic model?

273
00:28:40,290 --> 00:28:44,700
What was I going to say? I had one other thing to say about this Oh, another really important thing.

274
00:28:44,700 --> 00:28:54,350
Well, actually, that fits with this. How does it differ from a topic model? Benchmark has some great tutorials on word embeddings online,

275
00:28:54,350 --> 00:29:00,950
and I used I learnt a lot from them and probably saw some of his intuitions pop up in here.

276
00:29:00,950 --> 00:29:07,310
And he says this and trying to to describe the difference, whereas a topic model aims to reduce the words down to some core meanings so

277
00:29:07,310 --> 00:29:11,900
that you can see what each individual document in the library is really about.

278
00:29:11,900 --> 00:29:16,310
So get rid of the fluff. I just want to know the themes, right? The topics.

279
00:29:16,310 --> 00:29:20,240
Effectively, this is about getting rid of words so that I can understand documents more clearly.

280
00:29:20,240 --> 00:29:22,130
Word embedding models do nearly the opposite.

281
00:29:22,130 --> 00:29:27,800
They try to ignore information about individual documents so you can better understand the relationship between words.

282
00:29:27,800 --> 00:29:34,250
And I mostly agree with this, but you can get information about topic.

283
00:29:34,250 --> 00:29:37,670
Models have distributions of topics across corpora as well.

284
00:29:37,670 --> 00:29:42,740
And so you do have this kind of sense of relationships between words in different contexts.

285
00:29:42,740 --> 00:29:47,360
So there's a little bit more to it. One is that in topics,

286
00:29:47,360 --> 00:29:52,970
you sort words into a predetermined number of topics and you don't kind of try and

287
00:29:52,970 --> 00:29:59,180
create these continual vector space relationships between the words That's not you.

288
00:29:59,180 --> 00:30:06,380
I say you don't aim to because you could create something like that out of a topic model by looking at the probabilities of a word

289
00:30:06,380 --> 00:30:14,540
in the document like they could be farther farther apart based on how probable they are to occur in a document that's giving.

290
00:30:14,540 --> 00:30:22,490
But that's not really what it tries to do. That would just be you being innovative and they don't do well at representing the

291
00:30:22,490 --> 00:30:27,050
relationships between words or how words mediate and moderate the meaning of one another.

292
00:30:27,050 --> 00:30:34,940
Again, with that kind of matrix algebra that you're able to do, which is a strength of the embedding model.

293
00:30:34,940 --> 00:30:40,780
Any questions? Oh, that's the one last thing I was going to say.

294
00:30:40,780 --> 00:30:48,370
Topic models rely on this kind of co-occurrence, right? A top two words will result in be in the same topic.

295
00:30:48,370 --> 00:30:54,280
If there's a co-occurrence that they have, that doesn't have to that the co-occur in the text.

296
00:30:54,280 --> 00:31:00,910
It doesn't have to happen in word embedding models if we call say we have a mixed

297
00:31:00,910 --> 00:31:07,390
time or US British corpus and there's pupil and student used interchangeably,

298
00:31:07,390 --> 00:31:09,280
but they never happened together.

299
00:31:09,280 --> 00:31:15,520
So there's no co-occurrence of student people because they're synonyms in word embedding models because they happen in the same context.

300
00:31:15,520 --> 00:31:20,390
Both people and student happen in the context of classroom and grades or exams or whatever.

301
00:31:20,390 --> 00:31:28,470
You would still have them close in vector space, which is which is kind of neat.

302
00:31:28,470 --> 00:31:40,140
Sure. Yeah. Sense, there are a couple of very good sort of thing models, which I think pretty brains alongside the other.

303
00:31:40,140 --> 00:31:46,560
Yeah, that's a good one. You can just download them and use them, but it's like the after effects model.

304
00:31:46,560 --> 00:31:51,060
Yeah. So use the most. Yeah, yeah.

305
00:31:51,060 --> 00:31:57,030
Massive corpora. These embeddings have been trained on. And then if you wanted to just.

306
00:31:57,030 --> 00:32:03,870
You know, not say something necessarily specific about your Corvette or validate your corpus against against that,

307
00:32:03,870 --> 00:32:09,960
but one of the things that will mention is that really validating your corpus depends on

308
00:32:09,960 --> 00:32:16,310
if it makes sense to do some sort of based truth of whatever world you're looking at.

309
00:32:16,310 --> 00:32:26,350
And so validating it against all of the Wikipedia data might not necessarily mean that it's valid because yours is different.

310
00:32:26,350 --> 00:32:34,330
You're looking only at some Aboriginal tribe or something that doesn't apply to the general modernise western world or something like that, right?

311
00:32:34,330 --> 00:32:40,780
But yeah, I think those in terms of just looking at culture or language are really useful.

312
00:32:40,780 --> 00:32:49,270
Some things in perspective, the reason why I used to work for a firm will want to look at the sentiments.

313
00:32:49,270 --> 00:32:57,310
Of people talking about things that were happening in their lives and because people were using offshoring and outsourcing,

314
00:32:57,310 --> 00:33:04,280
that's giving them the same thing and then had a sort of whole rant about it or not, basically.

315
00:33:04,280 --> 00:33:08,710
But the two were different words that ultimately didn't come together.

316
00:33:08,710 --> 00:33:16,850
So I use the word embedding to basically exchange these two if they were so distant, so familiar to another in the automotive space.

317
00:33:16,850 --> 00:33:21,400
I just use either one on each exchange or outsourcing job for a roommate.

318
00:33:21,400 --> 00:33:28,140
Thank you both for. So you can use it in that way, you can make your day and more.

319
00:33:28,140 --> 00:33:32,700
Nice and large, but not as much as you can use as an exchange. Yeah.

320
00:33:32,700 --> 00:33:41,190
So similarly, one of the things that I did with mine, I'm interested in gender inequality in creative professions.

321
00:33:41,190 --> 00:33:46,260
And there's obviously a lot of different artists and creatives mentioned in the corpora that I work with.

322
00:33:46,260 --> 00:33:49,990
And so I had demographic information on a large number of artists,

323
00:33:49,990 --> 00:33:55,140
and I just went in and replaced all the women's names with the word femme word and all

324
00:33:55,140 --> 00:33:59,130
the men's names with the word man word and so suddenly became like the same word,

325
00:33:59,130 --> 00:34:01,380
which is a different. It's slightly different.

326
00:34:01,380 --> 00:34:08,010
But there were enough occurrences of any individual artist to get a sense of how men and women are discussed differently.

327
00:34:08,010 --> 00:34:11,430
But that's another another way I could have gone about.

328
00:34:11,430 --> 00:34:17,850
It is similarly to just train the embedding model and oh, I see all of these artists cluster together, and all of these artists cluster together.

329
00:34:17,850 --> 00:34:22,560
So I could use them as synonyms of female and male if you wanted to add another mouse.

330
00:34:22,560 --> 00:34:31,270
There's a lot you can do with these methods, and very little that has been done thus far in the context of social social science.

331
00:34:31,270 --> 00:34:33,630
I think it's important for the president this morning,

332
00:34:33,630 --> 00:34:42,780
that's when the case of president that we didn't want to make our mothers take the good and the bad boys.

333
00:34:42,780 --> 00:34:46,150
You come together when you talk about a similar thing.

334
00:34:46,150 --> 00:34:52,870
Yeah, it's like something that seems like they can be taking this stuff so rich in some of the pieces you may not once.

335
00:34:52,870 --> 00:35:00,580
Yeah, but how have people handled just taking good out and then looking at bad relation to vectors and putting good back in and taking bad out?

336
00:35:00,580 --> 00:35:05,530
Is that how you handle it? Yeah. Yeah.

337
00:35:05,530 --> 00:35:10,620
OK. Generally, they.

338
00:35:10,620 --> 00:35:14,340
It's interesting because my intuition would be like, whoa,

339
00:35:14,340 --> 00:35:20,070
I just want to see what do people talk about when they're talking about bad movies just figured out.

340
00:35:20,070 --> 00:35:25,580
Yeah. Hmm. For people with double negations in their stuff?

341
00:35:25,580 --> 00:35:30,600
Yeah, you must do something about. Yeah.

342
00:35:30,600 --> 00:35:34,950
But do you think you often a lot of these bag of words sort of methods that ignore syntax?

343
00:35:34,950 --> 00:35:38,400
This becomes a huge, huge problem of that.

344
00:35:38,400 --> 00:35:45,150
The meaning is built into the rules of language that we're supposedly ignoring a lot of the time.

345
00:35:45,150 --> 00:35:54,810
Similarly, a professor of mine in sociology of culture once asked he was very questioning of these large computational methods in general,

346
00:35:54,810 --> 00:36:04,560
particularly text, and he insisted what becomes text is not everything that we do in social life quite clearly becomes any form of meaningful text.

347
00:36:04,560 --> 00:36:12,780
And what is left out in the context of social media studies that becomes even bigger, not only what becomes language, what becomes what?

348
00:36:12,780 --> 00:36:14,280
What gets on to Twitter?

349
00:36:14,280 --> 00:36:22,980
What do we never say on Twitter, which is might be very highly personal things or obviously not long winded, intricate things because we just can't.

350
00:36:22,980 --> 00:36:28,520
But. Because if you look at the answer, people in the UK, which is how are you doing?

351
00:36:28,520 --> 00:36:33,080
They would say not that the US can say good bye. Yeah, oh really.

352
00:36:33,080 --> 00:36:41,350
I believe that era. But in all this, if you're not thinking about immigration being something good.

353
00:36:41,350 --> 00:36:47,060
Yeah. OK.

354
00:36:47,060 --> 00:36:49,340
There is the third method of network analysis,

355
00:36:49,340 --> 00:36:55,190
but I kind of think that we should pause and do the tutorials on the topic, modelling and word embedding.

356
00:36:55,190 --> 00:36:57,710
And if we have time, we can cover the network analysis.

357
00:36:57,710 --> 00:37:04,820
Basically, the network analysis is just giving an introduction to network analysis and allowing you with some examples from text,

358
00:37:04,820 --> 00:37:09,350
but largely allowing you to kind of brainstorm how you would.

359
00:37:09,350 --> 00:37:15,200
You could create new variables, new metrics by thinking about text analysis in the context of networks.

360
00:37:15,200 --> 00:37:19,490
But a lot of that is is just in the materials folder called.

361
00:37:19,490 --> 00:37:25,490
Text nets or network analysis materials or something like that, and you can probably go through it on your on your own.

362
00:37:25,490 --> 00:37:31,260
I just want to make sure we get to some of the tutorials before we get exhausted. So let's let's do that.

363
00:37:31,260 --> 00:37:37,320
OK. We have an hour to go through three very large methods, we're probably going to only get through two,

364
00:37:37,320 --> 00:37:41,940
but again, if anyone wants to go through tax nets later, I'm happy to.

365
00:37:41,940 --> 00:37:51,390
But let's start by going through a structural topic modelling developed by Molly Roberts and her colleagues in 2014.

366
00:37:51,390 --> 00:37:53,790
They've got this really, I think,

367
00:37:53,790 --> 00:38:05,370
exemplary R package that helps you kind of from start to to end in terms of using their method and visualising what you're doing along the way.

368
00:38:05,370 --> 00:38:11,430
Thinking through important decisions that you're going to have to make along the way.

369
00:38:11,430 --> 00:38:17,070
The package in particular includes all of these things ingesting and manipulating, pre processing the data,

370
00:38:17,070 --> 00:38:26,160
estimating the model's calculating covariate effects and then kind of visualising those helping you make decisions.

371
00:38:26,160 --> 00:38:31,590
This is very much the tutorial phase, so stop me and ask questions and everything again.

372
00:38:31,590 --> 00:38:35,250
The only reason I'm not having you to long live is that the models,

373
00:38:35,250 --> 00:38:39,090
some of them we'd be sitting here for like forty five minutes waiting for it to go.

374
00:38:39,090 --> 00:38:48,990
And if we used all the data even longer, but you load the data, so we load the maybe I should bring up the HTML.

375
00:38:48,990 --> 00:38:56,130
We can go through that and I'll skip over to my kind of uglier code when.

376
00:38:56,130 --> 00:39:01,080
It's very small. There we go. OK, let's do that.

377
00:39:01,080 --> 00:39:06,660
So you would load the packages, you load the data, we're loading our sociology abstracts.

378
00:39:06,660 --> 00:39:10,350
Then you clean the data and describe it a little bit.

379
00:39:10,350 --> 00:39:14,280
So this was just in one corpus that I was using.

380
00:39:14,280 --> 00:39:16,170
There were some texts that were just empty.

381
00:39:16,170 --> 00:39:24,110
So you want to make sure you don't have those and like we did before, you might want to do remove duplicate texts.

382
00:39:24,110 --> 00:39:28,250
And then we could take a look at again, some of like how does it?

383
00:39:28,250 --> 00:39:38,030
The number of abstracts by year, just basically getting a sense of our data as we would at the beginning of any type of analysis.

384
00:39:38,030 --> 00:39:46,580
Again, so the key about structural topic models is that you can have these other variables asking, how does.

385
00:39:46,580 --> 00:39:55,280
In this case, I'm going to be looking at how does the effect of being published in a certain journal change the way an abstract,

386
00:39:55,280 --> 00:39:59,960
the language in an abstract or the topics that are studied in in studies of that journal?

387
00:39:59,960 --> 00:40:04,550
So this is one of the variables that you can add to a structural topic model.

388
00:40:04,550 --> 00:40:10,460
These are all the journals that it comes from and their prevalence in the corpus.

389
00:40:10,460 --> 00:40:20,990
Again, just getting to know your data before you do anything with it. This is me as I had before creating a a binary variable for whether or not it

390
00:40:20,990 --> 00:40:26,390
appears in the American Journal of Sociology or the American Sociological Review.

391
00:40:26,390 --> 00:40:30,980
If it occurred in either of those, it gets a one. If it didn't, it gets a zero.

392
00:40:30,980 --> 00:40:41,570
And this is what our data looks like, which you probably already know, but we've got our different variables and then this new one that I've created.

393
00:40:41,570 --> 00:40:48,740
So the STM package does a lot of that preprocessing that we did with the Quantita package on its own.

394
00:40:48,740 --> 00:40:52,550
A lot of these kind of topic modelling packages do that again.

395
00:40:52,550 --> 00:40:55,820
There's always a million different ways that you could do something if you wanted to.

396
00:40:55,820 --> 00:41:04,820
I wouldn't be surprised if if the SDM package is relying on either tiny text functions or quantita functions behind the scenes to do this cleaning,

397
00:41:04,820 --> 00:41:09,860
but it will do it itself with just this, these commands that we'll look at.

398
00:41:09,860 --> 00:41:17,060
So text processor builds the corpus. We talked about a corpus in the quantita package, an object,

399
00:41:17,060 --> 00:41:25,850
a data object that STM has the same and you build it by just doing text processor, feeding it your data,

400
00:41:25,850 --> 00:41:33,440
identifying which variable or which column the text is in, and then telling it where to find the rest of the metadata,

401
00:41:33,440 --> 00:41:36,770
which is again, in that sense, in our case, it's in the same data frame.

402
00:41:36,770 --> 00:41:47,000
But in other cases, maybe you have all the texts and the metadata with an idea and the metadata in a different relational database sort of structure.

403
00:41:47,000 --> 00:41:52,460
And then you pre process your texts. STM has some cool stuff.

404
00:41:52,460 --> 00:41:56,690
So this is there's a lot of words, right, that maybe you don't want to analyse.

405
00:41:56,690 --> 00:42:04,460
They occur there. Stop words, of course, which you could just remove or you could do it this way if something occurs in like 90 percent of the texts.

406
00:42:04,460 --> 00:42:08,600
It's probably not particularly interesting, and I just want to take it out.

407
00:42:08,600 --> 00:42:15,410
And so here you can. What you're doing is you're saying here's a threshold from one to two hundred by 10 meaning,

408
00:42:15,410 --> 00:42:22,920
and it's the lower threshold, meaning I want to see how many words were removed if I removed words.

409
00:42:22,920 --> 00:42:28,980
That occurred there, lower the lower threshold that occurred less than wants me,

410
00:42:28,980 --> 00:42:32,170
not at all, that doesn't even make sense, but up to two hundred by increments of 10.

411
00:42:32,170 --> 00:42:36,690
So what would happen if I removed words that occurred less than 100?

412
00:42:36,690 --> 00:42:39,930
Less than 10. Less than 20. Less than 30. Less than 40.

413
00:42:39,930 --> 00:42:47,820
And slowly, you get these kind of curves and this is how many documents you'd remove, which is none.

414
00:42:47,820 --> 00:42:54,600
That was scary. Which is none. How many words you would remove and how many tokens you remove.

415
00:42:54,600 --> 00:43:04,540
Remember, a token is. A word would just be the actual different, like if Taylor was said multiple times, four times, it's four different words,

416
00:43:04,540 --> 00:43:11,110
whereas the tokens are the or the other way around or just all of those would be collapsed into one.

417
00:43:11,110 --> 00:43:15,790
So it's showing me that if I get to the lower threshold of about, maybe here,

418
00:43:15,790 --> 00:43:25,370
like seventy five is where the curve starts to stop and you don't you're not really removing many other words when you bring up that lower threshold.

419
00:43:25,370 --> 00:43:35,630
Does that make sense? You can also set an upper threshold so you can say get rid of any words that happen in more than many times because

420
00:43:35,630 --> 00:43:42,230
they're probably not useful or fewer than this many times because they're probably rare and obscure and not very useful.

421
00:43:42,230 --> 00:43:48,710
So that's one way other than stop words specifying actual words, another way of removing words you don't want to analyse.

422
00:43:48,710 --> 00:43:55,880
And the S10 package just plots these graphs for you, which is really nice for the sake of just kind of efficiency.

423
00:43:55,880 --> 00:44:02,150
I set my lower threshold to 50 and my upper threshold to about 50 percent of the document,

424
00:44:02,150 --> 00:44:07,130
saying yes if a word occurs in more than 50 percent of the abstracts. Get rid of it.

425
00:44:07,130 --> 00:44:13,180
That was. Late last night, Heuristic, it wasn't really that informed.

426
00:44:13,180 --> 00:44:19,570
But you would hopefully be more informed as you were going about it. Then you can look at which words were removed when you did that.

427
00:44:19,570 --> 00:44:25,570
At first, I freaked out because I was like, Wait, why is divorce and ethnicity and all of these being removed?

428
00:44:25,570 --> 00:44:31,960
But actually, that's being removed as a double hyphen, ethnic double hyphen established those sort of things.

429
00:44:31,960 --> 00:44:37,510
So they're the actual lamas or stems of family in divorce or not.

430
00:44:37,510 --> 00:44:40,030
Those are still in the corpus. It's just when they're attached.

431
00:44:40,030 --> 00:44:46,090
And I could have fixed this by doing a better job of removing punctuation like double punctuation.

432
00:44:46,090 --> 00:44:52,510
But I didn't. This is just to show you once you remove your words, it will show you it saves which words were removed.

433
00:44:52,510 --> 00:45:02,330
So you know you could report those if you were very good at reporting in your journals appendix.

434
00:45:02,330 --> 00:45:12,200
OK, so after you've done that, you have this output data frame called output, and within it you have the documents.

435
00:45:12,200 --> 00:45:16,790
Maybe some of the documents got removed, maybe in your setting of thresholds.

436
00:45:16,790 --> 00:45:21,980
Certain documents were so short and none of the words within them or within that threshold.

437
00:45:21,980 --> 00:45:30,110
So the documents got kicked out. But so now you might have new documents, the vocab is the actual words that are left,

438
00:45:30,110 --> 00:45:34,590
excluding the ones you didn't have, and then the meta is the metadata associated with each document.

439
00:45:34,590 --> 00:45:42,130
So again, the journal or the year that it was published. That all make sense.

440
00:45:42,130 --> 00:45:46,000
Then you move on to estimating the model.

441
00:45:46,000 --> 00:45:53,170
And like we said and topic models, you have to select K, you have to select how many topics you want to derive from your corpus.

442
00:45:53,170 --> 00:46:03,040
And that's not always a straightforward decision, and one way is to run it a million times and look at the different.

443
00:46:03,040 --> 00:46:04,330
It can take a really long time.

444
00:46:04,330 --> 00:46:15,880
You should probably do that, but SDM provides this great function search kick in which you provide it like a vector of possible numbers of of topics,

445
00:46:15,880 --> 00:46:20,920
and it will give you some diagnostics on what happens when you change that number of topics,

446
00:46:20,920 --> 00:46:25,780
which is not available in a lot of other topic modelling packages. So it's kind of cool.

447
00:46:25,780 --> 00:46:30,700
It doesn't solve the problem for you at all. You have you have trade offs of like as we'll see.

448
00:46:30,700 --> 00:46:37,180
Do you want to have semantic coherence or do you want to optimise some other measure and picking what your topics are?

449
00:46:37,180 --> 00:46:46,110
But nevertheless, it gives you help in making that decision, which other things tend not to.

450
00:46:46,110 --> 00:46:51,660
I don't run that because it takes a long time. But if you want to run this code on your own, you can do that.

451
00:46:51,660 --> 00:46:56,790
You can search and you can see what the diagnostics are that help you then make the decision it will like,

452
00:46:56,790 --> 00:47:02,490
plot something for you where, well, maybe I do plot it later on.

453
00:47:02,490 --> 00:47:06,630
I'll have to check, but you'll see like how, when or when you change this.

454
00:47:06,630 --> 00:47:15,920
You have less semantic semantic coherence, meaning that the words don't really cluster together that well when you have.

455
00:47:15,920 --> 00:47:20,630
So many or so few topics, and then as you move it up, they start to get better together.

456
00:47:20,630 --> 00:47:26,810
So then when you look at the topic words, they maybe that has more Facebook did, you're like, Oh yeah, that's a very clean topic.

457
00:47:26,810 --> 00:47:34,940
But there are other other. Diagnostics that you could run that that might give you something else.

458
00:47:34,940 --> 00:47:43,430
So anyways, that's that there's other ways to go about it, about kind of checking how you would like, how you wanted to define care.

459
00:47:43,430 --> 00:47:49,760
But in in this case, I just picked 20 and.

460
00:47:49,760 --> 00:47:53,820
I specified that year is going to be one of my covariates,

461
00:47:53,820 --> 00:47:58,580
so one thing I haven't mentioned yet in those covariates that you define, they can change the model in two ways.

462
00:47:58,580 --> 00:48:10,400
One, they could change How prevalent is this topic? So maybe liberals in the US talk much more about gun control.

463
00:48:10,400 --> 00:48:13,340
And so that's the prevalence of the topic of gun control.

464
00:48:13,340 --> 00:48:22,520
It goes up if the speaker is a liberal, whereas if the speaker is a conservative, say you had a topic for the the like.

465
00:48:22,520 --> 00:48:32,650
Right to bear arms, I bet those would be put together, but if you did have a cyber one, it probably comes up more with conservatives.

466
00:48:32,650 --> 00:48:40,750
There's a little bit in here about some of the really interesting problems of post-war and non is non convex with topic models,

467
00:48:40,750 --> 00:48:42,610
and I don't really want to go too much into that.

468
00:48:42,610 --> 00:48:49,990
But basically what it's saying is where are you wherever you start in searching for optimising will change what your outcome is.

469
00:48:49,990 --> 00:48:53,620
And so one thing like basically what your results are.

470
00:48:53,620 --> 00:49:01,690
And so one thing that STM does pretty well is allows you this maximum number of iterations where it will run the model again and again,

471
00:49:01,690 --> 00:49:06,570
as you see in these outputs kind of starting at a different point.

472
00:49:06,570 --> 00:49:14,820
So one thing that's really good about that is, you know, that like your run of it didn't kind of create what you see.

473
00:49:14,820 --> 00:49:19,570
But this is what the output of that is. You'll see like these are kind of the topics that start to emerge.

474
00:49:19,570 --> 00:49:21,480
I said, give me 20 topics, right?

475
00:49:21,480 --> 00:49:32,640
Here's one on immigration, economy, market, social network structures, health relationships and risk politic state movement.

476
00:49:32,640 --> 00:49:34,470
You start to have a bit of face validity, right?

477
00:49:34,470 --> 00:49:41,970
Neighbourhood population area women, gender men, white, black, maybe stratification, education in school.

478
00:49:41,970 --> 00:49:43,800
So these are some of the topics you start to get out,

479
00:49:43,800 --> 00:49:53,670
but it's running it over and over again and you'll see in each iteration they'll change a little bit.

480
00:49:53,670 --> 00:49:58,800
You'll want to run a lot more iterations, I think I mentioned that than than I do in the script,

481
00:49:58,800 --> 00:50:03,210
and you'll want to read the documentation if you want to actually do stand, but it just takes a long time.

482
00:50:03,210 --> 00:50:09,420
So I set it pretty low. Ultimately, what you'll get, though, I say only two, so we only did that twice.

483
00:50:09,420 --> 00:50:12,240
If you did it like 50 times or something,

484
00:50:12,240 --> 00:50:21,570
you would end up with this graph looking at the balance of exclusivity versus semantic coherence, which is within your topic.

485
00:50:21,570 --> 00:50:27,810
Optimising on exclusivity would say, I want words to be pretty darn exclusive to an individual topic because if you remember,

486
00:50:27,810 --> 00:50:31,650
each topic is a distribution of all the words in the corpus.

487
00:50:31,650 --> 00:50:39,660
And if you want words to have a very high probability and only one or very few topics, you're going to probably emphasise exclusivity.

488
00:50:39,660 --> 00:50:40,750
And you might have a model.

489
00:50:40,750 --> 00:50:48,060
This only ran two models and they were very similar, but you might end up with one of your iterations that's like way over here.

490
00:50:48,060 --> 00:50:52,590
And if you wanted to emphasise exclusivity, that might help you make your decision.

491
00:50:52,590 --> 00:51:01,400
Semantic coherence is just how well the words all fit together. Maybe they're not exclusive to one topic, but they all go well together.

492
00:51:01,400 --> 00:51:06,420
In one topic, yeah.

493
00:51:06,420 --> 00:51:15,150
Well, it's it's in the result of the probabilities of the words you could get into the function and look at exactly how it's doing that,

494
00:51:15,150 --> 00:51:19,380
and you would probably very well to distinguish how it's doing it with some of the coding.

495
00:51:19,380 --> 00:51:29,220
But yeah, it's just looking at when you put it into the different topics, what are the probabilities of the words within that topic?

496
00:51:29,220 --> 00:51:35,850
How much do they all have like a high probability of being within that topic versus?

497
00:51:35,850 --> 00:51:43,260
Like the balance of the probability against the currents within all the other topics, there is a formal definition I might have.

498
00:51:43,260 --> 00:51:48,330
I used to have it in here of.

499
00:51:48,330 --> 00:51:56,070
Yeah, no, it's just the kind of women's semantic clearance provides a semantic clearance measure for all topics within each model.

500
00:51:56,070 --> 00:52:05,280
But in their documentation, it has the actual code and the of how they do the semantic.

501
00:52:05,280 --> 00:52:11,520
Exclusivity is another one that you can look at sparsity anyways.

502
00:52:11,520 --> 00:52:14,880
You can check the residuals. We're going to kind of skim past this.

503
00:52:14,880 --> 00:52:21,780
All of this is documented more in there. They've got a ton of documentation.

504
00:52:21,780 --> 00:52:31,710
Then you can look within a particular topic like it looks like topics 17 and 14 are super high on both actually 70 and semantic coherence,

505
00:52:31,710 --> 00:52:39,530
whereas topic 11 is not that great at either. And Topic 12 has good exclusivity, but not a lot of semantic coherence.

506
00:52:39,530 --> 00:52:47,450
And you could go back to those topics and look at why, you know, and maybe change according to that.

507
00:52:47,450 --> 00:52:58,340
Let's see what what was Topic 14? So this one, mother, father, child, children, household, parent, divorce, that was good on both measures.

508
00:52:58,340 --> 00:53:06,110
That's a pretty clean topic, whereas topic 11 wasn't so great skills, teach decision, implement programme career.

509
00:53:06,110 --> 00:53:12,560
Maybe that's because it's like mixing the labour market with education, which a lot of it is about education.

510
00:53:12,560 --> 00:53:15,350
How does that affect the labour market, right?

511
00:53:15,350 --> 00:53:25,450
Maybe that's why it doesn't have quite as much exclusivity or because words about the labour market occur throughout a lot of other topics as well.

512
00:53:25,450 --> 00:53:34,320
This is just the probability or the prevalence of different topics throughout your documents, what is the expected topic proportions?

513
00:53:34,320 --> 00:53:39,480
And here you covering the word probabilities across two topics, so topic one in topic 20,

514
00:53:39,480 --> 00:53:43,500
topic one, sorry, it's cut off, but like immigrant market and labour force,

515
00:53:43,500 --> 00:53:48,270
as was we were saying earlier, the font size makes it bigger because it has a higher probability and it's further

516
00:53:48,270 --> 00:53:53,190
over to that side of the spectrum because it's more associated with that topic.

517
00:53:53,190 --> 00:53:58,830
Whereas it looks like theory, it looks like this topic 20 is more about sociological theory.

518
00:53:58,830 --> 00:54:00,810
Series is very prominent in that one,

519
00:54:00,810 --> 00:54:07,770
but things that are more toward the middle are like development and maybe development because you develop a theory,

520
00:54:07,770 --> 00:54:13,320
but also and you have immigration because of international development programmes or something like that,

521
00:54:13,320 --> 00:54:20,430
or internal migration and development programmes often coincide so you can start comparing sea.

522
00:54:20,430 --> 00:54:25,380
I just do that with the topics one in 20 that parameter, you can start looking at it.

523
00:54:25,380 --> 00:54:32,960
This is. The histogram of the distribution of topics, anyways, the point is not often, as most of you probably know in our packages,

524
00:54:32,960 --> 00:54:39,290
are you given this much structure for doing analysis where they're helping you diagnose make diagnostics,

525
00:54:39,290 --> 00:54:47,240
where you don't have to create the visuals yourself, they're doing it. So just like huge props to the people who created this package.

526
00:54:47,240 --> 00:54:50,690
This is a way of fine thoughts as kind of a fun one.

527
00:54:50,690 --> 00:54:58,370
If you just wanted to see, I want to pick a topic and I want like a representative quote from that topic to put in my paper.

528
00:54:58,370 --> 00:54:59,540
This is a way to do it.

529
00:54:59,540 --> 00:55:09,770
You can kind of hear I just take the first two hundred ten to two hundred and fifty words of the abstract or characters of the abstract,

530
00:55:09,770 --> 00:55:17,240
and then that's all that I print out. I obviously it's I don't know why I started at 10 because it cuts the word halfway through.

531
00:55:17,240 --> 00:55:21,950
Don't do that. All right.

532
00:55:21,950 --> 00:55:26,950
Does that make sense kind of so far, is anyone copying and pasting and running it?

533
00:55:26,950 --> 00:55:33,040
OK. Do that later, but hopefully I ran it a few times and it was working.

534
00:55:33,040 --> 00:55:35,550
This is where in this point we can turn to the metadata.

535
00:55:35,550 --> 00:55:42,370
So so far we've just created the topics and we haven't looked at how those are affected by covariates.

536
00:55:42,370 --> 00:55:46,930
So one cobra we have that is continuous this year. What year was the abstract right?

537
00:55:46,930 --> 00:55:51,940
And we don't have a lot of scale here. I think it was like 2008 to 2012. But the other is one we created.

538
00:55:51,940 --> 00:55:58,450
That's binary, which is was it published in ages race or was it published in something else which again,

539
00:55:58,450 --> 00:56:08,860
just kind of a stupid binary, but it's what I came up with last night. So in this, if the cupboard is binary with that age, as are variable,

540
00:56:08,860 --> 00:56:15,160
maybe we just want to see is a topic more likely to be discussed in one of those two journals or in everywhere else.

541
00:56:15,160 --> 00:56:19,450
There's not a ton of variation, but that's basically what this is doing.

542
00:56:19,450 --> 00:56:30,760
So over this is more likely to be mentioned. This topic has a higher probability in non-Asian SSR articles, and this is more essays or articles.

543
00:56:30,760 --> 00:56:33,130
Unfortunately, I haven't relabelled the topics yet,

544
00:56:33,130 --> 00:56:41,200
so what you could have done is go in and label topic one according to what you think it is about maybe marriage and family or whatever.

545
00:56:41,200 --> 00:56:52,690
And then you could create this plot and see if it has face validity. If you're if it's liberal, conservative or if it's.

546
00:56:52,690 --> 00:57:00,070
I was going to try and go with the sports analogy, but I don't know sport, so anyways, any sort of binary might be more extreme, right?

547
00:57:00,070 --> 00:57:06,130
Or there's a certain topic that maybe you have like Doctors' offices versus first dates or something,

548
00:57:06,130 --> 00:57:12,870
there's some things that are very much more likely to be mentioned in the doctor's office.

549
00:57:12,870 --> 00:57:21,540
Then if your model is if your Cabarrot is continuous, as with you, you can just look at the expected proportion of a particular topic over time.

550
00:57:21,540 --> 00:57:26,940
So this was looking at topic number seven over time. And this is just the base visual.

551
00:57:26,940 --> 00:57:32,040
You can make the visuals much more interesting and layer on all the topics and how they change over time.

552
00:57:32,040 --> 00:57:39,780
But this is really neat that they provide this at all, and it looks like whatever topic 7NEWS has increased over time again.

553
00:57:39,780 --> 00:57:47,410
I don't imagine it was like a huge increase, relatively. But let's look at what topic seven was crime, violence and control.

554
00:57:47,410 --> 00:57:55,570
Evidently, that topic has been increasing over time. In the very short timespan we're looking at.

555
00:57:55,570 --> 00:58:02,050
So that was looking at the proportion of the topic over time throughout documents, the other thing you can do, as we mentioned with these covariates,

556
00:58:02,050 --> 00:58:08,110
is look at how does the covariate change the word proportions within the topic,

557
00:58:08,110 --> 00:58:15,680
so make sense to know not how much the topic is discussed, but how the topic is what is internal to the topic.

558
00:58:15,680 --> 00:58:24,250
Like if you're talking, relying again on the example of guns in the US, maybe there's only one topic for guns, but does the proportion of words?

559
00:58:24,250 --> 00:58:32,620
The probability of words in that topic changed depending on if a pro gun rights or an anti-gun rights person is speaking?

560
00:58:32,620 --> 00:58:38,640
So that's what you're doing here. You're estimating the storm with a little function where we're saying.

561
00:58:38,640 --> 00:58:43,870
Determine the content of that should that should change.

562
00:58:43,870 --> 00:58:49,600
Sorry about that. According to the agency or our variables go here, this is where I did it right.

563
00:58:49,600 --> 00:58:53,470
So we have changed the prevalence, according to both of these variables,

564
00:58:53,470 --> 00:59:01,570
but we want to know how the content changes just depending on events and ages and ages or in the other in the other journals.

565
00:59:01,570 --> 00:59:10,640
Does that make sense? So here we're estimating both content prevalence and content and or content and prevalence.

566
00:59:10,640 --> 00:59:14,430
OK. We're not going to get to word embedding.

567
00:59:14,430 --> 00:59:21,030
I'm happy to stay longer and get toward embedding these. I probably bit off more than I could chew, but I thought you'd be interested in all of it.

568
00:59:21,030 --> 00:59:31,200
So this is how the content of one of the topics sorry, I got cut off changed.

569
00:59:31,200 --> 00:59:34,680
If it's in the air or if it's in another.

570
00:59:34,680 --> 00:59:40,230
So if it's an SSRI, it looks like you're talking more about inequality and status and these sort of things.

571
00:59:40,230 --> 00:59:44,640
Whereas if you're talking about if it's outside, there's more gender and women,

572
00:59:44,640 --> 00:59:48,300
which actually to me has a bit of visibility in the sociology literature.

573
00:59:48,300 --> 00:59:53,880
A lot of times there's a lot more like gender theory outside of society, like gender studies,

574
00:59:53,880 --> 00:59:58,440
gender, which is amazing but much deeper in the gender theoretical side of things.

575
00:59:58,440 --> 01:00:06,780
We're often within our. Gender is mentioned in the kind of binary way, and it's used as like just comparing men and women.

576
01:00:06,780 --> 01:00:11,310
It's not about gender theory as much as it is like gender as a covariate for other outcomes.

577
01:00:11,310 --> 01:00:20,150
So this kind of makes sense to be interesting to others, but I only plotted that one for now.

578
01:00:20,150 --> 01:00:30,420
Yeah. So. What's important is that this is just the one thing this is for one topic

579
01:00:30,420 --> 01:00:37,920
comparing when that topic is occurring within SARS are versus all the others.

580
01:00:37,920 --> 01:00:47,280
So a CSR versus all the others and the size of the font is the problem is the probability of the word occurring in the topic and you just kind of see.

581
01:00:47,280 --> 01:00:53,700
So that doesn't mean that agenda and branding as a whole is covered best in ages and is right.

582
01:00:53,700 --> 01:00:58,470
It means that within the topics, just like the cluster of managers in your family,

583
01:00:58,470 --> 01:01:05,730
has more to do with equality, for example, by asking if it has something to do with inequality telling.

584
01:01:05,730 --> 01:01:11,370
Yeah, I'd have to look at the code, but I believe the size of the fine is the topic like the probability of the word,

585
01:01:11,370 --> 01:01:19,890
whereas the location on the spectrum is just how prevalent it is when this person or this corpus is speaking versus when this one is speaking.

586
01:01:19,890 --> 01:01:24,420
So these are all with for both age, CSR and non-agency.

587
01:01:24,420 --> 01:01:32,250
Are these all these words are obviously in the topic, but this is more common outside and this is in within the topic.

588
01:01:32,250 --> 01:01:37,740
It's more about inequality and things like that. Um?

589
01:01:37,740 --> 01:01:46,230
We'll skip over this, but it's just creating interactions so you can do interactions, effects, and then SDM has some cool visuals.

590
01:01:46,230 --> 01:01:53,310
This is just like looking at how the topics are connected in a network according to the probability of their words.

591
01:01:53,310 --> 01:01:58,140
And then they've got this great storm, corvil's and steam browser that create like HTML,

592
01:01:58,140 --> 01:02:03,120
where you can click on and off and see the prevalence of topics move around and things.

593
01:02:03,120 --> 01:02:07,160
I didn't do any of that, but definitely play around with that if you do.

594
01:02:07,160 --> 01:02:17,600
And this is just how you would export the data to whatever format I think I did like yes, data or data or CSV.

595
01:02:17,600 --> 01:02:23,570
Yeah. Questions. I can hop right into word embedding with the seven minutes we have left.

596
01:02:23,570 --> 01:02:30,080
Or we could do that like sometime next week when we're doing group projects.

597
01:02:30,080 --> 01:02:36,160
What are you guys feeling? Now, raise your hand if you want me to the word embedding now.

598
01:02:36,160 --> 01:02:41,290
Raise your hand if you want a tiny break before I do it or you just want me to keep going, that's for me.

599
01:02:41,290 --> 01:02:48,200
OK, I'll just give you any questions on STEM though before.

600
01:02:48,200 --> 01:02:53,970
There. So word embeddings.

601
01:02:53,970 --> 01:03:03,080
Oh, yeah. It's just the covariance.

602
01:03:03,080 --> 01:03:09,890
Yeah. Then I learnt Leighton industry allocation doesn't deal with covariates like how does the topic change, depending who's speaking it?

603
01:03:09,890 --> 01:03:16,880
It doesn't. It doesn't do any of that. This is built or Eskom is built hugely on the foundations of LDA.

604
01:03:16,880 --> 01:03:24,650
In fact, it's running LDA, but it allows the covariates to change the probabilities to estimate the probabilities.

605
01:03:24,650 --> 01:03:30,980
And again, yeah, their documentation, some of the best that I've that I've seen and they've got some really great papers.

606
01:03:30,980 --> 01:03:35,600
One of the ones that I reference where they just use time to look at interviews.

607
01:03:35,600 --> 01:03:43,170
So I think it's like short answer surveys, depending on the demographics of the person who answered the survey.

608
01:03:43,170 --> 01:03:50,470
OK. OK, word embedding, I apologise, I do in Python.

609
01:03:50,470 --> 01:03:57,040
I don't apologise because I know that a lot of you do Python, but just because it's kind of annoying to switch linguistic syntax,

610
01:03:57,040 --> 01:04:01,480
it's just because I started in Python and I learnt word embedding an abnormally early time.

611
01:04:01,480 --> 01:04:05,110
Yeah. Oh, it's not a question. It's a stretch.

612
01:04:05,110 --> 01:04:15,670
OK, so I am going to I can like if you have any questions as an art user about the syntax of of what I'm what I do here.

613
01:04:15,670 --> 01:04:20,380
I'm happy to just give you a mention like the import is basically like library.

614
01:04:20,380 --> 01:04:26,590
You're just importing your packages here. A lot of it is very similar and hopefully will make sense to you.

615
01:04:26,590 --> 01:04:33,880
But I also provide on the last slide. There is the, I think, the last side.

616
01:04:33,880 --> 01:04:42,310
There's a bunch of citations and I'm pretty sure the last citation is what my friend uses the package my friend uses to do word embedding in our.

617
01:04:42,310 --> 01:04:47,650
So you could read the documentation and just do word embedding in our I just use Python.

618
01:04:47,650 --> 01:04:53,830
Does that make sense? Where to find that resource? Just ask me if you need it later.

619
01:04:53,830 --> 01:05:03,790
We start by just loading the packages. And then this is actually where I created the social abstracts that CSP that you use and the other tutorials.

620
01:05:03,790 --> 01:05:09,070
It's originally a web of science document that comes as a text file, and it's not like a tab delimited thing.

621
01:05:09,070 --> 01:05:16,390
So I'm reading that in here line by line and selecting only the variables of interest and then saving it as a CSV for those other tutorials.

622
01:05:16,390 --> 01:05:24,430
That's all that this is, but it might be useful if you end up wanting to analyse web of science abstracts.

623
01:05:24,430 --> 01:05:30,310
One option before you do what word embedding is to lowercase or stem.

624
01:05:30,310 --> 01:05:35,110
So again, you might want to turn running in to run.

625
01:05:35,110 --> 01:05:44,680
I didn't do any of that, but it's definitely possible to do in your analysis with Python or R, as we've seen.

626
01:05:44,680 --> 01:05:47,380
Then I define two functions.

627
01:05:47,380 --> 01:05:57,430
The first of which converts the document into a list, each document that you read in into a list of words and lower cases them.

628
01:05:57,430 --> 01:06:08,440
And the second document splits the text into specific sentences and returns a list of sentences where each sentence is a list of words.

629
01:06:08,440 --> 01:06:18,200
Does that make sense? Yeah, so that's just these are just functions there, like the curly bracket things in our but it's in Python in its preview.

630
01:06:18,200 --> 01:06:29,010
Then you make that list of texts to loop through, and I created an empty list to put each sentence into.

631
01:06:29,010 --> 01:06:36,000
And I apply those functions. So ultimately, we have.

632
01:06:36,000 --> 01:06:44,160
In that corpus of abstracts, we have one hundred and eight thousand four hundred and twenty two sentences.

633
01:06:44,160 --> 01:06:48,210
And an example of one of those sentences is during globalisation, modern period,

634
01:06:48,210 --> 01:06:51,780
so we remove some of the stop words and stuff, that's why doesn't make sense. Not all of them, though.

635
01:06:51,780 --> 01:06:59,710
Evidently to a maybe I'll do that later on. Let's look at another one.

636
01:06:59,710 --> 01:07:08,650
In contrast, many national church like the Orthodox, some of the stuff is already removed, but this is this is what your input is.

637
01:07:08,650 --> 01:07:13,660
So it's a list of sentences where each sentence is a list of words.

638
01:07:13,660 --> 01:07:26,010
OK, and then we run the word to VEC model, which this is just importing the package for that from Jensen.

639
01:07:26,010 --> 01:07:32,530
So I was trying to get rid of this progress bar things, but I it wasn't it wasn't a priority.

640
01:07:32,530 --> 01:07:37,860
So in the. The implementation of your word embedding model, as always,

641
01:07:37,860 --> 01:07:49,710
you have parameters to set the number of features or dimensions like we were talking about before that you want to create the minimum word count,

642
01:07:49,710 --> 01:07:55,830
the like. How how again, similar to the STM, how often the word is is seen.

643
01:07:55,830 --> 01:07:58,560
The number of workers is about the processing of it.

644
01:07:58,560 --> 01:08:05,730
Doing parallel ization so that it's going quicker depends on the computation power that you have in your computer context.

645
01:08:05,730 --> 01:08:10,920
OK, so remember, context was if you have a target word, it's the number of words around it that you want to consider.

646
01:08:10,920 --> 01:08:17,910
The context somewhere around eight is again what they recommended for the English language, what some studies have suggested.

647
01:08:17,910 --> 01:08:25,290
I chose six and then downsampling is is a process that you can use and like machine learning and other ways to just

648
01:08:25,290 --> 01:08:33,940
make sure that you're not that very frequent or infrequent words or very frequent words in this case aren't kind of.

649
01:08:33,940 --> 01:08:40,510
Taking too much of the effect like causing too much of the effect.

650
01:08:40,510 --> 01:08:44,980
And then you run it, this is how you run it, you're specifying you gave it your list of sentences,

651
01:08:44,980 --> 01:08:51,460
which are each a list of words, you're saying the number, weren't you just defining the parameters based on these things above?

652
01:08:51,460 --> 01:08:57,000
And then it runs it, and it's showing the progress as it goes along.

653
01:08:57,000 --> 01:09:03,470
And if you don't plan to add more data and keep training the model, then you just want to save it.

654
01:09:03,470 --> 01:09:09,060
I saved it. Now we can look so we've created that the model did what we talked about,

655
01:09:09,060 --> 01:09:14,150
I'm happy to go over it again, but it did the process we were talking about of.

656
01:09:14,150 --> 01:09:18,740
Of sampling and created our vector space, and now we want to look at it.

657
01:09:18,740 --> 01:09:24,620
This is the vocabulary, so I'm only looking up to the first twenty five words.

658
01:09:24,620 --> 01:09:30,350
The vocabulary is every word that is in this vector space.

659
01:09:30,350 --> 01:09:37,150
You can check. If a word is in your vocabulary, so I wanted to know, is analysis in my vocabulary?

660
01:09:37,150 --> 01:09:43,570
Is Oxford in my vocabulary? No, but maybe university is.

661
01:09:43,570 --> 01:09:48,340
Yes. So you can kind of explore that. Then you can see one of your vectors.

662
01:09:48,340 --> 01:09:53,670
I want to see the vector for philosophy. OK, I said, Oh, that's not the.

663
01:09:53,670 --> 01:10:03,560
I'm over here. Sorry, I just got battery life issues.

664
01:10:03,560 --> 01:10:07,490
Like I said, each word is a vector, and vector is just a bunch of numbers, right?

665
01:10:07,490 --> 01:10:15,470
Defining where it is in the in the vector space so you can actually look at those vectors, extract those vectors if you wanted to.

666
01:10:15,470 --> 01:10:25,150
And that's what that is, is looking at the philosophy vector. That was only trained on on unit grounds.

667
01:10:25,150 --> 01:10:37,790
You might also want to do the diagrams. This to the victors.

668
01:10:37,790 --> 01:10:46,440
How do you mean, like the individual, it's just so there'll be as many dimensions as you defined, that's how many points there will be in each vector.

669
01:10:46,440 --> 01:10:50,440
Do you mean it's just defining its relation to the other words in that space?

670
01:10:50,440 --> 01:10:57,140
It's. I don't know.

671
01:10:57,140 --> 01:11:03,020
I mean, it's if you take two, when you take two vectors in terms of the relations.

672
01:11:03,020 --> 01:11:07,070
I don't think so. I would have to look on that,

673
01:11:07,070 --> 01:11:15,560
I know that like defining the space is just generally the cosine similarity of the two vectors is the metric of distance that is generally used,

674
01:11:15,560 --> 01:11:20,310
but the vector dimensions themselves.

675
01:11:20,310 --> 01:11:25,680
Like in terms of like, if you had those, could you take the first factor and it would be like a topic or something?

676
01:11:25,680 --> 01:11:32,700
I don't know. I don't think so, but it's an interesting question.

677
01:11:32,700 --> 01:11:40,530
So we could train it on bigram. I think I just over ran what we were doing, but we'll take a look.

678
01:11:40,530 --> 01:11:46,440
Could do it on by. OK, so we have financial crisis and religious beliefs.

679
01:11:46,440 --> 01:11:50,820
This gets to the question earlier of like, we don't want all migrants,

680
01:11:50,820 --> 01:11:59,340
we want diagrams that are our common and therefore it doesn't have like sleep protests because it's not an actual thing.

681
01:11:59,340 --> 01:12:04,580
Why do you keep doing that? But gender mainstreaming or connexions between a religious beliefs?

682
01:12:04,580 --> 01:12:06,690
These are things that we tend to occur.

683
01:12:06,690 --> 01:12:11,970
If there's something in the model that helps predict that you could get under the hood and change it if you wanted to.

684
01:12:11,970 --> 01:12:19,500
But it's nice that it's there, then this is what I was talking about with the kind of algebra that you can do on the vector space.

685
01:12:19,500 --> 01:12:26,370
So in this, we're taking the model, the vectors, and then we say, I want to know which vectors are most similar to the vector for race.

686
01:12:26,370 --> 01:12:30,090
Give me the top 15. That's what that's doing.

687
01:12:30,090 --> 01:12:36,840
And in this context, we have ethnicity, class, gender, race and ethnicity, sexuality, gender, ideology, et cetera, et cetera.

688
01:12:36,840 --> 01:12:41,540
And you're getting the cosine similarity of the two vectors there.

689
01:12:41,540 --> 01:12:49,400
Then you can do give me this similarity of race. But when we remove stratification, this is what I was talking about earlier.

690
01:12:49,400 --> 01:12:55,040
It doesn't actually produce. I thought maybe it would produce stuff on politics, maybe in a political science abstract setting.

691
01:12:55,040 --> 01:12:59,480
That's what it would give you. And you'd get like election and governor and things like that.

692
01:12:59,480 --> 01:13:04,910
But that's that's basically what that's doing is it's saying take take the vector for race,

693
01:13:04,910 --> 01:13:12,010
but remove the vector for stratification and then tell me what kind of the meaning of that vector, how it like, what it is now.

694
01:13:12,010 --> 01:13:17,300
And you could compare that to what it is with the vector, with the vector of stratification.

695
01:13:17,300 --> 01:13:20,010
And you can just see you could take two vectors and combine them.

696
01:13:20,010 --> 01:13:24,950
I thought this was interesting this morning when I was testing the code, I was like, What about ethnography with results?

697
01:13:24,950 --> 01:13:30,590
Because usually in an abstract ethnography, you don't talk about their results, whereas a regression would.

698
01:13:30,590 --> 01:13:37,880
And so you see that you've got a negative correlation between ethnography and results, but a positive one between results and regression.

699
01:13:37,880 --> 01:13:43,190
The last thing that I did in this code of that is not something you usually do with word embedding models,

700
01:13:43,190 --> 01:13:47,030
but I was trying to kind of think, how could you get a word embedding model?

701
01:13:47,030 --> 01:13:57,230
To do something similar to a topic model is what I what I did is I took all of those vectors and I did cluster analysis on the vector space to see if,

702
01:13:57,230 --> 01:14:00,590
like, how would and this is what I thought of doing for the group.

703
01:14:00,590 --> 01:14:04,070
Exercise tomorrow was to take the same corpus and have some of you do a topic

704
01:14:04,070 --> 01:14:08,210
model instead of do word embedding and then kind of discuss what you could say,

705
01:14:08,210 --> 01:14:13,820
what you couldn't, what you found and how that changed with the same exact corpus.

706
01:14:13,820 --> 01:14:18,890
And so I thought I'd give a little bit of code if I decide to do that exercise tomorrow.

707
01:14:18,890 --> 01:14:24,020
I selected a K of 50. Maybe I should add on 20 to be similar to the STEM,

708
01:14:24,020 --> 01:14:30,590
but I just this is doing a cluster analysis on the vector space and these are the clusters that you get,

709
01:14:30,590 --> 01:14:33,500
some of which are just two words that are not that useful,

710
01:14:33,500 --> 01:14:43,010
but others actually made a decent bit of sense, one that I found kind of fun or would have been even more fun in the context of the Princeton six,

711
01:14:43,010 --> 01:14:47,940
because they do stuff with fragile families is.

712
01:14:47,940 --> 01:14:53,730
There's one topic that's all basically fragile families, but if you look at these, some of them make sense, some of them don't.

713
01:14:53,730 --> 01:15:01,170
Maybe you need a larger case, maybe you need a smaller car, but you.

714
01:15:01,170 --> 01:15:06,840
Yeah, this is kind of a cool thing that you could do with the vector space afterwards if you wanted to.

715
01:15:06,840 --> 01:15:15,460
And I wanted to provide some code for you to do it. Yeah.

716
01:15:15,460 --> 01:15:25,940
Does it? I didn't go into the nitty gritty of it, I'm happy to answer more questions.

717
01:15:25,940 --> 01:15:31,540
But I think it's a long day. OK.

718
01:15:31,540 --> 01:15:36,405
OK, thank you. Yeah.