1 00:00:08,630 --> 00:00:21,460 So welcome back from snack. We are going to be talking about the methods, three methods of large-scale text analysis, there are others, 2 00:00:21,460 --> 00:00:28,720 there's dictionary based methods, there's methods I've never even heard of, but these are three that are pretty prominent. 3 00:00:28,720 --> 00:00:32,390 The first is topic modelling, which you probably all at least heard of, if not done. 4 00:00:32,390 --> 00:00:42,310 And I'm sure a lot of you have done even a basic topic. Model word embedding models are a little less common thus far in the social sciences, 5 00:00:42,310 --> 00:00:48,040 although I know that there's a lot of people working with them now and they're really pretty cool 6 00:00:48,040 --> 00:00:54,460 and different than topic modelling and then network based approach approaches to text analysis, 7 00:00:54,460 --> 00:01:02,710 which are pretty long standing but are also have a ton of room to develop and for people to contribute. 8 00:01:02,710 --> 00:01:06,220 So in our last session, we talked about where do we get this data? 9 00:01:06,220 --> 00:01:11,650 Then we talked about how do we clean and pre process it for analysis, and now we're going to analyse it. 10 00:01:11,650 --> 00:01:16,210 And before we do that, I'll just give us kind of a state of reference. 11 00:01:16,210 --> 00:01:21,940 If I gave you a massive stack of documents that you'd never seen before and I asked you to 12 00:01:21,940 --> 00:01:26,230 tell me to read them and then to find some way to read them and tell me what they're about, 13 00:01:26,230 --> 00:01:33,620 what the general themes are. What would be your process? Maybe some of you have done this qualitative coding, right? 14 00:01:33,620 --> 00:01:47,520 How would you go about it? Yeah, it's you know, it's likely to solve some problems to try to pull out, and it seems like it's. 15 00:01:47,520 --> 00:01:52,460 It's a sad situation, and there's been. 16 00:01:52,460 --> 00:01:56,620 Then just lots of things. Yeah. 17 00:01:56,620 --> 00:01:56,900 Yeah, 18 00:01:56,900 --> 00:02:03,740 I think that's probably what a lot of people would have said that what he said was basically you take maybe a sub sample of the corpus and read it, 19 00:02:03,740 --> 00:02:09,770 pick out some themes and then in a rather iterative process, you'd maybe expand the number of things or demand minute collapse, 20 00:02:09,770 --> 00:02:14,570 the number of themes as you start to read and apply it to the entire corpus. 21 00:02:14,570 --> 00:02:18,980 Anything else? Or is that kind of what people thought they would do? 22 00:02:18,980 --> 00:02:26,480 Yeah. So that's kind of that's a pretty traditional way that we. It's still a very valuable way of doing text analysis. 23 00:02:26,480 --> 00:02:35,360 But I just wanted to give that as a reference point for some of these other methods that we do in computational text analysis. 24 00:02:35,360 --> 00:02:45,020 So first being topic models, how many have just to ask how many of you have used topic modelling in your research, even just playing around with it? 25 00:02:45,020 --> 00:02:50,060 I think there were maybe even weren't cool. Well, then this will be more really fun. 26 00:02:50,060 --> 00:02:53,720 How many of you have heard of the topic model? A lot more. 27 00:02:53,720 --> 00:02:59,670 Yeah. What is the topic model? 28 00:02:59,670 --> 00:03:10,350 Oops! You get back on this oh, so OK, so a topic model, what kind of idea of it was first proposed? 29 00:03:10,350 --> 00:03:18,030 Well, actually it was developed in population genetics in two thousand and then independently developed by Blé. 30 00:03:18,030 --> 00:03:23,730 This is what I was mentioning before in the context of text analysis in 2003. 31 00:03:23,730 --> 00:03:28,650 And so again, it doesn't just need to be applied to text, but it is interesting. 32 00:03:28,650 --> 00:03:36,090 And the kind of distinguishing feature of topic models is that there are an automated procedure for coding the content of texts, 33 00:03:36,090 --> 00:03:42,450 including very large corpora, into a set of meaningful categories or topics. 34 00:03:42,450 --> 00:03:47,940 This is done with algorithm minimal human intervention and therefore is kind of a more 35 00:03:47,940 --> 00:03:57,950 inductive way of drawing out topics than than the hand coding that we discussed before. 36 00:03:57,950 --> 00:04:07,130 How does it work? The really simple answer is that it relies on the assumption that language and meaning is relational. 37 00:04:07,130 --> 00:04:12,800 And so therefore, even though these are a topic, modelling is what we call a bag of words method, 38 00:04:12,800 --> 00:04:19,100 meaning that it doesn't really care about the rules of language, syntax and narrative order and parts of speech and things. 39 00:04:19,100 --> 00:04:28,760 It does still believe that there are these like content clusters that share and create meta meanings or topics. 40 00:04:28,760 --> 00:04:37,430 The algorithm is a probabilistic model and the most widely used one that a lot of us have heard is is LDA or lightweight and seriously allocation, 41 00:04:37,430 --> 00:04:42,620 which we'll talk quite a bit about. And that's kind of an introduction. 42 00:04:42,620 --> 00:04:54,020 But the assumption in this is that a corpus within a corpus there are there is a distribution of different topics and then within each topic. 43 00:04:54,020 --> 00:05:03,260 So maybe in one corpus on political and political text, there's one about health care and one about elections and whatnot. 44 00:05:03,260 --> 00:05:07,460 And then if you just went into the health care topic, you'd have all the entire corpus. 45 00:05:07,460 --> 00:05:12,530 All the words that are in that corpus would have a probability of occurring within that topic. 46 00:05:12,530 --> 00:05:17,060 So you have a distribution of topics across the corpus and within an individual document 47 00:05:17,060 --> 00:05:22,520 you and then within each topic you have a probability or a distribution to the words. 48 00:05:22,520 --> 00:05:30,290 Does that make sense? And instead of, as we talked about with the qualitative coding where you would, 49 00:05:30,290 --> 00:05:34,700 maybe you start by reading the text, get a sense of what the topics might be, add to them, 50 00:05:34,700 --> 00:05:40,760 diminish them in topic modelling you pre specify how many topics you want the algorithm to uncover, 51 00:05:40,760 --> 00:05:48,380 and it's kind of a long standing area of research is how do you predict or how do you choose the correct number of topics? 52 00:05:48,380 --> 00:05:55,670 And we could go on forever about that. There are some methods we'll see when we look at structural topic modelling, 53 00:05:55,670 --> 00:05:59,840 what they've developed to help you try and distinguish what's a good number of topics. 54 00:05:59,840 --> 00:06:06,830 A lot of times the answer for social scientists has just been No, your topic doesn't make sense to these topics. 55 00:06:06,830 --> 00:06:17,270 Makes sense. You might hope that, OK, so if I have five topics and I see and one of them that like marriage and family, 56 00:06:17,270 --> 00:06:21,440 marriage and family are mixed together in one topic and I want those separated. 57 00:06:21,440 --> 00:06:25,460 So I'm going to make it six topics and see if it breaks apart. Those of you have done this. 58 00:06:25,460 --> 00:06:28,250 No, that's not the way it works, and it can totally change. 59 00:06:28,250 --> 00:06:32,840 And you'll get a topic you've never heard of before in the marriage and family topic remains the same. 60 00:06:32,840 --> 00:06:38,490 So there is a bit of of. Art to some of it, you might say, 61 00:06:38,490 --> 00:06:47,460 and I think that's all the more reason that we we need to have good reporting standards about what we did and how we made our decisions. 62 00:06:47,460 --> 00:06:53,460 But the idea of the algorithms of topic modelling is that you want to kind of reverse engineer whatever 63 00:06:53,460 --> 00:07:00,240 it was that the speaker or the author of the text was trying to convey in terms of the ideas. 64 00:07:00,240 --> 00:07:05,580 And so let's see. 65 00:07:05,580 --> 00:07:13,980 Yeah. Now what the more practical question about real world in general in the studio space. 66 00:07:13,980 --> 00:07:17,880 Will you define what what it should be something like? 67 00:07:17,880 --> 00:07:23,760 How do you build things you needed to? Is this something you define all the model? 68 00:07:23,760 --> 00:07:27,000 Yeah. No, the model generates it and we'll talk about how, but this. 69 00:07:27,000 --> 00:07:31,440 I'm sorry, I didn't explain this image at all, but basically, if you have this text, 70 00:07:31,440 --> 00:07:34,770 you can see the different topics are coloured and the different words. 71 00:07:34,770 --> 00:07:42,990 So you see the Foundation Million Support Foundation public facilities is all in green, whereas New York, 72 00:07:42,990 --> 00:07:49,920 New York Philharmonic and performing and opera and music is all in red, that's maybe like the arts topic. 73 00:07:49,920 --> 00:07:55,020 And so you could see how in this document, there's a distribution of these different topics. 74 00:07:55,020 --> 00:08:03,090 These look pretty equal in distribution, but there might be one topic where there's very little representation of it in that document. 75 00:08:03,090 --> 00:08:08,400 And then for each topic, each of these words would have a probability of occurring in that topic. 76 00:08:08,400 --> 00:08:14,400 So there's a distribution of words within each topic in a different distribution of topics across the words. 77 00:08:14,400 --> 00:08:24,390 And basically, what you do with the topic modelling, which is what we'll discuss now, is you're trying to get it to uncover that those distributions. 78 00:08:24,390 --> 00:08:28,200 So the objective is given a corpus of text in a specified number of topics, 79 00:08:28,200 --> 00:08:37,970 find the parameters that likely generated it recreate in a very basic form the intent of the speaker in terms of what they wanted to talk about. 80 00:08:37,970 --> 00:08:41,900 The primary endpoint is the text and the number of topics you want it to uncover. 81 00:08:41,900 --> 00:08:47,340 And then basically the process is OK. The algorithms have been told how many topics there are. 82 00:08:47,340 --> 00:08:52,970 It has a corpus of words because you provided a text and it will take a topic. 83 00:08:52,970 --> 00:08:55,610 It doesn't have any information about that topic yet, 84 00:08:55,610 --> 00:09:02,510 but it knows that there's some probability of that topic arriving and in a particular document over the corpus. 85 00:09:02,510 --> 00:09:06,080 So it selects a topic and then for that topic, 86 00:09:06,080 --> 00:09:12,560 it selects the word and it puts that word in like the bag of words that is eventually supposed to be growing the whole document. 87 00:09:12,560 --> 00:09:24,500 Essentially, the algorithm is trying to recreate the the corpus that that you gave it by iteratively doing this, drawing a theme or drawing a topic, 88 00:09:24,500 --> 00:09:30,320 drawing a word within that topic and putting it in the bag until it recreates something like the corpus that you have. 89 00:09:30,320 --> 00:09:38,300 And then it says, OK, you gave me this number of topics. This is this is kind of the best I could do to recreate the corpus you gave me, 90 00:09:38,300 --> 00:09:44,190 given that number of topics by changing the probability of a word occurring in any topic. 91 00:09:44,190 --> 00:09:49,190 So the output that you get if you have a word distribution for every topic. 92 00:09:49,190 --> 00:09:57,490 So if we go back. If I then was like, I want to just look at this arts, this red one. 93 00:09:57,490 --> 00:10:01,690 You could look at it wouldn't give you the label, but you would call it the arts topic. 94 00:10:01,690 --> 00:10:07,810 We'll see this later. You could just look at that topic and you could say, OK, give me the word probability within that topic. 95 00:10:07,810 --> 00:10:17,020 And maybe art is super high, so it has a high probability of occurring whenever art is being discussed, whereas the word. 96 00:10:17,020 --> 00:10:27,130 Procedural might have a low probability in an arts topic, but it would still be there, every word in the corpus has a probability within every topic. 97 00:10:27,130 --> 00:10:34,280 It's just that some will have a like a very, very low probability within that topic, and some will have a higher probability. 98 00:10:34,280 --> 00:10:41,910 Does that make sense? It's hard to wrap your mind. It's been hard for me over the years to continually wrap my mind around. 99 00:10:41,910 --> 00:10:48,370 But ultimately, what you get, as I said, is you get a distribution of words for each topic in a distribution of topics over the corpus. 100 00:10:48,370 --> 00:10:54,480 So you know, not only how likely is it for a certain word to pop up if a certain topic is being mentioned, 101 00:10:54,480 --> 00:11:04,580 but you also know how likely is it for that topic to be mentioned in this corpus or within an individual document? 102 00:11:04,580 --> 00:11:08,660 What is a topic model not or what are they not? 103 00:11:08,660 --> 00:11:11,330 I think this is a really important point. 104 00:11:11,330 --> 00:11:18,770 It's a long quote from that herald as well that I mentioned earlier kind of a forerunner for content analysis in the modern age. 105 00:11:18,770 --> 00:11:25,120 And I know it's long and I hate when people read long quotes, but we're going to do it anyways. Content analysis should be given. 106 00:11:25,120 --> 00:11:29,030 We should begin with traditional modes of research and the person who wishes to 107 00:11:29,030 --> 00:11:33,020 use content analysis for a study of the propaganda of some political party, 108 00:11:33,020 --> 00:11:39,560 for example, should steep themselves in that propaganda before he or she begins to count. 109 00:11:39,560 --> 00:11:44,180 They should read it to detect characteristic mechanisms and devices. 110 00:11:44,180 --> 00:11:50,750 They should study the vocabulary and format. They should know the party organisation and personnel from this knowledge. 111 00:11:50,750 --> 00:11:54,980 They can organise the hypotheses and predict hypotheses and predictions. 112 00:11:54,980 --> 00:12:01,250 At that point, in a conventional study, they can start writing at this point in a content analysis, 113 00:12:01,250 --> 00:12:06,440 they are instead ready to set up their categories, pre-test them and start counting. 114 00:12:06,440 --> 00:12:13,550 So that was in the context of a much more qualitative type of content analysis, but it absolutely, in my opinion, applies to topic modelling. 115 00:12:13,550 --> 00:12:20,210 In my first course in text data with Neil Cameron at UMC. 116 00:12:20,210 --> 00:12:26,180 My final project looked at a corpus of religious texts of a religion that I was particularly familiar with, 117 00:12:26,180 --> 00:12:30,950 and I was looking over time at the topic model and there were anomalies within it 118 00:12:30,950 --> 00:12:35,030 that I know had I not been familiar with that religion from a family background. 119 00:12:35,030 --> 00:12:43,250 I would not have understood why there were these, these anomalies in the topics and why that topic totally shifted in this year, 120 00:12:43,250 --> 00:12:50,360 because the leadership of that church totally shifted in that you're not because suddenly the the congregations changed, 121 00:12:50,360 --> 00:12:56,900 but because leadership within that religion is very much driven by the personality and the interests 122 00:12:56,900 --> 00:13:02,240 of who or the kind of culture of the religion is driven by whoever is the leader at that time. 123 00:13:02,240 --> 00:13:06,680 And I knew who the leaders were, and one was really strict. 124 00:13:06,680 --> 00:13:08,630 And so it changed. 125 00:13:08,630 --> 00:13:19,220 But yeah, that's the sort of thing that you really want to understand your texts and in a more recent context and apply it to topic modelling. 126 00:13:19,220 --> 00:13:25,490 We have this seen in this light, it's useful to think about topic models not as providing an automatic text analysis programme, 127 00:13:25,490 --> 00:13:29,300 but rather as providing a lens that allows researchers to work on a problem to 128 00:13:29,300 --> 00:13:34,190 view a relevant textual corpus in a different light and at a different scale. 129 00:13:34,190 --> 00:13:40,760 We can just look at way more text, and we can do it in a slightly different way than we would with qualitative coding. 130 00:13:40,760 --> 00:13:50,670 And that's true of all the methods that we'll use. But it's just that reiteration that it's not like you don't have to read the text anymore. 131 00:13:50,670 --> 00:13:56,320 In the tutorial, we're going to look at structural topic models, which I'm not going to read this because I read enough of them. 132 00:13:56,320 --> 00:14:02,460 They it's basically a topic model, but you get to add in variables like demographic variables. 133 00:14:02,460 --> 00:14:07,320 Maybe it's like whoever said this were a Republican or a Democrat. 134 00:14:07,320 --> 00:14:11,430 Were they a man or woman? Was it in the 18th century or the 19th century? 135 00:14:11,430 --> 00:14:21,530 You can have continuous variables. What date was it said in? How long is the text that it came from that sort of thing and use those to see 136 00:14:21,530 --> 00:14:27,050 how does a particular topic change depending on who set it or in what context? 137 00:14:27,050 --> 00:14:36,410 So again, this is a type of topic model designed for social researchers, which is pretty cool, I think, and designed in like 2014. 138 00:14:36,410 --> 00:14:46,170 You can read more about it there. And we'll do the our package this time, our package, there was more on that slide. 139 00:14:46,170 --> 00:14:52,680 I guess not. So. I was going to take a break and do the structural topic model, 140 00:14:52,680 --> 00:14:58,880 but I kind of feel like I could just talk about word embedding immediately and then do both of those tutorials back to back. 141 00:14:58,880 --> 00:15:08,420 Does that sound OK? Yeah. OK, so we're going to contrast topic modelling with word embedding models and sort of word embedding models. 142 00:15:08,420 --> 00:15:17,240 A lot fewer people, but you will hear I genuinely believe you will hear about them increasingly in the context of social scientific research, 143 00:15:17,240 --> 00:15:23,660 they're not being part of the reason I didn't. We're not just live the the. 144 00:15:23,660 --> 00:15:28,100 The Princeton site is that we want there's so much to text analysis, right, 145 00:15:28,100 --> 00:15:32,870 and they do an excellent job covering sentiment analysis and that sort of stuff, but they don't cover word embedding. 146 00:15:32,870 --> 00:15:38,650 And so now the live streamers can get even more than they would have otherwise. 147 00:15:38,650 --> 00:15:44,720 And you as well, because you can totally go watch the Princeton live stream later on if you want. 148 00:15:44,720 --> 00:15:50,180 But OK, so we just topic discussed topic modelling where? 149 00:15:50,180 --> 00:15:57,140 There's distributions of topics over your corpus, and there's distribution of words with integrity within every topic. 150 00:15:57,140 --> 00:16:03,080 Word embedding models are completely different. Well, first off, what is a word embedding model? 151 00:16:03,080 --> 00:16:07,400 A method of text analysis that results in a matrix of word vectors, 152 00:16:07,400 --> 00:16:14,180 in which words used in similar contexts are closer together in vector space than words used in different contexts. 153 00:16:14,180 --> 00:16:18,320 That's not right with the output of a top edge model was which was this like distribution 154 00:16:18,320 --> 00:16:21,830 of topic over the corporate and distributed distribution of words within the topic. 155 00:16:21,830 --> 00:16:28,520 So you're getting at something very different here that has a lot more to do with the relationship of words within your corpus, 156 00:16:28,520 --> 00:16:32,810 not necessarily the topics in your corpus. Although in I, 157 00:16:32,810 --> 00:16:43,540 I just out of curiosity did something in the tutorial that we did to see if we could also get something of topics out of word and embedding. 158 00:16:43,540 --> 00:16:47,440 I will skip ahead real quick and come back. This is kind of what your output could look like, 159 00:16:47,440 --> 00:16:52,930 it would be not in just two dimensional space and you can't see this, but they're all different words. 160 00:16:52,930 --> 00:16:58,390 And the ones that are similar are close together in this vector space and the ones that are far apart are far apart. 161 00:16:58,390 --> 00:17:05,710 When we go back? Can someone remind me anyone when when I'm streaming for my computer, we can go to? 162 00:17:05,710 --> 00:17:14,770 Or you can go yourself to this projector dot TensorFlow dot org and you can see this in in this multidimensional three dimensional space, 163 00:17:14,770 --> 00:17:21,970 which is really more what the vectors are creating because it's not a two dimensional distance, but I'm probably getting ahead of myself. 164 00:17:21,970 --> 00:17:23,820 So. 165 00:17:23,820 --> 00:17:30,450 This is what we're trying to create that ultimately we want words that are similar to be close together and words that are different to be far apart. 166 00:17:30,450 --> 00:17:37,790 And early versions of of this used latent semantic analysis, I'm not going to go into that. 167 00:17:37,790 --> 00:17:44,340 Mean, that's going to be too much if none of us have even heard of these, really basically. 168 00:17:44,340 --> 00:17:50,940 How does it do this using a shell shallow neural network to produce these sort of. 169 00:17:50,940 --> 00:17:55,380 And we're going to talk about what that means in this context to produce these relationships. 170 00:17:55,380 --> 00:17:59,340 And these have been around for a long time. But as with a lot of things, 171 00:17:59,340 --> 00:18:05,760 the progress in computing power and in neural nets has allowed word embeddings to just suddenly 172 00:18:05,760 --> 00:18:12,750 go from doing a semi good job to doing a very good job and to help with translation of text or, 173 00:18:12,750 --> 00:18:19,390 like, you know, Siri on your phone and stuff like that. 174 00:18:19,390 --> 00:18:25,780 But we haven't used them much in social scientific research in some respects because they're a little intimidating, 175 00:18:25,780 --> 00:18:30,760 but they're really not that intimidating once you step into them and also just because none of us have the training for them. 176 00:18:30,760 --> 00:18:36,040 So this is hopefully a good introduction. So how does it work, right? 177 00:18:36,040 --> 00:18:39,790 How did these word embedding models actually work? There's two different approaches. 178 00:18:39,790 --> 00:18:44,470 There's a continuous bag of words and a skip graham approach. Hmm. 179 00:18:44,470 --> 00:18:48,310 Stop me if I'm getting, I'm getting ahead. Or you miss something that doesn't make sense. 180 00:18:48,310 --> 00:18:51,310 OK, but let's say we have a given text. 181 00:18:51,310 --> 00:18:59,620 The continuous bag of words would kind of slowly read in that text, and for each word, it would try and predict that word. 182 00:18:59,620 --> 00:19:04,510 Given the context that the word appears in in context here is defined or operationalise 183 00:19:04,510 --> 00:19:09,010 very explicitly by you as the researcher in the way that with topic modelling, 184 00:19:09,010 --> 00:19:12,940 you had to say how many topics there are here. You have to say what the context is, 185 00:19:12,940 --> 00:19:17,050 and there's been research to suggest that usually approximately eight words on 186 00:19:17,050 --> 00:19:21,760 either side of whatever word you're trying to predict is a good notion of context. 187 00:19:21,760 --> 00:19:29,320 That's a really long sentence. 17 words, but that's that's kind of what people have found to be a good one. 188 00:19:29,320 --> 00:19:33,400 So you define context as eight words on either side of your target word. 189 00:19:33,400 --> 00:19:38,600 And given all of those contexts, words, those 16 words. 190 00:19:38,600 --> 00:19:45,560 Tell me which word is missing the other way is Skip Graham, predict the context, given a word, you're given a word tell. 191 00:19:45,560 --> 00:19:53,900 Tell me what other words might occur around it. This is probably not intuitive quite yet as to how it works, but we're going to describe that. 192 00:19:53,900 --> 00:19:58,110 So how many of you have heard of word two vec? 193 00:19:58,110 --> 00:20:05,130 So more so words back is a word embedding there were I wondered all of a sudden when I saw this, I was like, That might be what people know. 194 00:20:05,130 --> 00:20:14,760 Word embedding models as word T-VEC is an implementation of a word embedding model, so you're more familiar than you realised. 195 00:20:14,760 --> 00:20:18,570 How does it do it? We're going to take the skip Graham example, because that one's a little more common. 196 00:20:18,570 --> 00:20:24,840 And what we'll be doing later. So you take a word or concept in your training corpus. 197 00:20:24,840 --> 00:20:28,530 Call it a target and the number of words that lie close to it. 198 00:20:28,530 --> 00:20:33,750 The context which again, we said you defined as, say, eight words around it or in this case, I think I did. 199 00:20:33,750 --> 00:20:38,710 Five. So here's an example from catch twenty two. 200 00:20:38,710 --> 00:20:43,630 It's a quote we'll just focus on this last sentence, Captain Black, who had aspired to the position himself, 201 00:20:43,630 --> 00:20:49,450 maintained that Major Major really was Henry Fonda and was too blank to admit it. 202 00:20:49,450 --> 00:20:52,600 If we defined Major Major as our, well, that's what we're trying to predict. 203 00:20:52,600 --> 00:20:57,010 We're trying to say that's one word we're trying to predict major majors name where 204 00:20:57,010 --> 00:21:01,870 will he turn up in the text while he might turn up whenever we see these other words, 205 00:21:01,870 --> 00:21:04,480 or at least some of these other words in the text? 206 00:21:04,480 --> 00:21:12,280 So Henry Fonda, maybe repeatedly throughout catch twenty two major majors mentioned in context with Henry Fonda. 207 00:21:12,280 --> 00:21:19,420 Well, as you read the text slowly and you keep coming across major major and seeing him in that context, 208 00:21:19,420 --> 00:21:24,280 you'll slowly bring those two closer together. That's kind of the intuition of it. 209 00:21:24,280 --> 00:21:28,870 But more specifically, there are these five steps, which is a really basic. 210 00:21:28,870 --> 00:21:32,860 You can read forever, obviously, on the computer and mathematical science behind this, 211 00:21:32,860 --> 00:21:43,040 but you take your target word and then each one of these words is represented by a vector or a bunch of numbers. 212 00:21:43,040 --> 00:21:52,650 And at the very beginning, these can just be random. But your aim is to make sure that the vectors that occur together by the end of your analysis are 213 00:21:52,650 --> 00:21:57,570 close together in vector space and those that never occur together are far apart in vector space. 214 00:21:57,570 --> 00:22:01,620 And you kind of do this by every time you come across the word. 215 00:22:01,620 --> 00:22:03,150 I'm just going to skip these. You can read this. 216 00:22:03,150 --> 00:22:09,530 But every time you come across the word, you take the context and you bring all those words closer together in vector space. 217 00:22:09,530 --> 00:22:14,700 You kind of take a random sample from the rest of the vector space and you push those vectors further away. 218 00:22:14,700 --> 00:22:22,650 And as you continue to read the text, you're slowly bringing the contexts together and pushing away whenever you don't see it in that context. 219 00:22:22,650 --> 00:22:26,820 Does that make sense? It's pretty neat, right? 220 00:22:26,820 --> 00:22:35,630 Yeah, just. I think just computationally like a lot to do that, 221 00:22:35,630 --> 00:22:44,960 but there's probably methods that you could I mean these these two that I mentioned continuous backwards and skip grandma just to there's there. 222 00:22:44,960 --> 00:22:51,020 I mean, there were more that preceded it that dealt with the entire corpus and the kind of neural net 223 00:22:51,020 --> 00:22:59,080 training allowed you to make contacts much smaller and and evidently just improved it quite a bit. 224 00:22:59,080 --> 00:23:05,650 But why would you look backwards at all? Just little bit on the. 225 00:23:05,650 --> 00:23:11,200 You mean, just bring its context words closer to us. Yeah. 226 00:23:11,200 --> 00:23:16,750 I don't know. I mean, it does seem I haven't thought about that, and I'm sure the computer scientists have. 227 00:23:16,750 --> 00:23:23,920 But uh, and maybe they've just tested within face validity which one works better. 228 00:23:23,920 --> 00:23:35,270 But it does seem like if you push it away, you would get a little bit more distinct clustering coefficient at the end of your thinking and networks. 229 00:23:35,270 --> 00:23:37,220 So does that make sense? 230 00:23:37,220 --> 00:23:43,370 This is really just supposed to be an introduction so that maybe when you approach a word of word to back in the future you get to, 231 00:23:43,370 --> 00:23:48,880 you generally know what it's doing. Yeah, one more time. 232 00:23:48,880 --> 00:23:55,030 Yeah, definitely. OK, so let's maybe try and do what we're going to work on the fly. 233 00:23:55,030 --> 00:24:03,550 We'll try and use this as an example. And as I slowly read in this quote, say I got to this sentence and I get two major major what I would do, 234 00:24:03,550 --> 00:24:12,310 what the algorithm will do is it'll take major major and it will take the vector for major major and it will bring that vector closer. 235 00:24:12,310 --> 00:24:25,480 I mentioned how with the kind of. The dot product of the vectors, this is kind of how it decides it creates closer and further away, 236 00:24:25,480 --> 00:24:31,240 but it will take that vector and take the vector for all of these other words around it and bring them closer 237 00:24:31,240 --> 00:24:37,060 together and then take a random sample from all of these other words and push them further away in space. 238 00:24:37,060 --> 00:24:40,310 And then it'll move on to really the next word and we'll do the same thing. 239 00:24:40,310 --> 00:24:45,160 So then twice Major Major was brought close to really when it was the target word and 240 00:24:45,160 --> 00:24:50,140 major major was brought close to really when really was the target or the target word. 241 00:24:50,140 --> 00:24:56,050 And slowly, as you do that, you're going to just get these distinct communities of things that happened in context with one another. 242 00:24:56,050 --> 00:25:05,560 Because if you never have major major appear in the context of discussing, then by the end, 243 00:25:05,560 --> 00:25:11,700 there would never have been a point in which those two were brought close together and they'll end up far apart in vacuum space. 244 00:25:11,700 --> 00:25:21,190 Yeah, that's what. This these are the vectors just in space. 245 00:25:21,190 --> 00:25:29,260 Yeah, and you defined the number of dimensions that you want, and this link will give you it in three dimensions, which was a little bit better. 246 00:25:29,260 --> 00:25:38,650 Yeah, yeah. And the other questions. 247 00:25:38,650 --> 00:25:46,580 I think it's really important when we're using a lot of these methods to question. 248 00:25:46,580 --> 00:25:54,920 Where it might break, right? Topic modelling can break in a variety of different ways. 249 00:25:54,920 --> 00:26:00,740 Word embedding one, it does particularly well with large corporate because it needs to see things repeatedly. 250 00:26:00,740 --> 00:26:07,660 You only see a word once it only gets to do the vector creation in the community. 251 00:26:07,660 --> 00:26:13,690 Context creation once. And maybe that was an aberrant use of the word right or a biased use of the word. 252 00:26:13,690 --> 00:26:16,960 And suddenly you think that those two are related and they're not. 253 00:26:16,960 --> 00:26:26,990 It also does a little bit better with a topically consistent or just linguistically consistent usage of the words to say you had two communities. 254 00:26:26,990 --> 00:26:32,100 Who use? Moral in very, very different ways. 255 00:26:32,100 --> 00:26:37,770 You would want to know that and maybe do these analyses separately because it would be harder, but as we'll see, 256 00:26:37,770 --> 00:26:44,010 there is a cool thing about the results of these that allow you to do a little bit of vector algorithms 257 00:26:44,010 --> 00:26:51,600 or vector algebra to be able to take out one vector and see how the vector space kind of changes. 258 00:26:51,600 --> 00:26:55,590 We'll see that and when we'll see that when we do the the tutorial. 259 00:26:55,590 --> 00:27:08,700 But here's a classic example of why the why this is a really interesting method is that ultimately you can do this sort of vector algebra where, say, 260 00:27:08,700 --> 00:27:18,420 you took the vector for king, just the, you know, we know the vectors and then you subtracted the vector for man and added the vector for woman. 261 00:27:18,420 --> 00:27:23,160 It would give you if the algorithm was really robust, the vector for queen. 262 00:27:23,160 --> 00:27:26,550 And you can kind of do these and we'll see them, we'll play with them in the code. 263 00:27:26,550 --> 00:27:31,710 And so we're playing again with the sociology abstracts so you can do things like, OK, 264 00:27:31,710 --> 00:27:38,670 if I'm looking at race and I take away inequality, maybe it gives you election, right? 265 00:27:38,670 --> 00:27:45,450 You're not, definitely. But maybe you can work on hypotheses and there's some cool research that I know 266 00:27:45,450 --> 00:27:52,530 of social scientists doing to try and look at cultural vectors across time. 267 00:27:52,530 --> 00:27:59,730 In this way, looking and training, embedding models at different points in time and seeing how the space changes and maybe 268 00:27:59,730 --> 00:28:06,090 looking at a profession and seeing how it moves closer to high status later on or lower. 269 00:28:06,090 --> 00:28:12,420 You know, if some of our theories of gender in the labour force that, like women entering a profession, 270 00:28:12,420 --> 00:28:17,100 makes it less prestigious, you could possibly uncover that with word embedding models. 271 00:28:17,100 --> 00:28:23,990 When we're talking about lawyers as women entered that, did lawyers become a less prestigious thing? 272 00:28:23,990 --> 00:28:40,290 As women into that profession, something that yeah. How does it differ from a topic model? 273 00:28:40,290 --> 00:28:44,700 What was I going to say? I had one other thing to say about this Oh, another really important thing. 274 00:28:44,700 --> 00:28:54,350 Well, actually, that fits with this. How does it differ from a topic model? Benchmark has some great tutorials on word embeddings online, 275 00:28:54,350 --> 00:29:00,950 and I used I learnt a lot from them and probably saw some of his intuitions pop up in here. 276 00:29:00,950 --> 00:29:07,310 And he says this and trying to to describe the difference, whereas a topic model aims to reduce the words down to some core meanings so 277 00:29:07,310 --> 00:29:11,900 that you can see what each individual document in the library is really about. 278 00:29:11,900 --> 00:29:16,310 So get rid of the fluff. I just want to know the themes, right? The topics. 279 00:29:16,310 --> 00:29:20,240 Effectively, this is about getting rid of words so that I can understand documents more clearly. 280 00:29:20,240 --> 00:29:22,130 Word embedding models do nearly the opposite. 281 00:29:22,130 --> 00:29:27,800 They try to ignore information about individual documents so you can better understand the relationship between words. 282 00:29:27,800 --> 00:29:34,250 And I mostly agree with this, but you can get information about topic. 283 00:29:34,250 --> 00:29:37,670 Models have distributions of topics across corpora as well. 284 00:29:37,670 --> 00:29:42,740 And so you do have this kind of sense of relationships between words in different contexts. 285 00:29:42,740 --> 00:29:47,360 So there's a little bit more to it. One is that in topics, 286 00:29:47,360 --> 00:29:52,970 you sort words into a predetermined number of topics and you don't kind of try and 287 00:29:52,970 --> 00:29:59,180 create these continual vector space relationships between the words That's not you. 288 00:29:59,180 --> 00:30:06,380 I say you don't aim to because you could create something like that out of a topic model by looking at the probabilities of a word 289 00:30:06,380 --> 00:30:14,540 in the document like they could be farther farther apart based on how probable they are to occur in a document that's giving. 290 00:30:14,540 --> 00:30:22,490 But that's not really what it tries to do. That would just be you being innovative and they don't do well at representing the 291 00:30:22,490 --> 00:30:27,050 relationships between words or how words mediate and moderate the meaning of one another. 292 00:30:27,050 --> 00:30:34,940 Again, with that kind of matrix algebra that you're able to do, which is a strength of the embedding model. 293 00:30:34,940 --> 00:30:40,780 Any questions? Oh, that's the one last thing I was going to say. 294 00:30:40,780 --> 00:30:48,370 Topic models rely on this kind of co-occurrence, right? A top two words will result in be in the same topic. 295 00:30:48,370 --> 00:30:54,280 If there's a co-occurrence that they have, that doesn't have to that the co-occur in the text. 296 00:30:54,280 --> 00:31:00,910 It doesn't have to happen in word embedding models if we call say we have a mixed 297 00:31:00,910 --> 00:31:07,390 time or US British corpus and there's pupil and student used interchangeably, 298 00:31:07,390 --> 00:31:09,280 but they never happened together. 299 00:31:09,280 --> 00:31:15,520 So there's no co-occurrence of student people because they're synonyms in word embedding models because they happen in the same context. 300 00:31:15,520 --> 00:31:20,390 Both people and student happen in the context of classroom and grades or exams or whatever. 301 00:31:20,390 --> 00:31:28,470 You would still have them close in vector space, which is which is kind of neat. 302 00:31:28,470 --> 00:31:40,140 Sure. Yeah. Sense, there are a couple of very good sort of thing models, which I think pretty brains alongside the other. 303 00:31:40,140 --> 00:31:46,560 Yeah, that's a good one. You can just download them and use them, but it's like the after effects model. 304 00:31:46,560 --> 00:31:51,060 Yeah. So use the most. Yeah, yeah. 305 00:31:51,060 --> 00:31:57,030 Massive corpora. These embeddings have been trained on. And then if you wanted to just. 306 00:31:57,030 --> 00:32:03,870 You know, not say something necessarily specific about your Corvette or validate your corpus against against that, 307 00:32:03,870 --> 00:32:09,960 but one of the things that will mention is that really validating your corpus depends on 308 00:32:09,960 --> 00:32:16,310 if it makes sense to do some sort of based truth of whatever world you're looking at. 309 00:32:16,310 --> 00:32:26,350 And so validating it against all of the Wikipedia data might not necessarily mean that it's valid because yours is different. 310 00:32:26,350 --> 00:32:34,330 You're looking only at some Aboriginal tribe or something that doesn't apply to the general modernise western world or something like that, right? 311 00:32:34,330 --> 00:32:40,780 But yeah, I think those in terms of just looking at culture or language are really useful. 312 00:32:40,780 --> 00:32:49,270 Some things in perspective, the reason why I used to work for a firm will want to look at the sentiments. 313 00:32:49,270 --> 00:32:57,310 Of people talking about things that were happening in their lives and because people were using offshoring and outsourcing, 314 00:32:57,310 --> 00:33:04,280 that's giving them the same thing and then had a sort of whole rant about it or not, basically. 315 00:33:04,280 --> 00:33:08,710 But the two were different words that ultimately didn't come together. 316 00:33:08,710 --> 00:33:16,850 So I use the word embedding to basically exchange these two if they were so distant, so familiar to another in the automotive space. 317 00:33:16,850 --> 00:33:21,400 I just use either one on each exchange or outsourcing job for a roommate. 318 00:33:21,400 --> 00:33:28,140 Thank you both for. So you can use it in that way, you can make your day and more. 319 00:33:28,140 --> 00:33:32,700 Nice and large, but not as much as you can use as an exchange. Yeah. 320 00:33:32,700 --> 00:33:41,190 So similarly, one of the things that I did with mine, I'm interested in gender inequality in creative professions. 321 00:33:41,190 --> 00:33:46,260 And there's obviously a lot of different artists and creatives mentioned in the corpora that I work with. 322 00:33:46,260 --> 00:33:49,990 And so I had demographic information on a large number of artists, 323 00:33:49,990 --> 00:33:55,140 and I just went in and replaced all the women's names with the word femme word and all 324 00:33:55,140 --> 00:33:59,130 the men's names with the word man word and so suddenly became like the same word, 325 00:33:59,130 --> 00:34:01,380 which is a different. It's slightly different. 326 00:34:01,380 --> 00:34:08,010 But there were enough occurrences of any individual artist to get a sense of how men and women are discussed differently. 327 00:34:08,010 --> 00:34:11,430 But that's another another way I could have gone about. 328 00:34:11,430 --> 00:34:17,850 It is similarly to just train the embedding model and oh, I see all of these artists cluster together, and all of these artists cluster together. 329 00:34:17,850 --> 00:34:22,560 So I could use them as synonyms of female and male if you wanted to add another mouse. 330 00:34:22,560 --> 00:34:31,270 There's a lot you can do with these methods, and very little that has been done thus far in the context of social social science. 331 00:34:31,270 --> 00:34:33,630 I think it's important for the president this morning, 332 00:34:33,630 --> 00:34:42,780 that's when the case of president that we didn't want to make our mothers take the good and the bad boys. 333 00:34:42,780 --> 00:34:46,150 You come together when you talk about a similar thing. 334 00:34:46,150 --> 00:34:52,870 Yeah, it's like something that seems like they can be taking this stuff so rich in some of the pieces you may not once. 335 00:34:52,870 --> 00:35:00,580 Yeah, but how have people handled just taking good out and then looking at bad relation to vectors and putting good back in and taking bad out? 336 00:35:00,580 --> 00:35:05,530 Is that how you handle it? Yeah. Yeah. 337 00:35:05,530 --> 00:35:10,620 OK. Generally, they. 338 00:35:10,620 --> 00:35:14,340 It's interesting because my intuition would be like, whoa, 339 00:35:14,340 --> 00:35:20,070 I just want to see what do people talk about when they're talking about bad movies just figured out. 340 00:35:20,070 --> 00:35:25,580 Yeah. Hmm. For people with double negations in their stuff? 341 00:35:25,580 --> 00:35:30,600 Yeah, you must do something about. Yeah. 342 00:35:30,600 --> 00:35:34,950 But do you think you often a lot of these bag of words sort of methods that ignore syntax? 343 00:35:34,950 --> 00:35:38,400 This becomes a huge, huge problem of that. 344 00:35:38,400 --> 00:35:45,150 The meaning is built into the rules of language that we're supposedly ignoring a lot of the time. 345 00:35:45,150 --> 00:35:54,810 Similarly, a professor of mine in sociology of culture once asked he was very questioning of these large computational methods in general, 346 00:35:54,810 --> 00:36:04,560 particularly text, and he insisted what becomes text is not everything that we do in social life quite clearly becomes any form of meaningful text. 347 00:36:04,560 --> 00:36:12,780 And what is left out in the context of social media studies that becomes even bigger, not only what becomes language, what becomes what? 348 00:36:12,780 --> 00:36:14,280 What gets on to Twitter? 349 00:36:14,280 --> 00:36:22,980 What do we never say on Twitter, which is might be very highly personal things or obviously not long winded, intricate things because we just can't. 350 00:36:22,980 --> 00:36:28,520 But. Because if you look at the answer, people in the UK, which is how are you doing? 351 00:36:28,520 --> 00:36:33,080 They would say not that the US can say good bye. Yeah, oh really. 352 00:36:33,080 --> 00:36:41,350 I believe that era. But in all this, if you're not thinking about immigration being something good. 353 00:36:41,350 --> 00:36:47,060 Yeah. OK. 354 00:36:47,060 --> 00:36:49,340 There is the third method of network analysis, 355 00:36:49,340 --> 00:36:55,190 but I kind of think that we should pause and do the tutorials on the topic, modelling and word embedding. 356 00:36:55,190 --> 00:36:57,710 And if we have time, we can cover the network analysis. 357 00:36:57,710 --> 00:37:04,820 Basically, the network analysis is just giving an introduction to network analysis and allowing you with some examples from text, 358 00:37:04,820 --> 00:37:09,350 but largely allowing you to kind of brainstorm how you would. 359 00:37:09,350 --> 00:37:15,200 You could create new variables, new metrics by thinking about text analysis in the context of networks. 360 00:37:15,200 --> 00:37:19,490 But a lot of that is is just in the materials folder called. 361 00:37:19,490 --> 00:37:25,490 Text nets or network analysis materials or something like that, and you can probably go through it on your on your own. 362 00:37:25,490 --> 00:37:31,260 I just want to make sure we get to some of the tutorials before we get exhausted. So let's let's do that. 363 00:37:31,260 --> 00:37:37,320 OK. We have an hour to go through three very large methods, we're probably going to only get through two, 364 00:37:37,320 --> 00:37:41,940 but again, if anyone wants to go through tax nets later, I'm happy to. 365 00:37:41,940 --> 00:37:51,390 But let's start by going through a structural topic modelling developed by Molly Roberts and her colleagues in 2014. 366 00:37:51,390 --> 00:37:53,790 They've got this really, I think, 367 00:37:53,790 --> 00:38:05,370 exemplary R package that helps you kind of from start to to end in terms of using their method and visualising what you're doing along the way. 368 00:38:05,370 --> 00:38:11,430 Thinking through important decisions that you're going to have to make along the way. 369 00:38:11,430 --> 00:38:17,070 The package in particular includes all of these things ingesting and manipulating, pre processing the data, 370 00:38:17,070 --> 00:38:26,160 estimating the model's calculating covariate effects and then kind of visualising those helping you make decisions. 371 00:38:26,160 --> 00:38:31,590 This is very much the tutorial phase, so stop me and ask questions and everything again. 372 00:38:31,590 --> 00:38:35,250 The only reason I'm not having you to long live is that the models, 373 00:38:35,250 --> 00:38:39,090 some of them we'd be sitting here for like forty five minutes waiting for it to go. 374 00:38:39,090 --> 00:38:48,990 And if we used all the data even longer, but you load the data, so we load the maybe I should bring up the HTML. 375 00:38:48,990 --> 00:38:56,130 We can go through that and I'll skip over to my kind of uglier code when. 376 00:38:56,130 --> 00:39:01,080 It's very small. There we go. OK, let's do that. 377 00:39:01,080 --> 00:39:06,660 So you would load the packages, you load the data, we're loading our sociology abstracts. 378 00:39:06,660 --> 00:39:10,350 Then you clean the data and describe it a little bit. 379 00:39:10,350 --> 00:39:14,280 So this was just in one corpus that I was using. 380 00:39:14,280 --> 00:39:16,170 There were some texts that were just empty. 381 00:39:16,170 --> 00:39:24,110 So you want to make sure you don't have those and like we did before, you might want to do remove duplicate texts. 382 00:39:24,110 --> 00:39:28,250 And then we could take a look at again, some of like how does it? 383 00:39:28,250 --> 00:39:38,030 The number of abstracts by year, just basically getting a sense of our data as we would at the beginning of any type of analysis. 384 00:39:38,030 --> 00:39:46,580 Again, so the key about structural topic models is that you can have these other variables asking, how does. 385 00:39:46,580 --> 00:39:55,280 In this case, I'm going to be looking at how does the effect of being published in a certain journal change the way an abstract, 386 00:39:55,280 --> 00:39:59,960 the language in an abstract or the topics that are studied in in studies of that journal? 387 00:39:59,960 --> 00:40:04,550 So this is one of the variables that you can add to a structural topic model. 388 00:40:04,550 --> 00:40:10,460 These are all the journals that it comes from and their prevalence in the corpus. 389 00:40:10,460 --> 00:40:20,990 Again, just getting to know your data before you do anything with it. This is me as I had before creating a a binary variable for whether or not it 390 00:40:20,990 --> 00:40:26,390 appears in the American Journal of Sociology or the American Sociological Review. 391 00:40:26,390 --> 00:40:30,980 If it occurred in either of those, it gets a one. If it didn't, it gets a zero. 392 00:40:30,980 --> 00:40:41,570 And this is what our data looks like, which you probably already know, but we've got our different variables and then this new one that I've created. 393 00:40:41,570 --> 00:40:48,740 So the STM package does a lot of that preprocessing that we did with the Quantita package on its own. 394 00:40:48,740 --> 00:40:52,550 A lot of these kind of topic modelling packages do that again. 395 00:40:52,550 --> 00:40:55,820 There's always a million different ways that you could do something if you wanted to. 396 00:40:55,820 --> 00:41:04,820 I wouldn't be surprised if if the SDM package is relying on either tiny text functions or quantita functions behind the scenes to do this cleaning, 397 00:41:04,820 --> 00:41:09,860 but it will do it itself with just this, these commands that we'll look at. 398 00:41:09,860 --> 00:41:17,060 So text processor builds the corpus. We talked about a corpus in the quantita package, an object, 399 00:41:17,060 --> 00:41:25,850 a data object that STM has the same and you build it by just doing text processor, feeding it your data, 400 00:41:25,850 --> 00:41:33,440 identifying which variable or which column the text is in, and then telling it where to find the rest of the metadata, 401 00:41:33,440 --> 00:41:36,770 which is again, in that sense, in our case, it's in the same data frame. 402 00:41:36,770 --> 00:41:47,000 But in other cases, maybe you have all the texts and the metadata with an idea and the metadata in a different relational database sort of structure. 403 00:41:47,000 --> 00:41:52,460 And then you pre process your texts. STM has some cool stuff. 404 00:41:52,460 --> 00:41:56,690 So this is there's a lot of words, right, that maybe you don't want to analyse. 405 00:41:56,690 --> 00:42:04,460 They occur there. Stop words, of course, which you could just remove or you could do it this way if something occurs in like 90 percent of the texts. 406 00:42:04,460 --> 00:42:08,600 It's probably not particularly interesting, and I just want to take it out. 407 00:42:08,600 --> 00:42:15,410 And so here you can. What you're doing is you're saying here's a threshold from one to two hundred by 10 meaning, 408 00:42:15,410 --> 00:42:22,920 and it's the lower threshold, meaning I want to see how many words were removed if I removed words. 409 00:42:22,920 --> 00:42:28,980 That occurred there, lower the lower threshold that occurred less than wants me, 410 00:42:28,980 --> 00:42:32,170 not at all, that doesn't even make sense, but up to two hundred by increments of 10. 411 00:42:32,170 --> 00:42:36,690 So what would happen if I removed words that occurred less than 100? 412 00:42:36,690 --> 00:42:39,930 Less than 10. Less than 20. Less than 30. Less than 40. 413 00:42:39,930 --> 00:42:47,820 And slowly, you get these kind of curves and this is how many documents you'd remove, which is none. 414 00:42:47,820 --> 00:42:54,600 That was scary. Which is none. How many words you would remove and how many tokens you remove. 415 00:42:54,600 --> 00:43:04,540 Remember, a token is. A word would just be the actual different, like if Taylor was said multiple times, four times, it's four different words, 416 00:43:04,540 --> 00:43:11,110 whereas the tokens are the or the other way around or just all of those would be collapsed into one. 417 00:43:11,110 --> 00:43:15,790 So it's showing me that if I get to the lower threshold of about, maybe here, 418 00:43:15,790 --> 00:43:25,370 like seventy five is where the curve starts to stop and you don't you're not really removing many other words when you bring up that lower threshold. 419 00:43:25,370 --> 00:43:35,630 Does that make sense? You can also set an upper threshold so you can say get rid of any words that happen in more than many times because 420 00:43:35,630 --> 00:43:42,230 they're probably not useful or fewer than this many times because they're probably rare and obscure and not very useful. 421 00:43:42,230 --> 00:43:48,710 So that's one way other than stop words specifying actual words, another way of removing words you don't want to analyse. 422 00:43:48,710 --> 00:43:55,880 And the S10 package just plots these graphs for you, which is really nice for the sake of just kind of efficiency. 423 00:43:55,880 --> 00:44:02,150 I set my lower threshold to 50 and my upper threshold to about 50 percent of the document, 424 00:44:02,150 --> 00:44:07,130 saying yes if a word occurs in more than 50 percent of the abstracts. Get rid of it. 425 00:44:07,130 --> 00:44:13,180 That was. Late last night, Heuristic, it wasn't really that informed. 426 00:44:13,180 --> 00:44:19,570 But you would hopefully be more informed as you were going about it. Then you can look at which words were removed when you did that. 427 00:44:19,570 --> 00:44:25,570 At first, I freaked out because I was like, Wait, why is divorce and ethnicity and all of these being removed? 428 00:44:25,570 --> 00:44:31,960 But actually, that's being removed as a double hyphen, ethnic double hyphen established those sort of things. 429 00:44:31,960 --> 00:44:37,510 So they're the actual lamas or stems of family in divorce or not. 430 00:44:37,510 --> 00:44:40,030 Those are still in the corpus. It's just when they're attached. 431 00:44:40,030 --> 00:44:46,090 And I could have fixed this by doing a better job of removing punctuation like double punctuation. 432 00:44:46,090 --> 00:44:52,510 But I didn't. This is just to show you once you remove your words, it will show you it saves which words were removed. 433 00:44:52,510 --> 00:45:02,330 So you know you could report those if you were very good at reporting in your journals appendix. 434 00:45:02,330 --> 00:45:12,200 OK, so after you've done that, you have this output data frame called output, and within it you have the documents. 435 00:45:12,200 --> 00:45:16,790 Maybe some of the documents got removed, maybe in your setting of thresholds. 436 00:45:16,790 --> 00:45:21,980 Certain documents were so short and none of the words within them or within that threshold. 437 00:45:21,980 --> 00:45:30,110 So the documents got kicked out. But so now you might have new documents, the vocab is the actual words that are left, 438 00:45:30,110 --> 00:45:34,590 excluding the ones you didn't have, and then the meta is the metadata associated with each document. 439 00:45:34,590 --> 00:45:42,130 So again, the journal or the year that it was published. That all make sense. 440 00:45:42,130 --> 00:45:46,000 Then you move on to estimating the model. 441 00:45:46,000 --> 00:45:53,170 And like we said and topic models, you have to select K, you have to select how many topics you want to derive from your corpus. 442 00:45:53,170 --> 00:46:03,040 And that's not always a straightforward decision, and one way is to run it a million times and look at the different. 443 00:46:03,040 --> 00:46:04,330 It can take a really long time. 444 00:46:04,330 --> 00:46:15,880 You should probably do that, but SDM provides this great function search kick in which you provide it like a vector of possible numbers of of topics, 445 00:46:15,880 --> 00:46:20,920 and it will give you some diagnostics on what happens when you change that number of topics, 446 00:46:20,920 --> 00:46:25,780 which is not available in a lot of other topic modelling packages. So it's kind of cool. 447 00:46:25,780 --> 00:46:30,700 It doesn't solve the problem for you at all. You have you have trade offs of like as we'll see. 448 00:46:30,700 --> 00:46:37,180 Do you want to have semantic coherence or do you want to optimise some other measure and picking what your topics are? 449 00:46:37,180 --> 00:46:46,110 But nevertheless, it gives you help in making that decision, which other things tend not to. 450 00:46:46,110 --> 00:46:51,660 I don't run that because it takes a long time. But if you want to run this code on your own, you can do that. 451 00:46:51,660 --> 00:46:56,790 You can search and you can see what the diagnostics are that help you then make the decision it will like, 452 00:46:56,790 --> 00:47:02,490 plot something for you where, well, maybe I do plot it later on. 453 00:47:02,490 --> 00:47:06,630 I'll have to check, but you'll see like how, when or when you change this. 454 00:47:06,630 --> 00:47:15,920 You have less semantic semantic coherence, meaning that the words don't really cluster together that well when you have. 455 00:47:15,920 --> 00:47:20,630 So many or so few topics, and then as you move it up, they start to get better together. 456 00:47:20,630 --> 00:47:26,810 So then when you look at the topic words, they maybe that has more Facebook did, you're like, Oh yeah, that's a very clean topic. 457 00:47:26,810 --> 00:47:34,940 But there are other other. Diagnostics that you could run that that might give you something else. 458 00:47:34,940 --> 00:47:43,430 So anyways, that's that there's other ways to go about it, about kind of checking how you would like, how you wanted to define care. 459 00:47:43,430 --> 00:47:49,760 But in in this case, I just picked 20 and. 460 00:47:49,760 --> 00:47:53,820 I specified that year is going to be one of my covariates, 461 00:47:53,820 --> 00:47:58,580 so one thing I haven't mentioned yet in those covariates that you define, they can change the model in two ways. 462 00:47:58,580 --> 00:48:10,400 One, they could change How prevalent is this topic? So maybe liberals in the US talk much more about gun control. 463 00:48:10,400 --> 00:48:13,340 And so that's the prevalence of the topic of gun control. 464 00:48:13,340 --> 00:48:22,520 It goes up if the speaker is a liberal, whereas if the speaker is a conservative, say you had a topic for the the like. 465 00:48:22,520 --> 00:48:32,650 Right to bear arms, I bet those would be put together, but if you did have a cyber one, it probably comes up more with conservatives. 466 00:48:32,650 --> 00:48:40,750 There's a little bit in here about some of the really interesting problems of post-war and non is non convex with topic models, 467 00:48:40,750 --> 00:48:42,610 and I don't really want to go too much into that. 468 00:48:42,610 --> 00:48:49,990 But basically what it's saying is where are you wherever you start in searching for optimising will change what your outcome is. 469 00:48:49,990 --> 00:48:53,620 And so one thing like basically what your results are. 470 00:48:53,620 --> 00:49:01,690 And so one thing that STM does pretty well is allows you this maximum number of iterations where it will run the model again and again, 471 00:49:01,690 --> 00:49:06,570 as you see in these outputs kind of starting at a different point. 472 00:49:06,570 --> 00:49:14,820 So one thing that's really good about that is, you know, that like your run of it didn't kind of create what you see. 473 00:49:14,820 --> 00:49:19,570 But this is what the output of that is. You'll see like these are kind of the topics that start to emerge. 474 00:49:19,570 --> 00:49:21,480 I said, give me 20 topics, right? 475 00:49:21,480 --> 00:49:32,640 Here's one on immigration, economy, market, social network structures, health relationships and risk politic state movement. 476 00:49:32,640 --> 00:49:34,470 You start to have a bit of face validity, right? 477 00:49:34,470 --> 00:49:41,970 Neighbourhood population area women, gender men, white, black, maybe stratification, education in school. 478 00:49:41,970 --> 00:49:43,800 So these are some of the topics you start to get out, 479 00:49:43,800 --> 00:49:53,670 but it's running it over and over again and you'll see in each iteration they'll change a little bit. 480 00:49:53,670 --> 00:49:58,800 You'll want to run a lot more iterations, I think I mentioned that than than I do in the script, 481 00:49:58,800 --> 00:50:03,210 and you'll want to read the documentation if you want to actually do stand, but it just takes a long time. 482 00:50:03,210 --> 00:50:09,420 So I set it pretty low. Ultimately, what you'll get, though, I say only two, so we only did that twice. 483 00:50:09,420 --> 00:50:12,240 If you did it like 50 times or something, 484 00:50:12,240 --> 00:50:21,570 you would end up with this graph looking at the balance of exclusivity versus semantic coherence, which is within your topic. 485 00:50:21,570 --> 00:50:27,810 Optimising on exclusivity would say, I want words to be pretty darn exclusive to an individual topic because if you remember, 486 00:50:27,810 --> 00:50:31,650 each topic is a distribution of all the words in the corpus. 487 00:50:31,650 --> 00:50:39,660 And if you want words to have a very high probability and only one or very few topics, you're going to probably emphasise exclusivity. 488 00:50:39,660 --> 00:50:40,750 And you might have a model. 489 00:50:40,750 --> 00:50:48,060 This only ran two models and they were very similar, but you might end up with one of your iterations that's like way over here. 490 00:50:48,060 --> 00:50:52,590 And if you wanted to emphasise exclusivity, that might help you make your decision. 491 00:50:52,590 --> 00:51:01,400 Semantic coherence is just how well the words all fit together. Maybe they're not exclusive to one topic, but they all go well together. 492 00:51:01,400 --> 00:51:06,420 In one topic, yeah. 493 00:51:06,420 --> 00:51:15,150 Well, it's it's in the result of the probabilities of the words you could get into the function and look at exactly how it's doing that, 494 00:51:15,150 --> 00:51:19,380 and you would probably very well to distinguish how it's doing it with some of the coding. 495 00:51:19,380 --> 00:51:29,220 But yeah, it's just looking at when you put it into the different topics, what are the probabilities of the words within that topic? 496 00:51:29,220 --> 00:51:35,850 How much do they all have like a high probability of being within that topic versus? 497 00:51:35,850 --> 00:51:43,260 Like the balance of the probability against the currents within all the other topics, there is a formal definition I might have. 498 00:51:43,260 --> 00:51:48,330 I used to have it in here of. 499 00:51:48,330 --> 00:51:56,070 Yeah, no, it's just the kind of women's semantic clearance provides a semantic clearance measure for all topics within each model. 500 00:51:56,070 --> 00:52:05,280 But in their documentation, it has the actual code and the of how they do the semantic. 501 00:52:05,280 --> 00:52:11,520 Exclusivity is another one that you can look at sparsity anyways. 502 00:52:11,520 --> 00:52:14,880 You can check the residuals. We're going to kind of skim past this. 503 00:52:14,880 --> 00:52:21,780 All of this is documented more in there. They've got a ton of documentation. 504 00:52:21,780 --> 00:52:31,710 Then you can look within a particular topic like it looks like topics 17 and 14 are super high on both actually 70 and semantic coherence, 505 00:52:31,710 --> 00:52:39,530 whereas topic 11 is not that great at either. And Topic 12 has good exclusivity, but not a lot of semantic coherence. 506 00:52:39,530 --> 00:52:47,450 And you could go back to those topics and look at why, you know, and maybe change according to that. 507 00:52:47,450 --> 00:52:58,340 Let's see what what was Topic 14? So this one, mother, father, child, children, household, parent, divorce, that was good on both measures. 508 00:52:58,340 --> 00:53:06,110 That's a pretty clean topic, whereas topic 11 wasn't so great skills, teach decision, implement programme career. 509 00:53:06,110 --> 00:53:12,560 Maybe that's because it's like mixing the labour market with education, which a lot of it is about education. 510 00:53:12,560 --> 00:53:15,350 How does that affect the labour market, right? 511 00:53:15,350 --> 00:53:25,450 Maybe that's why it doesn't have quite as much exclusivity or because words about the labour market occur throughout a lot of other topics as well. 512 00:53:25,450 --> 00:53:34,320 This is just the probability or the prevalence of different topics throughout your documents, what is the expected topic proportions? 513 00:53:34,320 --> 00:53:39,480 And here you covering the word probabilities across two topics, so topic one in topic 20, 514 00:53:39,480 --> 00:53:43,500 topic one, sorry, it's cut off, but like immigrant market and labour force, 515 00:53:43,500 --> 00:53:48,270 as was we were saying earlier, the font size makes it bigger because it has a higher probability and it's further 516 00:53:48,270 --> 00:53:53,190 over to that side of the spectrum because it's more associated with that topic. 517 00:53:53,190 --> 00:53:58,830 Whereas it looks like theory, it looks like this topic 20 is more about sociological theory. 518 00:53:58,830 --> 00:54:00,810 Series is very prominent in that one, 519 00:54:00,810 --> 00:54:07,770 but things that are more toward the middle are like development and maybe development because you develop a theory, 520 00:54:07,770 --> 00:54:13,320 but also and you have immigration because of international development programmes or something like that, 521 00:54:13,320 --> 00:54:20,430 or internal migration and development programmes often coincide so you can start comparing sea. 522 00:54:20,430 --> 00:54:25,380 I just do that with the topics one in 20 that parameter, you can start looking at it. 523 00:54:25,380 --> 00:54:32,960 This is. The histogram of the distribution of topics, anyways, the point is not often, as most of you probably know in our packages, 524 00:54:32,960 --> 00:54:39,290 are you given this much structure for doing analysis where they're helping you diagnose make diagnostics, 525 00:54:39,290 --> 00:54:47,240 where you don't have to create the visuals yourself, they're doing it. So just like huge props to the people who created this package. 526 00:54:47,240 --> 00:54:50,690 This is a way of fine thoughts as kind of a fun one. 527 00:54:50,690 --> 00:54:58,370 If you just wanted to see, I want to pick a topic and I want like a representative quote from that topic to put in my paper. 528 00:54:58,370 --> 00:54:59,540 This is a way to do it. 529 00:54:59,540 --> 00:55:09,770 You can kind of hear I just take the first two hundred ten to two hundred and fifty words of the abstract or characters of the abstract, 530 00:55:09,770 --> 00:55:17,240 and then that's all that I print out. I obviously it's I don't know why I started at 10 because it cuts the word halfway through. 531 00:55:17,240 --> 00:55:21,950 Don't do that. All right. 532 00:55:21,950 --> 00:55:26,950 Does that make sense kind of so far, is anyone copying and pasting and running it? 533 00:55:26,950 --> 00:55:33,040 OK. Do that later, but hopefully I ran it a few times and it was working. 534 00:55:33,040 --> 00:55:35,550 This is where in this point we can turn to the metadata. 535 00:55:35,550 --> 00:55:42,370 So so far we've just created the topics and we haven't looked at how those are affected by covariates. 536 00:55:42,370 --> 00:55:46,930 So one cobra we have that is continuous this year. What year was the abstract right? 537 00:55:46,930 --> 00:55:51,940 And we don't have a lot of scale here. I think it was like 2008 to 2012. But the other is one we created. 538 00:55:51,940 --> 00:55:58,450 That's binary, which is was it published in ages race or was it published in something else which again, 539 00:55:58,450 --> 00:56:08,860 just kind of a stupid binary, but it's what I came up with last night. So in this, if the cupboard is binary with that age, as are variable, 540 00:56:08,860 --> 00:56:15,160 maybe we just want to see is a topic more likely to be discussed in one of those two journals or in everywhere else. 541 00:56:15,160 --> 00:56:19,450 There's not a ton of variation, but that's basically what this is doing. 542 00:56:19,450 --> 00:56:30,760 So over this is more likely to be mentioned. This topic has a higher probability in non-Asian SSR articles, and this is more essays or articles. 543 00:56:30,760 --> 00:56:33,130 Unfortunately, I haven't relabelled the topics yet, 544 00:56:33,130 --> 00:56:41,200 so what you could have done is go in and label topic one according to what you think it is about maybe marriage and family or whatever. 545 00:56:41,200 --> 00:56:52,690 And then you could create this plot and see if it has face validity. If you're if it's liberal, conservative or if it's. 546 00:56:52,690 --> 00:57:00,070 I was going to try and go with the sports analogy, but I don't know sport, so anyways, any sort of binary might be more extreme, right? 547 00:57:00,070 --> 00:57:06,130 Or there's a certain topic that maybe you have like Doctors' offices versus first dates or something, 548 00:57:06,130 --> 00:57:12,870 there's some things that are very much more likely to be mentioned in the doctor's office. 549 00:57:12,870 --> 00:57:21,540 Then if your model is if your Cabarrot is continuous, as with you, you can just look at the expected proportion of a particular topic over time. 550 00:57:21,540 --> 00:57:26,940 So this was looking at topic number seven over time. And this is just the base visual. 551 00:57:26,940 --> 00:57:32,040 You can make the visuals much more interesting and layer on all the topics and how they change over time. 552 00:57:32,040 --> 00:57:39,780 But this is really neat that they provide this at all, and it looks like whatever topic 7NEWS has increased over time again. 553 00:57:39,780 --> 00:57:47,410 I don't imagine it was like a huge increase, relatively. But let's look at what topic seven was crime, violence and control. 554 00:57:47,410 --> 00:57:55,570 Evidently, that topic has been increasing over time. In the very short timespan we're looking at. 555 00:57:55,570 --> 00:58:02,050 So that was looking at the proportion of the topic over time throughout documents, the other thing you can do, as we mentioned with these covariates, 556 00:58:02,050 --> 00:58:08,110 is look at how does the covariate change the word proportions within the topic, 557 00:58:08,110 --> 00:58:15,680 so make sense to know not how much the topic is discussed, but how the topic is what is internal to the topic. 558 00:58:15,680 --> 00:58:24,250 Like if you're talking, relying again on the example of guns in the US, maybe there's only one topic for guns, but does the proportion of words? 559 00:58:24,250 --> 00:58:32,620 The probability of words in that topic changed depending on if a pro gun rights or an anti-gun rights person is speaking? 560 00:58:32,620 --> 00:58:38,640 So that's what you're doing here. You're estimating the storm with a little function where we're saying. 561 00:58:38,640 --> 00:58:43,870 Determine the content of that should that should change. 562 00:58:43,870 --> 00:58:49,600 Sorry about that. According to the agency or our variables go here, this is where I did it right. 563 00:58:49,600 --> 00:58:53,470 So we have changed the prevalence, according to both of these variables, 564 00:58:53,470 --> 00:59:01,570 but we want to know how the content changes just depending on events and ages and ages or in the other in the other journals. 565 00:59:01,570 --> 00:59:10,640 Does that make sense? So here we're estimating both content prevalence and content and or content and prevalence. 566 00:59:10,640 --> 00:59:14,430 OK. We're not going to get to word embedding. 567 00:59:14,430 --> 00:59:21,030 I'm happy to stay longer and get toward embedding these. I probably bit off more than I could chew, but I thought you'd be interested in all of it. 568 00:59:21,030 --> 00:59:31,200 So this is how the content of one of the topics sorry, I got cut off changed. 569 00:59:31,200 --> 00:59:34,680 If it's in the air or if it's in another. 570 00:59:34,680 --> 00:59:40,230 So if it's an SSRI, it looks like you're talking more about inequality and status and these sort of things. 571 00:59:40,230 --> 00:59:44,640 Whereas if you're talking about if it's outside, there's more gender and women, 572 00:59:44,640 --> 00:59:48,300 which actually to me has a bit of visibility in the sociology literature. 573 00:59:48,300 --> 00:59:53,880 A lot of times there's a lot more like gender theory outside of society, like gender studies, 574 00:59:53,880 --> 00:59:58,440 gender, which is amazing but much deeper in the gender theoretical side of things. 575 00:59:58,440 --> 01:00:06,780 We're often within our. Gender is mentioned in the kind of binary way, and it's used as like just comparing men and women. 576 01:00:06,780 --> 01:00:11,310 It's not about gender theory as much as it is like gender as a covariate for other outcomes. 577 01:00:11,310 --> 01:00:20,150 So this kind of makes sense to be interesting to others, but I only plotted that one for now. 578 01:00:20,150 --> 01:00:30,420 Yeah. So. What's important is that this is just the one thing this is for one topic 579 01:00:30,420 --> 01:00:37,920 comparing when that topic is occurring within SARS are versus all the others. 580 01:00:37,920 --> 01:00:47,280 So a CSR versus all the others and the size of the font is the problem is the probability of the word occurring in the topic and you just kind of see. 581 01:00:47,280 --> 01:00:53,700 So that doesn't mean that agenda and branding as a whole is covered best in ages and is right. 582 01:00:53,700 --> 01:00:58,470 It means that within the topics, just like the cluster of managers in your family, 583 01:00:58,470 --> 01:01:05,730 has more to do with equality, for example, by asking if it has something to do with inequality telling. 584 01:01:05,730 --> 01:01:11,370 Yeah, I'd have to look at the code, but I believe the size of the fine is the topic like the probability of the word, 585 01:01:11,370 --> 01:01:19,890 whereas the location on the spectrum is just how prevalent it is when this person or this corpus is speaking versus when this one is speaking. 586 01:01:19,890 --> 01:01:24,420 So these are all with for both age, CSR and non-agency. 587 01:01:24,420 --> 01:01:32,250 Are these all these words are obviously in the topic, but this is more common outside and this is in within the topic. 588 01:01:32,250 --> 01:01:37,740 It's more about inequality and things like that. Um? 589 01:01:37,740 --> 01:01:46,230 We'll skip over this, but it's just creating interactions so you can do interactions, effects, and then SDM has some cool visuals. 590 01:01:46,230 --> 01:01:53,310 This is just like looking at how the topics are connected in a network according to the probability of their words. 591 01:01:53,310 --> 01:01:58,140 And then they've got this great storm, corvil's and steam browser that create like HTML, 592 01:01:58,140 --> 01:02:03,120 where you can click on and off and see the prevalence of topics move around and things. 593 01:02:03,120 --> 01:02:07,160 I didn't do any of that, but definitely play around with that if you do. 594 01:02:07,160 --> 01:02:17,600 And this is just how you would export the data to whatever format I think I did like yes, data or data or CSV. 595 01:02:17,600 --> 01:02:23,570 Yeah. Questions. I can hop right into word embedding with the seven minutes we have left. 596 01:02:23,570 --> 01:02:30,080 Or we could do that like sometime next week when we're doing group projects. 597 01:02:30,080 --> 01:02:36,160 What are you guys feeling? Now, raise your hand if you want me to the word embedding now. 598 01:02:36,160 --> 01:02:41,290 Raise your hand if you want a tiny break before I do it or you just want me to keep going, that's for me. 599 01:02:41,290 --> 01:02:48,200 OK, I'll just give you any questions on STEM though before. 600 01:02:48,200 --> 01:02:53,970 There. So word embeddings. 601 01:02:53,970 --> 01:03:03,080 Oh, yeah. It's just the covariance. 602 01:03:03,080 --> 01:03:09,890 Yeah. Then I learnt Leighton industry allocation doesn't deal with covariates like how does the topic change, depending who's speaking it? 603 01:03:09,890 --> 01:03:16,880 It doesn't. It doesn't do any of that. This is built or Eskom is built hugely on the foundations of LDA. 604 01:03:16,880 --> 01:03:24,650 In fact, it's running LDA, but it allows the covariates to change the probabilities to estimate the probabilities. 605 01:03:24,650 --> 01:03:30,980 And again, yeah, their documentation, some of the best that I've that I've seen and they've got some really great papers. 606 01:03:30,980 --> 01:03:35,600 One of the ones that I reference where they just use time to look at interviews. 607 01:03:35,600 --> 01:03:43,170 So I think it's like short answer surveys, depending on the demographics of the person who answered the survey. 608 01:03:43,170 --> 01:03:50,470 OK. OK, word embedding, I apologise, I do in Python. 609 01:03:50,470 --> 01:03:57,040 I don't apologise because I know that a lot of you do Python, but just because it's kind of annoying to switch linguistic syntax, 610 01:03:57,040 --> 01:04:01,480 it's just because I started in Python and I learnt word embedding an abnormally early time. 611 01:04:01,480 --> 01:04:05,110 Yeah. Oh, it's not a question. It's a stretch. 612 01:04:05,110 --> 01:04:15,670 OK, so I am going to I can like if you have any questions as an art user about the syntax of of what I'm what I do here. 613 01:04:15,670 --> 01:04:20,380 I'm happy to just give you a mention like the import is basically like library. 614 01:04:20,380 --> 01:04:26,590 You're just importing your packages here. A lot of it is very similar and hopefully will make sense to you. 615 01:04:26,590 --> 01:04:33,880 But I also provide on the last slide. There is the, I think, the last side. 616 01:04:33,880 --> 01:04:42,310 There's a bunch of citations and I'm pretty sure the last citation is what my friend uses the package my friend uses to do word embedding in our. 617 01:04:42,310 --> 01:04:47,650 So you could read the documentation and just do word embedding in our I just use Python. 618 01:04:47,650 --> 01:04:53,830 Does that make sense? Where to find that resource? Just ask me if you need it later. 619 01:04:53,830 --> 01:05:03,790 We start by just loading the packages. And then this is actually where I created the social abstracts that CSP that you use and the other tutorials. 620 01:05:03,790 --> 01:05:09,070 It's originally a web of science document that comes as a text file, and it's not like a tab delimited thing. 621 01:05:09,070 --> 01:05:16,390 So I'm reading that in here line by line and selecting only the variables of interest and then saving it as a CSV for those other tutorials. 622 01:05:16,390 --> 01:05:24,430 That's all that this is, but it might be useful if you end up wanting to analyse web of science abstracts. 623 01:05:24,430 --> 01:05:30,310 One option before you do what word embedding is to lowercase or stem. 624 01:05:30,310 --> 01:05:35,110 So again, you might want to turn running in to run. 625 01:05:35,110 --> 01:05:44,680 I didn't do any of that, but it's definitely possible to do in your analysis with Python or R, as we've seen. 626 01:05:44,680 --> 01:05:47,380 Then I define two functions. 627 01:05:47,380 --> 01:05:57,430 The first of which converts the document into a list, each document that you read in into a list of words and lower cases them. 628 01:05:57,430 --> 01:06:08,440 And the second document splits the text into specific sentences and returns a list of sentences where each sentence is a list of words. 629 01:06:08,440 --> 01:06:18,200 Does that make sense? Yeah, so that's just these are just functions there, like the curly bracket things in our but it's in Python in its preview. 630 01:06:18,200 --> 01:06:29,010 Then you make that list of texts to loop through, and I created an empty list to put each sentence into. 631 01:06:29,010 --> 01:06:36,000 And I apply those functions. So ultimately, we have. 632 01:06:36,000 --> 01:06:44,160 In that corpus of abstracts, we have one hundred and eight thousand four hundred and twenty two sentences. 633 01:06:44,160 --> 01:06:48,210 And an example of one of those sentences is during globalisation, modern period, 634 01:06:48,210 --> 01:06:51,780 so we remove some of the stop words and stuff, that's why doesn't make sense. Not all of them, though. 635 01:06:51,780 --> 01:06:59,710 Evidently to a maybe I'll do that later on. Let's look at another one. 636 01:06:59,710 --> 01:07:08,650 In contrast, many national church like the Orthodox, some of the stuff is already removed, but this is this is what your input is. 637 01:07:08,650 --> 01:07:13,660 So it's a list of sentences where each sentence is a list of words. 638 01:07:13,660 --> 01:07:26,010 OK, and then we run the word to VEC model, which this is just importing the package for that from Jensen. 639 01:07:26,010 --> 01:07:32,530 So I was trying to get rid of this progress bar things, but I it wasn't it wasn't a priority. 640 01:07:32,530 --> 01:07:37,860 So in the. The implementation of your word embedding model, as always, 641 01:07:37,860 --> 01:07:49,710 you have parameters to set the number of features or dimensions like we were talking about before that you want to create the minimum word count, 642 01:07:49,710 --> 01:07:55,830 the like. How how again, similar to the STM, how often the word is is seen. 643 01:07:55,830 --> 01:07:58,560 The number of workers is about the processing of it. 644 01:07:58,560 --> 01:08:05,730 Doing parallel ization so that it's going quicker depends on the computation power that you have in your computer context. 645 01:08:05,730 --> 01:08:10,920 OK, so remember, context was if you have a target word, it's the number of words around it that you want to consider. 646 01:08:10,920 --> 01:08:17,910 The context somewhere around eight is again what they recommended for the English language, what some studies have suggested. 647 01:08:17,910 --> 01:08:25,290 I chose six and then downsampling is is a process that you can use and like machine learning and other ways to just 648 01:08:25,290 --> 01:08:33,940 make sure that you're not that very frequent or infrequent words or very frequent words in this case aren't kind of. 649 01:08:33,940 --> 01:08:40,510 Taking too much of the effect like causing too much of the effect. 650 01:08:40,510 --> 01:08:44,980 And then you run it, this is how you run it, you're specifying you gave it your list of sentences, 651 01:08:44,980 --> 01:08:51,460 which are each a list of words, you're saying the number, weren't you just defining the parameters based on these things above? 652 01:08:51,460 --> 01:08:57,000 And then it runs it, and it's showing the progress as it goes along. 653 01:08:57,000 --> 01:09:03,470 And if you don't plan to add more data and keep training the model, then you just want to save it. 654 01:09:03,470 --> 01:09:09,060 I saved it. Now we can look so we've created that the model did what we talked about, 655 01:09:09,060 --> 01:09:14,150 I'm happy to go over it again, but it did the process we were talking about of. 656 01:09:14,150 --> 01:09:18,740 Of sampling and created our vector space, and now we want to look at it. 657 01:09:18,740 --> 01:09:24,620 This is the vocabulary, so I'm only looking up to the first twenty five words. 658 01:09:24,620 --> 01:09:30,350 The vocabulary is every word that is in this vector space. 659 01:09:30,350 --> 01:09:37,150 You can check. If a word is in your vocabulary, so I wanted to know, is analysis in my vocabulary? 660 01:09:37,150 --> 01:09:43,570 Is Oxford in my vocabulary? No, but maybe university is. 661 01:09:43,570 --> 01:09:48,340 Yes. So you can kind of explore that. Then you can see one of your vectors. 662 01:09:48,340 --> 01:09:53,670 I want to see the vector for philosophy. OK, I said, Oh, that's not the. 663 01:09:53,670 --> 01:10:03,560 I'm over here. Sorry, I just got battery life issues. 664 01:10:03,560 --> 01:10:07,490 Like I said, each word is a vector, and vector is just a bunch of numbers, right? 665 01:10:07,490 --> 01:10:15,470 Defining where it is in the in the vector space so you can actually look at those vectors, extract those vectors if you wanted to. 666 01:10:15,470 --> 01:10:25,150 And that's what that is, is looking at the philosophy vector. That was only trained on on unit grounds. 667 01:10:25,150 --> 01:10:37,790 You might also want to do the diagrams. This to the victors. 668 01:10:37,790 --> 01:10:46,440 How do you mean, like the individual, it's just so there'll be as many dimensions as you defined, that's how many points there will be in each vector. 669 01:10:46,440 --> 01:10:50,440 Do you mean it's just defining its relation to the other words in that space? 670 01:10:50,440 --> 01:10:57,140 It's. I don't know. 671 01:10:57,140 --> 01:11:03,020 I mean, it's if you take two, when you take two vectors in terms of the relations. 672 01:11:03,020 --> 01:11:07,070 I don't think so. I would have to look on that, 673 01:11:07,070 --> 01:11:15,560 I know that like defining the space is just generally the cosine similarity of the two vectors is the metric of distance that is generally used, 674 01:11:15,560 --> 01:11:20,310 but the vector dimensions themselves. 675 01:11:20,310 --> 01:11:25,680 Like in terms of like, if you had those, could you take the first factor and it would be like a topic or something? 676 01:11:25,680 --> 01:11:32,700 I don't know. I don't think so, but it's an interesting question. 677 01:11:32,700 --> 01:11:40,530 So we could train it on bigram. I think I just over ran what we were doing, but we'll take a look. 678 01:11:40,530 --> 01:11:46,440 Could do it on by. OK, so we have financial crisis and religious beliefs. 679 01:11:46,440 --> 01:11:50,820 This gets to the question earlier of like, we don't want all migrants, 680 01:11:50,820 --> 01:11:59,340 we want diagrams that are our common and therefore it doesn't have like sleep protests because it's not an actual thing. 681 01:11:59,340 --> 01:12:04,580 Why do you keep doing that? But gender mainstreaming or connexions between a religious beliefs? 682 01:12:04,580 --> 01:12:06,690 These are things that we tend to occur. 683 01:12:06,690 --> 01:12:11,970 If there's something in the model that helps predict that you could get under the hood and change it if you wanted to. 684 01:12:11,970 --> 01:12:19,500 But it's nice that it's there, then this is what I was talking about with the kind of algebra that you can do on the vector space. 685 01:12:19,500 --> 01:12:26,370 So in this, we're taking the model, the vectors, and then we say, I want to know which vectors are most similar to the vector for race. 686 01:12:26,370 --> 01:12:30,090 Give me the top 15. That's what that's doing. 687 01:12:30,090 --> 01:12:36,840 And in this context, we have ethnicity, class, gender, race and ethnicity, sexuality, gender, ideology, et cetera, et cetera. 688 01:12:36,840 --> 01:12:41,540 And you're getting the cosine similarity of the two vectors there. 689 01:12:41,540 --> 01:12:49,400 Then you can do give me this similarity of race. But when we remove stratification, this is what I was talking about earlier. 690 01:12:49,400 --> 01:12:55,040 It doesn't actually produce. I thought maybe it would produce stuff on politics, maybe in a political science abstract setting. 691 01:12:55,040 --> 01:12:59,480 That's what it would give you. And you'd get like election and governor and things like that. 692 01:12:59,480 --> 01:13:04,910 But that's that's basically what that's doing is it's saying take take the vector for race, 693 01:13:04,910 --> 01:13:12,010 but remove the vector for stratification and then tell me what kind of the meaning of that vector, how it like, what it is now. 694 01:13:12,010 --> 01:13:17,300 And you could compare that to what it is with the vector, with the vector of stratification. 695 01:13:17,300 --> 01:13:20,010 And you can just see you could take two vectors and combine them. 696 01:13:20,010 --> 01:13:24,950 I thought this was interesting this morning when I was testing the code, I was like, What about ethnography with results? 697 01:13:24,950 --> 01:13:30,590 Because usually in an abstract ethnography, you don't talk about their results, whereas a regression would. 698 01:13:30,590 --> 01:13:37,880 And so you see that you've got a negative correlation between ethnography and results, but a positive one between results and regression. 699 01:13:37,880 --> 01:13:43,190 The last thing that I did in this code of that is not something you usually do with word embedding models, 700 01:13:43,190 --> 01:13:47,030 but I was trying to kind of think, how could you get a word embedding model? 701 01:13:47,030 --> 01:13:57,230 To do something similar to a topic model is what I what I did is I took all of those vectors and I did cluster analysis on the vector space to see if, 702 01:13:57,230 --> 01:14:00,590 like, how would and this is what I thought of doing for the group. 703 01:14:00,590 --> 01:14:04,070 Exercise tomorrow was to take the same corpus and have some of you do a topic 704 01:14:04,070 --> 01:14:08,210 model instead of do word embedding and then kind of discuss what you could say, 705 01:14:08,210 --> 01:14:13,820 what you couldn't, what you found and how that changed with the same exact corpus. 706 01:14:13,820 --> 01:14:18,890 And so I thought I'd give a little bit of code if I decide to do that exercise tomorrow. 707 01:14:18,890 --> 01:14:24,020 I selected a K of 50. Maybe I should add on 20 to be similar to the STEM, 708 01:14:24,020 --> 01:14:30,590 but I just this is doing a cluster analysis on the vector space and these are the clusters that you get, 709 01:14:30,590 --> 01:14:33,500 some of which are just two words that are not that useful, 710 01:14:33,500 --> 01:14:43,010 but others actually made a decent bit of sense, one that I found kind of fun or would have been even more fun in the context of the Princeton six, 711 01:14:43,010 --> 01:14:47,940 because they do stuff with fragile families is. 712 01:14:47,940 --> 01:14:53,730 There's one topic that's all basically fragile families, but if you look at these, some of them make sense, some of them don't. 713 01:14:53,730 --> 01:15:01,170 Maybe you need a larger case, maybe you need a smaller car, but you. 714 01:15:01,170 --> 01:15:06,840 Yeah, this is kind of a cool thing that you could do with the vector space afterwards if you wanted to. 715 01:15:06,840 --> 01:15:15,460 And I wanted to provide some code for you to do it. Yeah. 716 01:15:15,460 --> 01:15:25,940 Does it? I didn't go into the nitty gritty of it, I'm happy to answer more questions. 717 01:15:25,940 --> 01:15:31,540 But I think it's a long day. OK. 718 01:15:31,540 --> 01:15:36,405 OK, thank you. Yeah.