1 00:00:08,200 --> 00:00:12,400 All right. So we're happy to have today for cloakroom speaker Paul Ginsburg. 2 00:00:12,970 --> 00:00:17,230 Many of you may have heard of him and his fame. And fortune is the founder of the archive. 3 00:00:17,890 --> 00:00:21,190 Paul's got his undergraduate degree at Harvard. Right. 4 00:00:21,430 --> 00:00:24,850 Although this may have changed with Zuckerberg, but at least it used to be. 5 00:00:24,850 --> 00:00:31,000 This is class at the highest average income of any graduating class. 6 00:00:31,270 --> 00:00:36,010 Anywhere else. Anywhere on the planet. In the solar system. 7 00:00:36,070 --> 00:00:39,280 In the solar system? Yeah. Although Zuckerberg. 8 00:00:39,280 --> 00:00:42,940 No, no, no. It's not even close. It's Gates and Obama. 9 00:00:42,970 --> 00:00:51,010 Oh, that's right. Okay, so. So what's he then went and got his Ph.D. at Cornell with Ken Wilson in the Moral to the Undergraduates. 10 00:00:51,010 --> 00:00:56,890 He wrote a paper there with Ken Wilson that I just looked up the citations it had in its first five years of existence, 11 00:00:57,310 --> 00:01:04,270 only ten citations for then the next ten years after that, from 1987 to 97, 12 00:01:04,270 --> 00:01:14,499 had exactly zero citations, and now today has 917 new for Ginsburg Wilson lattice fermions so name. 13 00:01:14,500 --> 00:01:18,820 So the moral is write a good paper and sooner or later people discovered. 14 00:01:21,840 --> 00:01:29,160 So Paul was on the faculty at Harvard before he went to Los Alamos and started the archive, which is related to what he's talking about today. 15 00:01:29,490 --> 00:01:37,950 And about 15 years ago now, he moved from moved himself and the archive from Los Alamos to Cornell, 16 00:01:38,340 --> 00:01:46,110 where he still is and still on the advisory board, the archive away from the day to day running back hall. 17 00:01:48,910 --> 00:01:55,870 Thank you. The. The. I'm sorry. 18 00:01:56,380 --> 00:02:02,800 I was told by the audio video person that I would not be able to hear myself, but you could all hear me. 19 00:02:03,670 --> 00:02:07,420 So I don't know quite what the physics of that is, but I assume he was correct. 20 00:02:09,730 --> 00:02:12,920 Um, I don't have a fixed agenda here. 21 00:02:12,940 --> 00:02:18,140 I'll tell you right from the outset, I've heard it all before, so feel free to ask questions during the talk. 22 00:02:18,160 --> 00:02:21,520 I know that it's sometimes intimidating in a lecture hall this large. 23 00:02:22,270 --> 00:02:31,030 I will also apologise in advance that due to some oddities in my travel today, including a wake up call at three in the morning, 24 00:02:31,810 --> 00:02:34,240 which wasn't supposed to happen until an hour later, 25 00:02:34,720 --> 00:02:42,430 which meant the one that I was agonising over for the anxious about for the past month wasn't going to happen. 26 00:02:42,440 --> 00:02:45,700 So of course I never got back to sleep. I did not have a chance to. 27 00:02:45,700 --> 00:02:50,950 This is a very creative way for apologising, for not having organised my slides in advance. 28 00:02:51,550 --> 00:03:05,830 So this is just to remind you, I was told not to go outside the boundaries of Blackboard on just what the website looks like. 29 00:03:05,830 --> 00:03:09,550 This is actually a screengrab probably from over ten years ago, 30 00:03:09,820 --> 00:03:15,219 but it has the great virtue that I never have to update the screengrab to first approximation. 31 00:03:15,220 --> 00:03:21,490 And if you think about this, I usually describe this in the glass half full and glass half empty. 32 00:03:21,490 --> 00:03:30,670 The glass half full, is that. Oh, my goodness. Somebody had the foresight to engineer a site that didn't need to change over two decades. 33 00:03:32,380 --> 00:03:40,660 The glass half empty, of course. Is that my God? Does this site ever need to be rewritten? 34 00:03:41,110 --> 00:03:44,769 But you know, there are features of it. 35 00:03:44,770 --> 00:03:47,589 I'm exaggerating. We've threatened a number of things. 36 00:03:47,590 --> 00:03:54,940 And what I want to concentrate on here, as I had in the opening slide, are some of the physics aspects of it. 37 00:03:54,940 --> 00:04:02,319 I don't want to spend a lot of time on the sociology. I'll throw out a little of that in, but I want to think more about it in terms of data analysis. 38 00:04:02,320 --> 00:04:08,620 And a lot of it will be the statistical physics, as it were, of text mining. 39 00:04:09,790 --> 00:04:18,099 But just to continue briefly, the introductory part, just you get a feeling for the volume that we're handling. 40 00:04:18,100 --> 00:04:26,770 This thing was, as Paul mentioned, conceived shortly after I arrived at Los Alamos in 1991, 41 00:04:27,400 --> 00:04:38,080 and it was designed for a group of my close personal friends who at the time were working on matrix models of two dimensional surfaces. 42 00:04:38,080 --> 00:04:50,799 And it was designed back in August of 1991 as a way for about 300 of these people to share articles among themselves. 43 00:04:50,800 --> 00:05:00,460 I remember the design was originally for 100 articles a year, so we can see now we're in the 100,000 a year range. 44 00:05:01,240 --> 00:05:06,729 And I just did this screen grab at the airport this morning, 30th October. 45 00:05:06,730 --> 00:05:12,340 We're up close to 1.1 million right now, having gone through. 46 00:05:12,370 --> 00:05:17,499 I'll say a little more about that in a moment, having gone through the 1 million mark last December. 47 00:05:17,500 --> 00:05:21,399 So this is just some of the statistics. I don't want to dwell on this. 48 00:05:21,400 --> 00:05:24,760 So I want to get to more of the technical aspects in a moment. 49 00:05:25,120 --> 00:05:36,819 But just foreshadowing slightly where I'm going with the statistical regularities we were talking about last December or rather last November, 50 00:05:36,820 --> 00:05:40,510 we knew we were going to go through the 1 million submission point. 51 00:05:41,260 --> 00:05:46,720 One other comment I forgot to make about this graph is this is of course, the submissions per month. 52 00:05:46,990 --> 00:05:51,490 It was long in this I think it was Paul Fenway who may have actually made the 53 00:05:51,490 --> 00:05:56,710 observation that there was this strange linear trend for about a decade and a half. 54 00:05:58,060 --> 00:06:03,040 And, you know, since this is the number of submissions per month, it was difficult to model that. 55 00:06:03,040 --> 00:06:09,759 It's some kind of confluence of different communities sort of going through some exponential growth. 56 00:06:09,760 --> 00:06:15,610 And then another community comes in exponential growth leading to this roughly linear pattern. 57 00:06:15,850 --> 00:06:19,479 And then you can see something starts happening about five years ago. 58 00:06:19,480 --> 00:06:22,630 This growth starts accelerating. 59 00:06:22,750 --> 00:06:26,200 We have a number of that. 60 00:06:26,200 --> 00:06:29,409 I'll also get to later in the talk a number of societal factors, 61 00:06:29,410 --> 00:06:39,130 more people interested in open access and all of the rest that start out causing a snowball effect. 62 00:06:40,120 --> 00:06:44,880 So what I was going to say here is that Cornell's the sponsor. 63 00:06:45,040 --> 00:06:49,089 The it is housed in the. No university library. 64 00:06:49,090 --> 00:06:57,940 They do all of the system, administration and all the rest. And I get to just dabble around for the most part, likes to get a little publicity. 65 00:06:57,940 --> 00:07:02,319 And so we saw in early November that we knew we were going to be going through the 1 million mark. 66 00:07:02,320 --> 00:07:15,129 And I did this very simple, simplest imaginable way of extrapolating, which was to take the previous year, 67 00:07:15,130 --> 00:07:23,530 2013, and you have to shift it by one day because there's a strong weekday regularity. 68 00:07:23,530 --> 00:07:27,700 Mondays are the biggest you can see because they include the weekend. 69 00:07:28,030 --> 00:07:32,140 So because 365 is one, not seven, you shift it by one day, 70 00:07:32,410 --> 00:07:42,160 you add 10% and you get almost an exact day by day representation of what's going to happen from one year to the next, including the holidays. 71 00:07:42,460 --> 00:07:44,410 And so when I did this in early November, 72 00:07:44,410 --> 00:07:52,480 I predicted that we were going to go through 1 million submissions some time on the evening of December 25th. 73 00:07:52,960 --> 00:08:00,070 And sure enough, you can see that night I went out with my family to a movie. 74 00:08:01,180 --> 00:08:05,799 It was the latest night at the museum movie. It was okay, not great. 75 00:08:05,800 --> 00:08:14,590 And then when I came back a couple of hours later there, we had gone through the 1 million mark and so it all worked. 76 00:08:14,590 --> 00:08:18,399 It was actually useful to know that it was a little awkward having projected 77 00:08:18,400 --> 00:08:22,150 two months in advance it was going to come in this Christmas-New Year period. 78 00:08:22,540 --> 00:08:31,170 But nonetheless, we did get some very nice press coverage, including from British nature, 79 00:08:31,180 --> 00:08:36,879 the scientists, and you can see Italian and Dutch periodicals covering it. 80 00:08:36,880 --> 00:08:43,900 And so, you know, it's a it's a fun little thing, but it's also getting some global visibility. 81 00:08:44,290 --> 00:08:48,909 Another quick side slide, I wonder if I have a laser pointer. 82 00:08:48,910 --> 00:08:52,660 That was one thing I didn't ask about. Sorry. 83 00:08:53,950 --> 00:09:06,690 Okay. The slide on the left is just the original slide been instead brighter and broken down by subject area. 84 00:09:06,930 --> 00:09:16,739 And it just to give you an indication of how the different subject areas have been growing, it's as is frequently the case with these projectors. 85 00:09:16,740 --> 00:09:26,940 The colours look very unfamiliar to me and you can see early on in the early nineties there's a point that I usually make here. 86 00:09:26,940 --> 00:09:33,299 You can see the red, which is astrophysics compared to high energy physics is growing very slowly. 87 00:09:33,300 --> 00:09:41,520 And Michael Turner, one of my colleagues, memorably made the comment that this system is all, well, 88 00:09:41,530 --> 00:09:47,940 really good for high energy physicists who are only interested in developments of the last ten nanoseconds or so. 89 00:09:48,240 --> 00:09:51,960 But it will never catch on in a serious field like astrophysics. 90 00:09:52,410 --> 00:09:59,010 And you can see what happens is you've got supposed to be one right here. 91 00:10:03,900 --> 00:10:15,680 Great. And you can see by ten years later, astrophysics and condensed matter had both grown to be larger than high energy physics. 92 00:10:15,690 --> 00:10:21,420 This is exactly the same data. All I've done is I've scaled the vertical so you can see the percentages. 93 00:10:21,810 --> 00:10:28,200 And, you know, one could have said something similar about computer science that's gold over here. 94 00:10:28,410 --> 00:10:32,280 It seemed to be going nowhere for over a decade. And then thing. 95 00:10:32,490 --> 00:10:40,799 There are some external developments and the one point I would make about data like this is that there's something of a reaction effect. 96 00:10:40,800 --> 00:10:46,020 Once we've seen in high energy physics, of course, there was a long history, 97 00:10:46,020 --> 00:10:52,469 a long pre-existing tradition of trading preprints via these very organised preprint mailing lists. 98 00:10:52,470 --> 00:10:57,650 We had our personal lists that we sent out and we're at. 99 00:10:57,690 --> 00:11:06,269 And so it was fairly natural just to report this to the electronic realm, whereas we saturated all of those disciplines very early, 100 00:11:06,270 --> 00:11:13,980 we got 100% participation from high energy physics almost immediately, within a few years. 101 00:11:14,220 --> 00:11:24,629 And now all of the growth is from communities that didn't have these pre-existing preprint cultures are just another overview slide. 102 00:11:24,630 --> 00:11:32,160 There are various I mean, it's hard for any of us, frankly, to remember what it was like before the Web existed. 103 00:11:32,910 --> 00:11:43,110 This thing originally was just an email transponder with FTP added on a year later, and the web really didn't start becoming popular. 104 00:11:43,980 --> 00:11:47,430 Tim Berners-Lee, of course, had configured it. 105 00:11:47,430 --> 00:11:52,110 I was he was doing it on next step. I was doing it also a next step. 106 00:11:52,110 --> 00:11:56,459 So we were in touch in the early nineties. Next step? 107 00:11:56,460 --> 00:12:01,470 Well, everybody knows what next step is because it's covered heavily in the latest Steve Jobs movie. 108 00:12:01,470 --> 00:12:14,340 But if you don't know it's it's it's this we know that because Mac OS ten still has the same bugs in the mail app that I reported to them in 1991. 109 00:12:14,340 --> 00:12:25,319 It's the same source code. And so it really started taking off with Mosaic some of the developments in the early nineties. 110 00:12:25,320 --> 00:12:31,020 But you know, this as a service was not the first Spiers was the first website in the United States, 111 00:12:31,020 --> 00:12:36,809 but one of the first and the first where we sort of pioneered things that are now taken for granted. 112 00:12:36,810 --> 00:12:40,830 You go to the extract page, that's the hub for all of these things that you go to. 113 00:12:40,830 --> 00:12:47,969 We were serving compress postscript and later on very quickly PDF and those have, 114 00:12:47,970 --> 00:12:52,890 for better or for worse become the standards I say foreshadowed web 2.0. 115 00:12:52,890 --> 00:12:59,490 You know semi facetiously. This is the kind of site where by definition web two point out with some skeletal structure and 116 00:12:59,490 --> 00:13:07,379 you rely on your users to provide all of the content surprises to me along the way along the way. 117 00:13:07,380 --> 00:13:14,920 Well, you know all of Google I argued couldn't be done. 118 00:13:14,940 --> 00:13:19,589 Wikipedia made no sense to me. Twitter, I still don't understand Facebook. 119 00:13:19,590 --> 00:13:24,419 I signed on for a few weeks in 2007, never returned. 120 00:13:24,420 --> 00:13:32,550 So I'm not really involved in the modern era of small mobile devices. 121 00:13:33,210 --> 00:13:36,060 Another surprise to me is we're still using tech. 122 00:13:36,420 --> 00:13:46,469 That seemed like, if you think about it, of all of the technologies that were around in the 1980s, what are we still using? 123 00:13:46,470 --> 00:13:54,480 And the problem with tech is Knuth or Knuth was so brilliant, 124 00:13:54,600 --> 00:14:00,239 we can consider tech some sort of encyclopaedic demolition of the problem of print on paper. 125 00:14:00,240 --> 00:14:06,420 It was never intended as a network transmission format, and it's really holding us back in a variety of ways. 126 00:14:06,420 --> 00:14:09,239 You have this tech produced PDF you'd like. 127 00:14:09,240 --> 00:14:15,930 When I get into starting to talk about these documents as computable objects, you'd like to ask a document where your offers are. 128 00:14:16,290 --> 00:14:19,649 What are your related resources in all of that? And with PDF, 129 00:14:19,650 --> 00:14:24,959 you have to sort of stand it on its head and use conditional random fields and very sophisticated machine 130 00:14:24,960 --> 00:14:31,840 learning to get answers to these very simple questions by inferring things from the page marker. 131 00:14:31,900 --> 00:14:39,300 So that's a problem. It's surprising we're still using it. I don't think we'll be using it a century from now, but who knows? 132 00:14:40,740 --> 00:14:43,600 Another surprise to me and this is, you know, 133 00:14:43,800 --> 00:14:51,629 an entire different talk that I won't be giving here today is the fact that scholarly publishing as a whole still remains in transition. 134 00:14:51,630 --> 00:14:58,740 And for whatever reason, reporters frequently asked me, You know what I think about this, this or that development? 135 00:14:59,160 --> 00:15:02,820 And I always prep it by saying, Well, whatever I think is. 136 00:15:04,070 --> 00:15:11,450 Not very compelling because in 1995 I said the current state of having free, 137 00:15:11,450 --> 00:15:17,210 open access to pre-publication materials as given by archive, for example, 138 00:15:18,230 --> 00:15:22,820 and having this parallel feed through the journals which libraries pay for, 139 00:15:22,820 --> 00:15:29,720 is clearly some kind of metastable state, and it can't possibly persist for another another five years. 140 00:15:30,140 --> 00:15:33,430 And 20 years on, we're still in that metastable state. 141 00:15:33,440 --> 00:15:41,750 I believe, of course, that I was correct in right description, but I'm willing to concede I got the time constant wrong. 142 00:15:43,040 --> 00:15:51,499 So we're still waiting to find out what will be the optimal sort of configuration for our publishing. 143 00:15:51,500 --> 00:16:02,840 There are a lot of things that we do right now which are still carryovers, vestigial carryovers from the way that we did things previously. 144 00:16:03,050 --> 00:16:09,800 One last comment I'll make here. Having grown up in the 1960s watching television is it surprises me that, you know, 145 00:16:09,860 --> 00:16:16,399 the financial model for the network, I never would have predicted it's become the same thing in the 1960s. 146 00:16:16,400 --> 00:16:20,540 Advertisers paid for your eyeballs, which they got by. 147 00:16:20,540 --> 00:16:23,299 You would watch television, they would slip these commercials in. 148 00:16:23,300 --> 00:16:33,709 And that, of course, is how most of these fantastic resources like Google support themselves by pushing advertisements in front of our eyes. 149 00:16:33,710 --> 00:16:41,480 And I never would have expected that would be a way of, you know, the financial model for crucial network infrastructure. 150 00:16:41,480 --> 00:16:44,570 And is it stable? I don't know. 151 00:16:44,960 --> 00:16:50,360 We'll see how long that works. So I'm not going to say much about this slide, but just I already intimated this. 152 00:16:50,360 --> 00:16:59,450 What I want to talk about here is to shift after the first 20 minutes in talking and talking about some details of what we can do with text. 153 00:16:59,750 --> 00:17:06,079 We've got all of these objects and what can we infer about them? 154 00:17:06,080 --> 00:17:15,830 What can we what are what are there features that we can use to make an even more useful resource, better engage the public or facilitate research? 155 00:17:15,830 --> 00:17:20,150 That's going to be the goal of the remainder of what I talk about. 156 00:17:20,630 --> 00:17:35,300 And in being involved in administering a site like this, we get to questions like, Oh, you know, what do we consider science? 157 00:17:35,300 --> 00:17:40,430 What do we consider as legitimately ingestible into the site so that there's 158 00:17:40,430 --> 00:17:44,989 always this trade-off that we don't want to annoy the professional researchers, 159 00:17:44,990 --> 00:17:50,780 but we don't want to be involved in creating the same infrastructure that exists. 160 00:17:50,870 --> 00:17:57,739 It's very expensive to administer, as we know, by the commercial journals. 161 00:17:57,740 --> 00:18:05,600 And then, you know what kind of it's just an outline for the remainder, what sort of text mining that we can do, 162 00:18:05,600 --> 00:18:11,840 that we have this gigantic text stream that we can infer information genealogies. 163 00:18:12,860 --> 00:18:23,719 So the main tool that I'm going to use in the first few stages of this is something that I teach at Cornell. 164 00:18:23,720 --> 00:18:29,150 I teach both courses in information science and in the physics department. 165 00:18:30,410 --> 00:18:37,310 In the spring, I'll be teaching the intro to Data Science Course, and in the fall be teaching the quantum computing course. 166 00:18:38,300 --> 00:18:43,670 The intro to data science courses essentially all about for the undergraduates. 167 00:18:44,060 --> 00:18:45,590 It's big data and all of that. 168 00:18:45,590 --> 00:18:56,180 And it's all about this unbelievably simple methodology, this one formula base with all the probability of some property, given some features, 169 00:18:56,180 --> 00:19:03,350 the probability of the feature given the property times, the probability of the property divided by the probability of the feature. 170 00:19:03,350 --> 00:19:13,999 And so this is just the posterior probability given the conditional probability, and this is used ubiquitously on the network. 171 00:19:14,000 --> 00:19:27,049 I can't pretend to recreate the talk that I heard from Peter Norvig, who is then head of technology at Google, 172 00:19:27,050 --> 00:19:35,990 where he talks about how you can do spell correction just by in that case, the objects here just play different roles. 173 00:19:35,990 --> 00:19:41,870 This would be the property of the correct spelling. Given what you spelled and this is your language model, 174 00:19:41,870 --> 00:19:49,549 you just iterate over a large amount of text to calculate 1 to 3 gram probability distributions. 175 00:19:49,550 --> 00:19:55,280 And then these are just normalisation factors. They use the same thing in voice recognition, 176 00:19:55,280 --> 00:20:03,230 where instead this is the probability that you said something that you of what you meant to say, given the phoneme. 177 00:20:03,350 --> 00:20:08,690 That you actually admitted. And then instead of being a language model, this is your auditory model, 178 00:20:08,690 --> 00:20:13,459 the probability that you emitted something given that you meant to say something 179 00:20:13,460 --> 00:20:20,130 else and this is plays a significant role in all of the voice recognition schemes, 180 00:20:20,130 --> 00:20:23,210 Siri and the rest that are out and around right now. 181 00:20:24,500 --> 00:20:35,690 And the one I take home from this slide that I also emphasise to the students is that our lesson over the past almost two decades now, 182 00:20:36,110 --> 00:20:41,960 certainly it's the lesson of Google is that as you get to more and more data, 183 00:20:42,440 --> 00:20:50,629 the simple algorithms not only work well, they actually work better due to problems of overfitting and all the rest of these complicated heuristics. 184 00:20:50,630 --> 00:21:00,080 And so a simple algorithm like this ultimately can easily outperform as long as you have 185 00:21:00,080 --> 00:21:04,610 enough data on the intuition for this is that when you've got a small amount of data, 186 00:21:04,610 --> 00:21:07,700 you have to work very hard to tease the signal out of the noise. 187 00:21:07,700 --> 00:21:14,149 You know that you're benefiting from the one over square root of Amazon gets very large. 188 00:21:14,150 --> 00:21:18,980 And so with large amounts of data, it's very, very simple to see what you want. 189 00:21:20,060 --> 00:21:30,890 So as an example of this, what I'm showing you here is the roughly speaking, the language model of the archive. 190 00:21:31,220 --> 00:21:44,930 So we have along the vertical and horizontal the roughly 142 subject areas and they are ordered here alphabetically. 191 00:21:45,350 --> 00:21:51,230 I can't read them myself. On the next slide, there's going to be a blow-up of some area outside. 192 00:21:51,650 --> 00:21:59,810 But what you see here and what what I'm plotting is what in physics we would call the cross entropy. 193 00:22:00,620 --> 00:22:04,370 In computer science, it's called the callback, right blur divergence. 194 00:22:04,370 --> 00:22:11,059 If you have a probability distribution pi i index the object sum of the pie is 195 00:22:11,060 --> 00:22:18,379 equal to one that these are all positive between zero and one and another one. 196 00:22:18,380 --> 00:22:26,230 Q Why? It's just the sum of minus the sum of pi log pi over. 197 00:22:26,240 --> 00:22:31,940 Q Why? It gives you something that's akin to a distance between two probability distributions. 198 00:22:31,940 --> 00:22:36,800 It's not a real distance for people use this because it doesn't satisfy a triangle inequality. 199 00:22:37,220 --> 00:22:45,050 And what it's really telling you is, for example, if you consider this blue, that's code, that means low signal. 200 00:22:45,650 --> 00:22:58,610 It's telling you the physically are the likelihood of drawing the documents from the vertical column from something on the horizontal. 201 00:22:59,170 --> 00:23:02,510 Well, actually, I'm sorry I got that wrong. It's the probability. 202 00:23:02,510 --> 00:23:09,020 It's it's asymmetric, clearly. And this is the probability of drawing the horizontal from things on the vertical. 203 00:23:09,020 --> 00:23:14,390 And I'm reminded of that because I can see that this is happy X and happy X. 204 00:23:14,780 --> 00:23:20,239 A high energy physics experiment has a very specific vocabulary that isn't used anywhere else, 205 00:23:20,240 --> 00:23:25,969 so it's very difficult to draw high energy physics documents from anywhere else. 206 00:23:25,970 --> 00:23:31,610 They're very easily distinguished because of the vocabulary involving the experimental installations and all that, 207 00:23:31,910 --> 00:23:36,500 whereas the verticals, these are just categories that have that are very small. 208 00:23:36,500 --> 00:23:40,610 It's difficult to draw anything from them because there's just too little data. 209 00:23:41,060 --> 00:23:46,549 And what you can see here, the incredible regularities are alphabetically. 210 00:23:46,550 --> 00:23:54,140 These are the six astrophysics categories. They have much more in common with themselves than they have with anything else. 211 00:23:54,140 --> 00:23:56,690 That's hot colours, as was the cold colours. 212 00:23:57,080 --> 00:24:05,360 Similarly, this is the condensed matter and so you can see these systematic regularities in the use of the technical vocabulary. 213 00:24:05,780 --> 00:24:15,380 And just to expand, as I mentioned a little, this allows us to see the oh, where is it? 214 00:24:17,390 --> 00:24:24,560 Over here, the high energy physics. So the four of them are it just came into focus for a moment there. 215 00:24:24,890 --> 00:24:28,650 Great. You see. Happy except loud. Hear it. 216 00:24:29,030 --> 00:24:31,760 And if you're paying careful attention, 217 00:24:32,420 --> 00:24:42,860 you can see here that helped the high energy physics theory has much more in common with math than it does with high energy physics experiment. 218 00:24:43,460 --> 00:24:49,220 So there are better things here that are in accord with our experience. 219 00:24:49,700 --> 00:24:55,350 Now, why am I? Why am I mentioning this? 220 00:24:55,380 --> 00:25:05,830 Well, I originally wrote The Naive Days software as a classifier just to check something very specific. 221 00:25:05,850 --> 00:25:13,919 Let me pose this as a question to the audience. We would have these very outlandish mischaracterizations. 222 00:25:13,920 --> 00:25:24,030 Somebody would say, submit something that was clearly general relativity to high energy physics experiment. 223 00:25:24,450 --> 00:25:35,959 So why would somebody do that? I'll give you a hint. 224 00:25:35,960 --> 00:25:39,470 There's a reason why I looked up there to pick this example. 225 00:25:48,900 --> 00:25:51,470 Next week. Yes. Thank you. So. 226 00:25:52,910 --> 00:26:01,670 So you have this dropdown menu where you put your mouse down and you pull down this list of things and that last moment before you lift it up. 227 00:26:01,670 --> 00:26:07,100 There is the smallest statistical fluctuation and where you were. 228 00:26:07,100 --> 00:26:16,009 And this happens to me all the time. I'm not knocking these people the where you were quite certain you had it on g r you see it comes up as happy X. 229 00:26:16,010 --> 00:26:20,150 And more problematic than that, though, is that the submitter never notices it. 230 00:26:20,720 --> 00:26:24,080 So, you know, we would see that kind of thing all the time. 231 00:26:24,080 --> 00:26:31,270 And I was just intellectually curious, could I spot that automatically and just automatically correct those things? 232 00:26:31,280 --> 00:26:40,730 The answer was yes. Now, the surprise here is really getting into the realm that I was trying to foreshadow at the outset, 233 00:26:42,050 --> 00:26:55,490 is I would very frequently see things that were spit out as unclassifiable and very clear set of objects that and every time you would look at them, 234 00:26:55,940 --> 00:26:59,120 you would say, Oh, this is just nonsense. 235 00:27:00,050 --> 00:27:10,550 And so here I had just been trying to solve this rather mundane problem of the the menu 236 00:27:10,550 --> 00:27:17,000 Flickr and instead I had created of unintentionally this Holy Grail crackpot filter. 237 00:27:19,130 --> 00:27:24,950 And it really works. I mean, it's gotten much more sophisticated, but that was what I was reminded. 238 00:27:24,950 --> 00:27:28,880 This every slide that comes as a surprising to me is to you at this point. 239 00:27:29,210 --> 00:27:36,800 So I looked at and I saw this slide and we see for this thing, this is it's using much more now. 240 00:27:36,800 --> 00:27:40,190 It's got much more sophisticated. Let me just say a word. 241 00:27:40,220 --> 00:27:43,460 Why why in the world would something like this possibly work? 242 00:27:44,510 --> 00:27:50,000 That was very surprising to me. And the reason is, what do we do? 243 00:27:50,330 --> 00:27:56,629 We as professional researchers, we've gone through a heavy duty undergraduate training or graduate training. 244 00:27:56,630 --> 00:28:00,680 We apprentice as postdocs and all of the rest in the course of more than a decade. 245 00:28:01,040 --> 00:28:06,830 During that time, we are learning to write articles, use language in a very specific way, 246 00:28:07,430 --> 00:28:14,390 and the way we use the language is reflected in ways we're not even aware of, 247 00:28:14,600 --> 00:28:26,509 of the statistical use of prepositions, the statistical use of the technical content words, and the simple inference from this. 248 00:28:26,510 --> 00:28:30,140 The fact that it works so well with such a small, positive, 249 00:28:30,380 --> 00:28:38,540 false positive rate is that it's very difficult to emulate this if you haven't gone through that kind of training. 250 00:28:38,540 --> 00:28:46,250 If you're just the person on the street and you're trying to write an article and you say lots of stuff about black holes and Einstein is wrong, 251 00:28:46,490 --> 00:28:54,469 it just doesn't get much traction with this kind of software. And what we have here is I just went backwards. 252 00:28:54,470 --> 00:29:02,840 This was for an article I wrote actually with Harry Collins, a sociologist of science, was very interested in French physics. 253 00:29:04,550 --> 00:29:11,690 You know, if I look at the ones that actually appeared in Gen H, how did you invent that? 254 00:29:12,260 --> 00:29:15,670 Okay. It's rare that I get to give a talk. 255 00:29:15,680 --> 00:29:18,740 So Gen is the category that we use when. 256 00:29:21,850 --> 00:29:26,889 I describe it as trying not to mention Brian Josephson. 257 00:29:26,890 --> 00:29:35,920 So I won't. There are, you know, people who had a physics training, but are, you know, 258 00:29:36,550 --> 00:29:42,880 sort of not quite entirely there anymore as opposed to things that we just remove entirely, 259 00:29:43,720 --> 00:29:48,310 you know, things that are obviously coming from people from outside the physics community. 260 00:29:48,580 --> 00:29:52,630 And so the agenda was modelled after the American Physical Society had this 261 00:29:52,990 --> 00:29:59,220 contentious issue of are all members allowed to present at the annual conferences. 262 00:29:59,230 --> 00:30:05,950 And so in order not to get involved in a lot of administrative infrastructure, they just made it available to everybody. 263 00:30:05,950 --> 00:30:13,030 But they've just quarantined certain things in the section that you would go to only if you wanted a certain form of entertainment. 264 00:30:13,450 --> 00:30:21,340 And we do that same thing, courtesy of Paul Fennelly when he worked on this at Los Alamos for two years. 265 00:30:22,810 --> 00:30:27,170 And then there are the removal of the Jedi, the ones that appeared there. 266 00:30:27,190 --> 00:30:34,690 This one correctly flagged 90% of them and 90% of the removals were flagged by the software, 267 00:30:34,690 --> 00:30:38,019 all confirmed by the human moderators who go over these things. 268 00:30:38,020 --> 00:30:43,659 And, you know, you might worry, you've got this software if it's running amok and we've got 100,000 submissions. 269 00:30:43,660 --> 00:30:52,569 If it has just a 1% false positive rate, that means looking at a thousand things that, you know, would have been dropped. 270 00:30:52,570 --> 00:30:56,229 But actually, the false positive rate is extremely small, 271 00:30:56,230 --> 00:31:06,430 less than 0.2% when you look at the over the course of this is an entire years worth of data was the things that made it into the into the categories. 272 00:31:06,760 --> 00:31:11,260 And I when I was doing this, I looked at these because, you know, 273 00:31:11,320 --> 00:31:15,549 I wanted to see if we could improve it, if there were any features that we should add in. 274 00:31:15,550 --> 00:31:25,390 And the answer was actually, by my reckoning, most of those were not actually false positives, meaning they were all, you know, really edgy. 275 00:31:26,320 --> 00:31:37,330 OC And where do they go to? This'll be interesting for, you know, the the removals and the g r QC r poor g r QC, 276 00:31:37,330 --> 00:31:43,299 moderator a friend of mine, not Visser in New Zealand, has been doing, you know, 277 00:31:43,300 --> 00:31:51,490 u u I cannot express the appreciation the community owes to him for having read these things out and to 278 00:31:51,490 --> 00:31:59,320 me for having given me this unbelievably fantastic training data from a machine learning classifier. 279 00:32:01,180 --> 00:32:05,140 One final thing that I threw in this slide specifically, 280 00:32:05,320 --> 00:32:16,930 because it's so much fun to mention this to a UK audience and I'll try to come back to this at the end, but I'm going to be a little short on time. 281 00:32:17,920 --> 00:32:22,420 Last early, last I guess it was late last January, 282 00:32:22,420 --> 00:32:32,410 I received a email message from a nature reporter who whom I corresponded with about with other things in the past, 283 00:32:32,680 --> 00:32:39,309 who said that there's this UK research assessment exercise that's about to come 284 00:32:39,310 --> 00:32:43,390 out and I'm presuming all of you are much more familiar with this than I am. 285 00:32:43,840 --> 00:32:55,239 And he just wanted information. He said it seems, ah, we'd really like to do some analysis of it, but it's 8000 PDFs and they're not even text. 286 00:32:55,240 --> 00:33:02,500 And is there any way of, you know, downloading 8000 PDFs in the first place? 287 00:33:02,500 --> 00:33:06,520 And can you extract text from them and do any analysis on them? 288 00:33:06,880 --> 00:33:14,080 And if you've been following what I've been saying, this is the kind of thing that we do without batting an eyelash before breakfast every day. 289 00:33:14,860 --> 00:33:19,509 And it's actually more seriously the kind exactly this kind of thing. 290 00:33:19,510 --> 00:33:22,569 I give this an exercise to students in my class. 291 00:33:22,570 --> 00:33:31,000 So as a my students would actually be heartened to know that I was able to do 292 00:33:31,000 --> 00:33:34,840 this as fast as I claimed they should be able to do their weekly problem sets. 293 00:33:35,260 --> 00:33:38,409 I volunteered to. Okay, I'll pull those down. 294 00:33:38,410 --> 00:33:40,630 Looking in, this will be a five or ten minute exercise. 295 00:33:41,050 --> 00:33:48,250 So what I did at his request and the results were very illuminating, should be even more illuminating to all of you, 296 00:33:48,250 --> 00:34:00,100 since you have to participate in this particular exercise was to do a linear regression and find the words that were most correlated with high scores. 297 00:34:00,730 --> 00:34:13,389 On this, I'm told that you write something that purports to demonstrate the impact of the research that's done in these various institutions, 298 00:34:13,390 --> 00:34:20,080 and it's separated into 36 different subject areas physics, chemistry, humanities and. 299 00:34:20,110 --> 00:34:20,870 To all of the rest. 300 00:34:21,370 --> 00:34:31,930 And they are then rated according to either, you know, how much impact you've had or how well you've described the impact that it has. 301 00:34:32,530 --> 00:34:35,070 And then you find the words, the red here is hot. 302 00:34:35,080 --> 00:34:44,290 Those are the words near the top in these 36 subject areas that were most correlated with getting high scores for impact. 303 00:34:44,620 --> 00:34:49,220 And the ones in the bottom are called the ones anti correlated. 304 00:34:49,240 --> 00:34:54,580 And you can't read them, but I can. The ones that are, are, have. 305 00:34:54,850 --> 00:34:59,590 And by the way, this, of course, is, as we always say, it's correlation, not causation. 306 00:35:00,940 --> 00:35:11,170 The words that are most negatively correlated are university academics, studies, research. 307 00:35:13,950 --> 00:35:19,649 Piece is absolutely at the bottom. So as I say, 308 00:35:19,650 --> 00:35:25,200 you're not going to get away with throwing in words like Government Policy Committee report 309 00:35:25,200 --> 00:35:28,990 and all the rest that turn out to have high impact and automatically get high scores. 310 00:35:29,010 --> 00:35:35,840 But if I were you, I'd give it a shot. You know, this was actually delayed. 311 00:35:35,880 --> 00:35:42,330 This will be, you know, at two or 3 minutes after the half hour. 312 00:35:43,380 --> 00:35:47,070 I'll get back to why this was delayed. It was an interesting story. 313 00:35:48,480 --> 00:35:53,910 I won't spend much time on that. Just want to you know, when I was talking about the information genealogy, 314 00:35:53,910 --> 00:36:00,420 we have all this fascinating, you know, this is the entire history of the word cupid. 315 00:36:02,370 --> 00:36:08,010 It's just a fun example. The the top is the number of uses its resource. 316 00:36:08,130 --> 00:36:11,970 This is per article and this is total with multiplicity. 317 00:36:12,180 --> 00:36:15,419 This is scale by a factor of ten. So I could put it on the same graph. 318 00:36:15,420 --> 00:36:21,450 So, you know, if this is a factor of three, it's when it's mentioned in an article, it's roughly 30 times per article. 319 00:36:21,450 --> 00:36:24,750 And the word cubit we know is we know who coined it and when. 320 00:36:25,200 --> 00:36:36,870 And it's started in quantum computing, but as a mnemonic for what was formerly a two state quantum system, which is a few more syllables than. 321 00:36:36,870 --> 00:36:40,769 Q but it's, it's, it's come to be very popular. 322 00:36:40,770 --> 00:36:51,149 And when I mention this to a friend here, Ramone in Aspen, actually, we were talking about the fact that I could go to these talks at Aspen. 323 00:36:51,150 --> 00:36:54,840 And basically they were in high energy physics. 324 00:36:54,840 --> 00:36:58,470 They were the same as 20 years ago when I was more actively involved. 325 00:36:58,770 --> 00:37:02,159 And he said, No, they're completely different. 326 00:37:02,160 --> 00:37:04,080 They never used to use the word portal. 327 00:37:05,130 --> 00:37:14,670 And sure enough, I, you know, just went back and found that for some reason it used to be hidden sector for supersymmetry. 328 00:37:14,670 --> 00:37:19,240 And then people have switched over to, you know, instead of describing it. 329 00:37:19,250 --> 00:37:28,890 It's it's one of these framing devices. Instead of describing it as stuff that you can't see, it's a, you know, door into, you know, this other realm. 330 00:37:30,150 --> 00:37:42,870 Okay. So in moving on, when I describe, for example, this ability to analyse articles, I was glossing over a couple of things. 331 00:37:44,760 --> 00:37:49,950 We actually use many more features. And to introduce that I wanted to give, you know, 332 00:37:50,240 --> 00:38:03,240 what's almost silly or would be silly if it weren't so tragic example of CI Gen So how many people here have encountered CI gen? 333 00:38:03,240 --> 00:38:07,080 It's remarkably not at all one person. 334 00:38:08,010 --> 00:38:23,069 You wrote it now. Okay. So the story of this goes back just about ten years and some computer science graduate students 335 00:38:23,070 --> 00:38:29,760 at MIT were annoyed they were getting these invitations to conferences that they were convinced. 336 00:38:29,760 --> 00:38:40,260 Were faux conferences run solely for the benefit of the organisers to make money on the conference registration fees? 337 00:38:40,500 --> 00:38:48,030 And they didn't believe that the peer review that was purported to be being done with it was actually undertaken. 338 00:38:49,650 --> 00:38:53,639 So being computer science graduate students, 339 00:38:53,640 --> 00:39:04,230 they decided to write a program to generate nonsense articles and what they used was what's known as a probabilistic context, free grammar. 340 00:39:04,680 --> 00:39:06,479 You can think of it as a template. 341 00:39:06,480 --> 00:39:15,540 It's got, you know, introduction, and then each sentence, each paragraph comes with these stochastic CLI extendable templates. 342 00:39:15,540 --> 00:39:22,949 In the US we had this, this, this game we played to win called Mad Libs where it says, you know, pick a noun, pick a verb. 343 00:39:22,950 --> 00:39:28,860 And so it's just picking from a random lookup list, nouns and verbs and you get stuff which is, 344 00:39:29,760 --> 00:39:34,290 which is very entertaining since they were not only computer science graduate students, 345 00:39:34,290 --> 00:39:40,320 they were computer science graduate students at MIT, which meant they also generated the text stores, 346 00:39:40,590 --> 00:39:45,930 generated random figures, generated random references, generated random tables of data. 347 00:39:46,320 --> 00:39:55,170 And so the idea was that if you looked at it from 30 feet away, it looked exactly like an article indistinguishable. 348 00:39:55,170 --> 00:40:01,230 But if as soon as you started reading it, you could instantly discern that there was something wrong. 349 00:40:01,770 --> 00:40:09,820 Ah, now. Oops. So it's the site is still up. 350 00:40:10,180 --> 00:40:17,120 So just to give you an example of what we're talking about, you can go here to this site. 351 00:40:17,140 --> 00:40:20,260 These are the three authors and this is the URL. 352 00:40:20,530 --> 00:40:25,660 You can plug in whatever names you choose and I'll let you see the result. 353 00:40:36,800 --> 00:40:47,910 Okay. So somewhere here they confirmed the confirm the development of vacuum tubes and all the rest. 354 00:40:47,940 --> 00:40:58,139 Now I assert with complete confidence that this article is better than any article 355 00:40:58,140 --> 00:41:05,250 those three authors have ever written together except the one I wrote with you. 356 00:41:08,940 --> 00:41:14,970 So there's I first learned about I mean, I knew about hygiene when, when, when it happened. 357 00:41:16,080 --> 00:41:20,040 But a few years ago, I first became aware of a fellow, 358 00:41:20,040 --> 00:41:26,219 a computer scientist at Grinnell by the name of Beck, because he did something relatively inspired. 359 00:41:26,220 --> 00:41:30,530 He decided to game Google Scholar, which automatically computes the index. 360 00:41:30,540 --> 00:41:34,349 So he took the software written by these MIT graduate students and modified it. 361 00:41:34,350 --> 00:41:47,580 He modified it specifically to create 100 articles, and each one of those articles had 100 references. 362 00:41:48,510 --> 00:41:57,660 And those 100 references consisted of one reference to the main stream literature and 99 references to the other 99 articles, 363 00:41:58,440 --> 00:42:02,640 and he posted them waiting for them to be ingested by Google Scholar. 364 00:42:02,880 --> 00:42:10,440 Now, here's another problem for the audience that tends to generate surprise, surprising difficulty. 365 00:42:11,310 --> 00:42:15,510 What is the value of the index in this circumstance? 366 00:42:21,350 --> 00:42:30,560 Everybody talks about agent index, but nobody has, you know, nine 100 articles each, one of which refers to the other 99. 367 00:42:33,560 --> 00:42:37,520 And 99. Yes, of course it is. 368 00:42:37,520 --> 00:42:41,510 And he attributes that to experimental error. 369 00:42:42,130 --> 00:42:45,080 No Google scholar comes along. 370 00:42:45,080 --> 00:42:51,340 Its methodology was it looks to see if there's at least one reference to the mainstream literature and just all of them. 371 00:42:51,350 --> 00:42:55,070 He has no idea why it came out to only 94. It should have been 99. 372 00:42:55,370 --> 00:43:00,630 But instantly this I don't know what why I can guattari that somebody and you know, 373 00:43:00,650 --> 00:43:08,990 probably mean something if you're French and instantly became the computer scientist with the highest H index. 374 00:43:08,990 --> 00:43:12,080 So he called him one of the great stars in the scientific firmament. 375 00:43:12,830 --> 00:43:22,969 And what happened about two years ago that attracted my attention to this was the same day later on 376 00:43:22,970 --> 00:43:30,380 found that there were seiten generated articles in publisher databases like Elsevier and Springer. 377 00:43:31,250 --> 00:43:38,629 And this boggles the mind, of course, because if you recall, I mean, in two different ways. 378 00:43:38,630 --> 00:43:46,070 If you recall my description decided it was intentionally supposed to be obvious as soon as you looked at it that this was nonsense. 379 00:43:46,070 --> 00:43:47,480 That's why I put that example up. 380 00:43:48,590 --> 00:43:56,419 So that meant there were these big multinational publishers who are producing this stuff claiming to do peer review and not actually doing it. 381 00:43:56,420 --> 00:43:59,110 That's the first surprise. Maybe that's not so surprising. 382 00:43:59,120 --> 00:44:08,300 Perhaps more surprising is why are people using signs in articles to pad their CV is because these are real people. 383 00:44:08,840 --> 00:44:13,370 They're not, you know, native English speakers. Maybe that's part of the explanation. 384 00:44:14,420 --> 00:44:22,360 And the only explanation I've ever been able to come up with is we still evades plagiarism detectors. 385 00:44:25,220 --> 00:44:28,340 Okay, so now let me go back. 386 00:44:28,490 --> 00:44:32,600 I mean, there'll be some you'll see what I'm talking about this in a moment. 387 00:44:33,560 --> 00:44:41,450 And this is a partial, partial advertisement for I Python notebooks. 388 00:44:43,820 --> 00:44:53,870 I happen to have an I python notebook hanging around from a year earlier because I had given a talk at a 60th birthday. 389 00:44:53,870 --> 00:44:57,950 This one happened to be for John Prescott and because it was the 60th birthday talk, 390 00:44:58,220 --> 00:45:02,750 I wanted to do something really creative and something specific for the occasion. 391 00:45:03,050 --> 00:45:09,570 And I had shortly before I heard there's a podcast on language, right? 392 00:45:09,680 --> 00:45:20,419 I was listening to on my iPod as I was making leaves or something, and it inspired me to see if I could reproduce this on archive data. 393 00:45:20,420 --> 00:45:26,680 And so what they talked about, and that was the burning question of who wrote the 15th Book of Hours. 394 00:45:26,690 --> 00:45:32,509 The Book of Hours was a series of books written in the early part of the 20th century, very popular. 395 00:45:32,510 --> 00:45:36,620 They were the Harry Potter of their time, apparently coming out every year. 396 00:45:37,190 --> 00:45:40,400 The first books were written by L. Frank Baum. 397 00:45:41,570 --> 00:45:46,910 The last 18 were written by children's author Elizabeth Plumley Thompson. 398 00:45:46,910 --> 00:45:56,899 And the question was, how do you do the style of metrics to determine the authorship of who wrote the intermediate book that came out after she died, 399 00:45:56,900 --> 00:45:59,750 but it had the two of them supposedly co-authoring it. 400 00:45:59,990 --> 00:46:04,820 Was it some cynical ploy on the part of the publisher to introduce the new author to the audience? 401 00:46:06,050 --> 00:46:10,070 Or was it really based on his notes and some of his writing? 402 00:46:10,610 --> 00:46:21,290 And if it were such a cynical ploy, as mentioned, it certainly worked because they got another 18 books out of it after the original author died. 403 00:46:21,560 --> 00:46:26,060 Now, the methodology they used was to rather difficult. 404 00:46:26,060 --> 00:46:32,000 In that case, they had to find all of the books, get copies of them, do OCR carefully, reread them, check them, 405 00:46:32,330 --> 00:46:40,110 and then they divided them up into words of 5000 blocks and pick out just the so-called R function search, 406 00:46:40,160 --> 00:46:52,280 structural words, the most frequent words, things like the of and in these things are very regular in in our own usage, 407 00:46:52,640 --> 00:47:02,780 they follow what's known as a Zipes law, which is the archetypal power law. 408 00:47:02,780 --> 00:47:08,689 It's a minus one. If you plot the frequency of usage of these words against their rank and you do it on a 409 00:47:08,690 --> 00:47:14,389 log log plot to first approximation will just be come out to be a straight line here. 410 00:47:14,390 --> 00:47:19,640 I'm semi log so it's not a straight line and then the deviations from that minus 411 00:47:19,640 --> 00:47:25,850 one power law of frequency versus rank give you styles of different people. 412 00:47:25,850 --> 00:47:32,690 So I decided to make this easy and instead of for me the. 413 00:47:32,780 --> 00:47:40,970 Burning question was who was going to have written the famous paper that press corps wrote with courtesy of on topological entanglement entropy? 414 00:47:41,420 --> 00:47:46,940 And I knew this was going to work because I already knew. 415 00:47:46,940 --> 00:47:57,229 If you look at these articles, the frequency of usage of the words goes from as little as two or 3% of the words in the text to as high as 15%. 416 00:47:57,230 --> 00:48:02,870 And you would think surely you would notice something odd about articles that have, 417 00:48:03,200 --> 00:48:06,739 you know, small percentage of does compared to a large percentage of those. 418 00:48:06,740 --> 00:48:14,780 But it's all just stylistic. And we're so used to sucking out the content and ignoring the style, somebody here can probably suggest what's going on. 419 00:48:15,260 --> 00:48:20,270 What accounts for that large variation? Sorry. 420 00:48:22,560 --> 00:48:27,010 Get here. One of the weapons is Russian. Yes, absolutely. 421 00:48:27,030 --> 00:48:28,290 Yeah, there was a clue there. 422 00:48:28,290 --> 00:48:35,520 So if you're ever in, I'm told in some of these Eastern European languages like Russia, there just is no word for the definite article. 423 00:48:35,530 --> 00:48:37,859 So when they speak and we just get used to it, 424 00:48:37,860 --> 00:48:45,329 whereas it turns out in Japanese they're just guys all over the place for some reason preceding everything. 425 00:48:45,330 --> 00:48:48,899 So it must. And these things are unbelievably systematic. 426 00:48:48,900 --> 00:49:00,420 And after I did this with a student, we went into the archive data and we selected out the single authored articles identified as, 427 00:49:00,420 --> 00:49:04,080 as best as we could, the native language of those people. 428 00:49:04,290 --> 00:49:06,479 And there's just this incredible clustering. 429 00:49:06,480 --> 00:49:14,280 It's, you know, cognitively fascinating that in the post-World War Two, where are we forced everybody in the world to write their articles in English? 430 00:49:14,280 --> 00:49:20,190 So we have this incredible database of articles written by people of various native languages, 431 00:49:20,580 --> 00:49:29,100 and you can identify the native language via the methodologies I'm about to show with great precision. 432 00:49:29,100 --> 00:49:38,800 The native Russian speakers, native Japanese, native, French, native German all have these systematic deviations and it's all the same deviation. 433 00:49:38,820 --> 00:49:43,140 It's based on whatever was imprinted on you from the first language that you learn. 434 00:49:43,500 --> 00:49:46,319 So what you see here is it's hard to see the dog, 435 00:49:46,320 --> 00:49:59,399 but you see words like in the green category is using if will much more impress skill with using words like has have and the 436 00:49:59,400 --> 00:50:07,440 methodology for doing for analysing this and make it easy is is probably familiar to a lot of you single value decomposition. 437 00:50:07,440 --> 00:50:14,579 It's a generalisation of usual eigenvalue decomposition. 438 00:50:14,580 --> 00:50:21,180 But instead of a symmetric matrix, you've got an arbitrary square matrix, the word frequency, 439 00:50:21,180 --> 00:50:28,380 the columns, the rows labelled by the words, the columns labelled by groups. 440 00:50:29,940 --> 00:50:34,230 Oh I didn't realise all along I've had. No. 441 00:50:37,540 --> 00:50:42,670 Some of them won't be in the columns labelled by the frequency. 442 00:50:43,150 --> 00:50:49,540 And, you know, roughly speaking, actually, more than roughly speaking, you just take an arbitrary rectangular matrix, 443 00:50:49,540 --> 00:50:58,180 you multiply it by its transpose, and then you get a separate symmetric square matrix and you do the eigenvalue decomposition than that. 444 00:50:58,420 --> 00:51:04,090 And the so called singular values are just the square roots of those eigenvalues by convention, 445 00:51:04,090 --> 00:51:11,559 taken with the positive sign and this sort of thing, it's very closely related in very many cases, 446 00:51:11,560 --> 00:51:22,210 the same as principal components analysis and is used whether data is something called latent semantic analysis very frequently in stock data, 447 00:51:22,450 --> 00:51:26,469 genomic data, Apple iTunes genius and Apple iTunes genius. 448 00:51:26,470 --> 00:51:33,490 They really are just doing a singular value decomposition where here the rectangular matrix are the users. 449 00:51:33,490 --> 00:51:36,760 And instead of frequencies of words, 450 00:51:36,760 --> 00:51:41,739 it's the number of minutes you've listened to songs or again something that really 451 00:51:41,740 --> 00:51:49,629 that I teach in the class the Netflix challenge where they had 500,000 users, 452 00:51:49,630 --> 00:51:57,670 17,000 movies, big rectangular matrix of the muted movie ratings, a very sparse matrix. 453 00:51:57,670 --> 00:52:03,639 But it turns out that, you know, once again, this kind of matrix, even though it looks enormous, 454 00:52:03,640 --> 00:52:13,930 is now the sort of thing I can you can just in Python, for example, doesn't have to be but they have all of the linear algebra routines pre compiled. 455 00:52:13,930 --> 00:52:24,430 So it's computationally efficient. And you know, doing spectral analysis on matrices of this size is something you can now do on a laptop like this. 456 00:52:25,150 --> 00:52:26,590 So these things are not difficult. 457 00:52:26,590 --> 00:52:33,639 I just threw in this slide because I teach the same thing I mentioned in the quantum computing course instead of singular value decomposition, 458 00:52:33,640 --> 00:52:37,330 which is called the Schmidt decomposition for physicists, 459 00:52:37,330 --> 00:52:40,460 this is the example of the you know, 460 00:52:40,480 --> 00:52:47,440 you take a physical system and you separate into two physical house and then you find the new basis in terms of the, 461 00:52:48,690 --> 00:52:53,560 you know, expand that out in terms of the bases of the two separate size. 462 00:52:53,830 --> 00:52:58,190 But it's the same thing in the case of cations and press. 463 00:52:58,200 --> 00:53:09,880 So those little systematic deviations can all be subsumed into one direction in this singular value space, 464 00:53:10,390 --> 00:53:12,940 the one that discriminates between the two of them. 465 00:53:12,940 --> 00:53:23,260 And you can see Pascals, the 5000 words, the blocks, his use of prepositions is systematically distinguishable from cations. 466 00:53:24,310 --> 00:53:28,959 And that's what I was expecting. And since I sometimes I usually forget, 467 00:53:28,960 --> 00:53:42,790 I'll just tell you that in the case of the Alpha and POM and Elizabeth Plumlee Thompson if this were the again 5000 blocks and by the way, 468 00:53:42,790 --> 00:53:48,099 when I was listening to this on my iPod, of course, what did they say? They talked about structural words and this. 469 00:53:48,100 --> 00:53:53,610 Oh, my God. Okay. Okay. 470 00:53:54,730 --> 00:54:03,940 They talked about structural words and, uh, and this magical way of getting into two dimensional plots. 471 00:54:03,940 --> 00:54:10,030 And I instantly realised they were just talking about principal components analysis on the software distribution. 472 00:54:10,450 --> 00:54:12,429 And so in this case, here's the L. 473 00:54:12,430 --> 00:54:18,219 Frank Baum was this pummelling Thompson that book in question came out right in the middle of The Wizard, the Family. 474 00:54:18,220 --> 00:54:28,600 Thompson So the cynical part, so this is highly unfortunate as I've only I'm got to try to pick and choose. 475 00:54:31,150 --> 00:54:36,010 You know, the point I was going to get to, I wrote this up, so this also became part of the screen. 476 00:54:36,880 --> 00:54:39,160 These are ordinary archive articles. 477 00:54:39,160 --> 00:54:47,320 And here I've got three principal components because they also have a math gem which distorts the if then distribution. 478 00:54:48,640 --> 00:54:56,890 And here's the iconic tolerate that's distorted from the standard side gem because of the 100 references. 479 00:54:56,890 --> 00:54:59,560 Also, you know, it distorts the knowledge. 480 00:54:59,560 --> 00:55:09,969 And so soon as I did that, I had the Python notebook from this silly, not silly, but an instructive pedagogical talk that I gave. 481 00:55:09,970 --> 00:55:20,950 And that also became an archive screen and also feeds into, you know, the, the, uh, the French scientist analysis. 482 00:55:22,270 --> 00:55:35,580 One last comment about this. I threw in those slides because they recently announced I mean, this again, very surprising that. 483 00:55:36,050 --> 00:55:42,410 Same publishers entered into sort of collaboration with Lobby for finding automated tools, 484 00:55:42,410 --> 00:55:49,460 for finding these articles that were generated by software that was supposed to be instantly obvious to any human reader. 485 00:55:49,850 --> 00:55:55,159 And so, you know, my comment when they asked me about this was, oh, you know, so this is wonderful. 486 00:55:55,160 --> 00:56:03,799 We now have a way of automatically detecting articles that were intentional nonsense generated by computers. 487 00:56:03,800 --> 00:56:07,360 But what about all of the unintentional nonsense generated by human? 488 00:56:13,280 --> 00:56:18,650 Now, one of the things I was going to talk about, I may well, you know, I could ask for. 489 00:56:23,280 --> 00:56:30,150 He said I should stop when you stop laughing at the jokes, but I know I can't go on forever. 490 00:56:30,780 --> 00:56:40,050 But the main thing that I wanted to do that was that was quite fun. 491 00:56:40,110 --> 00:56:50,610 Well, two other main things was to describe to you some recent work in semantic word embeddings, that has also fed into this in a very important way. 492 00:56:51,330 --> 00:57:03,690 What I proposed to do, I'm just reorganising this in my mind, is to spend another 5 minutes. 493 00:57:07,360 --> 00:57:11,560 Give him five and he takes now. Okay. So, you know, 494 00:57:12,220 --> 00:57:22,950 there's we've all done these analogy tests and used to be a see a study and the idea of one of the ideas of the semantic word embeddings, 495 00:57:22,960 --> 00:57:31,120 it's also been proven very powerful for the archive analysis is to convert this to a question of vector addition. 496 00:57:31,120 --> 00:57:35,050 So if I asked do Paris minus France plus Italy, what would you say? 497 00:57:36,130 --> 00:57:47,680 Rome. Rome? It's the right answer. And this was software written by others, which we then went and developed. 498 00:57:47,680 --> 00:57:48,669 It works so well. 499 00:57:48,670 --> 00:57:57,640 I had a student who worked on this to go through the code line by line to make sure it wasn't doing any surreptitious network lookup. 500 00:57:58,960 --> 00:58:07,780 But I apologise. I've got a good description of how all of the mathematics works. 501 00:58:07,780 --> 00:58:10,870 I'll have to defer that post to slides or something like that. 502 00:58:11,140 --> 00:58:17,470 But what it does is it assigns a vector in a 200 dimensional space to every one of the words 503 00:58:17,830 --> 00:58:25,450 in such a way that takes advantage of the words that occur in the surrounding context. 504 00:58:25,450 --> 00:58:34,090 And it turns out to be an incredibly powerful way of characterising all of the words, as you can see, feeds into all of these methodology. 505 00:58:34,100 --> 00:58:41,799 So what I'm showing here are just the cosines of the vectors, the angles between them, and it automatically, 506 00:58:41,800 --> 00:58:51,040 in this magical way, groups things together and also, you know, sort of understands hedge words, descriptive words. 507 00:58:51,430 --> 00:58:57,700 We use these things so systematically it's instantly able to group them into this into this space. 508 00:58:57,910 --> 00:59:03,160 But it's. Q What are the two things that the cosine. 509 00:59:03,160 --> 00:59:06,970 Well. Oh, sorry. Oh, thank you. 510 00:59:07,450 --> 00:59:12,810 It's all of these are with the one at the top. Oh, that's why the one at the top is always one. 511 00:59:12,820 --> 00:59:16,330 And these are these are the nearest neighbours to that word. 512 00:59:16,840 --> 00:59:21,819 And in a high dimensional space, of course, cosines fall off generically. 513 00:59:21,820 --> 00:59:26,370 Random vectors are orthogonal because there's so much more room to move around. 514 00:59:26,380 --> 00:59:30,490 Technically, it's again one over the square. They fall off this one over the square root of N. 515 00:59:31,510 --> 00:59:40,030 So you get numbers like this, they're anything but random vectors and you see all of the varieties of graphing are collected together. 516 00:59:40,030 --> 00:59:47,590 So I'm going to show, well, these are actual examples we pull from the data. 517 00:59:47,620 --> 00:59:51,970 We run it on the archive. Anybody and anybody get this one. 518 00:59:54,630 --> 00:59:58,170 Thank you. It know? Not always exact. 519 00:59:58,350 --> 01:00:02,250 How about this? No. 520 01:00:02,780 --> 01:00:05,840 I heard it a little harder. 521 01:00:13,500 --> 01:00:17,700 Yep. Just takes a while, but it comes out now. You know, there it is right at the top. 522 01:00:22,150 --> 01:00:27,130 We got it. So what's actually happening there? 523 01:00:28,330 --> 01:00:35,500 You know, just as in the case of here's here's France, here's Paris, here's Italy, here's Rome, 524 01:00:35,500 --> 01:00:40,060 the direction from France to Paris has to be the same as the direction from Italy to Rome. 525 01:00:40,330 --> 01:00:45,160 So when you do the difference and translate, you get to the other one. 526 01:00:45,490 --> 01:00:52,090 And what that the reason why if you get hit zero is because of this magical way you do this two dimensional projection and you find 527 01:00:52,090 --> 01:01:00,070 that the particles in one part of this 200 dimensional vector space and their antiparticles have exactly the same disposition. 528 01:01:00,220 --> 01:01:07,840 And so, you know, these are roughly speaking because the words surrounding them are identical in this very systematic way. 529 01:01:12,370 --> 01:01:17,800 So maybe. Well, I'm going to do one other. I had this demo up. 530 01:01:18,220 --> 01:01:22,750 I worked so hard to get this demo together. You're forced to look at it. 531 01:01:27,820 --> 01:01:33,820 But it's you can play with it more. The you know, this is online. 532 01:01:35,230 --> 01:01:42,700 There's the URL. This is a so-called Disney plot of these are, you know, four, 533 01:01:42,700 --> 01:01:48,940 three months of articles and they're colour coded according to their mutual information with various categories. 534 01:01:49,390 --> 01:01:54,770 And the graduate student who did this was very inspired. 535 01:01:54,790 --> 01:02:03,820 He just plugged it into the Google Maps API. So you can navigate this two dimensional projection in the same way you do with Google Maps. 536 01:02:04,330 --> 01:02:12,100 He was able to do this last in less than half an hour because his office made it use Google Maps to do a beer pong visualisation. 537 01:02:15,910 --> 01:02:25,510 And again, you can see the way these words cluster together and you can zoom in, you can do searches and look at various. 538 01:02:28,030 --> 01:02:33,120 Somebody pick a word. You did what? 539 01:02:33,960 --> 01:02:37,650 What'd you do? Look. Well, there's motility. 540 01:02:37,660 --> 01:02:47,280 I mean, it'll just go in and you'll find motility. Ah, but you know it the. 541 01:02:47,720 --> 01:02:57,110 Well, I'll do something else, you know. Here. If you look. You know, it tends to group together all of the universities here. 542 01:02:57,260 --> 01:03:02,720 If you go to the highest level. Well, let me pick this strange area over here. 543 01:03:03,500 --> 01:03:07,160 So this looks like a disconnected area. Why is this a disconnected area? 544 01:03:07,490 --> 01:03:15,069 You zoom in and you find that. And, you know, at the upper left here, it's mainly blue. 545 01:03:15,070 --> 01:03:22,120 And so we try to screen for the languages. But, you know, obviously there are a bunch of and so, you know, it knows to collect all of those together. 546 01:03:22,500 --> 01:03:32,020 There's another intriguing one are, you know, if you look at these outliers in this space, what's going on over here? 547 01:03:35,260 --> 01:03:49,960 You know, it's picked out all the figure captions. But then, you know, always, always amusing is the island of Japan down here, 548 01:03:51,810 --> 01:03:58,390 down in the lower left, which, you know, really turns out to be the island. 549 01:03:58,570 --> 01:04:01,840 So what's going on there? I mean, how does it group that together? 550 01:04:01,840 --> 01:04:05,020 It does. I mean, it's absolutely amazing. That's why I don't do the semantic lookup. 551 01:04:05,410 --> 01:04:10,090 All that means is that people with Japanese surnames tend to co-author more often than not. 552 01:04:10,990 --> 01:04:14,260 That's also true of people with French first names. 553 01:04:14,500 --> 01:04:19,569 It's also true of women with Italian first names. 554 01:04:19,570 --> 01:04:24,580 I could go in there. I won't right now because I'm already well over of. 555 01:04:30,230 --> 01:04:36,790 You know, you'll find the Daniela. There was somewhere in there, the dining hall, and all of the rest were all together. 556 01:04:36,800 --> 01:04:44,600 So let me just in my remaining minus 2 minutes, I did have a plea here for why these analogies. 557 01:04:44,600 --> 01:04:48,890 I'm going to go very quickly through slides so that you'll get a flavour of what you're missing. 558 01:04:50,030 --> 01:04:56,540 It was James Maxwell who this was my apology for talking about analogies to a physics audience. 559 01:04:57,470 --> 01:05:08,000 I found this absolutely fabulous quote from James Maxwell from the mid-19th century, who says, now a pun to truth, lie hid under one expression. 560 01:05:08,000 --> 01:05:11,600 So in an analogy, one truth is discovered under two expressions. 561 01:05:11,600 --> 01:05:20,600 Every question concerning analogies is therefore the reciprocal of a question concerning puns, and the solutions can be transposed by reciprocation. 562 01:05:20,600 --> 01:05:29,630 So I have never known about this duality between puns and analogies, but since Maxwell wrote about it, I thought that. 563 01:05:29,900 --> 01:05:35,210 So this is all the mathematics behind it. It's great mathematics. 564 01:05:35,810 --> 01:05:44,209 And I'm going to I guess I'll leave something for the for the questions. 565 01:05:44,210 --> 01:05:48,100 I wanted to leave you with one final thing. 566 01:05:48,110 --> 01:05:56,599 It's very easy to describe. These are a bunch of slides about are the overlap detection. 567 01:05:56,600 --> 01:05:58,909 We're the only ones since these things come in. 568 01:05:58,910 --> 01:06:05,930 It's remarkably you can take 600 new submissions a year and compare them against the million document database. 569 01:06:05,930 --> 01:06:08,150 They're these fantastic algorithms for doing this. 570 01:06:08,570 --> 01:06:16,670 And one of the surprises to me I thought I'd seen it all was to realise that people plagiarise, acknowledgements and theses. 571 01:06:23,060 --> 01:06:29,060 And so you can see and you know, I put the archive numbers here, look, this is up and, you know, I take this express. 572 01:06:29,060 --> 01:06:34,940 But, you know, it looks like the head of the Department of Mathematics changed from BCD Dos to Ajay Kumar. 573 01:06:34,940 --> 01:06:40,400 So, you know, but there was an even better one. 574 01:06:42,380 --> 01:06:46,520 You know, there are lots of these. So I pick and choose what I like. 575 01:06:46,520 --> 01:06:51,379 This is I cannot describe how indebted I am to my wonderful girlfriend, Amanda. 576 01:06:51,380 --> 01:06:58,230 His love and encouragement will always motivate. And then I cannot describe that. 577 01:06:58,240 --> 01:07:04,430 And then in my ongoing wife, Renate, so in someone peculiar working hours, erratic behaviour toward the end. 578 01:07:04,430 --> 01:07:13,639 Good. So this is my wanted to end up here, I wanted to end on, you know, usually when I give a talk like this to students, 579 01:07:13,640 --> 01:07:19,220 how important it is if you're going to do this, at least remember to change girlfriend Amanda to a wife Renata. 580 01:07:21,380 --> 01:07:26,120 You run into worse problems than just plagiarism. So I'll stop there and apologise for what I did.