1 00:00:00,300 --> 00:00:05,580 So I'll be talking about the replication crisis in psychology. And just to be fair to psychology, I'm not singling it out here. 2 00:00:05,580 --> 00:00:10,470 I could just as well be talking about a replication crisis in science generally, or medicine or biology. 3 00:00:10,770 --> 00:00:12,660 Cancer research is having some problems now. 4 00:00:13,020 --> 00:00:17,669 I'll be speaking about psychology because this is the area that I'm most familiar with and closer to the facts on the ground. 5 00:00:17,670 --> 00:00:19,800 And so I feel most comfortable talking about these examples. 6 00:00:20,850 --> 00:00:25,890 The other thing I want to say about the word crisis here is some people think this term is inappropriate, it's dramatic. 7 00:00:26,910 --> 00:00:29,190 Others think that it's a fair description of what's going on. 8 00:00:29,190 --> 00:00:34,890 So there's serious and substantive disagreement about whether indeed there is a crisis going on and if there is, what the nature of it is. 9 00:00:35,190 --> 00:00:41,040 So I want to be clear about that. In terms of history and philosophy of science, I want you to get your hopes up too much in terms of history. 10 00:00:41,040 --> 00:00:46,470 I'm not going back much further than 1970, and the philosophy of science is going to be sort of 1 to 1 introductory stuff. 11 00:00:46,950 --> 00:00:52,680 So for those of you who have any familiarity at all with philosophy of science, some of these ideas will be comfortable already. 12 00:00:53,910 --> 00:00:54,660 So I'll start with this. 13 00:00:55,920 --> 00:01:01,860 Replication is sometimes talked about in the context of falsification, which is an idea that goes back most notably to Karl Popper. 14 00:01:01,860 --> 00:01:05,040 This is from his 1934 treatise, The Logic of Scientific Discovery. 15 00:01:05,430 --> 00:01:08,730 I've already violated my 1970s rule there, but I won't do that again. 16 00:01:10,380 --> 00:01:15,990 Here's a naive view of what's going on with scientific discovery, and this is something that Popper was criticising in his work. 17 00:01:16,710 --> 00:01:19,740 You might think that you come up with a theory or maybe a hypothesis in your mind, 18 00:01:19,740 --> 00:01:23,670 and you think that it's going to predict some kind of observation, something you should be able to observe. 19 00:01:24,300 --> 00:01:28,320 And you try to design an experiment such that the observation will lead logically. 20 00:01:28,470 --> 00:01:30,630 Sorry, the theory will lead logically to an observation. 21 00:01:31,050 --> 00:01:38,129 And then if you get the observation that you predicted, you might think, well, that now confirms or at least counts in favour of my theory. 22 00:01:38,130 --> 00:01:39,900 That's sort of an eye view of what's going on of science. 23 00:01:40,830 --> 00:01:46,290 The problem with this is that any number of other theories might also just as well have led to the same observation. 24 00:01:46,620 --> 00:01:51,329 And so by finding evidence that's consistent with the theory, that's incredibly weak support for the theory. 25 00:01:51,330 --> 00:01:58,410 Unless you can rule out all of these competing alternative theories that might, as I say, just as well, have predicted predicted the same observation. 26 00:02:00,180 --> 00:02:06,720 So this is related to the fallacy in in logic called affirming the consequent. 27 00:02:06,810 --> 00:02:10,650 So this goes like this. If he then q. Q therefore p. 28 00:02:11,280 --> 00:02:15,240 You can't draw that conclusion because a, b, c, d might just as well have entailed. 29 00:02:15,630 --> 00:02:19,560 Q So getting Q doesn't give you P for free. You have to rule out all the alternatives. 30 00:02:20,160 --> 00:02:23,940 But this points to to a valid argument in logic, which is the modus talon's argument. 31 00:02:24,090 --> 00:02:27,220 And the way that this goes is if. P then. Q Not. 32 00:02:27,240 --> 00:02:30,930 Q Therefore not. P The idea is if P than. 33 00:02:30,930 --> 00:02:34,260 Q I didn't find. Q Therefore it can't have been P in the first place. 34 00:02:34,620 --> 00:02:38,759 So this inspired Popper's notion of falsification. 35 00:02:38,760 --> 00:02:43,680 So the idea here is I have a theory. I predict some sort of very specific observation. 36 00:02:43,680 --> 00:02:50,040 And if I don't get the observation that finds, then again, that's not supposed to seen as falsifying or at least counting against the theory. 37 00:02:50,070 --> 00:02:52,920 Again, this is a highly simplistic view, but that's the general idea. 38 00:02:54,450 --> 00:02:58,859 So Popper's notion was that instead of trying to come up with evidence that seemed to support or confirm our theories, 39 00:02:58,860 --> 00:03:03,540 we should constantly be trying to come up with hypotheses that predict very specific outcomes, 40 00:03:03,540 --> 00:03:08,900 such that if we don't get the outcome that we expected, we should be willing to say to forfeit our theory. 41 00:03:08,910 --> 00:03:15,210 If you constantly have theories that that can't be falsified by any any possible observation rather than being a strength of the theory. 42 00:03:15,240 --> 00:03:21,270 He thought that was a weakness. And so we should be subjecting our theories to what he calls critical tests, trying to falsify them. 43 00:03:21,630 --> 00:03:26,370 And if we can't falsify them, we don't say that they've been proven true. We just say they haven't yet been falsified. 44 00:03:27,360 --> 00:03:31,589 And begrudgingly, well, over time, if we've subjected them to many critical tests, say maybe, 45 00:03:31,590 --> 00:03:35,460 maybe we have something here, we haven't been able to falsify it, even though we tried our best to do so. 46 00:03:37,040 --> 00:03:43,639 Now, as critics of Popper have noted in all the intervening decades and in many different ways, even this is not get us off the hook. 47 00:03:43,640 --> 00:03:45,260 This is still a problematic theory. 48 00:03:46,160 --> 00:03:53,720 So again, the idea is I don't get the observation that I expected on Popper's view that it's at least supposed to count against my theory, 49 00:03:54,410 --> 00:03:59,230 but it might not be that straightforward. So first of all, there might be something wrong with the observation. 50 00:03:59,240 --> 00:04:02,240 Maybe my research assistant wrote down the wrong number or something like that, 51 00:04:02,240 --> 00:04:05,660 so I might not be sure that the observation is really the correct observation. 52 00:04:05,960 --> 00:04:11,790 So evisceration might be what's wrong rather than the theory. It might be that there are various weaknesses in the experiment itself. 53 00:04:11,810 --> 00:04:16,639 Maybe I didn't set it up in such a way that it really would give me the closest test of the theory. 54 00:04:16,640 --> 00:04:21,710 In terms of the observation, there might be various problems between the theory and the experiment as well. 55 00:04:22,490 --> 00:04:24,940 And these are often referred to as auxiliary assumptions. 56 00:04:24,950 --> 00:04:32,300 Basically, these are these are logical links between each step going from theory through the experiment to the observation. 57 00:04:32,720 --> 00:04:39,950 And you'd have to show that all of those were were true in order for the observation not turning out to count against the theory. 58 00:04:40,730 --> 00:04:45,770 Just to show what I mean by this. It could be that if the observation doesn't hold, the theory is not true. 59 00:04:45,780 --> 00:04:49,090 But it could be that the observation is wrong. It could be something's wrong with the experiment. 60 00:04:49,100 --> 00:04:52,250 It could be something's wrong in the link between the theory, the theory in the experiment. 61 00:04:52,520 --> 00:04:56,600 And also I rule out all of those alternative explanations. I actually haven't touched the theory at all. 62 00:04:57,080 --> 00:05:02,750 I certainly can't say that it's not true unless I've ruled out all of the possible links between the observation and the theory. 63 00:05:04,220 --> 00:05:08,660 So this creates something of a pickle for talk of failed replications in the current debate. 64 00:05:08,690 --> 00:05:13,550 You will often find researchers saying, Well, we failed to replicate the initial finding. 65 00:05:13,880 --> 00:05:20,090 This counts against that finding. But again, unless you've accounted for all the possible auxiliary hypotheses, simply getting a negative result, 66 00:05:20,090 --> 00:05:23,870 a result that doesn't appear to be the same as the first one that was reported 67 00:05:24,710 --> 00:05:28,010 isn't going to be enough to show that you somehow falsify the original finding, 68 00:05:28,010 --> 00:05:36,469 much less the original theory. So the opening shots of the current replication crisis were fired around 2012. 69 00:05:36,470 --> 00:05:42,620 I think this is this is a blog post by Ed Yong, a very distinguished science writer in the United States, I believe. 70 00:05:43,010 --> 00:05:46,669 Or maybe he's here. He used to work for discovers and writes for The Atlantic. 71 00:05:46,670 --> 00:05:49,969 And this post came out that caused a lot of controversy. 72 00:05:49,970 --> 00:05:54,380 You can see the title here primed by Expectations Why Classic Psychology Experiment Isn't What It Seemed. 73 00:05:55,220 --> 00:05:59,360 He was writing up this paper here by Doyin and colleagues called behavioural priming. 74 00:05:59,360 --> 00:06:07,730 It's all in the mind. But who's mine now? What was significant about this paper, also published in 2012, is that the researchers ran an experiment, 75 00:06:08,210 --> 00:06:12,560 didn't get evidence supportive of the original published findings that they were trying to replicate. 76 00:06:13,040 --> 00:06:17,149 And then instead of just burying the finding in their file drawer, which is what typically happens, they took that. 77 00:06:17,150 --> 00:06:20,840 They went through the trouble to write it up, which is further trouble of submitting it to a journal. 78 00:06:20,840 --> 00:06:23,960 And they actually got it published in a in a fairly well respected journal. 79 00:06:24,650 --> 00:06:26,810 Now, this used to never happen. 80 00:06:27,680 --> 00:06:33,100 Anthony Greenwald wrote a paper back in the seventies talking about the unintended consequences of what he calls prejudice against. 81 00:06:33,110 --> 00:06:35,450 No. This just means that for the longest time, 82 00:06:35,450 --> 00:06:40,280 journals had a strong disinclination from publishing any kind of negative or, as it were, failed experiments. 83 00:06:40,730 --> 00:06:46,160 So what happened is you would run an experiment to try. I mean, failed replications in some sense happened all the time and continue to happen. 84 00:06:46,580 --> 00:06:51,080 It's just that you would never go through the trouble of writing them up, much less submitting them because you knew you wouldn't get them published. 85 00:06:51,080 --> 00:06:57,260 So this was a significant moment. The only time you would hear about failed replications would be at the bar after a conference. 86 00:06:57,260 --> 00:07:00,979 When you're talking to your colleagues and you say, Yeah, I tried to run that experiment and I couldn't get it to work. 87 00:07:00,980 --> 00:07:06,830 Oh yeah, I tried to run the experiment. I couldn't get it to work. So you had this informal knowledge floating around in scientific communities, 88 00:07:07,130 --> 00:07:10,700 but because all of these failed replications are never published, they just get buried. 89 00:07:10,700 --> 00:07:16,189 And what you end up having the published literature is just a fraction of the attempts at running those experiments. 90 00:07:16,190 --> 00:07:21,650 And this creates systematic problems and also, by the way, increases the likelihood that the published finding is just a false alarm. 91 00:07:23,780 --> 00:07:27,890 So this is the original paper that Duane and colleagues were trying to replicate. 92 00:07:28,400 --> 00:07:33,950 I'm going to call this the elderly walking time study, just the shorthand, because the title is a bit long here. 93 00:07:34,580 --> 00:07:38,810 I just want to check in the room. Do people know about this study or have heard of this study? 94 00:07:39,050 --> 00:07:42,560 If you have, just raise your hand and if you and if you haven't, that's about half. 95 00:07:42,560 --> 00:07:45,770 Okay. So it's it's worth explaining. Oh, I'll go through it in some detail. 96 00:07:46,940 --> 00:07:49,759 Basically, this was a very important finding in the field of social psychology. 97 00:07:49,760 --> 00:07:54,290 It's been cited 3633 times, which is an astronomical number for studies of this kind. 98 00:07:54,650 --> 00:07:57,290 It's been written up in introductory textbooks and so on. 99 00:07:57,290 --> 00:08:01,460 So for researchers to claim that they weren't able to replicate this finding was, as you can imagine, sort of a big deal. 100 00:08:03,380 --> 00:08:07,250 This is basically how the experiment worked. I'll say a little bit about the theory. 101 00:08:07,250 --> 00:08:12,020 So there's this idea that simple cues in the environment activate stereotypes in our mind. 102 00:08:12,020 --> 00:08:16,850 If you see someone walking down the street, you pick out very basic aspects of what you can see, 103 00:08:17,150 --> 00:08:20,270 and all these stereotypical traits just become automatically activated. 104 00:08:21,080 --> 00:08:26,209 The researchers here are trying to test the idea of the further idea that once a stereotype is activated in your mind, 105 00:08:26,210 --> 00:08:32,210 you yourself tend to behave in accordance with that stereotype as a basically a a social lubricant. 106 00:08:32,210 --> 00:08:36,170 It helps you behave in accordance with the way that you think people basically are around you 107 00:08:36,170 --> 00:08:39,530 so that you're not consciously having to try to figure out how to behave in some situation. 108 00:08:39,980 --> 00:08:44,180 So I'm going to be talking about the elderly stereotype. Just imagine that I'm in a room full of elderly people. 109 00:08:44,480 --> 00:08:48,920 Well, I'm going to walk a little slower, maybe speak a little more quietly and so on just instantaneously, 110 00:08:48,920 --> 00:08:52,610 because my mind has a stereotype about what elderly, elderly people are like. 111 00:08:52,940 --> 00:08:55,639 And then it's going to encourage me to sort of just instantaneously act a little 112 00:08:55,640 --> 00:08:59,450 bit more like that so that I can find it easier to engage with my environment. 113 00:09:00,770 --> 00:09:05,149 So the clever idea for how they were going to test this was they gave what's called a linguistic priming task. 114 00:09:05,150 --> 00:09:10,280 Basically, they gave people a series of sentences that were in a scrambled order and you had to unscramble them. 115 00:09:10,280 --> 00:09:13,820 And it was just framed as a linguistic puzzle, basically. 116 00:09:14,720 --> 00:09:18,800 But in a subset of these sentences were words that were meant to activate the elderly stereotype. 117 00:09:18,810 --> 00:09:23,900 So the words are things like wrinkle grey. Bingo. Florida, I think, was in there. 118 00:09:23,900 --> 00:09:31,160 This is conducted in the US and the idea is that not so many of these words should, should be activated, that you notice any sort of connection here. 119 00:09:31,160 --> 00:09:34,190 But your mind is figuring out that there's something going on here. 120 00:09:34,400 --> 00:09:38,870 It's meant to activate the whole stereotype, which means that even traits that you didn't include in your priming materials 121 00:09:38,870 --> 00:09:41,690 should become activated in your mind because they're associated in the culture. 122 00:09:42,620 --> 00:09:50,059 So the trick here is that one of the traits they did not include in the priming materials was one having to do with moving slowly through space. 123 00:09:50,060 --> 00:09:52,160 And then they wanted to see if if that were activated. 124 00:09:52,460 --> 00:09:57,170 And it's true that it influences your behaviour and participants who were primed with the elderly stereotype 125 00:09:57,170 --> 00:10:01,909 should themselves move more slowly through space compared to those who were primed in a controlled condition. 126 00:10:01,910 --> 00:10:05,930 That was the prediction. So they tried to activate the stereotype of the elderly. 127 00:10:05,930 --> 00:10:11,149 And then the very interesting thing they had done here was that an experiment who was sitting 128 00:10:11,150 --> 00:10:15,650 in the hallway with a stopwatch behind a newspaper and just timing participants as they 129 00:10:15,650 --> 00:10:20,000 left the study and went to the elevator and this person was blind to conditions and just 130 00:10:20,000 --> 00:10:23,030 recorded how long it took each of them to walk down the hallway and get to the elevator. 131 00:10:23,030 --> 00:10:25,100 And then you can go back and look at which condition they were in. 132 00:10:25,430 --> 00:10:29,810 And the big finding was that participants who had been primed with the elderly stereotype did indeed walk more. 133 00:10:29,950 --> 00:10:32,980 Well we down the hall than participants who are primed with control. 134 00:10:34,870 --> 00:10:38,350 Now here's the replication attempt by the Belgian researchers. 135 00:10:38,890 --> 00:10:40,660 They administered the same priming task. 136 00:10:40,750 --> 00:10:46,930 The idea was that they're going to activate the elderly stereotype, but they weren't very happy with this measurement device here. 137 00:10:46,940 --> 00:10:51,960 You can see it's a human with a stopwatch and humans with stopwatches are potentially prone to errors. 138 00:10:52,300 --> 00:10:56,200 They might be influenced by their expectations in some way. So they actually made a change. 139 00:10:56,230 --> 00:11:01,270 This is a replication study, but it's not exactly the same and in a way that is important as our highlight in a moment. 140 00:11:01,900 --> 00:11:05,470 What they did is they replaced the students with stopwatches, with infrared sensors. 141 00:11:05,710 --> 00:11:08,770 So these infrared sensors are, strictly speaking, different. 142 00:11:08,860 --> 00:11:12,040 They made a change. But on any reasonable understanding about what's going on here, 143 00:11:12,040 --> 00:11:16,359 this is an improvement to the study design because they replaced a measure that's prone to 144 00:11:16,360 --> 00:11:21,790 human idiosyncrasy with a measure that's going to give an accurate reading in every condition. 145 00:11:22,720 --> 00:11:27,640 So the asterisk I'm raising here is that when you're talking about an exact replication or a direct replication, 146 00:11:27,970 --> 00:11:30,150 it doesn't necessarily mean you use the exact same materials. 147 00:11:30,160 --> 00:11:36,190 It's actually okay, I would argue, to make certain changes so long as they're better tests of the theory on any reasonable view. 148 00:11:36,190 --> 00:11:42,070 And that was true in this case. So unfortunately, they weren't able to find anything looking like the original result. 149 00:11:42,490 --> 00:11:48,250 And to drive the point home, we actually went back to the students with stopwatches and they ran a second study where they told half 150 00:11:48,250 --> 00:11:54,520 of half of the experimenters what the hypothesis was and the other half the experiment experimenters, 151 00:11:54,520 --> 00:11:57,280 what it wasn't. In other words, they were priming them with expectations. 152 00:11:57,970 --> 00:12:04,060 And here they were able to replicate the finding in the group of students with stopwatches who were aware in some way of the study hypothesis. 153 00:12:04,060 --> 00:12:08,260 And so the implication was that perhaps what was going on in the original famous study, 154 00:12:08,860 --> 00:12:11,559 it might have been that the experimenters maybe weren't properly blinded, 155 00:12:11,560 --> 00:12:14,620 that somehow their expectations of what was going on was influencing their effect. 156 00:12:15,190 --> 00:12:21,700 The timing measures. So the way this was written up by Ed Yong referred to what's called the clever Hans effect. 157 00:12:21,700 --> 00:12:27,520 And again, I'll just ask before I don't go into too much detail here. Does anyone not know what the clever Hans effect is? 158 00:12:27,550 --> 00:12:35,950 Just some people don't. So for your benefit, in the early 1900s, there was a show force that supposedly could perform mathematical operations, 159 00:12:35,950 --> 00:12:39,489 and his handler would bring him around and he could do addition and subtraction and so on. 160 00:12:39,490 --> 00:12:46,090 And everyone was amazed at the capacities of this horse. A psychologist at the time was brought along to try to figure out what was going on here. 161 00:12:46,090 --> 00:12:50,950 Did the horse really understand arithmetic? And it turned out, based on a series of careful experiments, 162 00:12:50,950 --> 00:12:55,629 that the horse was just responding to very subtle cues on the part of his handler, maybe that even the handler didn't know. 163 00:12:55,630 --> 00:13:00,880 So when the right answer is hoof is hovering over the right answer, maybe the experimenter perks up a little bit like that or something. 164 00:13:01,060 --> 00:13:03,879 And the horse notices that and says, okay, so I'm going to now give that answer. 165 00:13:03,880 --> 00:13:09,070 Now, that's still an amazing capacity if the horse can read subtle shifts and the psychological state of the handler. 166 00:13:09,070 --> 00:13:11,410 But it's not the the result that was being claimed. 167 00:13:12,790 --> 00:13:18,339 What's important about the clever hand's effect is that this is experimental methods one on one in social psychology, 168 00:13:18,340 --> 00:13:21,700 particularly when you're dealing with these priming experiments where you're doing very 169 00:13:21,700 --> 00:13:25,690 subtle manipulations of the environment to try to alter human thinking or behaviour, 170 00:13:26,080 --> 00:13:27,400 you have to be absolutely, 171 00:13:27,550 --> 00:13:33,940 absolutely certain that the experimenters, everyone who's interacting with subjects, has no idea what the experimental design is, 172 00:13:33,940 --> 00:13:40,149 lest they behave like the horse's handler and subtly give some sort of clue that brings about what's hypothesised, 173 00:13:40,150 --> 00:13:48,879 but not by the mechanism that was hypothesised. So essentially the claim was in such direct terms that this very distinguished researcher 174 00:13:48,880 --> 00:13:53,890 who'd come out with the original finding was guilty of making a Psychology 101 very, 175 00:13:53,890 --> 00:13:58,930 very basic design error. Now, I'm putting an asterisk here because I'm going to come back to this story. 176 00:13:58,930 --> 00:14:01,480 This is not how it ends, but this is this is how it begins. 177 00:14:04,000 --> 00:14:10,299 Something about the timeline here that I think is important to to point out is that if you see the here in November 2011, 178 00:14:10,300 --> 00:14:16,180 this was the year prior, there was this paper, this story, this essay published in The New York Times about a fraud case. 179 00:14:16,180 --> 00:14:21,159 This is you might have heard of Dietrich Stoffel, very important researcher who had published all these findings. 180 00:14:21,160 --> 00:14:25,299 And it turned out that it wasn't the case, that he'd just done sloppy research in some way. 181 00:14:25,300 --> 00:14:26,620 He was just making up the findings. 182 00:14:26,620 --> 00:14:33,549 He would open up an Excel spreadsheet, type in some numbers and pretend he'd run an experiment and then publishes results. 183 00:14:33,550 --> 00:14:37,720 And he made this whole influential career doing explicitly fraudulent behaviour. 184 00:14:38,170 --> 00:14:42,969 And so he was caught and found out there was this big exposé in the New York Times. 185 00:14:42,970 --> 00:14:46,660 And so I think people are thinking fraud now is what's going on in psychology. 186 00:14:46,660 --> 00:14:52,540 And this is this is difficult because if you have ordinary research behaviours that sometimes lead to a result that can't be replicated, 187 00:14:52,540 --> 00:14:56,560 I think a lot of members of the lay public immediately said, oh, it must have been fraud going on there, 188 00:14:56,800 --> 00:15:01,570 rather than seeing it as just a normal part of the scientific process. So that's something that was in the background here. 189 00:15:03,010 --> 00:15:09,909 Another paper that came out around this time was a study by Lesley John and colleagues where they administered a survey to more 190 00:15:09,910 --> 00:15:15,370 than 2000 practising psychologists and they made sure it was carefully anonymized so that the psychologists would be willing, 191 00:15:15,370 --> 00:15:19,810 in a sense, to admit to behaviours that otherwise they might not want to admit to if they knew that they were going to be identified. 192 00:15:20,410 --> 00:15:25,330 And these psychologists admitted to a whole range of what are now called questionable research practices, 193 00:15:25,330 --> 00:15:33,700 which are normal aspects of everyday, ordinary lab practice that are in the grey area between fraud and and perfect methodology. 194 00:15:34,090 --> 00:15:39,280 And when you kind of add them up, they lead to a higher likelihood of having false alarms published in the literature. 195 00:15:39,640 --> 00:15:42,820 And so they would admit to things like stopping and checking their data every 196 00:15:43,060 --> 00:15:46,150 so many participants to see if there was a statistically significant finding. 197 00:15:46,390 --> 00:15:51,960 If there wasn't, they would collect some more participants and check again. And then as soon as it drops below 0.05, they say, yes, we got a result. 198 00:15:51,970 --> 00:15:54,640 They would publish it. This was standard practice until a couple of years ago. 199 00:15:54,850 --> 00:15:59,260 Nobody really was talking about it except for a few methodology journals that maybe this isn't what you should be doing. 200 00:15:59,620 --> 00:16:06,370 So a lot of psychologists didn't know you weren't supposed to be doing this. Another study that came out around this time was this one by Darrell Bam. 201 00:16:06,760 --> 00:16:12,310 There is a very distinguished psychological researcher who has this side interest in what's called parapsychology. 202 00:16:12,670 --> 00:16:18,940 So he thinks that these sorts of phenomena that most, as it were, mainstream psychologists think are not true and couldn't possibly be true. 203 00:16:19,780 --> 00:16:21,429 He thinks that indeed they could exist. 204 00:16:21,430 --> 00:16:28,480 And so he published this study in the top journal in the field purporting to show evidence of what's called precognition, 205 00:16:28,600 --> 00:16:34,600 the idea that some future event can actually influence your current decision making before it's happened. 206 00:16:35,290 --> 00:16:41,619 So Ben presented this as evidence of precognition, whereas others in the field said he's used all the standard methodologies, 207 00:16:41,620 --> 00:16:46,540 he's used all the standard statistical procedures, and he's come up with a result that just cannot be true. 208 00:16:46,540 --> 00:16:51,669 And let's we want to throw out assumptions from physics and basic assumptions about the way the world works. 209 00:16:51,670 --> 00:16:53,950 So maybe something's wrong with our methodology, right? 210 00:16:54,250 --> 00:16:58,930 If he's using the best methods in the field and coming up with a ridiculous result, something's probably wrong with our methods. 211 00:17:00,840 --> 00:17:08,129 This paper is also published around this time. Undisclosed flexibility and data collection and analysis allows presenting anything is significant. 212 00:17:08,130 --> 00:17:15,990 And these researchers famously concluded it is unacceptably easy to publish, quote, statistically significant evidence consistent with any hypothesis. 213 00:17:17,640 --> 00:17:25,020 Now, there's a whole other debate going on right now about what sort of statistical procedure should we be using in psychology and other fields. 214 00:17:25,110 --> 00:17:28,379 The standard now is to use something called null hypothesis significance testing, 215 00:17:28,380 --> 00:17:32,370 which rests on that famous P value when P is less than a five or sometimes less than a one. 216 00:17:33,180 --> 00:17:36,360 Researchers rejoice and submit their their finding to a journal. 217 00:17:37,290 --> 00:17:41,759 For 60 or 70 years, statisticians have been shouting at psychologists that You must not use this procedure. 218 00:17:41,760 --> 00:17:43,710 It's an invalid inference procedure. 219 00:17:43,890 --> 00:17:50,549 It's only gives you helpful information under an extremely rare conditions that almost never hold when you're running a psychology experiment. 220 00:17:50,550 --> 00:17:54,480 And yet, because psychology professors learned these tricks in grad school and teach them to their students, 221 00:17:54,480 --> 00:18:00,299 there's a sort of institutional inertia whereby null hypothesis significance testing is just carried on and on and on, 222 00:18:00,300 --> 00:18:05,220 despite the fact that it's very likely to lead to, as I say, findings that won't be replicated later. 223 00:18:06,540 --> 00:18:11,549 So in response to this, the editor of one journal Basic in Applied Social Psychology, a colleague of mine, 224 00:18:11,550 --> 00:18:15,780 David Crawford, now actually took over the editorship of the journal and banned P values. 225 00:18:15,780 --> 00:18:17,729 He said, You actually can't use this procedure anymore. 226 00:18:17,730 --> 00:18:22,680 And a lot of psychologists about what the [INAUDIBLE] do we do, we don't know anything else to do. So this created a whole furore in the field. 227 00:18:24,600 --> 00:18:30,450 Just to give a personal turn here. I had difficulty replicating a major finding in the field myself. 228 00:18:30,450 --> 00:18:34,800 This was when I was an undergraduate student back in 2009 or eight or something like that. 229 00:18:35,940 --> 00:18:38,610 The study I was trying to replicate, which I just assumed was true, 230 00:18:38,610 --> 00:18:42,840 is published in Science magazine, and at that point it had already been cited hundreds of times. 231 00:18:44,100 --> 00:18:49,800 Now it's about 800 times. So this is a very influential finding. The idea here is part of the so-called embodied cognition literature. 232 00:18:50,700 --> 00:18:57,299 The thought is that if you're made to feel guilty or morally impure in some way, you'll actually be tempted to physically cleanse your body. 233 00:18:57,300 --> 00:19:01,890 And the thought is that moral purity and physical purity are somehow connected in the mind. 234 00:19:01,890 --> 00:19:07,290 There's evolutionary explanations for why this might be so that initially we had discussed towards pathogens, 235 00:19:07,500 --> 00:19:13,440 and then that was sort of accepted by later evolutionary processes to have us feel disgust toward immoral events as well. 236 00:19:13,830 --> 00:19:17,639 And so the finding here was that if you induce a feeling of guilt in the participants, 237 00:19:17,640 --> 00:19:24,180 then they're more likely to want to wash their hands, more likely to select cleansing items on a shopping list compared to other items. 238 00:19:24,180 --> 00:19:25,410 That was that was the finding. 239 00:19:26,010 --> 00:19:33,959 Well, the first time I ran this, I had two times as many participants as the original study reported, and I wasn't able to find anything. 240 00:19:33,960 --> 00:19:35,580 But so I wrote this up with my colleagues. 241 00:19:35,580 --> 00:19:41,850 I thought I would try to not commit myself to the file drawer, but rather make this say, Listen, I tried to find this effect. 242 00:19:41,850 --> 00:19:46,770 I couldn't find it. We sent it to the journal. And what was interesting was that the reviewers said, well, listen, 243 00:19:46,770 --> 00:19:50,790 you know that the effect size that was published in the initial report was probably inflated. 244 00:19:52,470 --> 00:19:58,230 Initial estimates of effect sizes are almost always bigger than they really are once you sort of get more findings into the literature. 245 00:19:58,570 --> 00:20:00,389 So we said you should assume that it's much less than that. 246 00:20:00,390 --> 00:20:04,200 And we eventually had to run with five times as many participants in order to satisfy this reviewer. 247 00:20:05,280 --> 00:20:08,219 And still, even though we were using the exact same methods and materials, 248 00:20:08,220 --> 00:20:11,790 we contacted the original researchers who very graciously walked us through the design. 249 00:20:11,790 --> 00:20:17,759 There was no sort of attempt to hide what was going on. We still just couldn't find any evidence of this effect using standard procedure. 250 00:20:17,760 --> 00:20:23,370 So we wrote that up and published it here. So this whole string of events was, I have to admit, very disheartening. 251 00:20:23,370 --> 00:20:30,359 The area of psychology that I was training at the time, at the time was this priming psychology, which I thought was very interesting and exciting, 252 00:20:30,360 --> 00:20:34,470 that these subtle interventions in the environment could be influencing the thoughts and feelings and behaviour. 253 00:20:34,800 --> 00:20:37,590 And here it looked like this sort of thing was just crumbling before my very eyes. 254 00:20:38,340 --> 00:20:41,940 So rather than continuing on to much further as an experimental psychologist, 255 00:20:41,940 --> 00:20:46,410 I decided to take a detour and study the history and philosophy of science at Cambridge. 256 00:20:47,310 --> 00:20:54,900 And it was there that I wrote a sort of thesis on this replication issue that eventually was edited and published, 257 00:20:54,900 --> 00:20:59,520 as does this paper here with David Crawford. Now, he's the one I mentioned earlier who had its basic complexity psychology. 258 00:21:00,300 --> 00:21:04,740 And well, I was there at Cambridge. I met this visiting scholar named Stuart Firestein. 259 00:21:04,740 --> 00:21:09,000 He was there on some sort of fancy grant to write a book. He's based normally at Columbia. 260 00:21:09,690 --> 00:21:13,079 And we got to talking about this replication issue and this crisis. 261 00:21:13,080 --> 00:21:18,800 And he said to me, Well, what crisis? And I'd just written this whole paper saying, Oh, there's this big old crisis going on. 262 00:21:18,810 --> 00:21:22,500 Have you been paying attention? So we get into these arguments over whisky late at night, 263 00:21:23,100 --> 00:21:28,379 and often what happens when you have one person say there's a crisis and another person says there's not a crisis, 264 00:21:28,380 --> 00:21:33,450 it's that you just have two different senses of crisis in your mind. And that's what I think was going on between Stuart and me. 265 00:21:33,840 --> 00:21:38,100 So I'll tease apart two senses of crisis here. The first is crisis of confidence. 266 00:21:38,340 --> 00:21:46,980 This is basically I'm there's something going on where people have lost their confidence and it's sort of descriptive claim. 267 00:21:47,130 --> 00:21:50,190 It's not saying whether they're justified in having a loss of confidence. 268 00:21:50,190 --> 00:21:53,280 It's just a descriptive claim. The other is a crisis of process. 269 00:21:53,280 --> 00:21:59,429 And the way that I'm kind of summarising this is a crisis of confidence. Since people are freaking out and crisis, the process of science is broken. 270 00:21:59,430 --> 00:22:03,430 They're actually. There's a crisis in the sense that we're doing this wrong and we need to make some serious changes. 271 00:22:04,930 --> 00:22:08,739 So in terms of the first sort of crisis, the people are freaking out. 272 00:22:08,740 --> 00:22:12,370 Sense what's going on. Descriptively, is this true? 273 00:22:12,400 --> 00:22:15,459 Well, here's a paper this published around 2012. 274 00:22:15,460 --> 00:22:19,930 This is a special issue on this. They said, is there currently a crisis of confidence in psychological science, 275 00:22:19,930 --> 00:22:25,210 reflecting an unprecedented level of doubt among practitioners about the reliability of research findings in the field? 276 00:22:25,540 --> 00:22:30,760 It would certainly appear that there is. So this is some evidence that there's a crisis of confidence in the sense I've just described. 277 00:22:31,660 --> 00:22:33,970 Now, when I was in Cambridge and we're studying the history of science, 278 00:22:34,420 --> 00:22:37,600 one thing you learn to be very cautious about is anytime you hear the word unprecedented. 279 00:22:39,280 --> 00:22:43,209 This is another paper from that same journal where Roger Connor Sarala says, well, 280 00:22:43,210 --> 00:22:48,370 crises have been declared regularly since at least the time of William Hunt, one of the sort of founding fathers of psychology. 281 00:22:49,180 --> 00:22:55,059 In fact, to go back to the 1970s, as I promised. Here is one of a number of papers published in that decade. 282 00:22:55,060 --> 00:23:00,010 This is the crisis of confidence in social psychology. I mean, if you can see it's published there October 1975. 283 00:23:00,490 --> 00:23:04,420 So people were bringing up these same problems. Our sample sizes are too small. 284 00:23:04,810 --> 00:23:10,240 What is the problem with the file or the prejudice against? And all that paper was a 1970s paper by Anthony Greenwald that I mentioned, 285 00:23:11,200 --> 00:23:17,440 and everybody kind of acknowledged that there were problems with the standard method methodology and then they would just carry on doing it as before. 286 00:23:18,160 --> 00:23:20,320 I think what's different about this time is the Internet, 287 00:23:21,130 --> 00:23:25,660 this blog post that I mentioned and the response from the researcher and the further articles that were written up. 288 00:23:25,660 --> 00:23:32,290 It was it was all out in the open. The dirty laundry of psychology was suddenly being aired in a very public way in The New York Times and so on. 289 00:23:32,650 --> 00:23:37,690 And I think researchers now, the reason why this is sort of taken hold and persisted in the public consciousness 290 00:23:37,690 --> 00:23:40,870 is because I think researchers are aware that we really have to clean up their act. 291 00:23:41,770 --> 00:23:46,329 There's also a fascinating sort of public communication of science issue going on here as well, 292 00:23:46,330 --> 00:23:51,010 because I think the public generally thinks that if a paper was published in a scientific journal, that just means it's a fact. 293 00:23:51,430 --> 00:23:55,690 And then if somebody says, well, we had a hard time replicating it, then they say, Well, science is just garbage. 294 00:23:55,690 --> 00:23:59,080 I can't trust anything, you know? And there's information coming out about health research or whatever. 295 00:24:00,010 --> 00:24:00,280 You know, 296 00:24:00,370 --> 00:24:05,679 the public needs to know that there should be a certain amount of failure to replicate because that means science is being productive and interesting. 297 00:24:05,680 --> 00:24:07,870 It's trying stuff out that might not work or might not hold up. 298 00:24:08,200 --> 00:24:12,610 If every result replicated, that would mean we'd be doing very boring, slow, piecemeal, bricklaying science. 299 00:24:12,940 --> 00:24:18,010 So that's something I think we should talk about. Now, here's a survey that came out in nature just a little bit ago. 300 00:24:18,880 --> 00:24:25,630 This is, again, the descriptive claim. Is there a reproducibility crisis? Something like 52% say there's a significant, significant crisis. 301 00:24:25,630 --> 00:24:30,730 38 think there's a slight crisis and some small minority think there isn't. This is across scientific fields. 302 00:24:32,110 --> 00:24:37,960 So I think that gives some support for the first sense of crisis. There is indeed a crisis of confidence, but is there a crisis of process? 303 00:24:38,890 --> 00:24:46,930 Is it the case that science is in some sense broken? Well, here's a post from Stephen Porter Gorg, a researcher who says psychology is broken. 304 00:24:47,380 --> 00:24:50,800 And what he's referring to here is a painstaking, years long effort to reproduce. 305 00:24:50,800 --> 00:24:56,810 100 studies published in three leading psychology journals found that more than half of the findings did not hold up when retested. 306 00:24:56,830 --> 00:24:57,970 And what he's referring to is, 307 00:24:58,240 --> 00:25:05,080 is this paper that was published in science called Estimating the Reproducibility of Psychological Science by the Open Science Collaboration. 308 00:25:05,890 --> 00:25:14,200 This is published a couple of years ago, raised a huge splash and it caused a lot of people to think that psychology, anyway, is broken. 309 00:25:14,230 --> 00:25:18,430 So let me bring to the surface what the underlying reasoning is here. 310 00:25:18,910 --> 00:25:22,270 In order to claim that psychology is broken on the basis of these findings. 311 00:25:22,690 --> 00:25:29,770 You need to think something like this. If a field is not broken, most of its results should replicate when independent labs rerun the experiments. 312 00:25:31,060 --> 00:25:35,590 Well, most of psychology's results did not replicate when Independent Labs reran the experiments. 313 00:25:35,950 --> 00:25:39,730 So psychology is broken. That's what I think is going on here under the surface. 314 00:25:41,320 --> 00:25:43,300 Now, I've just indicated here, 315 00:25:43,660 --> 00:25:49,510 I'm not sure we should be very confident about this first claim that most results should replicate when Independent Labs rerun the experiments. 316 00:25:49,540 --> 00:25:52,810 You have to have some prior sense of what you think the correct rate of replication should be. 317 00:25:53,170 --> 00:25:56,350 It shouldn't be 100% because that means that we're not advancing at all. 318 00:25:56,350 --> 00:25:59,920 It should be 0%. That would be pretty disturbing. Maybe it should only be 50%. 319 00:26:00,400 --> 00:26:04,360 So until you know what you think the prior appropriate percentages when you're dealing with trade-offs 320 00:26:04,360 --> 00:26:09,549 between exciting new research and sort of brick lane research to confirm previous results, 321 00:26:09,550 --> 00:26:16,150 unless you know what the correct ratio is between those things, you don't have any grounds for saying that 50% is too big or too small. 322 00:26:16,720 --> 00:26:19,270 But I want to focus on the second issue and just see whether it is the case 323 00:26:19,270 --> 00:26:23,350 that we can conclude on the basis of this open science collaboration paper, 324 00:26:23,380 --> 00:26:28,330 that that it is true that most of psychology's results did not replicate when Independent Labs reran the experiment. 325 00:26:29,860 --> 00:26:33,350 Here's this paper. They looked at 100 studies from three major journals. 326 00:26:34,630 --> 00:26:38,810 They ran each of those studies one more time with independent labs who had sort of they would consult with 327 00:26:38,810 --> 00:26:42,430 the original researchers and try to come up with an agreement about how we should run these five experiments. 328 00:26:43,420 --> 00:26:47,710 And what they found was that the mean effect size went down by about half in the follow up studies. 329 00:26:48,340 --> 00:26:53,350 And the P values, the infamous P values were mostly bigger and mostly over that point five threshold. 330 00:26:53,350 --> 00:26:59,800 They were, you know, point ten or 20 or something like that, which on some sort of naive view of what counts as a replication, was often taken to be. 331 00:26:59,870 --> 00:27:06,920 Failure to replicate? Well, we have to actually dig into this a little bit to understand what we can actually learn from this. 332 00:27:07,940 --> 00:27:13,000 This famous paper that came out a couple of years ago, what counts as replication? 333 00:27:13,010 --> 00:27:19,760 Well, first of all, you have to question whether the materials were exactly the same as the ones that were used in 334 00:27:19,760 --> 00:27:24,600 the original study and whether the number of participants was the right amount of participants. 335 00:27:24,620 --> 00:27:29,240 So I'll just give one example here. I went and looked through those 100 studies to have all the original materials up online, 336 00:27:29,720 --> 00:27:33,860 and in one of them they used about half as many participants as were used in the original report. 337 00:27:33,890 --> 00:27:36,950 Based on this naive expectation that the published effect size was accurate, 338 00:27:36,950 --> 00:27:40,690 which, as I learned from my Macbeth effect paper, that was not that's not true. 339 00:27:40,700 --> 00:27:44,330 You have to have more participants than were published in the original study for most of them. 340 00:27:44,420 --> 00:27:50,300 What's that? I only looked at the one and a handful of others. I can't go into study to usually have a much higher power. 341 00:27:51,470 --> 00:27:54,350 That's what I thought as well, which is why so startled when I saw this particular one. 342 00:27:54,350 --> 00:27:59,780 The reason why I looked at it was it was a former psychology professor of mine had run the study and it didn't seem to turn out. 343 00:27:59,780 --> 00:28:01,400 And I went back and thought, what's going on here? 344 00:28:02,390 --> 00:28:07,850 So this this particular example was a replication of a study by Paul Bloom and a colleague, and they use half as many participants. 345 00:28:08,120 --> 00:28:11,530 So I'd have to go through and look at all the others to make sure whether that's a more general trend. 346 00:28:13,160 --> 00:28:17,149 Also, we have to ask ourselves, what's the effects that we're looking for? Are we what counts as a replication? 347 00:28:17,150 --> 00:28:20,870 Would it be the exact same effect size or the effect size within a certain bound? 348 00:28:21,450 --> 00:28:24,710 So we're looking at the same P value or just any P value less than oh five. 349 00:28:25,010 --> 00:28:27,890 So unless you have some further sense of what you mean by replication, 350 00:28:28,220 --> 00:28:31,520 depending on how the results turn out, you're not sure whether it counts as a failure to replicate. 351 00:28:32,840 --> 00:28:37,129 My hunch is that what we want to know is is not so much is the p value the same or the effect size the same, 352 00:28:37,130 --> 00:28:41,750 but is there a finding here of interest that's comparable to the original finding that was reported? 353 00:28:42,140 --> 00:28:45,110 Is there something theoretically or practically interesting that really does 354 00:28:45,110 --> 00:28:48,200 exist based on our best assessments of evidence that we have available to us? 355 00:28:50,990 --> 00:28:54,630 Now, here's why I think we shouldn't be very surprised about the results of this new study. 356 00:28:54,650 --> 00:29:01,200 This is what we should expect given p b stands for publication Bias Prejudice against the norm when you're only publishing a fraction of studies. 357 00:29:01,220 --> 00:29:05,330 Let's let's imagine you take a study and you run it under ideal conditions 100 times. 358 00:29:05,330 --> 00:29:09,459 And it's the same study over and over and over again. Well, you're going to get a distribution of P values. 359 00:29:09,460 --> 00:29:10,580 So it's going to be the same time. 360 00:29:10,580 --> 00:29:16,280 Every time you run that test, the same p value and your effect size is going to vary as well over the course of those 100 studies. 361 00:29:16,700 --> 00:29:20,480 Well, given that you only typically publish studies that work, 362 00:29:20,570 --> 00:29:24,260 you know that the published findings are going to be on the high end of this distribution. 363 00:29:24,260 --> 00:29:29,000 The effect sizes are going to be the ones that are a little bit bigger, the P values, the ones that are going to be smaller. 364 00:29:29,240 --> 00:29:32,719 And so this means that almost by definition, the second time you run this study, 365 00:29:32,720 --> 00:29:37,250 you're going to get a bigger P value because of regression to the mean you're going to get a smaller effect size. 366 00:29:37,490 --> 00:29:41,090 In other words, you don't have to even run this experiment to know that the second iteration of any 367 00:29:41,090 --> 00:29:45,110 published study is very likely going to give you a smaller effect size and a larger p value. 368 00:29:46,310 --> 00:29:49,490 So this is the problem with repeating a study one time. 369 00:29:49,730 --> 00:29:54,650 The p value, by the way, almost means nothing on one iteration of any study. 370 00:29:54,980 --> 00:29:59,360 It only begins to accrue some inferential value over multiple iterations of a study. 371 00:29:59,900 --> 00:30:05,330 And so, as I say, when you run a study one more time, your informational value from that is is almost zero. 372 00:30:05,390 --> 00:30:07,280 You don't learn almost anything from that. 373 00:30:08,000 --> 00:30:11,810 And this is exactly what you should expect, is that the mean effect size should go down and p value should go up. 374 00:30:13,070 --> 00:30:17,420 So now I want to talk about how do we know we're doing an adequate replication? 375 00:30:17,420 --> 00:30:23,670 I mentioned the smaller sample size which as I, as far as I know, only places that one case AA stands here for auxiliary assumptions. 376 00:30:23,670 --> 00:30:28,580 So I want to go back to this study by Farge and colleagues and the replication attempts by dining colleagues. 377 00:30:29,390 --> 00:30:36,260 So I told you that they gave the same priming task here. They tried to bring to the mind the elderly stereotype. 378 00:30:36,710 --> 00:30:40,190 They replaced the stopwatches with the infrared sensors. 379 00:30:41,060 --> 00:30:44,930 But I didn't tell you actually they didn't do an exact replication here. 380 00:30:44,930 --> 00:30:47,930 They made a change as well in terms of the primary materials. Specifically, 381 00:30:47,930 --> 00:30:50,870 they translated all the materials into French because they're carrying out the 382 00:30:50,870 --> 00:30:54,500 study in Belgium and the participants would understand French rather than English. 383 00:30:55,550 --> 00:30:59,120 Why is this important? Well, here's an still unpublished study. 384 00:30:59,120 --> 00:31:04,040 I saw a previous draft of this sort of circling or circulating around in the grey literature. 385 00:31:04,040 --> 00:31:05,270 I don't know what its fate is, 386 00:31:05,630 --> 00:31:11,390 but Michael Rampart is a very senior linguist who wanted to look at this case of the difficulty to replicate the original finding. 387 00:31:12,620 --> 00:31:15,950 The thing we have to pay attention to here is that this is a linguistic priming effect. 388 00:31:16,430 --> 00:31:21,290 You're calling to mind a stereotype based on the connotations that are associated with certain words in English. 389 00:31:21,290 --> 00:31:27,650 So if you translate just naively translate these words into French without showing that the same properties apply in French as apply in English, 390 00:31:28,190 --> 00:31:32,000 you may very well not be doing an adequate replication of the original study. 391 00:31:33,020 --> 00:31:38,150 One difference between English and French is that in English adjectives can be four nouns, whereas in French, domestic and after. 392 00:31:38,540 --> 00:31:44,869 And so if you're using adjectives to prime and now that's going to be part of the stereotype it's important which all of these things come in, 393 00:31:44,870 --> 00:31:50,149 in the language. And so Michael Ramkumar and his colleagues did a corpus analysis of French and English, 394 00:31:50,150 --> 00:31:55,610 and they found well, generally and with respect to the specific words used in the framing study, 395 00:31:56,030 --> 00:31:59,600 that when it comes to their experience of encountering the adjectives in the prime set in contexts where they. 396 00:31:59,770 --> 00:32:01,059 She served as prime finance. 397 00:32:01,060 --> 00:32:06,280 We can expect that the subjects in the original marginal study would have had something on the order of six times more experience. 398 00:32:06,700 --> 00:32:13,279 In other words, the suggestion here is that the Prime was likely to be six times stronger among English participants than among French participants. 399 00:32:13,280 --> 00:32:17,530 So the failure to find an effect could very well be due to the fact that the Prime wasn't strong enough to begin with. 400 00:32:17,560 --> 00:32:24,660 In other words, it wasn't a faithful replication of the original finding. So there are many more examples of this that could be raised. 401 00:32:24,670 --> 00:32:30,400 But what I'd like to point out is that replication is is hard. You often don't know if you're violating one of those auxiliary assumptions. 402 00:32:30,580 --> 00:32:34,050 You might think that just translating the materials into French. Well, what's the big deal about that? 403 00:32:34,070 --> 00:32:36,250 I have to do it in French because my participants are in French. 404 00:32:36,640 --> 00:32:43,150 Michael has to do this complicated linguistic caucus analysis to show that actually that wasn't an adequate replication of the original finding. 405 00:32:43,570 --> 00:32:47,799 And that happens because my grandma is a very good linguist and had the capacity to do that. 406 00:32:47,800 --> 00:32:53,260 But in how many other replication times are these subtle things not being adequately dealt with? 407 00:32:53,740 --> 00:32:59,440 It's hard to say. So a failed replication doesn't necessarily mean that the original finding was not real. 408 00:33:01,240 --> 00:33:04,479 And therefore I think that replication initiatives like the Open Science Collaboration one 409 00:33:04,480 --> 00:33:07,780 are unlikely to provide the most direct and compelling evidence that science is broken, 410 00:33:07,780 --> 00:33:14,140 if indeed it is. That said, I think we have more direct evidence that something is wrong with the way we're conducting science currently. 411 00:33:14,980 --> 00:33:19,480 We already have more direct evidence of a crisis of process or at least problems of the process. 412 00:33:19,480 --> 00:33:25,600 And that comes from that earlier work by John and colleagues and others who showed that psychologists openly admit to 413 00:33:25,600 --> 00:33:33,340 engaging in research practices that we know are very likely to lead to the publication of Type one errors and false alarms. 414 00:33:33,790 --> 00:33:39,700 And so since we already have evidence of those more directly from the admissions of psychologists, if we want to have an intervention, 415 00:33:39,970 --> 00:33:45,040 it might not be the best way to do that is to have these massive armies of replicating replicators running around, 416 00:33:45,040 --> 00:33:50,889 trying to redo every study that was ever published. You have resource constraints on how to actually do that. 417 00:33:50,890 --> 00:33:55,540 You have questions about which studies are worth replicating or not. It might very well be that in terms of intervention, 418 00:33:55,540 --> 00:34:02,859 we need to focus on the problems upstream that we know are very likely to lead to false alarms rather than trying to count the number of false alarms, 419 00:34:02,860 --> 00:34:03,580 if you see what I'm saying. 420 00:34:05,200 --> 00:34:11,109 So I indicated already, here's the more direct evidence psychologists are admitting to these questionable research practices. 421 00:34:11,110 --> 00:34:14,530 Key Hacking Harking, which stands for hypothesising after the results are known. 422 00:34:14,950 --> 00:34:17,950 This is where you run essentially an exploratory study. You're not sure what you're going to get. 423 00:34:17,950 --> 00:34:21,759 You don't really have a strong sense of hypothesis. You run some statistics and a couple of them pop out. 424 00:34:21,760 --> 00:34:25,030 P is less than five and then you come up with some hypothesis after the fact. 425 00:34:25,030 --> 00:34:27,010 They're going, Gee, what would predict that? I don't know. How about this? 426 00:34:27,250 --> 00:34:32,590 And then you write up the paper presenting what are exploratory statistics as though they were confirmatory statistics, and that's a problem. 427 00:34:33,550 --> 00:34:39,250 There's also quite a lot of work that's been done on the reliability of peer review, which is just not reliable. 428 00:34:39,490 --> 00:34:41,050 You can test peer review by, for example, 429 00:34:41,680 --> 00:34:46,450 embedding a bunch of errors in a manuscript and sending it out to a bunch of peer reviewers and seeing how many of them notice the errors, 430 00:34:46,450 --> 00:34:51,850 and very few of them seem to notice it. There's also just problems with sloppy peer review cronyism. 431 00:34:52,090 --> 00:34:53,560 There's a lot of politics of peer review. 432 00:34:55,000 --> 00:35:00,010 One thing you learn if you ever work as an associate editor for a journal is you get the manuscript and if you have a stance on it, 433 00:35:00,010 --> 00:35:03,280 you can sink it or or floated, depending on who you send it to for review. 434 00:35:03,310 --> 00:35:06,850 You know that so-and-so is going to give it a crappy review or that he's going to give it a good review. 435 00:35:07,180 --> 00:35:13,120 So it's not like there's this beautiful, objective process by which the most dispassionate and qualified reviewers are handling every paper. 436 00:35:13,540 --> 00:35:16,299 So peer review. I think if we're going to count on this as a quality control mechanism, 437 00:35:16,300 --> 00:35:23,500 we should be extremely concerned that peer review is not a quality control mechanism up to the job publication bias we talked about. 438 00:35:23,500 --> 00:35:26,200 This is the issue of failure to publish negative results. 439 00:35:26,950 --> 00:35:31,600 Again, if we have 20 labs running, it's essentially the same experiment and one of them gets it to work. 440 00:35:32,320 --> 00:35:35,170 The the highest likelihood is that that's a false alarm. 441 00:35:35,170 --> 00:35:40,330 And all the other studies, if we never hear about them, we don't have any way to know how much confidence you should place in that published finding. 442 00:35:42,010 --> 00:35:45,940 And I want to share a quick story about this issue of the need to publish negative results. 443 00:35:46,120 --> 00:35:50,469 I'm I was writing some papers on this topic and I was searching around in the literature. 444 00:35:50,470 --> 00:35:55,000 I think I typed in the need for reporting negative results to come up with some examples I might cite. 445 00:35:55,330 --> 00:36:00,010 And I found this paper that goes back to 1927 in the Journal of the American Medical Association. 446 00:36:00,010 --> 00:36:07,149 So this is something that's been written about for ages. I have to tell you what this researcher said. 447 00:36:07,150 --> 00:36:12,010 I'm going to read this thing mostly in because I think it's very clever. So it says, I'm to the editor. 448 00:36:12,010 --> 00:36:15,280 One of the things we practitioners sometimes neglect is the reporting of failures. 449 00:36:15,280 --> 00:36:21,310 In the Journal, Doctor So-and-so reported the treatment of six consecutive cases of warts with a certain injection. 450 00:36:21,670 --> 00:36:25,180 I venture to guess that as a result of this publication, not less than 100 physicians, 451 00:36:25,180 --> 00:36:29,020 perhaps several hundred, injected this substance into the patients. 452 00:36:29,470 --> 00:36:33,459 Supposing that 99% get negative results, what happens? Each of them gives up. 453 00:36:33,460 --> 00:36:38,380 The method is a failure and does not say anything more about it. And the treatment remains on record as an undisputed success. 454 00:36:38,680 --> 00:36:42,399 Maybe 1% who meet with success will communicate with Dr. Sutton so that by and by he will 455 00:36:42,400 --> 00:36:45,940 have quite an impressive series of cases which seem to support the original findings. 456 00:36:46,480 --> 00:36:51,840 To practice what I'm preaching, let me now report that on November 30th I injected the substance into the left buttock of CBM, 457 00:36:51,990 --> 00:36:55,240 a girl age 18, who was at that day complaining of 24 warts. 458 00:36:55,870 --> 00:36:59,580 The present date there are 28 warts and evidence of aggressive changes in the original 20. 459 00:36:59,660 --> 00:37:06,200 Or has not been seen. So now I have this big reveal, which is that the author of this is John Rosenberg, who is my paternal grandfather. 460 00:37:07,190 --> 00:37:10,670 He died in 1949. I never met him. I never knew him. 461 00:37:10,880 --> 00:37:14,240 He died when my dad was seven years old, so my dad didn't know him either. 462 00:37:14,690 --> 00:37:19,999 And here it was that I found this publication where some many number of years ago he was 463 00:37:20,000 --> 00:37:23,090 writing about this exact same issue that I happened to have been writing about at this time. 464 00:37:23,810 --> 00:37:28,310 So that was kind of a fun, personal note I wanted to end on, and I'll just say thank you and leave it there.