1
00:00:00,300 --> 00:00:05,580
So I'll be talking about the replication crisis in psychology. And just to be fair to psychology, I'm not singling it out here.

2
00:00:05,580 --> 00:00:10,470
I could just as well be talking about a replication crisis in science generally, or medicine or biology.

3
00:00:10,770 --> 00:00:12,660
Cancer research is having some problems now.

4
00:00:13,020 --> 00:00:17,669
I'll be speaking about psychology because this is the area that I'm most familiar with and closer to the facts on the ground.

5
00:00:17,670 --> 00:00:19,800
And so I feel most comfortable talking about these examples.

6
00:00:20,850 --> 00:00:25,890
The other thing I want to say about the word crisis here is some people think this term is inappropriate, it's dramatic.

7
00:00:26,910 --> 00:00:29,190
Others think that it's a fair description of what's going on.

8
00:00:29,190 --> 00:00:34,890
So there's serious and substantive disagreement about whether indeed there is a crisis going on and if there is, what the nature of it is.

9
00:00:35,190 --> 00:00:41,040
So I want to be clear about that. In terms of history and philosophy of science, I want you to get your hopes up too much in terms of history.

10
00:00:41,040 --> 00:00:46,470
I'm not going back much further than 1970, and the philosophy of science is going to be sort of 1 to 1 introductory stuff.

11
00:00:46,950 --> 00:00:52,680
So for those of you who have any familiarity at all with philosophy of science, some of these ideas will be comfortable already.

12
00:00:53,910 --> 00:00:54,660
So I'll start with this.

13
00:00:55,920 --> 00:01:01,860
Replication is sometimes talked about in the context of falsification, which is an idea that goes back most notably to Karl Popper.

14
00:01:01,860 --> 00:01:05,040
This is from his 1934 treatise, The Logic of Scientific Discovery.

15
00:01:05,430 --> 00:01:08,730
I've already violated my 1970s rule there, but I won't do that again.

16
00:01:10,380 --> 00:01:15,990
Here's a naive view of what's going on with scientific discovery, and this is something that Popper was criticising in his work.

17
00:01:16,710 --> 00:01:19,740
You might think that you come up with a theory or maybe a hypothesis in your mind,

18
00:01:19,740 --> 00:01:23,670
and you think that it's going to predict some kind of observation, something you should be able to observe.

19
00:01:24,300 --> 00:01:28,320
And you try to design an experiment such that the observation will lead logically.

20
00:01:28,470 --> 00:01:30,630
Sorry, the theory will lead logically to an observation.

21
00:01:31,050 --> 00:01:38,129
And then if you get the observation that you predicted, you might think, well, that now confirms or at least counts in favour of my theory.

22
00:01:38,130 --> 00:01:39,900
That's sort of an eye view of what's going on of science.

23
00:01:40,830 --> 00:01:46,290
The problem with this is that any number of other theories might also just as well have led to the same observation.

24
00:01:46,620 --> 00:01:51,329
And so by finding evidence that's consistent with the theory, that's incredibly weak support for the theory.

25
00:01:51,330 --> 00:01:58,410
Unless you can rule out all of these competing alternative theories that might, as I say, just as well, have predicted predicted the same observation.

26
00:02:00,180 --> 00:02:06,720
So this is related to the fallacy in in logic called affirming the consequent.

27
00:02:06,810 --> 00:02:10,650
So this goes like this. If he then q. Q therefore p.

28
00:02:11,280 --> 00:02:15,240
You can't draw that conclusion because a, b, c, d might just as well have entailed.

29
00:02:15,630 --> 00:02:19,560
Q So getting Q doesn't give you P for free. You have to rule out all the alternatives.

30
00:02:20,160 --> 00:02:23,940
But this points to to a valid argument in logic, which is the modus talon's argument.

31
00:02:24,090 --> 00:02:27,220
And the way that this goes is if. P then. Q Not.

32
00:02:27,240 --> 00:02:30,930
Q Therefore not. P The idea is if P than.

33
00:02:30,930 --> 00:02:34,260
Q I didn't find. Q Therefore it can't have been P in the first place.

34
00:02:34,620 --> 00:02:38,759
So this inspired Popper's notion of falsification.

35
00:02:38,760 --> 00:02:43,680
So the idea here is I have a theory. I predict some sort of very specific observation.

36
00:02:43,680 --> 00:02:50,040
And if I don't get the observation that finds, then again, that's not supposed to seen as falsifying or at least counting against the theory.

37
00:02:50,070 --> 00:02:52,920
Again, this is a highly simplistic view, but that's the general idea.

38
00:02:54,450 --> 00:02:58,859
So Popper's notion was that instead of trying to come up with evidence that seemed to support or confirm our theories,

39
00:02:58,860 --> 00:03:03,540
we should constantly be trying to come up with hypotheses that predict very specific outcomes,

40
00:03:03,540 --> 00:03:08,900
such that if we don't get the outcome that we expected, we should be willing to say to forfeit our theory.

41
00:03:08,910 --> 00:03:15,210
If you constantly have theories that that can't be falsified by any any possible observation rather than being a strength of the theory.

42
00:03:15,240 --> 00:03:21,270
He thought that was a weakness. And so we should be subjecting our theories to what he calls critical tests, trying to falsify them.

43
00:03:21,630 --> 00:03:26,370
And if we can't falsify them, we don't say that they've been proven true. We just say they haven't yet been falsified.

44
00:03:27,360 --> 00:03:31,589
And begrudgingly, well, over time, if we've subjected them to many critical tests, say maybe,

45
00:03:31,590 --> 00:03:35,460
maybe we have something here, we haven't been able to falsify it, even though we tried our best to do so.

46
00:03:37,040 --> 00:03:43,639
Now, as critics of Popper have noted in all the intervening decades and in many different ways, even this is not get us off the hook.

47
00:03:43,640 --> 00:03:45,260
This is still a problematic theory.

48
00:03:46,160 --> 00:03:53,720
So again, the idea is I don't get the observation that I expected on Popper's view that it's at least supposed to count against my theory,

49
00:03:54,410 --> 00:03:59,230
but it might not be that straightforward. So first of all, there might be something wrong with the observation.

50
00:03:59,240 --> 00:04:02,240
Maybe my research assistant wrote down the wrong number or something like that,

51
00:04:02,240 --> 00:04:05,660
so I might not be sure that the observation is really the correct observation.

52
00:04:05,960 --> 00:04:11,790
So evisceration might be what's wrong rather than the theory. It might be that there are various weaknesses in the experiment itself.

53
00:04:11,810 --> 00:04:16,639
Maybe I didn't set it up in such a way that it really would give me the closest test of the theory.

54
00:04:16,640 --> 00:04:21,710
In terms of the observation, there might be various problems between the theory and the experiment as well.

55
00:04:22,490 --> 00:04:24,940
And these are often referred to as auxiliary assumptions.

56
00:04:24,950 --> 00:04:32,300
Basically, these are these are logical links between each step going from theory through the experiment to the observation.

57
00:04:32,720 --> 00:04:39,950
And you'd have to show that all of those were were true in order for the observation not turning out to count against the theory.

58
00:04:40,730 --> 00:04:45,770
Just to show what I mean by this. It could be that if the observation doesn't hold, the theory is not true.

59
00:04:45,780 --> 00:04:49,090
But it could be that the observation is wrong. It could be something's wrong with the experiment.

60
00:04:49,100 --> 00:04:52,250
It could be something's wrong in the link between the theory, the theory in the experiment.

61
00:04:52,520 --> 00:04:56,600
And also I rule out all of those alternative explanations. I actually haven't touched the theory at all.

62
00:04:57,080 --> 00:05:02,750
I certainly can't say that it's not true unless I've ruled out all of the possible links between the observation and the theory.

63
00:05:04,220 --> 00:05:08,660
So this creates something of a pickle for talk of failed replications in the current debate.

64
00:05:08,690 --> 00:05:13,550
You will often find researchers saying, Well, we failed to replicate the initial finding.

65
00:05:13,880 --> 00:05:20,090
This counts against that finding. But again, unless you've accounted for all the possible auxiliary hypotheses, simply getting a negative result,

66
00:05:20,090 --> 00:05:23,870
a result that doesn't appear to be the same as the first one that was reported

67
00:05:24,710 --> 00:05:28,010
isn't going to be enough to show that you somehow falsify the original finding,

68
00:05:28,010 --> 00:05:36,469
much less the original theory. So the opening shots of the current replication crisis were fired around 2012.

69
00:05:36,470 --> 00:05:42,620
I think this is this is a blog post by Ed Yong, a very distinguished science writer in the United States, I believe.

70
00:05:43,010 --> 00:05:46,669
Or maybe he's here. He used to work for discovers and writes for The Atlantic.

71
00:05:46,670 --> 00:05:49,969
And this post came out that caused a lot of controversy.

72
00:05:49,970 --> 00:05:54,380
You can see the title here primed by Expectations Why Classic Psychology Experiment Isn't What It Seemed.

73
00:05:55,220 --> 00:05:59,360
He was writing up this paper here by Doyin and colleagues called behavioural priming.

74
00:05:59,360 --> 00:06:07,730
It's all in the mind. But who's mine now? What was significant about this paper, also published in 2012, is that the researchers ran an experiment,

75
00:06:08,210 --> 00:06:12,560
didn't get evidence supportive of the original published findings that they were trying to replicate.

76
00:06:13,040 --> 00:06:17,149
And then instead of just burying the finding in their file drawer, which is what typically happens, they took that.

77
00:06:17,150 --> 00:06:20,840
They went through the trouble to write it up, which is further trouble of submitting it to a journal.

78
00:06:20,840 --> 00:06:23,960
And they actually got it published in a in a fairly well respected journal.

79
00:06:24,650 --> 00:06:26,810
Now, this used to never happen.

80
00:06:27,680 --> 00:06:33,100
Anthony Greenwald wrote a paper back in the seventies talking about the unintended consequences of what he calls prejudice against.

81
00:06:33,110 --> 00:06:35,450
No. This just means that for the longest time,

82
00:06:35,450 --> 00:06:40,280
journals had a strong disinclination from publishing any kind of negative or, as it were, failed experiments.

83
00:06:40,730 --> 00:06:46,160
So what happened is you would run an experiment to try. I mean, failed replications in some sense happened all the time and continue to happen.

84
00:06:46,580 --> 00:06:51,080
It's just that you would never go through the trouble of writing them up, much less submitting them because you knew you wouldn't get them published.

85
00:06:51,080 --> 00:06:57,260
So this was a significant moment. The only time you would hear about failed replications would be at the bar after a conference.

86
00:06:57,260 --> 00:07:00,979
When you're talking to your colleagues and you say, Yeah, I tried to run that experiment and I couldn't get it to work.

87
00:07:00,980 --> 00:07:06,830
Oh yeah, I tried to run the experiment. I couldn't get it to work. So you had this informal knowledge floating around in scientific communities,

88
00:07:07,130 --> 00:07:10,700
but because all of these failed replications are never published, they just get buried.

89
00:07:10,700 --> 00:07:16,189
And what you end up having the published literature is just a fraction of the attempts at running those experiments.

90
00:07:16,190 --> 00:07:21,650
And this creates systematic problems and also, by the way, increases the likelihood that the published finding is just a false alarm.

91
00:07:23,780 --> 00:07:27,890
So this is the original paper that Duane and colleagues were trying to replicate.

92
00:07:28,400 --> 00:07:33,950
I'm going to call this the elderly walking time study, just the shorthand, because the title is a bit long here.

93
00:07:34,580 --> 00:07:38,810
I just want to check in the room. Do people know about this study or have heard of this study?

94
00:07:39,050 --> 00:07:42,560
If you have, just raise your hand and if you and if you haven't, that's about half.

95
00:07:42,560 --> 00:07:45,770
Okay. So it's it's worth explaining. Oh, I'll go through it in some detail.

96
00:07:46,940 --> 00:07:49,759
Basically, this was a very important finding in the field of social psychology.

97
00:07:49,760 --> 00:07:54,290
It's been cited 3633 times, which is an astronomical number for studies of this kind.

98
00:07:54,650 --> 00:07:57,290
It's been written up in introductory textbooks and so on.

99
00:07:57,290 --> 00:08:01,460
So for researchers to claim that they weren't able to replicate this finding was, as you can imagine, sort of a big deal.

100
00:08:03,380 --> 00:08:07,250
This is basically how the experiment worked. I'll say a little bit about the theory.

101
00:08:07,250 --> 00:08:12,020
So there's this idea that simple cues in the environment activate stereotypes in our mind.

102
00:08:12,020 --> 00:08:16,850
If you see someone walking down the street, you pick out very basic aspects of what you can see,

103
00:08:17,150 --> 00:08:20,270
and all these stereotypical traits just become automatically activated.

104
00:08:21,080 --> 00:08:26,209
The researchers here are trying to test the idea of the further idea that once a stereotype is activated in your mind,

105
00:08:26,210 --> 00:08:32,210
you yourself tend to behave in accordance with that stereotype as a basically a a social lubricant.

106
00:08:32,210 --> 00:08:36,170
It helps you behave in accordance with the way that you think people basically are around you

107
00:08:36,170 --> 00:08:39,530
so that you're not consciously having to try to figure out how to behave in some situation.

108
00:08:39,980 --> 00:08:44,180
So I'm going to be talking about the elderly stereotype. Just imagine that I'm in a room full of elderly people.

109
00:08:44,480 --> 00:08:48,920
Well, I'm going to walk a little slower, maybe speak a little more quietly and so on just instantaneously,

110
00:08:48,920 --> 00:08:52,610
because my mind has a stereotype about what elderly, elderly people are like.

111
00:08:52,940 --> 00:08:55,639
And then it's going to encourage me to sort of just instantaneously act a little

112
00:08:55,640 --> 00:08:59,450
bit more like that so that I can find it easier to engage with my environment.

113
00:09:00,770 --> 00:09:05,149
So the clever idea for how they were going to test this was they gave what's called a linguistic priming task.

114
00:09:05,150 --> 00:09:10,280
Basically, they gave people a series of sentences that were in a scrambled order and you had to unscramble them.

115
00:09:10,280 --> 00:09:13,820
And it was just framed as a linguistic puzzle, basically.

116
00:09:14,720 --> 00:09:18,800
But in a subset of these sentences were words that were meant to activate the elderly stereotype.

117
00:09:18,810 --> 00:09:23,900
So the words are things like wrinkle grey. Bingo. Florida, I think, was in there.

118
00:09:23,900 --> 00:09:31,160
This is conducted in the US and the idea is that not so many of these words should, should be activated, that you notice any sort of connection here.

119
00:09:31,160 --> 00:09:34,190
But your mind is figuring out that there's something going on here.

120
00:09:34,400 --> 00:09:38,870
It's meant to activate the whole stereotype, which means that even traits that you didn't include in your priming materials

121
00:09:38,870 --> 00:09:41,690
should become activated in your mind because they're associated in the culture.

122
00:09:42,620 --> 00:09:50,059
So the trick here is that one of the traits they did not include in the priming materials was one having to do with moving slowly through space.

123
00:09:50,060 --> 00:09:52,160
And then they wanted to see if if that were activated.

124
00:09:52,460 --> 00:09:57,170
And it's true that it influences your behaviour and participants who were primed with the elderly stereotype

125
00:09:57,170 --> 00:10:01,909
should themselves move more slowly through space compared to those who were primed in a controlled condition.

126
00:10:01,910 --> 00:10:05,930
That was the prediction. So they tried to activate the stereotype of the elderly.

127
00:10:05,930 --> 00:10:11,149
And then the very interesting thing they had done here was that an experiment who was sitting

128
00:10:11,150 --> 00:10:15,650
in the hallway with a stopwatch behind a newspaper and just timing participants as they

129
00:10:15,650 --> 00:10:20,000
left the study and went to the elevator and this person was blind to conditions and just

130
00:10:20,000 --> 00:10:23,030
recorded how long it took each of them to walk down the hallway and get to the elevator.

131
00:10:23,030 --> 00:10:25,100
And then you can go back and look at which condition they were in.

132
00:10:25,430 --> 00:10:29,810
And the big finding was that participants who had been primed with the elderly stereotype did indeed walk more.

133
00:10:29,950 --> 00:10:32,980
Well we down the hall than participants who are primed with control.

134
00:10:34,870 --> 00:10:38,350
Now here's the replication attempt by the Belgian researchers.

135
00:10:38,890 --> 00:10:40,660
They administered the same priming task.

136
00:10:40,750 --> 00:10:46,930
The idea was that they're going to activate the elderly stereotype, but they weren't very happy with this measurement device here.

137
00:10:46,940 --> 00:10:51,960
You can see it's a human with a stopwatch and humans with stopwatches are potentially prone to errors.

138
00:10:52,300 --> 00:10:56,200
They might be influenced by their expectations in some way. So they actually made a change.

139
00:10:56,230 --> 00:11:01,270
This is a replication study, but it's not exactly the same and in a way that is important as our highlight in a moment.

140
00:11:01,900 --> 00:11:05,470
What they did is they replaced the students with stopwatches, with infrared sensors.

141
00:11:05,710 --> 00:11:08,770
So these infrared sensors are, strictly speaking, different.

142
00:11:08,860 --> 00:11:12,040
They made a change. But on any reasonable understanding about what's going on here,

143
00:11:12,040 --> 00:11:16,359
this is an improvement to the study design because they replaced a measure that's prone to

144
00:11:16,360 --> 00:11:21,790
human idiosyncrasy with a measure that's going to give an accurate reading in every condition.

145
00:11:22,720 --> 00:11:27,640
So the asterisk I'm raising here is that when you're talking about an exact replication or a direct replication,

146
00:11:27,970 --> 00:11:30,150
it doesn't necessarily mean you use the exact same materials.

147
00:11:30,160 --> 00:11:36,190
It's actually okay, I would argue, to make certain changes so long as they're better tests of the theory on any reasonable view.

148
00:11:36,190 --> 00:11:42,070
And that was true in this case. So unfortunately, they weren't able to find anything looking like the original result.

149
00:11:42,490 --> 00:11:48,250
And to drive the point home, we actually went back to the students with stopwatches and they ran a second study where they told half

150
00:11:48,250 --> 00:11:54,520
of half of the experimenters what the hypothesis was and the other half the experiment experimenters,

151
00:11:54,520 --> 00:11:57,280
what it wasn't. In other words, they were priming them with expectations.

152
00:11:57,970 --> 00:12:04,060
And here they were able to replicate the finding in the group of students with stopwatches who were aware in some way of the study hypothesis.

153
00:12:04,060 --> 00:12:08,260
And so the implication was that perhaps what was going on in the original famous study,

154
00:12:08,860 --> 00:12:11,559
it might have been that the experimenters maybe weren't properly blinded,

155
00:12:11,560 --> 00:12:14,620
that somehow their expectations of what was going on was influencing their effect.

156
00:12:15,190 --> 00:12:21,700
The timing measures. So the way this was written up by Ed Yong referred to what's called the clever Hans effect.

157
00:12:21,700 --> 00:12:27,520
And again, I'll just ask before I don't go into too much detail here. Does anyone not know what the clever Hans effect is?

158
00:12:27,550 --> 00:12:35,950
Just some people don't. So for your benefit, in the early 1900s, there was a show force that supposedly could perform mathematical operations,

159
00:12:35,950 --> 00:12:39,489
and his handler would bring him around and he could do addition and subtraction and so on.

160
00:12:39,490 --> 00:12:46,090
And everyone was amazed at the capacities of this horse. A psychologist at the time was brought along to try to figure out what was going on here.

161
00:12:46,090 --> 00:12:50,950
Did the horse really understand arithmetic? And it turned out, based on a series of careful experiments,

162
00:12:50,950 --> 00:12:55,629
that the horse was just responding to very subtle cues on the part of his handler, maybe that even the handler didn't know.

163
00:12:55,630 --> 00:13:00,880
So when the right answer is hoof is hovering over the right answer, maybe the experimenter perks up a little bit like that or something.

164
00:13:01,060 --> 00:13:03,879
And the horse notices that and says, okay, so I'm going to now give that answer.

165
00:13:03,880 --> 00:13:09,070
Now, that's still an amazing capacity if the horse can read subtle shifts and the psychological state of the handler.

166
00:13:09,070 --> 00:13:11,410
But it's not the the result that was being claimed.

167
00:13:12,790 --> 00:13:18,339
What's important about the clever hand's effect is that this is experimental methods one on one in social psychology,

168
00:13:18,340 --> 00:13:21,700
particularly when you're dealing with these priming experiments where you're doing very

169
00:13:21,700 --> 00:13:25,690
subtle manipulations of the environment to try to alter human thinking or behaviour,

170
00:13:26,080 --> 00:13:27,400
you have to be absolutely,

171
00:13:27,550 --> 00:13:33,940
absolutely certain that the experimenters, everyone who's interacting with subjects, has no idea what the experimental design is,

172
00:13:33,940 --> 00:13:40,149
lest they behave like the horse's handler and subtly give some sort of clue that brings about what's hypothesised,

173
00:13:40,150 --> 00:13:48,879
but not by the mechanism that was hypothesised. So essentially the claim was in such direct terms that this very distinguished researcher

174
00:13:48,880 --> 00:13:53,890
who'd come out with the original finding was guilty of making a Psychology 101 very,

175
00:13:53,890 --> 00:13:58,930
very basic design error. Now, I'm putting an asterisk here because I'm going to come back to this story.

176
00:13:58,930 --> 00:14:01,480
This is not how it ends, but this is this is how it begins.

177
00:14:04,000 --> 00:14:10,299
Something about the timeline here that I think is important to to point out is that if you see the here in November 2011,

178
00:14:10,300 --> 00:14:16,180
this was the year prior, there was this paper, this story, this essay published in The New York Times about a fraud case.

179
00:14:16,180 --> 00:14:21,159
This is you might have heard of Dietrich Stoffel, very important researcher who had published all these findings.

180
00:14:21,160 --> 00:14:25,299
And it turned out that it wasn't the case, that he'd just done sloppy research in some way.

181
00:14:25,300 --> 00:14:26,620
He was just making up the findings.

182
00:14:26,620 --> 00:14:33,549
He would open up an Excel spreadsheet, type in some numbers and pretend he'd run an experiment and then publishes results.

183
00:14:33,550 --> 00:14:37,720
And he made this whole influential career doing explicitly fraudulent behaviour.

184
00:14:38,170 --> 00:14:42,969
And so he was caught and found out there was this big exposé in the New York Times.

185
00:14:42,970 --> 00:14:46,660
And so I think people are thinking fraud now is what's going on in psychology.

186
00:14:46,660 --> 00:14:52,540
And this is this is difficult because if you have ordinary research behaviours that sometimes lead to a result that can't be replicated,

187
00:14:52,540 --> 00:14:56,560
I think a lot of members of the lay public immediately said, oh, it must have been fraud going on there,

188
00:14:56,800 --> 00:15:01,570
rather than seeing it as just a normal part of the scientific process. So that's something that was in the background here.

189
00:15:03,010 --> 00:15:09,909
Another paper that came out around this time was a study by Lesley John and colleagues where they administered a survey to more

190
00:15:09,910 --> 00:15:15,370
than 2000 practising psychologists and they made sure it was carefully anonymized so that the psychologists would be willing,

191
00:15:15,370 --> 00:15:19,810
in a sense, to admit to behaviours that otherwise they might not want to admit to if they knew that they were going to be identified.

192
00:15:20,410 --> 00:15:25,330
And these psychologists admitted to a whole range of what are now called questionable research practices,

193
00:15:25,330 --> 00:15:33,700
which are normal aspects of everyday, ordinary lab practice that are in the grey area between fraud and and perfect methodology.

194
00:15:34,090 --> 00:15:39,280
And when you kind of add them up, they lead to a higher likelihood of having false alarms published in the literature.

195
00:15:39,640 --> 00:15:42,820
And so they would admit to things like stopping and checking their data every

196
00:15:43,060 --> 00:15:46,150
so many participants to see if there was a statistically significant finding.

197
00:15:46,390 --> 00:15:51,960
If there wasn't, they would collect some more participants and check again. And then as soon as it drops below 0.05, they say, yes, we got a result.

198
00:15:51,970 --> 00:15:54,640
They would publish it. This was standard practice until a couple of years ago.

199
00:15:54,850 --> 00:15:59,260
Nobody really was talking about it except for a few methodology journals that maybe this isn't what you should be doing.

200
00:15:59,620 --> 00:16:06,370
So a lot of psychologists didn't know you weren't supposed to be doing this. Another study that came out around this time was this one by Darrell Bam.

201
00:16:06,760 --> 00:16:12,310
There is a very distinguished psychological researcher who has this side interest in what's called parapsychology.

202
00:16:12,670 --> 00:16:18,940
So he thinks that these sorts of phenomena that most, as it were, mainstream psychologists think are not true and couldn't possibly be true.

203
00:16:19,780 --> 00:16:21,429
He thinks that indeed they could exist.

204
00:16:21,430 --> 00:16:28,480
And so he published this study in the top journal in the field purporting to show evidence of what's called precognition,

205
00:16:28,600 --> 00:16:34,600
the idea that some future event can actually influence your current decision making before it's happened.

206
00:16:35,290 --> 00:16:41,619
So Ben presented this as evidence of precognition, whereas others in the field said he's used all the standard methodologies,

207
00:16:41,620 --> 00:16:46,540
he's used all the standard statistical procedures, and he's come up with a result that just cannot be true.

208
00:16:46,540 --> 00:16:51,669
And let's we want to throw out assumptions from physics and basic assumptions about the way the world works.

209
00:16:51,670 --> 00:16:53,950
So maybe something's wrong with our methodology, right?

210
00:16:54,250 --> 00:16:58,930
If he's using the best methods in the field and coming up with a ridiculous result, something's probably wrong with our methods.

211
00:17:00,840 --> 00:17:08,129
This paper is also published around this time. Undisclosed flexibility and data collection and analysis allows presenting anything is significant.

212
00:17:08,130 --> 00:17:15,990
And these researchers famously concluded it is unacceptably easy to publish, quote, statistically significant evidence consistent with any hypothesis.

213
00:17:17,640 --> 00:17:25,020
Now, there's a whole other debate going on right now about what sort of statistical procedure should we be using in psychology and other fields.

214
00:17:25,110 --> 00:17:28,379
The standard now is to use something called null hypothesis significance testing,

215
00:17:28,380 --> 00:17:32,370
which rests on that famous P value when P is less than a five or sometimes less than a one.

216
00:17:33,180 --> 00:17:36,360
Researchers rejoice and submit their their finding to a journal.

217
00:17:37,290 --> 00:17:41,759
For 60 or 70 years, statisticians have been shouting at psychologists that You must not use this procedure.

218
00:17:41,760 --> 00:17:43,710
It's an invalid inference procedure.

219
00:17:43,890 --> 00:17:50,549
It's only gives you helpful information under an extremely rare conditions that almost never hold when you're running a psychology experiment.

220
00:17:50,550 --> 00:17:54,480
And yet, because psychology professors learned these tricks in grad school and teach them to their students,

221
00:17:54,480 --> 00:18:00,299
there's a sort of institutional inertia whereby null hypothesis significance testing is just carried on and on and on,

222
00:18:00,300 --> 00:18:05,220
despite the fact that it's very likely to lead to, as I say, findings that won't be replicated later.

223
00:18:06,540 --> 00:18:11,549
So in response to this, the editor of one journal Basic in Applied Social Psychology, a colleague of mine,

224
00:18:11,550 --> 00:18:15,780
David Crawford, now actually took over the editorship of the journal and banned P values.

225
00:18:15,780 --> 00:18:17,729
He said, You actually can't use this procedure anymore.

226
00:18:17,730 --> 00:18:22,680
And a lot of psychologists about what the [INAUDIBLE] do we do, we don't know anything else to do. So this created a whole furore in the field.

227
00:18:24,600 --> 00:18:30,450
Just to give a personal turn here. I had difficulty replicating a major finding in the field myself.

228
00:18:30,450 --> 00:18:34,800
This was when I was an undergraduate student back in 2009 or eight or something like that.

229
00:18:35,940 --> 00:18:38,610
The study I was trying to replicate, which I just assumed was true,

230
00:18:38,610 --> 00:18:42,840
is published in Science magazine, and at that point it had already been cited hundreds of times.

231
00:18:44,100 --> 00:18:49,800
Now it's about 800 times. So this is a very influential finding. The idea here is part of the so-called embodied cognition literature.

232
00:18:50,700 --> 00:18:57,299
The thought is that if you're made to feel guilty or morally impure in some way, you'll actually be tempted to physically cleanse your body.

233
00:18:57,300 --> 00:19:01,890
And the thought is that moral purity and physical purity are somehow connected in the mind.

234
00:19:01,890 --> 00:19:07,290
There's evolutionary explanations for why this might be so that initially we had discussed towards pathogens,

235
00:19:07,500 --> 00:19:13,440
and then that was sort of accepted by later evolutionary processes to have us feel disgust toward immoral events as well.

236
00:19:13,830 --> 00:19:17,639
And so the finding here was that if you induce a feeling of guilt in the participants,

237
00:19:17,640 --> 00:19:24,180
then they're more likely to want to wash their hands, more likely to select cleansing items on a shopping list compared to other items.

238
00:19:24,180 --> 00:19:25,410
That was that was the finding.

239
00:19:26,010 --> 00:19:33,959
Well, the first time I ran this, I had two times as many participants as the original study reported, and I wasn't able to find anything.

240
00:19:33,960 --> 00:19:35,580
But so I wrote this up with my colleagues.

241
00:19:35,580 --> 00:19:41,850
I thought I would try to not commit myself to the file drawer, but rather make this say, Listen, I tried to find this effect.

242
00:19:41,850 --> 00:19:46,770
I couldn't find it. We sent it to the journal. And what was interesting was that the reviewers said, well, listen,

243
00:19:46,770 --> 00:19:50,790
you know that the effect size that was published in the initial report was probably inflated.

244
00:19:52,470 --> 00:19:58,230
Initial estimates of effect sizes are almost always bigger than they really are once you sort of get more findings into the literature.

245
00:19:58,570 --> 00:20:00,389
So we said you should assume that it's much less than that.

246
00:20:00,390 --> 00:20:04,200
And we eventually had to run with five times as many participants in order to satisfy this reviewer.

247
00:20:05,280 --> 00:20:08,219
And still, even though we were using the exact same methods and materials,

248
00:20:08,220 --> 00:20:11,790
we contacted the original researchers who very graciously walked us through the design.

249
00:20:11,790 --> 00:20:17,759
There was no sort of attempt to hide what was going on. We still just couldn't find any evidence of this effect using standard procedure.

250
00:20:17,760 --> 00:20:23,370
So we wrote that up and published it here. So this whole string of events was, I have to admit, very disheartening.

251
00:20:23,370 --> 00:20:30,359
The area of psychology that I was training at the time, at the time was this priming psychology, which I thought was very interesting and exciting,

252
00:20:30,360 --> 00:20:34,470
that these subtle interventions in the environment could be influencing the thoughts and feelings and behaviour.

253
00:20:34,800 --> 00:20:37,590
And here it looked like this sort of thing was just crumbling before my very eyes.

254
00:20:38,340 --> 00:20:41,940
So rather than continuing on to much further as an experimental psychologist,

255
00:20:41,940 --> 00:20:46,410
I decided to take a detour and study the history and philosophy of science at Cambridge.

256
00:20:47,310 --> 00:20:54,900
And it was there that I wrote a sort of thesis on this replication issue that eventually was edited and published,

257
00:20:54,900 --> 00:20:59,520
as does this paper here with David Crawford. Now, he's the one I mentioned earlier who had its basic complexity psychology.

258
00:21:00,300 --> 00:21:04,740
And well, I was there at Cambridge. I met this visiting scholar named Stuart Firestein.

259
00:21:04,740 --> 00:21:09,000
He was there on some sort of fancy grant to write a book. He's based normally at Columbia.

260
00:21:09,690 --> 00:21:13,079
And we got to talking about this replication issue and this crisis.

261
00:21:13,080 --> 00:21:18,800
And he said to me, Well, what crisis? And I'd just written this whole paper saying, Oh, there's this big old crisis going on.

262
00:21:18,810 --> 00:21:22,500
Have you been paying attention? So we get into these arguments over whisky late at night,

263
00:21:23,100 --> 00:21:28,379
and often what happens when you have one person say there's a crisis and another person says there's not a crisis,

264
00:21:28,380 --> 00:21:33,450
it's that you just have two different senses of crisis in your mind. And that's what I think was going on between Stuart and me.

265
00:21:33,840 --> 00:21:38,100
So I'll tease apart two senses of crisis here. The first is crisis of confidence.

266
00:21:38,340 --> 00:21:46,980
This is basically I'm there's something going on where people have lost their confidence and it's sort of descriptive claim.

267
00:21:47,130 --> 00:21:50,190
It's not saying whether they're justified in having a loss of confidence.

268
00:21:50,190 --> 00:21:53,280
It's just a descriptive claim. The other is a crisis of process.

269
00:21:53,280 --> 00:21:59,429
And the way that I'm kind of summarising this is a crisis of confidence. Since people are freaking out and crisis, the process of science is broken.

270
00:21:59,430 --> 00:22:03,430
They're actually. There's a crisis in the sense that we're doing this wrong and we need to make some serious changes.

271
00:22:04,930 --> 00:22:08,739
So in terms of the first sort of crisis, the people are freaking out.

272
00:22:08,740 --> 00:22:12,370
Sense what's going on. Descriptively, is this true?

273
00:22:12,400 --> 00:22:15,459
Well, here's a paper this published around 2012.

274
00:22:15,460 --> 00:22:19,930
This is a special issue on this. They said, is there currently a crisis of confidence in psychological science,

275
00:22:19,930 --> 00:22:25,210
reflecting an unprecedented level of doubt among practitioners about the reliability of research findings in the field?

276
00:22:25,540 --> 00:22:30,760
It would certainly appear that there is. So this is some evidence that there's a crisis of confidence in the sense I've just described.

277
00:22:31,660 --> 00:22:33,970
Now, when I was in Cambridge and we're studying the history of science,

278
00:22:34,420 --> 00:22:37,600
one thing you learn to be very cautious about is anytime you hear the word unprecedented.

279
00:22:39,280 --> 00:22:43,209
This is another paper from that same journal where Roger Connor Sarala says, well,

280
00:22:43,210 --> 00:22:48,370
crises have been declared regularly since at least the time of William Hunt, one of the sort of founding fathers of psychology.

281
00:22:49,180 --> 00:22:55,059
In fact, to go back to the 1970s, as I promised. Here is one of a number of papers published in that decade.

282
00:22:55,060 --> 00:23:00,010
This is the crisis of confidence in social psychology. I mean, if you can see it's published there October 1975.

283
00:23:00,490 --> 00:23:04,420
So people were bringing up these same problems. Our sample sizes are too small.

284
00:23:04,810 --> 00:23:10,240
What is the problem with the file or the prejudice against? And all that paper was a 1970s paper by Anthony Greenwald that I mentioned,

285
00:23:11,200 --> 00:23:17,440
and everybody kind of acknowledged that there were problems with the standard method methodology and then they would just carry on doing it as before.

286
00:23:18,160 --> 00:23:20,320
I think what's different about this time is the Internet,

287
00:23:21,130 --> 00:23:25,660
this blog post that I mentioned and the response from the researcher and the further articles that were written up.

288
00:23:25,660 --> 00:23:32,290
It was it was all out in the open. The dirty laundry of psychology was suddenly being aired in a very public way in The New York Times and so on.

289
00:23:32,650 --> 00:23:37,690
And I think researchers now, the reason why this is sort of taken hold and persisted in the public consciousness

290
00:23:37,690 --> 00:23:40,870
is because I think researchers are aware that we really have to clean up their act.

291
00:23:41,770 --> 00:23:46,329
There's also a fascinating sort of public communication of science issue going on here as well,

292
00:23:46,330 --> 00:23:51,010
because I think the public generally thinks that if a paper was published in a scientific journal, that just means it's a fact.

293
00:23:51,430 --> 00:23:55,690
And then if somebody says, well, we had a hard time replicating it, then they say, Well, science is just garbage.

294
00:23:55,690 --> 00:23:59,080
I can't trust anything, you know? And there's information coming out about health research or whatever.

295
00:24:00,010 --> 00:24:00,280
You know,

296
00:24:00,370 --> 00:24:05,679
the public needs to know that there should be a certain amount of failure to replicate because that means science is being productive and interesting.

297
00:24:05,680 --> 00:24:07,870
It's trying stuff out that might not work or might not hold up.

298
00:24:08,200 --> 00:24:12,610
If every result replicated, that would mean we'd be doing very boring, slow, piecemeal, bricklaying science.

299
00:24:12,940 --> 00:24:18,010
So that's something I think we should talk about. Now, here's a survey that came out in nature just a little bit ago.

300
00:24:18,880 --> 00:24:25,630
This is, again, the descriptive claim. Is there a reproducibility crisis? Something like 52% say there's a significant, significant crisis.

301
00:24:25,630 --> 00:24:30,730
38 think there's a slight crisis and some small minority think there isn't. This is across scientific fields.

302
00:24:32,110 --> 00:24:37,960
So I think that gives some support for the first sense of crisis. There is indeed a crisis of confidence, but is there a crisis of process?

303
00:24:38,890 --> 00:24:46,930
Is it the case that science is in some sense broken? Well, here's a post from Stephen Porter Gorg, a researcher who says psychology is broken.

304
00:24:47,380 --> 00:24:50,800
And what he's referring to here is a painstaking, years long effort to reproduce.

305
00:24:50,800 --> 00:24:56,810
100 studies published in three leading psychology journals found that more than half of the findings did not hold up when retested.

306
00:24:56,830 --> 00:24:57,970
And what he's referring to is,

307
00:24:58,240 --> 00:25:05,080
is this paper that was published in science called Estimating the Reproducibility of Psychological Science by the Open Science Collaboration.

308
00:25:05,890 --> 00:25:14,200
This is published a couple of years ago, raised a huge splash and it caused a lot of people to think that psychology, anyway, is broken.

309
00:25:14,230 --> 00:25:18,430
So let me bring to the surface what the underlying reasoning is here.

310
00:25:18,910 --> 00:25:22,270
In order to claim that psychology is broken on the basis of these findings.

311
00:25:22,690 --> 00:25:29,770
You need to think something like this. If a field is not broken, most of its results should replicate when independent labs rerun the experiments.

312
00:25:31,060 --> 00:25:35,590
Well, most of psychology's results did not replicate when Independent Labs reran the experiments.

313
00:25:35,950 --> 00:25:39,730
So psychology is broken. That's what I think is going on here under the surface.

314
00:25:41,320 --> 00:25:43,300
Now, I've just indicated here,

315
00:25:43,660 --> 00:25:49,510
I'm not sure we should be very confident about this first claim that most results should replicate when Independent Labs rerun the experiments.

316
00:25:49,540 --> 00:25:52,810
You have to have some prior sense of what you think the correct rate of replication should be.

317
00:25:53,170 --> 00:25:56,350
It shouldn't be 100% because that means that we're not advancing at all.

318
00:25:56,350 --> 00:25:59,920
It should be 0%. That would be pretty disturbing. Maybe it should only be 50%.

319
00:26:00,400 --> 00:26:04,360
So until you know what you think the prior appropriate percentages when you're dealing with trade-offs

320
00:26:04,360 --> 00:26:09,549
between exciting new research and sort of brick lane research to confirm previous results,

321
00:26:09,550 --> 00:26:16,150
unless you know what the correct ratio is between those things, you don't have any grounds for saying that 50% is too big or too small.

322
00:26:16,720 --> 00:26:19,270
But I want to focus on the second issue and just see whether it is the case

323
00:26:19,270 --> 00:26:23,350
that we can conclude on the basis of this open science collaboration paper,

324
00:26:23,380 --> 00:26:28,330
that that it is true that most of psychology's results did not replicate when Independent Labs reran the experiment.

325
00:26:29,860 --> 00:26:33,350
Here's this paper. They looked at 100 studies from three major journals.

326
00:26:34,630 --> 00:26:38,810
They ran each of those studies one more time with independent labs who had sort of they would consult with

327
00:26:38,810 --> 00:26:42,430
the original researchers and try to come up with an agreement about how we should run these five experiments.

328
00:26:43,420 --> 00:26:47,710
And what they found was that the mean effect size went down by about half in the follow up studies.

329
00:26:48,340 --> 00:26:53,350
And the P values, the infamous P values were mostly bigger and mostly over that point five threshold.

330
00:26:53,350 --> 00:26:59,800
They were, you know, point ten or 20 or something like that, which on some sort of naive view of what counts as a replication, was often taken to be.

331
00:26:59,870 --> 00:27:06,920
Failure to replicate? Well, we have to actually dig into this a little bit to understand what we can actually learn from this.

332
00:27:07,940 --> 00:27:13,000
This famous paper that came out a couple of years ago, what counts as replication?

333
00:27:13,010 --> 00:27:19,760
Well, first of all, you have to question whether the materials were exactly the same as the ones that were used in

334
00:27:19,760 --> 00:27:24,600
the original study and whether the number of participants was the right amount of participants.

335
00:27:24,620 --> 00:27:29,240
So I'll just give one example here. I went and looked through those 100 studies to have all the original materials up online,

336
00:27:29,720 --> 00:27:33,860
and in one of them they used about half as many participants as were used in the original report.

337
00:27:33,890 --> 00:27:36,950
Based on this naive expectation that the published effect size was accurate,

338
00:27:36,950 --> 00:27:40,690
which, as I learned from my Macbeth effect paper, that was not that's not true.

339
00:27:40,700 --> 00:27:44,330
You have to have more participants than were published in the original study for most of them.

340
00:27:44,420 --> 00:27:50,300
What's that? I only looked at the one and a handful of others. I can't go into study to usually have a much higher power.

341
00:27:51,470 --> 00:27:54,350
That's what I thought as well, which is why so startled when I saw this particular one.

342
00:27:54,350 --> 00:27:59,780
The reason why I looked at it was it was a former psychology professor of mine had run the study and it didn't seem to turn out.

343
00:27:59,780 --> 00:28:01,400
And I went back and thought, what's going on here?

344
00:28:02,390 --> 00:28:07,850
So this this particular example was a replication of a study by Paul Bloom and a colleague, and they use half as many participants.

345
00:28:08,120 --> 00:28:11,530
So I'd have to go through and look at all the others to make sure whether that's a more general trend.

346
00:28:13,160 --> 00:28:17,149
Also, we have to ask ourselves, what's the effects that we're looking for? Are we what counts as a replication?

347
00:28:17,150 --> 00:28:20,870
Would it be the exact same effect size or the effect size within a certain bound?

348
00:28:21,450 --> 00:28:24,710
So we're looking at the same P value or just any P value less than oh five.

349
00:28:25,010 --> 00:28:27,890
So unless you have some further sense of what you mean by replication,

350
00:28:28,220 --> 00:28:31,520
depending on how the results turn out, you're not sure whether it counts as a failure to replicate.

351
00:28:32,840 --> 00:28:37,129
My hunch is that what we want to know is is not so much is the p value the same or the effect size the same,

352
00:28:37,130 --> 00:28:41,750
but is there a finding here of interest that's comparable to the original finding that was reported?

353
00:28:42,140 --> 00:28:45,110
Is there something theoretically or practically interesting that really does

354
00:28:45,110 --> 00:28:48,200
exist based on our best assessments of evidence that we have available to us?

355
00:28:50,990 --> 00:28:54,630
Now, here's why I think we shouldn't be very surprised about the results of this new study.

356
00:28:54,650 --> 00:29:01,200
This is what we should expect given p b stands for publication Bias Prejudice against the norm when you're only publishing a fraction of studies.

357
00:29:01,220 --> 00:29:05,330
Let's let's imagine you take a study and you run it under ideal conditions 100 times.

358
00:29:05,330 --> 00:29:09,459
And it's the same study over and over and over again. Well, you're going to get a distribution of P values.

359
00:29:09,460 --> 00:29:10,580
So it's going to be the same time.

360
00:29:10,580 --> 00:29:16,280
Every time you run that test, the same p value and your effect size is going to vary as well over the course of those 100 studies.

361
00:29:16,700 --> 00:29:20,480
Well, given that you only typically publish studies that work,

362
00:29:20,570 --> 00:29:24,260
you know that the published findings are going to be on the high end of this distribution.

363
00:29:24,260 --> 00:29:29,000
The effect sizes are going to be the ones that are a little bit bigger, the P values, the ones that are going to be smaller.

364
00:29:29,240 --> 00:29:32,719
And so this means that almost by definition, the second time you run this study,

365
00:29:32,720 --> 00:29:37,250
you're going to get a bigger P value because of regression to the mean you're going to get a smaller effect size.

366
00:29:37,490 --> 00:29:41,090
In other words, you don't have to even run this experiment to know that the second iteration of any

367
00:29:41,090 --> 00:29:45,110
published study is very likely going to give you a smaller effect size and a larger p value.

368
00:29:46,310 --> 00:29:49,490
So this is the problem with repeating a study one time.

369
00:29:49,730 --> 00:29:54,650
The p value, by the way, almost means nothing on one iteration of any study.

370
00:29:54,980 --> 00:29:59,360
It only begins to accrue some inferential value over multiple iterations of a study.

371
00:29:59,900 --> 00:30:05,330
And so, as I say, when you run a study one more time, your informational value from that is is almost zero.

372
00:30:05,390 --> 00:30:07,280
You don't learn almost anything from that.

373
00:30:08,000 --> 00:30:11,810
And this is exactly what you should expect, is that the mean effect size should go down and p value should go up.

374
00:30:13,070 --> 00:30:17,420
So now I want to talk about how do we know we're doing an adequate replication?

375
00:30:17,420 --> 00:30:23,670
I mentioned the smaller sample size which as I, as far as I know, only places that one case AA stands here for auxiliary assumptions.

376
00:30:23,670 --> 00:30:28,580
So I want to go back to this study by Farge and colleagues and the replication attempts by dining colleagues.

377
00:30:29,390 --> 00:30:36,260
So I told you that they gave the same priming task here. They tried to bring to the mind the elderly stereotype.

378
00:30:36,710 --> 00:30:40,190
They replaced the stopwatches with the infrared sensors.

379
00:30:41,060 --> 00:30:44,930
But I didn't tell you actually they didn't do an exact replication here.

380
00:30:44,930 --> 00:30:47,930
They made a change as well in terms of the primary materials. Specifically,

381
00:30:47,930 --> 00:30:50,870
they translated all the materials into French because they're carrying out the

382
00:30:50,870 --> 00:30:54,500
study in Belgium and the participants would understand French rather than English.

383
00:30:55,550 --> 00:30:59,120
Why is this important? Well, here's an still unpublished study.

384
00:30:59,120 --> 00:31:04,040
I saw a previous draft of this sort of circling or circulating around in the grey literature.

385
00:31:04,040 --> 00:31:05,270
I don't know what its fate is,

386
00:31:05,630 --> 00:31:11,390
but Michael Rampart is a very senior linguist who wanted to look at this case of the difficulty to replicate the original finding.

387
00:31:12,620 --> 00:31:15,950
The thing we have to pay attention to here is that this is a linguistic priming effect.

388
00:31:16,430 --> 00:31:21,290
You're calling to mind a stereotype based on the connotations that are associated with certain words in English.

389
00:31:21,290 --> 00:31:27,650
So if you translate just naively translate these words into French without showing that the same properties apply in French as apply in English,

390
00:31:28,190 --> 00:31:32,000
you may very well not be doing an adequate replication of the original study.

391
00:31:33,020 --> 00:31:38,150
One difference between English and French is that in English adjectives can be four nouns, whereas in French, domestic and after.

392
00:31:38,540 --> 00:31:44,869
And so if you're using adjectives to prime and now that's going to be part of the stereotype it's important which all of these things come in,

393
00:31:44,870 --> 00:31:50,149
in the language. And so Michael Ramkumar and his colleagues did a corpus analysis of French and English,

394
00:31:50,150 --> 00:31:55,610
and they found well, generally and with respect to the specific words used in the framing study,

395
00:31:56,030 --> 00:31:59,600
that when it comes to their experience of encountering the adjectives in the prime set in contexts where they.

396
00:31:59,770 --> 00:32:01,059
She served as prime finance.

397
00:32:01,060 --> 00:32:06,280
We can expect that the subjects in the original marginal study would have had something on the order of six times more experience.

398
00:32:06,700 --> 00:32:13,279
In other words, the suggestion here is that the Prime was likely to be six times stronger among English participants than among French participants.

399
00:32:13,280 --> 00:32:17,530
So the failure to find an effect could very well be due to the fact that the Prime wasn't strong enough to begin with.

400
00:32:17,560 --> 00:32:24,660
In other words, it wasn't a faithful replication of the original finding. So there are many more examples of this that could be raised.

401
00:32:24,670 --> 00:32:30,400
But what I'd like to point out is that replication is is hard. You often don't know if you're violating one of those auxiliary assumptions.

402
00:32:30,580 --> 00:32:34,050
You might think that just translating the materials into French. Well, what's the big deal about that?

403
00:32:34,070 --> 00:32:36,250
I have to do it in French because my participants are in French.

404
00:32:36,640 --> 00:32:43,150
Michael has to do this complicated linguistic caucus analysis to show that actually that wasn't an adequate replication of the original finding.

405
00:32:43,570 --> 00:32:47,799
And that happens because my grandma is a very good linguist and had the capacity to do that.

406
00:32:47,800 --> 00:32:53,260
But in how many other replication times are these subtle things not being adequately dealt with?

407
00:32:53,740 --> 00:32:59,440
It's hard to say. So a failed replication doesn't necessarily mean that the original finding was not real.

408
00:33:01,240 --> 00:33:04,479
And therefore I think that replication initiatives like the Open Science Collaboration one

409
00:33:04,480 --> 00:33:07,780
are unlikely to provide the most direct and compelling evidence that science is broken,

410
00:33:07,780 --> 00:33:14,140
if indeed it is. That said, I think we have more direct evidence that something is wrong with the way we're conducting science currently.

411
00:33:14,980 --> 00:33:19,480
We already have more direct evidence of a crisis of process or at least problems of the process.

412
00:33:19,480 --> 00:33:25,600
And that comes from that earlier work by John and colleagues and others who showed that psychologists openly admit to

413
00:33:25,600 --> 00:33:33,340
engaging in research practices that we know are very likely to lead to the publication of Type one errors and false alarms.

414
00:33:33,790 --> 00:33:39,700
And so since we already have evidence of those more directly from the admissions of psychologists, if we want to have an intervention,

415
00:33:39,970 --> 00:33:45,040
it might not be the best way to do that is to have these massive armies of replicating replicators running around,

416
00:33:45,040 --> 00:33:50,889
trying to redo every study that was ever published. You have resource constraints on how to actually do that.

417
00:33:50,890 --> 00:33:55,540
You have questions about which studies are worth replicating or not. It might very well be that in terms of intervention,

418
00:33:55,540 --> 00:34:02,859
we need to focus on the problems upstream that we know are very likely to lead to false alarms rather than trying to count the number of false alarms,

419
00:34:02,860 --> 00:34:03,580
if you see what I'm saying.

420
00:34:05,200 --> 00:34:11,109
So I indicated already, here's the more direct evidence psychologists are admitting to these questionable research practices.

421
00:34:11,110 --> 00:34:14,530
Key Hacking Harking, which stands for hypothesising after the results are known.

422
00:34:14,950 --> 00:34:17,950
This is where you run essentially an exploratory study. You're not sure what you're going to get.

423
00:34:17,950 --> 00:34:21,759
You don't really have a strong sense of hypothesis. You run some statistics and a couple of them pop out.

424
00:34:21,760 --> 00:34:25,030
P is less than five and then you come up with some hypothesis after the fact.

425
00:34:25,030 --> 00:34:27,010
They're going, Gee, what would predict that? I don't know. How about this?

426
00:34:27,250 --> 00:34:32,590
And then you write up the paper presenting what are exploratory statistics as though they were confirmatory statistics, and that's a problem.

427
00:34:33,550 --> 00:34:39,250
There's also quite a lot of work that's been done on the reliability of peer review, which is just not reliable.

428
00:34:39,490 --> 00:34:41,050
You can test peer review by, for example,

429
00:34:41,680 --> 00:34:46,450
embedding a bunch of errors in a manuscript and sending it out to a bunch of peer reviewers and seeing how many of them notice the errors,

430
00:34:46,450 --> 00:34:51,850
and very few of them seem to notice it. There's also just problems with sloppy peer review cronyism.

431
00:34:52,090 --> 00:34:53,560
There's a lot of politics of peer review.

432
00:34:55,000 --> 00:35:00,010
One thing you learn if you ever work as an associate editor for a journal is you get the manuscript and if you have a stance on it,

433
00:35:00,010 --> 00:35:03,280
you can sink it or or floated, depending on who you send it to for review.

434
00:35:03,310 --> 00:35:06,850
You know that so-and-so is going to give it a crappy review or that he's going to give it a good review.

435
00:35:07,180 --> 00:35:13,120
So it's not like there's this beautiful, objective process by which the most dispassionate and qualified reviewers are handling every paper.

436
00:35:13,540 --> 00:35:16,299
So peer review. I think if we're going to count on this as a quality control mechanism,

437
00:35:16,300 --> 00:35:23,500
we should be extremely concerned that peer review is not a quality control mechanism up to the job publication bias we talked about.

438
00:35:23,500 --> 00:35:26,200
This is the issue of failure to publish negative results.

439
00:35:26,950 --> 00:35:31,600
Again, if we have 20 labs running, it's essentially the same experiment and one of them gets it to work.

440
00:35:32,320 --> 00:35:35,170
The the highest likelihood is that that's a false alarm.

441
00:35:35,170 --> 00:35:40,330
And all the other studies, if we never hear about them, we don't have any way to know how much confidence you should place in that published finding.

442
00:35:42,010 --> 00:35:45,940
And I want to share a quick story about this issue of the need to publish negative results.

443
00:35:46,120 --> 00:35:50,469
I'm I was writing some papers on this topic and I was searching around in the literature.

444
00:35:50,470 --> 00:35:55,000
I think I typed in the need for reporting negative results to come up with some examples I might cite.

445
00:35:55,330 --> 00:36:00,010
And I found this paper that goes back to 1927 in the Journal of the American Medical Association.

446
00:36:00,010 --> 00:36:07,149
So this is something that's been written about for ages. I have to tell you what this researcher said.

447
00:36:07,150 --> 00:36:12,010
I'm going to read this thing mostly in because I think it's very clever. So it says, I'm to the editor.

448
00:36:12,010 --> 00:36:15,280
One of the things we practitioners sometimes neglect is the reporting of failures.

449
00:36:15,280 --> 00:36:21,310
In the Journal, Doctor So-and-so reported the treatment of six consecutive cases of warts with a certain injection.

450
00:36:21,670 --> 00:36:25,180
I venture to guess that as a result of this publication, not less than 100 physicians,

451
00:36:25,180 --> 00:36:29,020
perhaps several hundred, injected this substance into the patients.

452
00:36:29,470 --> 00:36:33,459
Supposing that 99% get negative results, what happens? Each of them gives up.

453
00:36:33,460 --> 00:36:38,380
The method is a failure and does not say anything more about it. And the treatment remains on record as an undisputed success.

454
00:36:38,680 --> 00:36:42,399
Maybe 1% who meet with success will communicate with Dr. Sutton so that by and by he will

455
00:36:42,400 --> 00:36:45,940
have quite an impressive series of cases which seem to support the original findings.

456
00:36:46,480 --> 00:36:51,840
To practice what I'm preaching, let me now report that on November 30th I injected the substance into the left buttock of CBM,

457
00:36:51,990 --> 00:36:55,240
a girl age 18, who was at that day complaining of 24 warts.

458
00:36:55,870 --> 00:36:59,580
The present date there are 28 warts and evidence of aggressive changes in the original 20.

459
00:36:59,660 --> 00:37:06,200
Or has not been seen. So now I have this big reveal, which is that the author of this is John Rosenberg, who is my paternal grandfather.

460
00:37:07,190 --> 00:37:10,670
He died in 1949. I never met him. I never knew him.

461
00:37:10,880 --> 00:37:14,240
He died when my dad was seven years old, so my dad didn't know him either.

462
00:37:14,690 --> 00:37:19,999
And here it was that I found this publication where some many number of years ago he was

463
00:37:20,000 --> 00:37:23,090
writing about this exact same issue that I happened to have been writing about at this time.

464
00:37:23,810 --> 00:37:28,310
So that was kind of a fun, personal note I wanted to end on, and I'll just say thank you and leave it there.