I teach my university’s Graduate Social Psychology course, and I start off the semester (as I assume many other professors who teach this course do) by talking about research methods in social psychology. Over the past several years, as the problems with reproducibility in science have become more and more central to the discussions going on in the field, my introductory lectures have gradually become more dismal. I’ve come to think that it’s important to teach students that most research findings are likely false, that there is very likely a high degree of publication bias in many areas of research, and that some of our most cherished ideas about how the mind works might be completely wrong.
In general, I think it’s hard to teach students what we have learned about the low reproducibility of many of the findings in social science without leaving them with a feeling of anomie, so this year, I decided to teach them how to do p-curve analyses so that they would at least have a tool that would help them to make up their own minds about particular areas of research. But I didn’t just teach them from the podium: I sent them away to form small groups of two to four students who would work together to conceptualize and conduct p-curve analysis projects of their own.
I had them follow the simple rules that are specified in the p-curve user’s guide, which can be obtained here, and I provided a few additional ideas that I thought would be helpful in a one-page rubric. I encouraged them to make sure they were sampling from the available population of studies in a representative way. Many of the groups cut down their workload by consulting recent meta-analyses to select the studies to include. Others used Google Scholar or Medline. They were all instructed to follow the p-curve manual chapter-and-verse, and to write a little paper in which they summarized their findings. The students told me that they were able to produce their p-curve analyses (and the short papers that I asked them to write up) in 15-20 person-hours or less. I cannot recommend this exercise highly enough. The students seemed to find it very empowering.
This past week, all ten groups of students presented the results of their analyses, and their findings were surprisingly (actually, puzzlingly) rosy: All ten of the analyses revealed that the literatures under consideration possessed evidentiary value. Ten out of ten. None of them showed evidence for intense p-hacking. On the basis of their conclusions (coupled with the conclusions that previous meta-analysts had made about the size of the effects in question), it does seem to me that there really is license to believe a few things about human behavior:
(1) Time-outs really do reduce undesirable behavior in children (parents with young kids take notice);
(2) Expressed Emotion (EE) during interactions between people with schizophrenia and their family members really does predict whether the patient will relapse in in the successive 9-12 months (based on a p-curve analysis of a sample of the papers reviewed here);
(3) The amount of psychological distress that people with cancer experience is correlated with the amounts of psychological distress that their caregivers manifest (based on a p-curve analysis of a sample of the papers reviewed here);
(4) Men really do report more distress when they imagine their partners’ committing sexual infidelity than women do (based on a p-curve analysis of a sample of the papers reviewed here; caveats remain about what this finding actually means, of course…)
I have to say that this was a very cheering exercise for my students as well as for me. But frankly, I wasn’t expecting all ten of the p-curve analyses to provide such rosy results, and I’m quite sure the students weren’t either. Ten non-p-hacked literatures out of ten? What are we supposed to make of that? Here are some ideas that my students and I came up with:
(1) Some of the literatures my students reviewed involved correlations between measured variables (for example, emotional states or personality traits) rather than experiments in which an independent variable was manipulated. They were, in a word, personality studies rather than “social psychology experiments.” The major personality journals (Journal of Personality, Journal of Research in Personality, and the “personality” section of JPSP) tend to publish studies with conspicuously higher statistical power than do the major journals that publish social psychology-type experiments (e.g., Psychological Science, JESP and the two “experimental” sections of JPSP), and one implication of this fact, as Chris Fraley and Simine Vazire just pointed out is that the former set of experiment-friendly journals are more likely, ceteris paribus, to have higher false positive rates than is the latter set of personality-type journals.
(2) Some of the literatures my students reviewed were not particularly “sexy” or “faddish”–at least not to my eye (Biologists refer to the large animals that get the general public excited about conservation and ecology as the “charismatic megafauna.” Perhaps we could begin talking about “charismatic” research topics rather than “sexy” or “faddish” ones? It might be perceived as slightly less derogatory…). Perhaps studies on less charismatic topics generate less temptation among researchers to capitalize on undisclosed researcher degrees of freedom? Just idle speculation…
(3) The students went into the exercise without any a priori prejudice against the research areas they chose. They wanted to know whether the literatures the focused on were p-hacked because they cared about the research topics and wanted to base their own research upon what had come before–not because they had read something seemingly fishy on a given topic that gave them impetus to do a full p-curve analysis. I wonder if this subjective component to the exercise of conducting a p-curve analysis is going to end up being really significant as this technique becomes more popular.
If you teach a graduate course in psychology and you’re into research methods, I cannot recommend this exercise highly enough. My students loved it, they found it extremely empowering, and it was the perfect positive ending to the course. If you have used a similar exercise in any of your courses, I’d love to hear about what your students found.
By the way, Sunday will be the 1-year anniversary of the Social Science Evolving Blog. I have appreciated your interest. And if I don’t get anything up here before the end of 2014, happy holidays.
A great post, thanks for sharing!
One idea that is worth mentioning, is that although a left skewed p-curve is evidence of p-hacking and publication bias, a right skewed p-curve is not necessarily evidence for a true finding. It is intriguing to see what percentage of the non-p-hacked literature will actually replicate by a sufficiently powered direct replication. For example, suppose we are measuring 10 variables and looking for a significant correlation at the (p<0.01) level, the chances of finding something is over 60% (1-0.99^100) if one does not correct for multiple comparisons.
Just a thought…
If you p curve ESP studies you will *also* find that they have evidentiary value. So yes, when the p curve is flat this indicates a serious problem; but the fact that it is properly skewed does not prove that the effect is real (some QPRs lead to p curves that look OK). Nevertheless, these are interesting analyses!
EJ and Gidi: Thanks for these comments. I’m intrigued by the idea that (a) a left-skewed p-curve is evidence that researchers HAVE exercised researcher degrees of freedom to get results that have p values just low enough to be publishable; but (b) a right-skewed p-curve is NOT evidence that researchers HAVE NOT exercised researcher degrees of freedom to get results that have p values just low enough to be publishable. Where could one read more about this? Has Uri Simonsohn or anyone else chimed in on this? Thanks for posting your ideas!
My Faith is As Yet Unrestored(1)
Here is why. I do not trust your class’s “findings” — not saying they are wrong, just saying I do not trust them, at least not without A LOT more info. Here is why. The undergrads in my lab, with heavy oversight by a grad and postdoc, are currently undertaking pcurve analyses of three literatures. They have been at it for months. Not done yet. Here is why. When my “overseers” reviewed their preliminary pcurve tables, they found that the undergrads were only about 50% successful at identifying, for each study, the first p-value associated with the first test of a main hypothesis.
Of course, your p-value selection rule could be “just take the first.” But that could mean you would sweep up manipulation checks, preliminary analyses designed to simplify things, and the like, rather than tests of main hypotheses.
In fact, this — identifying the key researcher hypothesis often turns out to be far more difficult than most of us might imagine. Many researchers do not state their key hypotheses or research questions in the intro. In the results, it is often not clear which analyses are central to the hypotheses.
This has become interesting to me in its own right. If researchers are (even unintentionally) vague in their specification of their hypotheses, they have massive degrees of freedom, but not in the Simonsohn/Nelson/Simmons sense. They have massive degrees of freedom to *interpret* their results as supporting their theories, because they have not clearly articulated their hypotheses!
For example, consider the ever charismatic stereotype threat research. What, exactly, is the key prediction? Is it that Stigmatized Group does worse under threat than under nonthreat? Or is it that Stigmatized Group does just as well as NonStigmatized Group under nonthreat? Or is it the overall two way interaction (SG does worse under threat than no threat, threat has no effect on NSG)? As long as researchers do not clearly state which, and simply make vague pronouncements about the power of threat, they can, without any phacking, simply declare whichever results they find from above (as long as they find at least one) as supporting their theory.
Which gets to limits to pcurving. Pcurving is a great tool, but it has its own limitations, which are not yet well understood. First, it strikes me as conservative. You need either A LOT of studies, or a VERY LEFT SKEWED curve to produce a significant pcurve showing phacking.
Maybe that is fine, we do not want to be in the business of unduly trashing our colleagues unless we are really sure. But the important issue is not trashing our colleagues, it is figuring out what is true and what is not. In that spirit, a conservative test is a problem.
And of course showing “evidentiary value” is itself the sort of problematic dichotomy inherent to pvalues themselves. How much evidentiary value? As of right now, it does not tell us (though Simonsohn recently emailed me that they are working on a version that will allow pcurve be used to estimate effect sizes).
I recently made up data yielding a U shaped curve of p-values — and all three tests were significant (showing evidentiary value, intense phacking, and lack of evidentiary value). What does this mean? IDK.
Last, pcurve is conservative because it is not even intended to catch lots of forms of distortion. Sometimes, we do not want a significant effect, we want a nonsignificant one. Sometimes, ala above, if the main analysis is weak, we do lots more analyses in the hope of finding something close enough to support to strengthen our case, but such ancillary analyses might be reframed as central, whereas they would not even be reported had the main analysis worked out.
The point is pcurve is not only conservative because it takes A LOT to register phacking and lack of evidentiary value, it is conservative in that there are lots of degrees of freedom it is not even designed to identify.
Between that and the difficulty undergrads have in identifying the appropriate p-values in my lab*, I am compelled to take your class project results as indicating widespread validity with a whole shaker full of salt. I am not saying I know or even think there is a widespread lack of validity — and, indeed, my intuitions coincide with yours that there probably is less distortion in less charismatic** areas and in areas where large Ns are more typical. Still, I do not know what to make of your class results.
(1) On the other hand, I just completed work on a study resoundingly replicating (conceptually) the fundamental attribution error in the context of stereotyping. Gives me some confidence that our field is not all hot air, smoke, and mirrors.
* I also have a class project where students have to fill out a table summarizing a study. Not for pcurving, just for them to communicate their understanding. They need to report the key hypotheses, the stats used to test that hypothesis, the result testing that hypothesis, and whether the result supported the hypothesis. The first two times they did this, many — and these were seniors in an advanced psych class, all of whom had stats and lab courses — did not even understand that: 1. The main hypothesis referred to variables; 2. The variables in the main analysis had to be the same as the ones in the hypothesis; 3. The key analysis had to involve a relation between the IVs and DVs (or predictors and outcomes) identified in 1 and 2 above.
Given this, without lots more detail indicating undergrads held accountable for the decisions about which p-values to include actually get it right, I have deep reservations about unleashing undergrads on pcurve analyses.
** I love the term “charismatic” as a replacement for “hot” or “faddish.”
Thanks, Lee, for your comments. A few in response.
To be clear, these were graduate students and not undergraduates. I think this probably matters at least a little as it may suggest enhanced statistical competence in comparison to undergraduates. Having said that, I appreciate your points and for the most part agree with you. I am puzzled by my students’ results as well. If I do this exercise again next year (which I probably will), I will most likely work with them a bit more closely to make sure they are pulling the correct statistical tests.
Thank you, too, for sharing your thoughts about the p-curve technique, which I intend to study a bit more carefully before repeating this exercise when I teach graduate social next time. I think I will also insist that they find literatures for which they can produce at least 20 statistical tests (as a way of dealing with the power issue).
Pingback: Two critiques, and a faith restorer | Åse Fixes Science