Tag Archives: social psychology

Trust in the Time of Coronavirus: Low Trusters are Particularly Skeptical of Local Officials and Their Own Neighbors

A few days ago, I saw the results of a new Pew poll on Americans’ trust in the wake of the Coronavirus outbreak. The poll, based on a random sample of 11,537 U.S. adults, addressed two questions: Which groups of people and societal institutions do Americans trust right now? And how do their background levels of generalized trust influence their trust in those specific groups of people and institutions?

The takeaway is troubling: High trusters and low trusters have comparable amounts of trust in our federal agencies and national institutions, but they have vastly different amounts of trust in the responses and judgments of their local officials and neighbors.

To examine these issues, the Pew resesarchers first divided the sample into three groups based on their responses to three standard questions for measuring generalized trust. Helpfully, they called these three subgroups Low Trusters, Medium Trusters, and High Trusters.

As many other researchers have found, generalized trust was associated with ethnicity (white Americans have higher levels of generalized trust than blacks and hispanics do), age (the more you have of one, the more you have of the other) education (ditto), and income (ditto). These results are hardly surprising–ethnicity, age, education, and income are among the most robust predictors of trust in survey after survey–but they do nevertheless provide an interpretive backdrop for the study’s more important findings.

What really struck me were the associations of people’s levels of generalized trust and their sentiments toward public institutions and groups of other people. Low, medium, and high trusters had fairly similar evaluations of how the CDC, the news media, and even Donald Trump were responding: On average, people at all three levels of generalized trust had favorable evaluations of the CDC; on average, people at all three levels of generalized trust had lukewarm evaluations of Trump’s response.

Where the three groups of trusters differed more conspicuously was in their evaluations of their state officials, their local officials, and–most strikingly–ordinary people in their communities. About 80% of high trusters thought their local and state officials were doing an excellent or good job of responding to the outbreak. Only 57% of low trusters said the same.

But the biggest gulf in the sentiments of high trusters and low trusters was in their evaluations of ordinary people in their communities. Eighty percent of high trusters said that ordinary people in their community were doing an excellent or good job in responding to the outbreak. Only 44% of low trusters approved.

 

High trusters, medium trusters, and low trusters also had widely divergent opinions about the responses of ordinary people–both across the country and in their local communities.

Most people, regardless of how much generalized trust they had, thought their state governments, local governments, and local school systems were responding with the right amount of urgency to the outbreak. However, high trusters and low trusters differed greatly in their attitudes toward the responses of their neighbors. Where as16% of high trusters thought ordinary people in their local communities were overreacting; 35% of the low trusters–more than twice as many–thought ordinary people in their local communities were overreacting.

What I find troubling about these statistics is that all epidemics, like all politics, are local. The people who should be best equipped to tell you about what’s going on in your community are the people who are paid to know what’s going on in your community and the people who actually live in your community. We’re entitled to clear and accurate information from local officials, and we should be ashamed that local people cannot always trust their judgment. But local officials are not the only source of information that people should be able to trust. An ordinary person in your community could, in principle, be able to tell you whether a teacher at your kid’s school or a cashier at your local grocery store tested positive. How much unnecessary risk do we expose ourselves to when some of us inhabit communities or worldviews that cause us to perceive our local officials and neighbors are liars, incompetents, or chicken-littles?

(Following an interesting interchange on Twitter with Cameron Brick and Dave Pietraszewski about essentialism in psychology and the hazards it creates for scientific progress, I thought I would re-post this 2017 blog entry, which might be useful and/or interesting to some, and perhaps even entertaining for an extremely small subset of that small group. I daresay the concerns I raise here aren’t any less concerning in 2021. ~M)

TWO
years ago, I idly surfed my way to a harmless-seeming article from 2004 by Denny Borsboom, Gideon Mellenbergh, and Jaap van Heerden entitled The Concept of Validity. More than a decade had passed since its publication, and I had never heard of it. Egocentrically, this seemed like reason enough to surf right past it. Then I skimmed the abstract. Intrigued, I proceeded to read the first few paragraphs. By that point, I was hooked: I scrapped my plans for the next couple of hours so I could give this article my complete attention. This was a paper I needed to read immediately.

I’ve thought about The Concept of Validity every day for the past two years. I have mentioned or discussed or recommended The Concept of Validity hundreds of times. My zeal for The Concept of Validity is the zeal of an ex-smoker. The concept of validity in The Concept of Validity has led to a complete reformatting of my understanding of validity, and of measurement in general—and not just in the psychological sciences, but in the rest of the sciences, too. And those effects have oozed out to influence just about everything else I believe about science. The Concept of Validity is the most important paper you’ve probably never heard of.*

The concept of validity in The Concept of Validity is so simple that it’s a bit embarrassing even to write it down, but its simplicity is what makes it so diabolical, and so very different from what most in the social sciences of have believed validity to be for the past 60 years.

According to Borsboom and colleagues, a scientific device (let’s label it D) validly measures a trait or substance (which we will label T), if and only if two conditions are fulfilled:

(1) T must exist;

(2) T must cause the measurements on D.

That’s it. That is the concept of validity in The Concept of Validity.

This is a Device. There are invisible forces in the world that cause changes in the physical state of this Device. Those physical changes can be read off as representations of the states of those invisible forces. Thus, this Device is a valid measurement of those invisible forces.

What is most conspicuous about the concept of validity in The Concept of Validity is what it lacks. There is no talk of score meanings and interpretations (à la Cronbach and Meehl). There is no talk of integrative judgments involving considerations of the social or ethical consequences of how scores are put to use (à la Messick). There’s no talk of multitrait-multimethod matrixes (à la Campbell and Fiske), nomological nets (Cronbach and Meehl again), or any of the other theoretical provisos, addenda, riders, or doo-dads with which psychologists have been burdening their concepts of validity since the 1950s. Instead, all we need—and all we must have—for valid measurement is the fulfillment of two conditions: (1) a real force or trait or substance (2) whose presence exerts a causal influence on the physical state of a device. Once those conditions are fulfilled, a scientist can read off the physical changes to the device as measurements of T. And voila: We’ve got valid measurement.

Boorsboom and colleagues’ position is such a departure from 20th century notions of validity precisely because they are committed to scientific realism—a stance to which many mid-20th-century philosophers of science were quite allergic. But most philosophers of science have gotten over their aversion to scientific realism now. In general, they’re mostly comfortable with the idea that there could be hidden realities that are responsible for observable experience. Realism seemed like a lot to swallow in 1950. It doesn’t in 2017.

As soon as you commit to scientific realism, there is a kind of data you will prize more highly than any other for assessing validity, and that’s causal evidence. What a realist wants more than anything else on earth or in the heavens is evidence that the hypothesized invisible reality (the trait, or substance, or whatever) is causally responsible for the measurements the device produces. Every other productive branch of science is already working from this definition of validity. Why aren’t the social sciences?

For some of the research areas I’ve messed around with over the past few years, the implications of embracing the concept of validity in The Concept of Validity are profound, and potentially nettlesome: If we follow Borsboom and colleagues’ advice, we can discover that some scientific devices do indeed provide valid measurement, precisely because the trait or substance T they supposedly measure actually seems to exist (fulfilling Condition #1) and because there is good evidence that T is causally responsible for physical features of the device that can be read off as measurements of T (fulfilling Condition #2). In other areas, the validity of certain devices as measures looks less certain because even though we can be reasonably confident that the trait or substance T exists, we cannot be sure that changes in T are responsible for the physical changes in the device. In still other areas, it’s not clear that T exists at all, in which case there’s no way that the device can be a measure of T.

I will look at some of these scenarios more closely in an upcoming post.

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061-1071.

*Weirdly, The Concept of Validity does not come up in Google Scholar. I’ve seen this before, actually. Why does this happen?

A P-Curve Exercise That Might Restore Some of Your Faith in Psychology

I teach my university’s Graduate Social Psychology course, and I start off the semester (as I assume many other professors who teach this course do) by talking about research methods in social psychology. Over the past several years, as the problems with reproducibility in science have become more and more central to the discussions going on in the field, my introductory lectures have gradually become more dismal. I’ve come to think that it’s important to teach students that most research findings are likely false, that there is very likely a high degree of publication bias in many areas of research, and that some of our most cherished ideas about how the mind works might be completely wrong.

In general, I think it’s hard to teach students what we have learned about the low reproducibility of many of the findings in social science without leaving them with a feeling of anomie, so this year, I decided to teach them how to do p-curve analyses so that they would at least have a tool that would help them to make up their own minds about particular areas of research. But I didn’t just teach them from the podium: I sent them away to form small groups of two to four students who would work together to conceptualize and conduct p-curve analysis projects of their own.

I had them follow the simple rules that are specified in the p-curve user’s guide, which can be obtained here, and I provided a few additional ideas that I thought would be helpful in a one-page rubric. I encouraged them to make sure they were sampling from the available population of studies in a representative way. Many of the groups cut down their workload by consulting recent meta-analyses to select the studies to include. Others used Google Scholar or Medline. They were all instructed to follow the p-curve manual chapter-and-verse, and to write a little paper in which they summarized their findings. The students told me that they were able to produce their p-curve analyses (and the short papers that I asked them to write up) in 15-20 person-hours or less. I cannot recommend this exercise highly enough. The students seemed to find it very empowering.

This past week, all ten groups of students presented the results of their analyses, and their findings were surprisingly (actually, puzzlingly) rosy: All ten of the analyses revealed that the literatures under consideration possessed evidentiary value. Ten out of ten. None of them showed evidence for intense p-hacking. On the basis of their conclusions (coupled with the conclusions that previous meta-analysts had made about the size of the effects in question), it does seem to me that there really is license to believe a few things about human behavior:

(1) Time-outs really do reduce undesirable behavior in children (parents with young kids take notice);

(2) Expressed Emotion (EE) during interactions between people with schizophrenia and their family members really does predict whether the patient will relapse in in the successive 9-12 months (based on a p-curve analysis of a sample of the papers reviewed here);

(3) The amount of psychological distress that people with cancer experience is correlated with the amounts of psychological distress that their caregivers manifest (based on a p-curve analysis of a sample of the papers reviewed here);

and

(4) Men really do report more distress when they imagine their partners’ committing sexual infidelity than women do (based on a p-curve analysis of a sample of the papers reviewed here; caveats remain about what this finding actually means, of course…)

I have to say that this was a very cheering exercise for my students as well as for me. But frankly, I wasn’t expecting all ten of the p-curve analyses to provide such rosy results, and I’m quite sure the students weren’t either. Ten non-p-hacked literatures out of ten? What are we supposed to make of that? Here are some ideas that my students and I came up with:

(1) Some of the literatures my students reviewed involved correlations between measured variables (for example, emotional states or personality traits) rather than experiments in which an independent variable was manipulated. They were, in a word, personality studies rather than “social psychology experiments.” The major personality journals (Journal of Personality, Journal of Research in Personality, and the “personality” section of JPSP) tend to publish studies with conspicuously higher statistical power than do the major journals that publish social psychology-type experiments (e.g., Psychological Science, JESP and the two “experimental” sections of JPSP), and one implication of this fact, as Chris Fraley and Simine Vazire just pointed out is that the former set of experiment-friendly journals are more likely, ceteris paribus, to have higher false positive rates than is the latter set of personality-type journals.

(2) Some of the literatures my students reviewed were not particularly “sexy” or “faddish”–at least not to my eye (Biologists refer to the large animals that get the general public excited about conservation and ecology as the “charismatic megafauna.” Perhaps we could begin talking about “charismatic” research topics rather than “sexy” or “faddish” ones? It might be perceived as slightly less derogatory…). Perhaps studies on less charismatic topics generate less temptation among researchers to capitalize on undisclosed researcher degrees of freedom? Just idle speculation…

(3) The students went into the exercise without any a priori prejudice against the research areas they chose. They wanted to know whether the literatures the focused on were p-hacked because they cared about the research topics and wanted to base their own research upon what had come before–not because they had read something seemingly fishy on a given topic that gave them impetus to do a full p-curve analysis. I wonder if this subjective component to the exercise of conducting a p-curve analysis is going to end up being really significant as this technique becomes more popular.

If you teach a graduate course in psychology and you’re into research methods, I cannot recommend this exercise highly enough. My students loved it, they found it extremely empowering, and it was the perfect positive ending to the course. If you have used a similar exercise in any of your courses, I’d love to hear about what your students found.

By the way, Sunday will be the 1-year anniversary of the Social Science Evolving Blog. I have appreciated your interest.  And if I don’t get anything up here before the end of 2014, happy holidays.

The Myth of Moral Outrage

This year, I am a senior scholar with the Chicago-based Center for Humans and Nature. If you are unfamiliar with this Center (as I was until recently), here’s how they describe their mission:

The Center for Humans and Nature partners with some of the brightest minds to explore humans and nature relationships. We bring together philosophers, biologists, ecologists, lawyers, artists, political scientists, anthropologists, poets and economists, among others, to think creatively about how people can make better decisions — in relationship with each other and the rest of nature.

In the year to come, I will be doing some writing for the Center, starting with a piece I that has just appeared on their web site. In The Myth of Moral Outrage, I attack the winsome idea that humans’ moral progress over the past few centuries has ridden on the back of a natural human inclination to react with a special kind of anger–moral outrage–in response to moral violations against unrelated third parties:

It is commonly believed that moral progress is a surfer that rides on waves of a peculiar emotion: moral outrage. Moral outrage is thought to be a special type of anger, one that ignites when people recognize that a person or institution has violated a moral principle (for example, do not hurt others, do not fail to help people in need, do not lie) and must be prevented from continuing to do so . . . Borrowing anchorman Howard Beale’s tag line from the film Network, you can think of the notion that moral outrage is an engine for moral progress as the “I’m as mad as hell and I’m not going to take this anymore” theory of moral progress.

I think the “Mad as Hell” theory of moral action is probably quite flawed, despite the popularity that it has garnered among may social scientists who believe that humans possess “prosocial preferences” and a built-in (genetically group-selected? culturally group selected?) appetite for punishing norm-violators. I go on to describe the typical experimental result that has given so many people the impression that we humans do indeed possess prosocial preferences that motivate us to spend our own resources for the purpose of punishing norm violators who have harmed people whom we don’t know or otherwise care about. Specialists will recognize that the empirical evidence that I am taking to task comes from that workhorse of experimental economics, the third-party punishment game:

…[R]esearch subjects are given some “experimental dollars” (which have real cash value). Next, they are informed that they are about to observe the results of a “game” to be played by two other strangers—call them Stranger 1 and Stranger 2. For this game, Stranger 1 has also been given some money and has the opportunity to share none, some, or all of it with Stranger 2 (who doesn’t have any money of her own). In advance of learning about the outcome of the game, subjects are given the opportunity to commit some of their experimental dollars toward the punishment of Stranger 1, should she fail to share her windfall with Stranger 2.

Most people who are put in this strange laboratory situation agree in advance to commit some of their experimental dollars to the purpose of punishing Stranger 1’s stingy behavior. And it is on the basis of this finding that many social scientists believe that humans have a capacity for moral outrage: We’re willing to pay good money to “buy” punishment for scoundrels.

In the rest of the piece, I go on to point out the rather serious inferential limitations of the third-party punishment game as it is typically carried out in experimental economists’ labs. I also point to some contradictory (and, in my opinion, better) experimental evidence, both from my lab and from other researchers’ labs, that gainsay the widely accepted belief in the reality of moral outrage. I end the piece with a proposal for explaining what the appearance of moral outrage might be for (in a strategic sense), even if moral outrage is actually not a unique emotion (that is, a “natural kind” of the type that we assume anger, happiness, grief, etc. to be) at all.

I don’t want to steal too much thunder from the Center‘s own coverage of the piece, so I invite you to read the entire piece over on their site. Feel free to post a comment over there, or back over here, and I’ll be responding in both places over the next few days.

As I mentioned above, I’ll be doing some additional writing for the center in the coming six months or so, and I’ll be speaking at a Center event in New York City in a couple of months, which I will announce soon.

Why Do Honor Killings Defy the First Law of Homicide? And Will Smaller Families Lead to Fewer Of Them?

Few categories of human rights violations more deeply scandalize the liberal (with a little-L) moral sensibility than honor killings do. Reliable numbers are hard to come by, but by most credible accounts it seems likely that several thousand Muslim women each year (and more than a few men) are stoned, burned, hanged, strangled, beheaded, stabbed, or shot to death for the sins of getting raped, falling in love, or dressing immodestly. But to anyone who thinks about human behavior from an evolutionary point of view, honor killings are not just morally outrageous: They’re also really puzzling.

As Martin Daly and Margo Wilson documented in their marvelous book Homicide, killers are very rarely the genetic relatives of their victims. Instead, they’re most often strangers, or rivals, or cuckolded lovers (who, of course, are not each others’ kin even if married—at least, not in the sense that matters to natural selection). Indeed, the typically low level of kinship between the victims of homicides and the people who kill them is so predictable that we could get away with calling it “The First Law of Homicide.” When two genetic relatives are involved in a homicide, it’s usually either as co-victims or co-perpetrators, not as victim and perpetrator.

In a sense, a general reluctance to harm or kill one’s genetic relatives is not exactly breaking news. We’ve understood since William Hamilton’s 1963 and 1964 papers that natural selection creates organisms that appear designed to maximize their inclusive fitness (which incorporates the reproductive success of the individual in whom the gene is physically located, as well as the reproductive success of other individuals who are carrying copies of that gene around) rather than their simple direct fitness. Genes “want” to maximize the total number of copies of themselves that are floating around in the world, even if some of those copies are located in other individuals’ gonads. The principle of kin selection virtually guarantees that we’re walking around with instincts that restrain us from harming our relatives, even when they’ve irritated us. To be clear, I’m not saying people never kill their kin (mental illness is a real wild card here), but the fitness disincentives of doing so were so high as our psychology was evolving that the perceived incentives to do so now have to be very high indeed.

Which is what makes honor killings so puzzling. In a recent article, Andrzej Kulczycki and Sarah Windle summarized data on the circumstances behind more than 300 honor killings across Northern Africa and the Middle East. What jumps off the page when you look at their data is how flagrantly honor killings flout the First Law of Homicide: About three-quarters of honor killings are carried out by family members of the victim. To be specific, the victims’ brothers carry out 29% of them, fathers and (to a much lesser extent, mothers), carry out about 25%, and “other male relatives” carry out an additional 19% of them. (Of the remaining 25%, virtually all are carried out by the victims’ husbands/ex-husbands.)

I’m really interested in that 75% that violate the First Law of Homicide. For the perpetrators of honor killings to over-ride their intuitive aversions to killing their own daughters or sisters, the perceived costs of “dishonor” must be very high indeed. We can’t precisely measure the exact fitness value of honor for someone who lives in a so-called culture of honor, of course, but the link between fitness and honor is undeniable. If you live in an honor culture, your honor determines your (and your children’s) job prospects, marriage prospects, ability to recruit help from neighbors, ability to secure a loan, and protection against those who would otherwise do you harm. Honor is an insurance policy, a social security check, and a glowing letter of recommendation rolled into one bundle. The fitness costs of tarnished honor in an honor culture can be steep.

One of the things I came to appreciate about honor while doing research for one of my books is that honor is a sacred commodity. It doesn’t follow the laws we expect actual physical stuff to obey, or the normal laws of economics, or even the normal rules that govern our everyday psychology. It follows the laws of Sacred Things. If you feel sad one day, you can be pretty sure that the feeling won’t last forever. Dishonor doesn’t work like that. Dishonor doesn’t wash off or fade away with time. Dishonor has to be purged or atoned for. More importantly for my argument here, dishonor does not dilute. The dishonor that a “dishonorable” behavior creates for a family is not like a fixed quantity of scarlet paint that can be used to make only a finite number of scarlet letters. When a young woman “dishonors” her family, there’s enough dishonor to thoroughly cover every one of her brothers and sisters, no matter how many brothers and sisters she has.

There’s an interesting prediction waiting in the wings. If I’m right that dishonor does not dilute, then the perceived fitness-associated costs of a single act of dishonor will be larger for a father and mother with many children than for a father and mother with only with only a few children. This has implications for reducing honor killings. Let me illustrate with a thought experiment.

The Costs of Dishonor to a Father Are Higher in Large Families

Say I am a father with nine children and one of my daughters has done something (or, more likely, has had something done to her) that has brought dishonor upon herself and each of her eight siblings. (Believe me, I am more appalled by having to write sentences like these than you are by having to read them, but I can’t come up with a better way to think through these issues than to try to step into the shoes of someone who is actually factoring honor-related concerns into their social decision-making). As the father of these nine children, the dishonored daughter has reduced my fitness by 9d because each of my children will suffer an honor-related fitness cost of d. (It might be better to quantify the hit to my fitness as 9 * .5 = 4.5 because my genetic relatedness to my children with respect to a rare allele that I possess is 0.5 rather than 1.0, but that won’t change anything in what’s to come. Can we please agree to work with 9 so as to make the math prettier?) So, if I am a father of nine children, and I can restore my family’s honor by murdering my dishonored daughter, I can recover 8d units of fitness (by restoring the damaged honor of my other eight children), and it costs me (I know, the thought sickens me as well) the fitness decrement I suffer through murdering one of my offspring.

If, on the other hand, I have only two children, then the perceived fitness cost of my daughter’s dishonor is 2d (a cost of d is imputed to both of my children), and I’d only be able to recover 1d unit of fitness (for my remaining, unmurdered child) by murdering the dishonored daughter. So, for a father with only two children, the calculus is not so clear: Am I better off in the long run to have two children whose honor is tarnished, or only one child whose honor is restored? For any plausible value of d, it’s hard to imagine that the decision-making scales would tilt in favor of killing the dishonored daughter if doing so would leave you with only one child. I’m betting that the father of two will stay his hand under circumstances in which the father of nine might not.

If I’m right about this, then a demographic shift toward smaller families in developing societies could eventually help to solve the problem of honor killings. I couldn’t find any direct evidence to support this prediction, but Manuel Eisner and Lana Ghuneim recently published a study in which they surveyed 856 Jordanian adolescents from 14 different schools to examine the predictors of their attitudes toward honor killings. They found that even when they controlled for the students’ sex (male vs. female), their religion (Muslim vs. non-Muslim), whether their mothers worked outside of the home (a good proxy for modernization), and the parents’ educational levels (also a good proxy for modern thinking), children with four or more siblings had more favorable attitudes toward honor killings than did children with three or fewer siblings. Not an exact test of my prediction, but to the extent that kids adopt their parents’ views, it seems to me that these results are at least tantalizingly consistent.

Do the human rights groups that want to reduce honor killings and other kinds of honor-related violence around the world ever talk about family size as a truly exogenous (and, in principle, modifiable) cause of honor killings? People are pinning their hopes for solving so many other problems around the world on reductions in family size, so perhaps I’m not being too pie-in-the-sky to add “reductions in honor-related violence” to that list of “Ways In Which We’d Be Better Off If People Had Fewer Kids.” As families shrink, I’m guessing that spared lives become subjectively more valuable than restored family honor.

Why Not Use Cap and Trade to Reduce False Positives In Science? An Elaboration

This post is a longer-form treatment of the Cap and Trade idea for controlling false positives in science that Dave Kelly and I outlined in our brief letter, which appeared in this week’s issue of Nature. It provides more background and additional details that we simply couldn’t cover in a 250-word letter.

First, the background. For the past several years, as many readers are surely aware, a replication crisis has been roiling the sciences. The problem, quite simply, is that some (probably large) proportion of published scientific findings are false. Many remedies have been proposed for addressing the replication crisis, including (1) system-wide changes in how researchers are trained in statistics and research methods; (2) exhortations to greater statistical and methodological virtue among researchers; (3) higher editorial standards for journal editors and reviewers; and (4) journal reforms that would require more transparency from investigators about hypothesis formulation, research methods, data collection, and data analysis as a condition for publication.

Most of these remedies are sensible, but Nature has suggested here and here that NIH officials have been contemplating an even more radical measure: Some sort of audit system in which independent laboratories would be tasked with trying to reproduce recently published scientific results from particular fields. An audit-based system would have its merits, but a cap and trade system might work even better. Our proposal rests on the idea that false positives are a form of pollution: I call it false positive pollution.

False Positives are Pollution

False positives fit the standard economic definition of pollution: They impose opportunity costs on others when they are emitted into the environment. If all published research findings were correct (i.e., if the false discovery rate were zero), then any single conclusion from any single research paper (“Drug X is safe and effective,” say, or “Cap and trade systems reduce false positives in scientific literatures”) could form the basis for confident decision-making. You could read a published paper and then take action on the basis of its conclusions, knowing that those conclusions reflected true states of the world.

However, the more false positive pollution a literature contains, the more costly, on average, it becomes to make decisions on the basis of any published finding. The recent Tamiflu debacle provides a vivid case study: The reason drug companies, governments, and individuals got so excited about Tamiflu as a treatment for flu was that their decision-making was distorted by irreproducible research results. The Tamiflu misadventure features false positive pollution doing what it does best: imposing costs on others, to the tune of $20 billion in wasted public expenditures (not to mention the harm the drugs might have done to their consumers, and the opportunity costs associated with not pursuing possible alternatives).

Likewise, if a published scientific article led you erroneously to believe that a particular laboratory technique was a good way to manipulate some variable in your research, and then you went on to base your PhD work on that technique—only to find that it did not work for you (because it actually doesn’t work for anybody)—then false positive pollution would have caused you to devote time and resources to hocus-pocus rather than the pursuit of something that could have produced actual scientific knowledge. This is one of the costs of false positive pollution that should really bother graduate students, post-docs, and anyone who cares about their career development: Trainees with just as much scientific promise as any other end up wasting their valuable training time on illusions. False positive pollution sets careers back.

A cap and trade system might be useful for reducing false positive pollution in the same way that cap and trade systems have, over the past 45 years, helped to reduce sulphur dioxide, nitrogen oxide, lead additives in gasoline, and even over-fishing. Below, I outline some of the steps we’d need to undertake as a society to implement a cap and trade system to control false positive pollution.

Step 1: Measuring Existing Levels of False Positive Pollution

The first step forward could be to estimate how much false positive pollution is emitted annually, which would require independent replications of random samples of published findings from the prior year. What we would be trying to estimate is the proportion of published experiments, out of the 1.5 million or so that are published each year worldwide, whose results cannot be independently reproduced even when the original protocols are followed exactly. I rather admire the way this was done in the Many Labs Replication Project: Several lab directors agree on the experimental protocol [ideally in collaboration with the investigator(s) who originally published the study] and then go back to their labs and re-run the experiment. The results from all of their independent attempts to replicate are then statistically aggregated to determine whether the original result is a true positive or a false positive.

Expensive, yes, but don’t let the expense distract you for the moment. Good research takes money, and we’re already hemorraging money through the production of false knowledge (keep the image of those warehouses full of Tamiflu vividly in your mind). Why not invest in trying to understand how much money we’re actually wasting and what we might do about it?

Step 2: Determining An Optimal Level of False Positive Pollution

Once we had an estimate of much false positive pollution is emitted annually, we’d need to figure out how much false positive pollution we’d like to live with. A 100% pollution-free research literature would be nice. So would 100% pollution-free air. However, “100% pollution-free air” is an unrealistic goal. Compliance would be too expensive, and it would come with too many undesirable side effects. Likewise, a research literature that’s 100% free of false positive pollution sounds great, but that’s a goal that cannot be attained without adversely affecting the overall research enterprise. False positives are going to happen—even by scientists who have done their best to avoid them (after all, there is no such thing as a study with 100% statistical power). There must be some amount of false positive pollution we can tolerate.

One way to set an acceptable level of false positive pollution would be to measure the costs and benefits associated with the average false positive emission. How much money is wasted each time a researcher emits an erroneous “finding?” And how much would it cost to prevent such an event? These benefits and costs are likely to vary quite a lot from field to field, so I see good, plentiful work for economists here. In any case, with those data in hand, it should be possible to estimate the optimal amount of false positive pollution that we should be willing to tolerate—that is, the amount that maximizes society-wide benefits relative to costs.

But there’s actually a simpler way to set an acceptable level: Society tacitly endorses the idea that we can live with a 5% false positive pollution rate each time we accept the results of a study in which the p value was set at .05. That’s what p < .05 actually means: “In a world in which the null hypothesis is true, we’d only get results as extreme as those we obtained in this study in 5 out of 100 exact replications.” We could simply make a 5% FPP emissions rate our explicit society-wide ideal.

Step 3: Setting Goals

Once key stakeholders have agreed upon an acceptable annual level, whether that acceptable level is derived by measuring costs and benefits (as outlined above), or by the “5% fiat” approach, an independent regulatory body would be in a position to set goals (with stakeholder input, of course) for reducing the annual FPP emissions rate down to the acceptable level. (In the United States, the regulatory body might be the NIH, the NSF, or some agency that does the regulatory work on behalf of all of the federal agencies that sponsor scientific research; an international regulatory body might resemble the European Union’s Emissions Trading System.)

I’ll illustrate here with a simplified example that assumes a global regulatory agency and a global trading market. Let’s assume that the global production of scientific papers is 1,500,000 papers per year. Now, suppose the goal is to reduce the global false positive emission rate from, say, 50% of all research findings (I use this estimate here merely for argument’s sake; nobody knows what the field-wide FPP emission rate is, though for some journals and sub-fields it could be as high as 80%) to 5%, and we want to accomplish that goal at the rate of 1% per year over a 45-year period. (In our Nature correspondence, space limitations forced Dave and me to envision a move from the current emission levels to 5% emissions in a single year. The scenario I’m presenting here is more complex, but it’s also considerably less draconian.)

Our approach relies on the issuance of false positive pollution (FPP) permits. These permits allow research organizations to emit some false positive pollution, but the number of available permits, and thus, the total amount of pollution emitted annually, is strictly regulated. In Year 1, the Agency would distribute enough FPP permits to cover only 49% of the total global research output (or 1,500,000*.49 = 735,000 false positive permits). The number of permits distributed to each research-producing institution (universities are canonical examples of research-sponsoring institutions, as are drug companies) would be based on each institution’s total research output. Highly productive institutions would get more, and less productive ones would get fewer, but for all institutions, the goal would be to provide them with enough permits to allow a 49% emissions rate in Year 1. After the agency distributes the first year’s supply of FPP permits, it’s up to each individual research-sponsoring institution to determine how it wants to limit its false positive pollution to 49%. In Year 2, the number of permits distributed would go down a little further, a little further in the year after that, and so on until the 5% ideal was reached.

By the way, there are lots of ways to make the distribution process fair to small businesses, independent scientists, and middle-school science fair geniuses (including, for example, exempting small research enterprises and individuals, so long as the absolute value of their contributions to FPP are trivially small) so it’s not fair to dismiss my idea on the basis of such objections. Cap and trade systems can be extremely flexible.

Step 4: Monitoring and Enforcement

Once the FPP permits have been distributed for the year, the regulatory agency would turn to another important task: Monitoring. In the carbon sector, monitoring of individual polluters can be accomplished with electronic sensors at the point of production, so the monitoring can be extremely precise and comprehensive. In the research sector, this level of precision and comprehensiveness would be impossible. We’d have to make do with random samples of research-producing institutions’ research output from the prior year. (Yes; some research studies would be difficult to replicate because the experiment or data set is literally unrepeatable. Complications like these, again, are just details; they don’t render a cap-and-trade approach unworkable by any means). If the estimated FPP emission rate for any research-sponsoring institution substantially exceeded (by some margin of error) the number of FPP permits the institution possessed at the time, the institution would be forced to purchase additional permits from other institutions that had done a better job of getting their FPP emissions under control. If you, as a research institution, could get your FPP emissions rate down to 40% in Year 1, you’d have a bunch of permits available to sell on the market to institutions that hadn’t done enough to get their emissions under control. In a cap and trade system, there is money to be made by institutions that take their own false positive pollution problems seriously.

The Virtues of a Cap and Trade System

Cap and trade systems have many virtues that suit them well to addressing the replication crisis. Here are a few examples:

  • Cap and trade systems use shame effectively. On one hand, they enable us to clearly state what is bad about false positives in a way that reduces righteous indignation, shame-based secrecy, and all of the pathologies these moralistic reactions create. On the other hand, were we to make information about institutions’ sales and purchases of false positive permits publicly available, then institutions would face the reputational consequences that would come from being identified publicly as flagrant polluters. Likewise, permit-sellers would come to be known as organizations whose research was relatively trustworthy. These reputational incentives would motivate all institutions—even Big Pharma and universities with fat endowments, which could afford to buy all the excess permits they desired on the open market—to get their emissions problems under control.
  • Cap and trade systems don’t rely on appeals to personal restraint, which are subject to public goods dilemmas. (Fewer false positives are good for everyone, of course, but I’m best off if I enjoy the benefits of your abstemiousness while I continue polluting whenever I feel like it.) Cap and trade systems do away with these sorts of free-rider problems.
  • Cap and trade systems encourage innovation: Each research-sponsoring institution is free to come up with its own policies for limiting the production of false positives. Inevitably, these innovations will diffuse out to other institutions, increasing cost-effectiveness in the entire sector.
  • A cap and trade system would be less chilling to individual investigators than a simple audit-and-sanction system would be because a cap-and-trade system would require institutions, and not just investigators, to share in the compliance burden. Research-sponsoring institutions take the glory for their scientists’ discoveries (and the overhead); they should also share the responsibility for reform.
  • Most importantly; cap and trade systems reduce pollution where it is cheapest to do so first. All of the low-hanging fruit will be picked in the first year; and harder-to-implement initiatives will be pursued in the successive years. This means that we could expect tangible progress in getting our problems with false positives under control right away. Audit systems do not possess this very desirable feature.

Wouldn’t a Cap and Trade System Be Expensive?

Elizabeth Iorns estimated that it costs $25,000 to replicate a major pre-clinical experiment that involves in vitro and/or animal work. I don’t know that well-conducted laboratory-based behavioral experiments are that much cheaper (at least, once you’ve factored in the personnel time for running the study, analyzing the data properly, and writing up the paper). So all of those replications goal-setting and monitoring purposes are going to cost a lot of money.

But bear in mind, as I already explained, that false positives are expensive, too—and they produce no societal benefit. In fact, what they produce is harm. It costs as much money to produce a false positive as it does to produce a true positive, but the money devoted to producing a false positive is wasted. (If it’s true that the United States spends around $70 billion/year on basic research, then if even 10% of the resultant findings are false positives (which is almost surely a gross underestimate), then the U.S. alone is using $7 billion dollars per year to buy pollution). Also, Tamiflu. What if we used some of the money we’re currently using to buy pollution to make sure that the rest of our research budget is spent not on the production of more pollution, but instead, on true-positives and true-negatives—that is, results that actually have value to society?

Cap and Trade: Something For Everyone (In Congress)

Here’s the final thing I like about the cap-and-trade idea: It has something for both liberals and conservatives. (I presume that enacting a project this big, which would have such a huge impact on how federal research dollars are spent, would require congressional authorization, and possibly the writing of new laws, but perhaps I am wrong about that). Liberals venerate science as a source of guidance for addressing societal problems, so they should be motivated to champion legislation that helps restore science’s damaged reputation. Conservatives, for their part, like market-based solutions, private sector engagement, and cutting fraud, waste, and abuse, so the idea should appeal to them as well. In a congress as impotent as the 113th U.S. congress has been, can you think of another issue that has as much to offer both sides of the aisle?

The Trouble with Oxytocin, Part II: Extracting the Truth from Oxytocin Research

Two weeks ago, the Society for Personality and Social Psychology (SPSP) held its annual meeting in Austin, TX. I tried to get there myself, as I had been invited to give a talk on the measurement of oxytocin in social science research as part of the “Social Neuroendocrinology” pre-conference. However, some things were brewing on the home front that kept me in Miami. Undeterred, the pre-conference organizers arranged for me to give my talk via Skype, which worked out reasonably well.

In this essay, I’ve turned some of that talk into the second installment in my “The Trouble with Oxytocin” series (the first installment is here). It’s a bit wonkish, focusing as it does on the importance of a bioanalytical technique called extraction, but it’s an important topic nonetheless. Many of the social scientists who are studying oxytocin have decided that they can skip this step entirely. As a result of their decision to take this shortcut, it’s quite possible that many scientific claims about the personality traits, emotions, and relationship factors that influence circulating oxytocin levels are—how to put this diplomatically?—without adequate basis in fact. I’ll substantiate this claim anon, but first, a bit of nomenclature.

A Bit of Nomenclature

Applied researchers generally measure oxytocin in bodily fluids by immunoassay—a technique so ingenious that the scientists who developed it received a Nobel Prize in 1977. Simplifying greatly, to develop an immunoassay for Substance X, you inject animals (probably rabbits) with Substance X and wait for the animal(s) to produce an immune reaction. To the extent that one of the antibodies an animal produces in response to Substance X is sensitive to Substance X, but not to other substances that can masquerade as Substance X, you may be in a position to conclude that you have successfully produced a “Substance X antibody.” With that antibody in hand, you’ve got the most important ingredient for developing an immunoassay.

Antibodies can be used to make several types of immunoassays, but two types are prominent in the oxytocin field: Radioimmunoassays (RIA) and Enzyme-Linked Immunosorbent Assays (ELISA, or EIA). Both methods are widely accepted (although ELISAs don’t require the analysts to handle radiation—a benefit to be sure). I wanted to familiarize you with these terms here at the outset only because I don’t want my toggling back and forth between them to distract you. The focal issue for our purposes here is the issue of extraction.

To Be Exact, You Must Extract

Extraction is a set of preliminary processes an analyst can use to separate Substance X from other substances in a sample of (for instance) blood plasma that might interfere with the immunoassay’s ability to quantify precisely how much Substance X is in the sample. I’m going to skip the details, but you can read up here. Antibodies can bind to all sorts of substances that are not Substance X (for example, proteins, other peptides, or their degradation products) if you’re not careful to remove that other stuff first. More relevant for our purposes here, researchers have known for a really long time that a failure to extract before conducting immunoassays for plasma oxytocin will result in profound overestimates of how much oxytocin is actually in the sample.

This is not some well-kept industry secret. The manufacturers of some of the more widely used commercial ELISAs have been admonishing the users of their assays to extract samples since at least 2007. Below is a snip from an instruction manual bearing a 2006 copyright. (The admonition gets repeated in this 2013-copyright instruction manual also):

Instruction Manual

What the manufacturers are showing here (see the two columns of data on the left) is that when they performed their oxytocin assay on a sample of human blood plasma without performing an extraction step, they read off an oxytocin concentration of 2,761 pg/ml (picograms [10-12 grams] per milliliter). When they performed the extraction step on the same sample, they got a value of 3.4 pg/ml—three orders of magnitude smaller. Plain English translation: “There are some substances in human blood plasma that fool our antibody into believing they’re oxytocin molecules. You’d better get rid of those imposters before you run our assay on your sample. After you do that, we think you’ll be OK.” Keep this value of 3.4 pg/ml in mind. As I’ll show you below, it’s the sort of value, more or less, that one ought to be expecting from assays that actually measure oxytocin.

Like I say, the need for extraction is no secret. Basic biological researchers who study oxytocin have been extracting their samples since The Waltons had a prime-time slot on CBS. But extraction takes a lot of time, so it is expensive. Perhaps this is why a team of researchers started to skip the extraction step in the early 2000s.[1] In no time at all, other social scientists were following in their footsteps, and with that, a Pandora’s box was opened. Most social scientists just stopped extracting, often citing the originators of this custom to justify their choice.

In what follows, I’ll chronicle what happened to the social science literature on oxytocin as a result of this fateful methodological choice. Table 1, below, is from a paper that Armando Mendez, Pat Churchland, and I published last year.[2] It illustrates the typical oxytocin values one can expect to see in samples of extracted plasma measured by radioimmunoassay versus the values one can expect to see when using one of the commercial ELISAs on raw (i.e., unextracted) plasma.

MCA Table 1.jpgFrom McCullough, Churchland, and Mendez (2013)

A few things stand out in Table 1. First, when you measure oxytocin in blood plasma using RIA on extracted samples, you typically find that healthy, non-pregnant women and men have oxytocin levels of somewhere between 0 and 10 picograms per milliliter of blood plasma. This is consistent with that value of 3.4 pg/ml that I suggested you keep in mind from the 2006 instructions that came with that assay kit.

Below are some values that Ben Tabak, our neuroscience/biochemistry colleagues, and I obtained on 35 women whose oxytocin we measured in five different samples of plasma. Mean values were in the 1-2 picogram range.[3]

Tabak ValuesAdapted from Tabak et al., (2011)

The Tabak et al. (2011) sample was small. We had oxytocin values for only a few dozen women, so I won’t be offended if you don’t want to place too much trust in them, but here are some values that Tim Smith and his colleagues obtained with an RIA on extracted samples from 180 male-female couples: Again, their mean values hovered around 1-2 picograms per milliliter. [4]

Smith DataFrom Smith et al., 2013

So this is very reassuring.  The values that we got, and the values that Smith and his colleagues got, are very consistent with the 1-10 pg/ml range that we’ve come to expect over the past 35 years.

MCA Table 1.jpgFrom McCullough, Churchland, and Mendez (2013)

But now take a look the right side of Table 1 above to see what happens when you assay plasma for oxytocin using commercial ELISAs without extraction. It doesn’t matter whether you’re studying healthy non-pregnant women, healthy non-pregnant men, pregnant women, or new mothers: You’re going to get mean oxytocin values in the 200-400 pg/ml range, that is, values that are 100 to 200 times higher than what you get with RIAs on extracted samples.

Consider, for instance, the data below, which come from this paper, which the authors accurately described in the abstract as “[u]tilizing the largest sample of plasma OT to date (N = 473).” They found a mean value for men of approximately 400 pg/ml and a mean value for women of around 359 pg/ml.[5]

Weisman CurvesFrom Weisman et al. (2013)

Mean values of 200, 300, and 400 pg/ml for oxytocin in unextracted plasma are not exceptions to an otherwise orderly corpus of findings. They are what you should expect to find if you perform an oxytocin assay without extraction. For instance, the data below, from this paper show the sorts of oxytocin values you can expect to find in the plasma of pregnant and recently pregnant women when you use ELISA on raw plasma:[6]

Feldman ValuesFrom Feldman et al. (2007)

The values above are measured in picomolars rather than in pg/ml, but oxytocin has a molecular mass of 1007 Daltons, so by sheer coincidence one picomolar of oxytocin is roughly equivalent to one pg/ml. In other words, these authors also got mean values for oxytocin using an ELISA on raw plasma that are way too high—and look at the upper end of those ranges—3,648 pg/ml! There’s just no good reason for believing that there could be 300 picograms of OT—much less 3,648—in a milliliter of blood plasma.

Why are these ELISAs giving such high values? There’s nothing wrong in principle with using an ELISA to measure OT in plasma, even though some of the commercial assays have used antibodies whose sensitivity and specificity is far from ideal. (This is an extremely important issue, by the way, but not the one to tackle here.) Instead, the predominant reason why researchers are getting such wacky values from these ELISAs is that they’re skipping the extraction step.

How do I know? Because I know what happens if you do extract your samples before you assay them via ELISA. Our research group found that when you extract your samples before you analyze them with a certain commercial ELISA kit, the mean values drop from somewhere around 358 pg/ml to somewhere around 1.8 pg/ml—just as you’d expect, given the admonitions in the manufacturer’s instructions.[7] And here are some extracted values that Karen Grewen and her colleagues got for 20 healthy breastfeeding mothers when they used the same ELISA that gave Weisman et al. those values in the 300-400 pg/ml range for raw plasma.[8] ELISAs can give plausible values if you extract first.

Grewen ValuesFrom Grewen, Davenport, and Light (2010)

Estimating OT from Unextracted Samples: Is There Any Signal Amidst the Noise?

Of course, none of this would matter very much if there were some way to statistically transform the OT values you obtain from unextracted plasma into the values you would have obtained from extracted plasma, but that doesn’t seem to be the case: The evidence currently available suggests that the values from the two methods are, quite possibly, uncorrelated.

We looked at this issue in our 2011 paper.[7] We had 39 plasma samples, which we analyzed with one of the most widely used commercial ELISAs, both before and after extraction. The correlation coefficients ranged from .09 to -.14, depending on distributional assumptions. Kelly Robinson and her colleagues just came to the same conclusion with their own data—52 samples of blood plasma from seals.[9] In fairness, I have to acknowledge another study that revealed a very high correlation between the oxytocin values derived from extracted samples versus those obtained from unextracted samples (0.89), but that study was based on very little data (11 samples of blood serum, rather than plasma, from Rhesus monkeys), so it would be a mistake to give it too much weight.[10]

Conclusion

So, what shall we conclude about oxytocin assays on unextracted plasma, given the data we have to go on at this point? Well, on the plus side, raw plasma is cheaper and quicker to assay than extracted plasma. Nobody disputes that. On the minus side, if you don’t extract those samples before you assay them, you apparently convert those ingenious oxytocin assays into random number generators, and there are cheaper ways to generate random numbers.

For ten years, many social scientists who study oxytocin have been side-stepping an expensive but evidently crucial extraction step. If you’ve come to believe that the trust of a stranger, or sharing a secret, or sensitive parenting, or mother-infant bonding, or your mental health, can influence (or is influenced by) how much oxytocin is coursing through your veins, you might want to take a second look. Chances are, those findings came from studies that used immunoassays on unextracted plasma (it’s easy to know for sure: just check the papers’ Method sections), and if so, there’s little compelling reason to think the results are accurate.

Now, if any researchers out there have data that can prove that we should be taking the results from immunoassays on unextracted samples at face value, they would do the field a great favor to make those results public, and at that point I will happily concede that all my worrying has been for nought. Even better, perhaps someone could conduct a large, pre-registered study on the correlation of OT values from extracted versus raw plasma. Pre-registration is easy (for example, here), and would increase the inferential value of such a study immensely. In any case, more data on this topic would be most welcome. I, for one, would love to know whether we should be taking the results of studies on raw plasma seriously, or whether we’d be better off by dragging them into the recycle folder.

References

1.         Kramer, K.M., et al., Sex and species differences in plasma oxytocin using an enzyme immunoassay. Canadian Journal of Zoology, 2004. 82: p. 1194-1200.

2.         McCullough, M.E., P.S. Churchland, and A.J. Mendez, Problems with measuring peripheral oxytocin: Can the data on oxytocin and human behavior be trusted? Neuroscience and Biobehavioral Reviews, 2013. 37: p. 1485-1492.

3.         Tabak, B.A., et al., Oxytocin indexes relational distress following interpersonal harms in women. Psychoneuroendocrinology, 2011. 36: p. 115-122.

4.         Smith, T.W., et al., Effects of couple interactions and relationship quality on plasma oxytocin and cardiovascular reactivity: Empirical findings and methodological considerations. International Journal of Psychphysiology, 2013. 88: p. 271-281.

5.         Weisman, O., et al., Plasma oxytocin distributions in a large cohort of women and men and their gender-specific associations with anxiety. Psychoneuroendocrinology, 2013. 38: p. 694-701.

6.         Feldman, R., et al., Evidence for a neuroendocrinological foundation of human affiliation: Plasma oxytocin levels across pregnancy and the postpartum period predict mother-infant bonding. Psychological Science, 2007. 18: p. 965-970.

7.         Szeto, A., et al., Evaluation of enzyme immunoassay and radioimmunoassay methods for the measurement of plasma oxytocin. Psychosomatic Medicine, 2011. 73: p. 393-400.

8.         Grewen, K.M., R.E. Davenport, and K.C. Light, An investigation of plasma and salivary oxytocin responses in breast- and formula-feeding mothers of infants. Psychophysiology, 2010. 47: p. 625-632.

9.         Robinson, K.J., et al., Validation of an enzyme-linked immunoassay (ELISA) for plasma oxytocin in a novel mammal species reveals potential errors induced by sampling procedure. Journal of Neuroscience Methods, in press.

10.       Michopoulos, V., et al., Estradiol effects on behavior and serum oxytocin are modified by social status and polymorphisms in the serotonin transporter gene in female rhesus monkeys. Hormones and Behavior, 2011. 58: p. 528-535.

A Refreshingly Human-Sounding Public Radio Interview: Yours Truly on Morality, Revenge, Forgiveness and Evolution

I have a friend who won’t listen to public radio in the U.S. It’s not that he objects to public radio programming or pubic radio values: It’s just that he doesn’t like the sonic quality of public radio programs. In the United States, at least, public radio is very heavily produced. I generally cannot be on a radio show that is syndicated to NPR (National Public Radio) stations unless I’m willing to schlep myself over to an ISDN studio because NPR requires “that noiseless ISDN sound.” Turn your radio right now to an NPR station and you’ll get a decent sampling of what I’m describing.  Sometimes, I like that sound, but I must agree with my friend. It does sound rather sterile.

Ever since my friend mentioned this to me, I have been struck by how slick I sound (relative to real life) in general when I am on public radio shows in the United States. It’s not always a kind of slick that I like. Some of it has to do with the ISDN sound, but some of it also has to do with the editing after the interview is finished. Everyone involved ends up, I think, sounding smarter and more eloquent than they did during the interview itself. That’s not always a bad thing–nobody wants to sound like an idiot if he or she can help it–but as a listener, all of that sweet perfection can make you wonder if you’re at risk of getting a cavity.

I therefore found my recent interview with Charlotte Graham from a Radio New Zealand, for her show Summer Nights, quite refreshing–particularly (though not only) from an aural point of view. It’s really just an uninterrupted and unedited phone call between me in Miami (at 9:00 PM my time) and Charlotte in New Zealand (where it was 3:00 in the afternoon of the following day). The phone line wasn’t, to say the least, ISDN quality, and both of us (though I to a rather greater extent than Charlotte) exhibited a healthy dose of the errors and disfluencies that characterize most people’s real conversations. Even so, we managed to cover some decent conceptual territory on evolution, culture, morality, revenge, and forgiveness.

Here’s a link to the interview. Hope you enjoy it.

The Trouble with Oxytocin, Part I:
Does OT Actually Increase Trusting Behavior?

It’s the holiday season, when many people try to clear a little mental space for thoughts about peace on earth and good will toward humanity. In this spirit, I thought I’d inaugurate this blog with a close look at an endocrine hormone that, according to some researchers, can promote trust, generosity, empathy, and, yes, even world peace. I’m referring, of course, to oxytocin (OT).

I’ve been involved with a few research projects on OT over the past few years, mostly in collaboration with my former PhD student Ben Tabak (plus some other colleagues here in Miami), but I’ve made no secret of my concerns about the validity of the techniques that scientists use to measure and manipulate OT experimentally. I also remain unconvinced that intranasally administered OT even makes it into the human brain in the first place. (Many experts think the brain is involved in the control of behavior, so this particular gap in our scientific knowledge seems to me like a problem that OT researchers should be taking a lot more seriously.)

I’ll probably write about these issues in the future, but for now I want to look closely at a much more circumscribed OT-related idea that took the scientific world by storm a few years back. This is the notion that spraying a little OT up people’s noses causes them to become more trusting toward strangers. Let’s look at the initial test of this hypothesis, as well as the evidence that emerged in the wake of the initial experiment, with the goal of estimating the strength of the evidence both for, and against, this charming idea.

 The Kosfield (2005) Experiment

In the very first experiment on oxytocin’s effect on trusting behavior, which bore the definitive title “Oxytocin increases trust in humans” [1], Kosfeld and colleagues randomly assigned 58 healthy men to receive either OT, or an equivalent amount of placebo, via a nasal spray. After the sprays had been given a chance to “kick in” (50 minutes), participants played four rounds (each time with different partners) of the Trust Game—one of the workhorses of experimental economics. The Trust Game is a two-player game in which one player takes on the role of the Investor (these are the subjects whose oxytocin-influenced behavior matters for our purposes here), and the other takes on the role of the Trustee. The Trust Game is hard to describe succinctly, but the Kosfeld paper has a helpful illustration.

Trust Game_Kosfeld

The Trust Game is a two-stage game. In Stage 1, the Investor chooses how much money (in the Kosfeld experiment, either 0, 4, 8, or 12 “monetary units,” or “MU”) from a bolus 12 of MUs (which the experimenter provides) to transfer to an anonymous Trustee. (Participants are told that these MUs will be converted into real cash after the experiment ends.) The experimenters typically triple the transfer on its way to the Trustee. As a consequence, if the Investor sends 4 MU to the Trustee from her bolus of 12 MU (second branch from the left, marked “4”), the Trustee will finish Stage 1 with her original 12 MU, plus the additional 4 MU * 3 = 12 MU that result from the 4-MU transfer from the Investor (after the experimenters multiply that transfer by 3). In contrast, the Investor will be left with 12 – 4 = 8 MU at the end of Stage 1.

In Stage 2, the Trustee is given a choice to send as much or as little of her 24 MU back to the Investor as she wishes. This is called a back-transfer. If the Trustee chooses to send 0 back, she keeps all 24 MU for herself. Anything she does sends back to the Investor gets subtracted from the Trustee’s 24 MUs, and is added to the 8 MU that remained in the Investor’s account at the end of Stage 1. The game is called the trust game under the assumption that people generally like money and prefer to have as much of it as possible. Under this assumption, it does make sense to conceptualize Investors’ choices about how much to send to their Trustees during Stage 1 as measures of their trust that the Trustees will reciprocate during Stage 2.

So, the key question is this: Did OT increase Investors’ Stage 1 transfers in the Kosfeld experiment? That is, did OT increase their trusting behavior? Here’s what the authors wrote: “The investors’ average transfer is 17% higher in the oxytocin group (Mann-Whitney U-test; z = -1.897, P = 0.029, one-sided), and the median transfer in the oxytocin group is 10MU, compared to a median of only 8MU for subjects in the placebo group” (p. 674). The figure below, also from the Kosfeld paper, shows the distribution of transfers for the OT group and the placebo group.

OT_TRUST_DISTRIB_KOSFELD_CORRECT

Look at the far right side of the figure: The difference in the percentages of participants in the OT and placebo conditions who transferred all of their MUs (12) to their four Trustees is really quite arresting. The authors summarize this result on p. 647: “Out of the 29 subjects, 13 (45%) in the oxytocin group showed the maximal trust level [that is, they entrusted all of their MUs to their Trustees on all 4 rounds], whereas only 6 of the 29 subjects (21%) in the placebo group showed maximal trust.” Mind you, a statistical purist would likely have winced at the researchers’ use of a one-tailed statistical test—especially since the difference in the distributions for the two groups would not have registered as statistically significant at p < .05 (which signals that the results would be expected less than 5% of the time in a world in which the null hypothesis is true) with a two-tailed test. Nevertheless, just by looking at the figure you can understand why the authors got excited by their data.

The Kosfeld paper has become a citation classic. Google Scholar tells me that it has been cited 1,673 times as of today (by means of comparison, Watson and Crick’s 1953 Nature paper on the structure of DNA, which has also been sort of important for science, has been cited 9,130 times). But is it correct? That is to say, are the Kosfeld findings robust enough to license the conclusion that oxytocin really does increase trust in humans? Allow me to lay out the post-Kosfeld evidence so you can make up your own mind. I have located five post-Kosfeld experiments that examined the effects of intranasal OT on trusting behavior in the trust game, and I restrict my remarks to those experiments only. (I’m ignoring studies on people’ s self-reported trust of strangers, for example, as well as a few other experiments that have used experimental games other than the trust game.) I have scored each of these five replication experiments as either a successful replication or a failure to replicate (or some admixture of success and failure). (Caveat lector: None of these studies is an exact replication of Kosfeld).

The Post-Kosfeld Experiments

Replication 1: Baumgartner et al. (2008) In 2008, Baumgartner and colleagues ran a reasonably close replication of the Kosfeld experiment, though they modified the protocol so participants could play the trust games while their brains were being scanned via fMRI.[2] Forty-nine men, randomly assigned to receive either OT or placebo, played a series of six trust games (interleaved with six other kinds of games, which I’m ignoring) with anonymous partners. At the end of the first six trust games, Investors received the feedback that only 50% of their Trustees had made back-transfers. After this disappointing feedback, the Investors played six new trust games (interleaved with some other games) with six new anonymous partners. The figure below, from the supplemental online materials for the paper, shows the main results.

Baumgartner_FIGURE_SOI

As you can see on the left side of the figure, OT did not meaningfully increase trust during the first six “Pre-Feedback” rounds. Baumgartner mostly ignored those results, however, and focused instead in their discussion on the right side of the figure: In the six “Post-Feedback” Trust Games, OT participants entrusted significantly more money to their Trustees, on average, than did the placebo participants.

But it seems to me that we, as dispassionate consumers, are ill-advised to discount the lack of OT-vs.-placebo differences on the Pre-Feedback rounds: I myself am going to score them as an unambiguous  “failure to replicate.” Nevertheless, it’s nearly Christmas, and science would stop progressing if we were unwilling to open our minds to new ideas, so I’m happy to score the results from the post-feedback rounds as a “successful replication” of Kosfield. I am going to score Baumgartner, then, as a 50% successful replication and a 50% failure to replicate.

Replication 2: Mikolajczak et al. [3] Mikolajczak and colleagues randomly assigned 60 healthy men to either OT or placebo, and then had them play ten trust games with partners who had been described as “reliable,” and ten with partners who had been described as “unreliable” (and some other trials that aren’t directly relevant here). Men in the OT group entrusted more money, on average, to partners who had been described as “reliable” than did men in the placebo group, although there were was no OT-vs.-placebo difference in the amounts entrusted to partners who had been described as “unreliable.” The results for the “reliable” partners can be interpreted as a reasonably successful replication of Kosfeld, and a good story can be told for why the results for “unreliable” partners are not a failure to replicate Kosfeld, but I’m not sure whether we can just ignore the lack of OT effects for unreliable partners entirely. I am going to score Mikolajczak as a 75% successful replication and a 25% failure to replicate. I admit that this is a hard one to call, though, and other people of good will could come to different conclusions about how to score this study.

Replication 3: Barraza (2010). Jorge Barraza [4] found that 44 healthy men who received OT did not invest more money in four consecutive trust games than did 22 men who received placebo (disclosure: I was an outside reader of Jorge’s dissertation, and co-authored a paper based on some of the results he obtained during that work). I’m calling this one a 100% failure to replicate. Take note that Investors played their four games with a single anonymous partner, with feedback on the back-transfers after each game, which makes this experiment a bit different from the others included here. Even so, it’s a mistake to exclude Barraza if we want to know whether Kosfeld and colleagues were right to claim that “Oxytocin increases trust in humans.”

Replications 4 and 5: Klackl et al. (2012) and Ebert et al. (2013). Only two more to go. Klackl and colleagues performed a fairly close replication of the 2008 Baumgartner paper with 40 healthy men (sans fMRI) and found that participants who received OT did not, on average, send more money to partners during six pre-feedback games, or during six post-feedback games.[5] (This study, therefore, is not only a failure to replicate Kosfeld, but also a failure to replicate Baumgartner.) Finally, Ebert et al. found that 26 people (13 who had been diagnosed with Borderline Personality Disorder and 13 non-diagnosed controls; mostly women) were no more trusting of 20 strangers in a series of trust games following OT administration than they were following administration of a placebo (all 26 participants did OT trials on one occasion, and placebo trials on another occasion, with counterbalancing).[6] On this basis, I’m calling Ebert, too, a 100% failure to replicate.

Summing Up

So, does OT increase trust in humans? The Kosfeld experiment found a faint statistical signal (remember, p = .029, one-tailed) for an effect of OT across a series of trust games with different Trustees, but statistical hard-liners who would insist on a p value less than .05—two-tailed—might reasonably argue that Kosfeld did not even find a phenomenon in need of replication to begin with. That said, the post-feedback rounds from Baumgartner look quite consistent with the claim that OT increases trusting behavior, as do Mikolajczak’s results for “reliable” partners (though I can’t convince myself to call Mikolajczak a 100% successful replication because of the failure to find effects for the “unreliable” partners). On the other hand, the pre-feedback rounds from Baumgartner, and the results from Barraza, Klackl, and Ebert, look to me like out-and-out failures to replicate Kosfeld.  (Plus, I’m going to weight 25% of the Mikolajczak results as a failure to replicate; again, I don’t think we can just ignore the lack of effects for unreliable partners, or pretend that the original Kosfeld hypothesis explicitly entails such a pattern.)

Adding up these scores, then, leads me to conclude that the original Kosfeld results have been succeeded by 1.25 studies’ worth successful replications and 3.75 studies’ worth of failures to replicate. Here’s the box score for the replications:

 

Replication

Outcome

1

Baumgartner

2

Mikolajczak

3

Barraza

4

Klackl

5

Ebert

Total

Success

.50

.75

0

0

0

1.25

Failure

.50

.25

1.0

1.0

1.0

3.75

With the relevant post-Kosfeld data favoring failures to replicate by 3:1, I think a dispassionate reader is justified in not believing that OT increases trusting behavior–at least not in the context of the trust game. Should we do a few more studies just to make sure? Fine by me, but it seems to me that we, as a field, should have some sort of stop-rule that would tell us when to turn away from this hypothesis entirely–as well, of course, as how much data in support of the hypothesis we would need to justify our acceptance of it. In addition, I’m struck by the fact that no one has ever gotten around to reporting the results of an exact replication of Kosfeld. In light of the Many Labs Projects’ recent successes in identifying experimental results that do and do not replicate, I’d personally be content to believe the results of several (five, perhaps?) large-N, coordinated, pre-registered exact replications of the Kosfeld experiment. But until then, or until new data come in that are relevant to this question, I know what I am going to believe.

By the way, if you don’t like how I scored the studies, I would be curious to know how you would synthesize these results to come to your own conclusion. Also, there could be other data on this topic out there that I have failed to include. If you’ll let me know about them, I’ll get around to incorporating them here and updating my box score accordingly.

References

1.         Kosfeld, M., et al., Oxytocin increases trust in humans. Nature, 2005. 435: p. 673-676.

2.         Baumgartner, T., et al., Oxytocin shapes the neural circuitry of trust and trust adaptation in humans. Neuron, 2008. 58: p. 639-650.

3.         Mikolajczak, M., et al., Oxytocin makes people trusting, not gullible. Psychological Science, 2010. 21: p. 1072-1074.

4.         Barraza, J.A., The physiology of empathy: Linking oxytocin to empathic responding. 2010, Unpublished Doctoral Dissertation, Claremont Graduate University: Claremont, CA.

5.         Klackl, J., et al., Who’s to blame? Oxytocin promotes nonpersonalistic attributions in response to a trust betrayal. Biological Psychology, 2012. 92: p. 387-394.

6.         Ebert, A., et al., Modulation of interpersonal trust in borderline personality disorder by intranasal oxytocin and childhood trauma. Social Neuroscience, 2013. 8: p. 305-313.