Tools for Those Who Summarize the Evidence Base
I'd noticed and browsed this very interesting piece by Jonah Lehrer in the New Yorker in December, but had not read it carefully until yesterday. It discusses publication bias (a.k.a. reporting bias) and the frustrations of researchers trying to replicate their own effects. The story about Jonathan Schooler is vividly told as is early work on ESP by J. B. Rhine. Yet of all the examples of debut curses the article lists, the one that piques my interest most are the efforts by John Crabbe to conduct the exact same experiment at the same time in three different locations in the northern hemisphere (Albany, NY; Edmonton, Alberta; and Portland, Oregon). The experiment had to do with the effects of exposure to cocaine on the behavior of mice. They attempted to hold seemingly every variable constant across the three settings. As Lehrer states,
The same strains of mice were used in each lab, shipped on the same day from the same supplier. The animals were raised in the same kind of enclosure, with the same brand of sawdust bedding. they had been exposed to the same amount of incandescent light, were living with the same number of litter mates, and were fed the exact same type of chow pellets. When the mice were handled, it was with the same kind of surgical glove, and when they were tested it was on the same equipment, at the same time in the morning.
(I'm guessing even more things Yet, how did results look? Disturbingly different. Lehrer reports (emphasis added):
In Portland the mice given the drug moved, on average, six hundred ccntin1ctres more than they normally did; in Albany they moved seven hundred and one additional centimetres. But in the Edmonton lab they moved more than five thousand additional centimetres. Similar deviations were observed in a test of anxiety. Furthermore, these inconsistencies didn't follow any detectable pattern. In Portland one strain of mouse proved most anxious, while in Albany another strain won that distinction.
The disturbing implication of the Crabbe study is that a lot of extraordinary scientific data are nothing but noise.
Of course, Lehrer reports these findings in a pop science way that does not permit us to get a sense of the magnitude of the effects that are described (e.g., if standard deviations on the critical variables were large, then we are just seeing small wavering in the data) and I'd be curious how they measured movement and anxiety (surely questionnaires were not used!).It would be valuable to know whether the observed variation was within that expected by sampling error or more than that expected by sampling error. (Perhaps the original Crabbe report addresses this issue?)
Those of us who work with human populations--which seem even more complex than mice--can surely relate to seemingly random results appearing in even very carefully conducted investigations. But I wonder if what we label "noise" or "randomness" is just a convention for the longer but more precise meaning, "we know the results are heterogeneous and have no clue what is causing the deviations." Maybe it just takes a different expert (than the one who did the original research) to figure out why the results deviate. Just thinking out loud: Maybe the mice had to travel farther to the Alberta site than the other sites, or maybe there is something about the local ecologies at work. Maybe Einstein was right when he opined, "God does not play dice with the universe."
Any discussion? I'd love to hear similar reports, or hear from people who know the Crabbe or Schooler work well enough to inform the discussion.
Three short thoughts:
First, difficulty in replicating results reminds me of what some have dubbed the "Proteus phenomenon" or the winner's curse -- basically, when early studies of a given effect tend to yield more extreme results than subsequent studies. (Perhaps this is discussed in Lehrer's piece, which I admittedly haven't read.) John Ioannidis and his colleagues have gotten quite a bit of press for their investigations into this, such as the following article:
Ioannidis, J. P., & Trikalinos, T. A. (2005). Early extreme contradictory estimates may appear in published research: The Proteus phenomenon in molecular genetics research and randomized trials. Journal of Clinical Epidemiology, 58, 543-549. doi:10.1016/j.jclinepi.2004.10.019
Second, I agree that tracking down Crabbe's original report(s) is essential for understanding these apparent failures to replicate -- partly for additional statistical info (e.g., measures of dispersion or uncertainty) and partly for details about study features. That's especially good advice in light of Lehrer's troubles with plagiarism earlier this year, which (in my opinion) reduces the credibility of his account.
Third, my bibliography on methodology for research synthesis includes several items related to issues of replication. Here are some from a small sample of items I've moved to this bibliography's CiteULike home:
My larger bibliography is less user-friendly -- which is why I'm moving it to CiteULike as time, funding, and other resources permit -- but this blog page describes how to access it via the 'Article Alerts' feature section in Research Synthesis Methods:
This larger collection includes several more items that address aspects of replication, especially if we consider topics such as cumulative meta-analysis or meta-analysis versus single large trials.