Trust scientists less; trust humanity more

Michael Zhang
20 min readAug 1, 2023

--

In this post, I review “Science Fictions” by Stuart Ritchie. This book took me down memory lane, exposing much of what I learned in class and in the media as shoddy science

By now, the replication crisis is famous. Everyone knows that many of the landmark studies in psychology and social science, among other fields, have gone down in flames. But how did this crisis happen, and how do we make sure it doesn’t happen again? As a young scientist and long-time science enthusiast, I have double the reason to be interested in these questions, and so eagerly read Science Fictions by Stuart Ritchie.

Science Fictions is part storytelling, part cautionary tale, and part call to action. After a pithy guide to how science actually works and a brief history of the replication crisis, he skewers the four cardinal sins of science (which he calls “faults and flaws”) — fraud, bias, negligence, and hype–as well as the woes they produce. In the last part of the book, he analyzes the perverse incentives that lead to bad science and how we can fix them. Let’s take a look at each part before reflecting on the merits of the whole.

Storytelling

The replication crisis began with the discrediting of three studies. On January 31, 2011, Daryl Bem, a top psychology professor at Cornell University, announced that he had found evidence of psychic precognition. Psychic precognition is, of course, physically impossible. When Stuart Ritchie and two colleagues reran Bem’s experiments three times, they failed to replicate the results, but the journal which published the original study refused to publish the replication on the grounds that they never publish replications of any kind. So much for reproducibility.

In April 2011, Science, one of the most prestigious journals in the world, published “Coping with Chaos” by Diederik Stapel. According to the paper, people show more prejudice and endorse more stereotypes in dirtier environments. His research has “clear policy implications”, he wrote: clean up the streets and you get rid of at least some racism! Unfortunately for clean freaks, his data had a major flaw–it was all made up:

I faked research data and invented studies that had never happened. I worked alone, knowing exactly what I was doing…I didn’t feel anything: no disgust, no shame, no regrets. […] I invented entire schools where I’d done my research, teachers with whom I’d discussed the experiments, lectures that I’d given, social-studies lessons that I’d contributed to, gifts that I’d handed out as thanks for people’s participation.

If the first tremors of the replication crisis were caused by an obviously unphysical result, and the second tremors were the result of obvious fraud, the third was more serious. It was the downfall of an entire psychological concept, called priming. I vividly remember learning about priming in my university psychology class during the early 2010s. I found it bizarre but fascinating that people walk more slowly after hearing words like old, grey, wise, knits, and Florida, as the original priming study from 1996 found. Later priming studies found even more interesting effects, like the “Macbeth effect”, where research participants were more likely to take an antiseptic wipe after recalling something unethical they’ve done.

The only problem is that these priming effects don’t replicate. The original 1996 study was done with a stopwatch controlled by research assistants who knew who was “supposed” to walk slower. When a bigger sample was measured with much more accurate infrared beams in a replication study, the effect disappeared. The Macbeth effect likewise failed to replicate, as did a ton of other priming effects. It turns out that plotting two points far apart on a piece of paper don’t make you feel more distant from loved ones; that writing moral dilemmas on black-and-white graph paper doesn’t push you toward more polarized judgments; that people might not be more judgemental when their disgust is primed.

After the downfall of priming, many famous results in psychology fell in quick succession. Power posing probably doesn’t work: standing in a “powerful” position, with legs apart and hands on the hips, might make people feel more powerful, but does not change testosterone, cortisol, or financial risk levels. The Stanford Prison Experiment is a staple of every psychology class, and the sadistic punishments the guards (who were supposed to represent ordinary men) spontaneously dreamt up stuck with me long after the class. It turns out that part of the reason the guards behaved the way they did is because the researcher told them to, including giving tips on how to dehumanize the prisoners.

Psychology can take some comfort from the fact that it is not alone in the replication crisis. A 2016 replication survey of 18 microeconomic surveys had a success rate of 61%. Scientists at Amgen, a biotechnology company, successfully replicated only 6 of 53 high-impact cancer studies published in top journals. An attempt to replicate 51 cancer studies was stymied because none of the papers contained enough technical detail to even attempt a replication. After painstakingly tracking down the original authors and their collaborators, they managed to attempt replication of 14 (with another 4 forthcoming): 5 were successful, 4 were partially successful, 3 were clear failures, and 2 had uninterpretable results. In 2019, Prasad, Cifu, and colleagues found that 396 studies out of 3000 in three top medical journals overturned the consensus on a medical practice. One familiar example of medical reversal is on peanut allergies: the guidelines used to recommend not giving babies peanuts until they were at least 3, which turned out (at least according to a trial in 2015) to increase the probability of developing an allergy by a factor of 7.

The Sins

The deadly sins of science are not pride, greed, wrath, lust, gluttony, envy, and sloth, but fraud, bias, negligence, and hype.

This part of the book is not short on stories. The section on fraud is particularly juicy. We have already encountered one case of fraud in this review, that of Diederik Stapel, but Stapel is practically a saint compared to the Italian surgeon Paolo Macchiarini. In 2011, Macchiarini announced that he had successfully transplanted a synthetic trachea. This was thought to be impossible, because the trachea is not a sterile environment. The real tracheal tissue does not have time to “grow onto” the artificial trachea before bacteria from the outside world infects the suture point, causing complications and eventually death. Macchiarini’s strategy to deal with this problem was to cover up the deaths of his patients. To give you an idea of how horrific this was, here is what Julia Tuulik, a Russian ballet dancer, told a documentary maker about her operation:

Everything is very, very bad for me. More than half a year I spent in Krasnodar in the hospital. Around 30 operations were performed on me under general anaesthesia. Three weeks after the first operation, I had a purulent fistula in my throat. And my throat is still rotting, even now. My weight is 47 kilograms. I can hardly walk. It’s hard to breathe, and now I have no voice. And the stench from me is such that people are recoiling in disgust. The second patient also stinks the same way. The artificial trachea is total crap. Unfortunately, Yulia

Tuulik died two years after her operation. To make the story all the more tragic, she didn’t even have a life-threatening condition. Her trachea was damaged by artificial ventilation after a car accident. Doctors damaged her trachea; other doctors then condemned her to a horrific death in trying to fix the damage.

Other patients didn’t fare much better:

One other Russian patient died in what was described as a ‘bicycle accident’, another died in uncertain circumstances the year after the operation, and another survived but only after the synthetic trachea had been removed. Macchiarini also operated on a Canadian-South Korean toddler at a hospital in Peoria, Illinois in the US in 2013, amid substantial media attention. She died just a few months later.

Most scientific fraudsters don’t quite cause this level of pain and suffering. They are more like South Korean scientist Woo-Suk Hwang, who claimed to have cloned human embryos by doctoring images; or Japanese scientist Obokata, who photoshopped her DNA blots to claim that she had found an efficient way to induce pluripotency in mature adult cells. America has its own rogue’s gallery of fraudsters, including Michael LaCour, a graduate student at UCLA who “showed” that gay campaigners are far more effective than straight campaigners in convincing people to support gay marriage. The only catch is that the survey never occurred: all the data was made up out of thin air, complete with stories and anecdotes to bamboozle his PhD advisor.

If fraud seems beyond the pale for most of us, the next sin, bias, is intimately familiar to all–whether we want to admit it or not. I have certainly not been innocent of it in my scientific career. There are two main types of bias: we want to confirm our pet theory, and refute our competitor’s; and we want to discover new and exciting phenomena, not null results.

Ritchie doesn’t make this point, but I think there is a perfect continuum between the first type of bias and scientific prudence. If you measure a speed faster than light, would you publish it right away, or would you scrutinize it inside and out to discover what your mistake was, because there almost certainly is one? Extraordinary claims require extraordinary evidence, and it makes perfect sense to scrutinize a-priori implausible claims extra closely. But scientific plausibility depends on the theory you subscribe to, and when that is controversial, it is not clear where good sense ends and bias begins.

The second type of bias is part of human nature. Scientists, journals, funding agencies, the government, the public–basically, everyone with a pulse–are more interested in an effective Alzhemer’s treatment than in the 1283th unsuccessful treatment. We are more interested in exciting new discoveries than in yet another null result. Sometimes, this leads to us publishing the positive results while keeping null results in the file drawer. Other times, we do p-hacking: analyzing the data in 20 different ways until we find p < 0.05 for one analysis method and claim a result barely related to the hypothesis we set out to test originally.

Ritchie certainly isn’t wrong that the file drawer effect and p-hacking distort the scientific literature. Unfortunately, just as with the first type of bias, they blend seamlessly into good sense. Does it really make sense to spend all your time writing up that A, B, C, and D are not cures for cancer, if it makes you five times slower at going through the rest of the alphabet and finding an actual cure? Are other scientists really interested in wading through a hundred papers with null results to find one that they can build on top of? As for p-hacking, it is all good and well to insist that scientists only test the original hypothesis, but some of the greatest scientific discoveries–penicillin, vaccination, the cosmic microwave background, the discovery of microorganisms–happened by chance. It seems foolish to collect a large, high-quality dataset and not even look at it to see if interesting patterns pop up. Where does data mining end and p-hacking begin? There are no easy answers to these questions, but I suspect that, as usual, the inscription outside the Delphic oracle is as wise today as it was 2500 years ago: nothing in excess.

A final form of bias that Ritchie briefly mentions is political bias. 90% of psychologists are liberal; how likely are they to endorse politically incorrect findings, or to carefully scrutinize findings supporting their beliefs? According to Ritchie, it could be partially due to this political bias that the field came to believe in stereotype threat: the idea that test takers’ performance suffers when reminded that, according to certain stereotypes, they are not “supposed to” do well because they are female (for math), black (for most academic subjects), or white (for athletic ability). Ritchie claims the evidence for stereotype threat is quite weak, citing a 2015 meta-analysis that shows evidence of publication bias. More recent meta-analyses that I have found generally report that if stereotype threat exists, its magnitude is much smaller than initial studies indicated. For example, Warne 2021 claims no evidence of stereotype threat in females; Lewis & Michalak 2019 gives a summary of the field which confirms that the reported effect size has been decreasing with time; it is not clear what Picho-Kiroga et al 2021’s conclusion is, but their funnel plot certainly doesn’t inspire confidence that stereotype threat has been robustly detected:

Above: a funnel plot from Picho-Kiroga et al 2021, summarizing studies on stereotype threat. The bigger the study (higher the inverse standard error), the smaller the reported effect, with the highest-powered study finding nearly no effect. The small studies finding large ST effects have no counterparts finding equally large reverse ST effects, which could be due to publication bias.

Moving on, Ritchie deals with negligence. Scientists are human, which means that sometimes they make mistakes–either typos, transcription errors, arithmetic errors, or other banalities that could fundamentally change the conclusion, or errors in the very design of the study, which could be due to apathy, poor training, or incompetence. How common are banal errors? In 2016, Michèle Nujiten and colleagues used an algorithm to check the consistency of the statistics reported in 30,000 psychology papers, finding that half the papers had at least one numerical inconsistency. 13% had a serious mistake that could change the interpretation, e.g. by changing p > 0.05 to p < 0.05. Unsurprisingly, these mistakes tend to support instead weaken the author’s hypothesis, combining with bias to form an even less savory cocktail.

According to Ritchie, another particularly embarrassing example of negligence are the candidate gene studies. In the 2000s, scientists found genes linked to everything from depression to schizophrenia to cognitive test scores, sometimes with effect sizes as large as 21%. Since then, improved technology has allowed scientists to simultaneously measure thousands of genetic variations in hundreds of thousands of people, completely blowing the early studies out of the water in terms of statistical power. Ritchie quotes a blogger to describe what scientists discovered:

Reading through the candidate gene literature is, in hindsight, a surreal experience: they were building a massive edifice of detailed studies on foundations that we now know to be completely false. As Scott Alexander of the blog Slate Star Codex put it: ‘This isn’t just an explorer coming back from the Orient and claiming there are unicorns there. It’s the explorer describing the life cycle of unicorns, what unicorns eat, all the different subspecies of unicorn, which cuts of unicorn meat are tastiest, and a blow-by-blow account of a wrestling match between unicorns and Bigfoot.”

Is it fair to call doing low-powered studies “negligence”? Ritchie makes the point that complex traits were known to be massively polygenic more than a century ago. As early as 1918, Ronald Fisher made a simple argument: complex traits like intelligence follow bell curves, while a single gene gives rise to a binary outcome (you have a certain trait or you don’t). To get a bell curve from these binary outcomes, thousands of them must be combined. If this argument is valid–and it’s hard for me to see why it wouldn’t be–it truly is surprising why scientists spent a decade chasing large effect candidate genes (they were insensitive to small-effect genes), when it was mathematically impossible that they would find one. Were all the scientists really this stupid or ignorant for so long, or are there subtleties that Ritchie is missing? Not being a geneticist, I don’t know–and unfortunately, neither does GPT-4.

Moving on, we come to the last sin: hype. We are all familiar with hype. 99% of the time, hype is how the public hears about scientific discoveries. Sometimes, hype comes in the form of the unmentioned cross-species leap, which is best illustrated by the @justsaysmice Twitter account:

Sometimes, hype comes in the form of unwarranted advice, or confounding correlation with causation. Over the years, I have noticed that nutrition science is reliably heavy on hype and thin on actual evidence. Ritchie shares this impression:

Fads like microbiome mania wax and wane, but there’s one field of research that consistently generates more hype, inspires more media interest and suffers more from the deficiencies outlined in this book than any other. It is, of course, nutrition. The media has a ravenous appetite for its supposed findings: ‘The Scary New Science That Shows Milk is Bad For You’; ‘Killer Full English: Bacon Ups Cancer Risk’; ‘New Study Finds Eggs Will Break Your Heart’. Given the sheer volume of coverage, and the number of conflicting assertions about how we should change our diets, little wonder the public are confused about what they should be eating. After years of exaggerated findings, the public now lacks confidence and is sceptical of the field’s research.

The public is right to lack confidence, given that nutrition scientists keep saying that everything either causes or prevents cancer:

In a now-classic paper entitled ‘Is everything we eat associated with cancer?’, researchers Jonathan Schoenfeld and John Ioannidis randomly selected fifty ingredients from a cookbook, then checked the scientific literature to see whether they had been said to affect the risk of cancer. Forty of them had, including bacon, pork, eggs, tomatoes, bread, butter and tea (essentially all the aspects of that Killer Full English).

Why is nutrition science so terrible? Bias is one factor: studies are often funded by the food industry. Heavy reliance on observational studies is another reason. Running a randomized controlled trial over 30 years to study the long term effects of a diet simply isn’t practical, so scientists instead ask people what they ate. Of course, survey participants don’t necessarily remember what they eat. Even if they had perfect memory, they might not want to admit to eating McDonalds every day. Even if they had perfect memory and were perfectly honest, culture, ethnicity, location, social habits, socioeconomic status, religion, and virtually every other aspect of being human all affect what we eat. Disentangling causation from the morass of confounding factors is extremely difficult, if not impossible. This is not to say that randomized controlled trials in nutrition science are good–one of the poster children of rigor, the 7500 participant PREDIMED study which claimed to show cardiovascular benefits from a Mediterranean diet, was retracted because 21% of the sample were plagued by randomization failures.

Nutrition science is of course not the only field plagued by hype. Psychology is not immune. Back in 2015, I remember reading about growth mindset: the idea that you can achieve anything just by saying “I can’t do X yet” instead of “I can’t do X” (I’m only mildly exaggerating!) Around the same time, I remember the Microsoft campus being plastered with posters like this, advertising growth mindset:

These posters spread like wildfire thanks to the promotion of Stanford University psychologist Carol Dweck, the idea’s originator. According to Ritchie’s Substack, which has the most recent updates on growth mindset, Dweck went on TED Talks viewed by tens of millions and said:

“Let’s not waste any more lives, because once we know that abilities are capable of such growth, it becomes a basic human right for children, all children, to live in places that create that growth, to live in places filled with “yet””

She also lamented that growth mindset is not yet a national education priority. She also published a paper in Science claiming that growth mindset can help solve the Israeli-Palestinian conflict, because it helped Israeli and Palestinian children build a higher spaghetti tower together (!)

Surprisingly, growth mindset does replicate–sometimes. A 2018 meta-analysis found an average effect size of 0.08 (95% confidence interval: 0.02–0.14). A 2019 Nature paper studying 6300 students came up with an effect size of 0.11 standard deviations (95% confidence interval: 0.04–0.18). Another 2019 study of 5018 English pupils found no effect whatsoever (effect size < 0.03 to 95% confidence). Growth mindset might yet be real–although the confidence intervals that barely miss 0 are hardly the most slam dunk evidence I’ve ever seen–and if so, it might yet improve your life incrementally. Just don’t expect it to be a magic wand.

The sagas of nutrition science and growth mindset illustrate another one of Ritchie’s points that I fully agree with: usually, hype is the fault of the scientist, not of the journalist. It wasn’t journalists who ran endless flawed observational studies. It wasn’t journalists who told Dweck to go on TED and claim growth mindset is a human right. In most cases, even science news articles that go far beyond the original paper in their exuberant claims are not the fault of journalists.

I have personal experience with how scientific results end up in newspapers. A few years ago, I had a result that I wanted press coverage for. I contacted my university’s press office, who sent a press strategist to interview me and hired an artist to make the illustrations. The strategist then wrote the article, put in the illustrations, and sent me the draft. I made changes to the draft, including both scientific corrections and stylistic suggestions. The strategist incorporated the changes and sent me a second draft. This repeated a few more times until both of us were satisfied, after which we set a date for the press release. The very day the release happened, I could easily find dozens of websites that carried the story. Almost all of them copied the press release word for word, the illustrations pixel by pixel. Not a word of these stories appeared without my review and approval.

To be fair, not every science news article is like this. After another paper of mine was published, I did not put out a press release or solicit publicity in any way. Instead, a journalist contacted me and said he was interested in writing a story. He interviewed me and asked a clarifying follow-up question afterwards, but did not allow me to see the draft before publication, as was standard policy at his newspaper. When the article came out, I found it to be incredibly accurate and even-handed. The journalist even quoted an expert in the field, who fairly and accurately mentioned my study’s limitations.

The Call to Action

In Part III of the book, Ritchie catalogs the perverse incentives that lead to all the scientific sins, and how to fix them. The discussion is rich with juicy examples, but the perverse incentives essentially boil down to two:

  • Publish or perish. When scientists aren’t paid by the paper (which happens regularly in China), they are rewarded with job opportunities and/or tenure for publishing papers and getting citations
  • Having to constantly apply for grant funding

According to Ritchie, the perverse behaviors that these perverse incentives lead to are legion. My inbox is bombarded with ads from predatory journals, who will publish anything for an exorbitant fee. Desperate to publish papers, many scientists either publish subpar results in predatory journals, or “salami slice” their results into the smallest possible publishable units to generate more papers. Others artificially boost their citation counts by citing themselves, or pressuring others into citing them while serving as peer reviewers.

I am unconvinced that these really are perverse incentives. Surely we want scientists to make discoveries and publish papers? Surely we want to allocate scarce funding to the projects that are most exciting and promising? It is hard to imagine any set of incentives that rewards good science, without also rewarding those who convincingly simulate good science. With a reliable, fast, and un-gameable oracle that can objectively rate the quality of a scientist, we would not need to count papers and citations or review grant proposals. Since we don’t have such an oracle (GPT-5 being some time away), citation counts are the next best thing. While not immune to gaming, it is harder to manufacture citations than Ritchie implies. As he says himself:

…huge numbers of these papers receive barely any attention from other scientists. One analysis showed that in the five years following publication, approximately 12 per cent of medical research papers and around 30 per cent of natural- and social-science papers had zero citations.

In my experience, landmark papers in my field routinely have very high citation counts, the most talented scientists routinely have many highly cited papers, and dodgy papers (even by top scientists) routinely have few citations. The correlation is certainly not perfect, but nobody has demonstrated that any other objective metric does better.

However scientists and their work are evaluated, it is not surprising that many scientists game the rules when their status and their livelihoods are on the line. This is not a technical problem, or even a structural one. It is an inescapable feature of the sinful human condition. Liars, bloviators, and cheaters have always been with us, and always will be. Improving ways of detecting and punishing them is possible. Removing the incentive to lie, bloviate, and cheat is not.

The very last chapter of the book is entitled “Fixing Science”. Anyone expecting a plan to remove the perverse incentives altogether will be disappointed. Instead, the chapter is a laundry list of ideas. Most of them are good; many are increasingly being implemented; few strike me as truly revolutionary. These ideas include:

  • Automated algorithms to catch statistical irregularities
  • Journals dedicated to null results
  • Journals should accept more replication studies
  • Doing away with statistical significance
  • Use Bayesian statistics
  • Have a panel of independent statisticians analyze the data, not the researchers
  • Have a government panel investigate allegations of misconduct, not the university
  • Analyze the data in many different ways (multiverse analysis)
  • Pre-registration of studies
  • Open science: publish all data and all code used to get the result
  • Open access: preprints
  • Hiring and tenure decisions should consider “good scientific citizenship” (openness and transparency, quality over quantity)
  • Fund scientists instead of specific projects
  • For grant proposals above a certain quality threshold, randomly select which ones to fund

As the book notes, some of these suggestions are already being implemented. Journals are already accepting more replication studies. Pre-registration is already required in clinical trials. In many fields, scientists put their papers on arXiv, medRxiv, or other preprint servers before publication. Open science is becoming more and more the norm, as storage costs decrease and open source software proliferates. There are already journals dedicated to null results–although nobody wants to publish in them.

Other suggestions have debatable risk-reward tradeoffs. Funding scientists instead of projects means a small clique of senior and influential scientists monopolize all the funding, while junior scientists and those with unpopular ideas are even more out of luck than they already are. Considering good scientific citizenship in hiring and tenure decisions might sound like a good idea, but evaluating even one applicant’s “complexity of building international collaborations, arduousness of collecting and sharing data, honesty of publishing every study whether null or otherwise, and the unglamorous but necessary process of running replication studies” is extremely non-trivial. Doing so fairly and quickly across many applicants in different fields sounds nearly impossible.

Reflections

For me, Science Fictions was a trip down memory lane. Countless studies that I had read about and believed during the past ten years are all there, all ignoble examples of how not to do science. One moment I’d be back in a university psychology lecture, learning about power posing and stereotype threat. Another moment, I would be staring at the innumerable “growth mindset” posters at Microsoft, where I worked a decade ago. Then I’d be arguing on Facebook about the gay canvassing study, or reading Slate Star Codex’s brutal takedown of 5-HTTLPR studies.

Did the book change my view about science? I’m not sure. The replication crisis has been well known and endlessly written about for a decade now. It is in the air we breathe. While the book was useful for updating my beliefs on specific findings, it did not make me significantly more skeptical of science, because reading about failed replications on the Internet and elsewhere has already cured me of any instinct to blindly trust scientists.

So has becoming a scientist myself. Real data is filled with anomalies that nobody understands. Real data analysis, especially on data from a new instrument, depends on a multitude of defensible choices that often give different results, without any way of knowing which one is better. Real interpretation is a mixture of guessing which data analysis method is better, which points are outliers, which model is more plausible, and what effects are or are not important. Real paper writing involves telling a consistent story while sweeping all the dead ends and paths not taken under the carpet, to avoid turning the paper into a novel. Real consensus-building involves people obsessively defending their pet theories, people with egos wanting to prove how smart they are, people getting jealous, and people arguing because they just don’t like each other. Nothing has made me trust science less than becoming a scientist and seeing how the sausage is made. Scientists are human too.

If the book didn’t substantially change my opinion about science, it did reinforce my humanism. Think back to all the psychology and social science studies that failed to replicate. What do they say about humanity? In nearly every case, the replication failure means that we are not as manipulable as scientists once thought. Having temporary power over others in a prison doesn’t make us abusive sadists. Hearing about old people doesn’t make us walk slower. Being reminded of a stereotype doesn’t make us conform to it. A single gene or food doesn’t make us depressed. Overall, we are stronger, more resilient, more rational, and more consistent than what the psychologists wanted us to believe. If that means there is no magic wand that can magically make us succeed in life, it also means that we have a coherent self that is our own, one that random passers-by cannot easily mold to their liking. If one lesson from the replication crisis is that scientists are as flawed as other humans, an equally valid lesson is that humanity might not be so flawed after all.

--

--