What Is This Thing Called ‘Evidence’
Back in 2013, I received a humbling email. I was being invited to be a guest on “The Joe Rogan Experience”. Unlike other guests who are brought on for a free-wheeling three hour conversation, however, I was being invited to engage in a specific task: to debate a man named Dr. Peter Duesberg, who was famous for refusing to accept the causal link between infection with the HIV virus and contracting the disease AIDS.
At the time, I was a fan of Rogan’s show. I appreciated his open format, his conversation style, his selection of guests, his open-mindedness, and his intellectual humility. I was also a fan of Rogan the comedian, martial arts commentator, the marijuana liberalization activist, and Rogan the navigator of various self-development strategies about which I would otherwise not know.
Fast forward 10 years and Rogan is now a media behemoth, his podcast attracting more listeners and viewers than does CNN. In the era of COVID, the show seems to have embraced and championed some troubling anti-science positions, has platformed some truly rancid characters, seems to revel in amplifying some tiresome talking points from a particular political ideology, and Rogan himself has made some public statements that make me uncomfortable.
But while I no longer watch or listen to his show regularly, and no longer consider myself a “fan” (whatever that word means), I still appreciate the role that the Rogan podcast plays in greater society, and am still impressed by what the man himself has accomplished, pretty much all by himself. And if the topic or guest is of particular interest, I will take the time to tune in. (I’m looking at you, Graham Hancock!) Therefore, please don’t count me among those knee-jerk Rogan haters. I’m not. I’m just disappointed with some of his choices of late. Very disappointed.
So why did I say no to that once-in-a-lifetime invitation? I did so for a number of reasons, among them that I am not an expert in HIV/AIDS, and it would have been unethical to present myself as one, solely for the thrill of smoking pot with Joe Rogan in his California home. As well, I did not feel comfortable contributing to the platforming of a man (Duesberg) whose ideas I felt were dangerous and best debated within the pages of scientific peer review.
I replied to the booking agent who had contacted me that I am more an expert in discussing the nuances of causal inference, which to me is the meat-and-potatoes of Epidemiology. So if that was of interest –to discuss how scientists weigh and define evidence, and thereby determine what causes what– then we could work something out. I didn’t hear back from them. Thus my window of opportunity unceremoniously closed to be catapulted before a seriously huge audience.
I’m thinking of that episode again today because paediatrician and vaccine developer Dr Peter Hotez has also been invited to debate a problematic character –anti-vaxxer Presidential candidate Robert F Kennedy Jr– on the Joe Rogan podcast. And Hotez’s reluctance to accept the challenge has riled up some truly awful people online.
I can’t help but wonder, if I had broached the subject on that particular growing platform nearly a decade ago, whether we would be in the predicament we’re in now. Not that my words would have been so impactful. But perhaps I would have contributed to a desire for laypeople to learn science methodology and the art of critical appraisal, and not just people’s opinions and conclusions.
What predicament am I talking about? Well, one in which there is such poor understanding by the public of what truly constitutes “proper evidence” in medical research, that the cracks between ideologies have been prized part unto gaping chasms by disingenuous uncertainty merchants in the wake of our current unprecedented public health crisis.
So that is what today’s topic will be: what is “evidence”?
Pyramid of Evidence
In medical science, what we’re really talking about here is the so-called hierarchy or pyramid of evidence. It looks something like this:
The study designs at the bottom are considered low quality, while those at the top are of higher quality. When making clinical practice guidelines, we summarize the evidence put forth in the very highest quality studies. You will find that the stuff inexpert people tweet about all the time are usually the evidence types of the lowest quality. So let’s go through briefly what the different design types are.
1. Animal or Laboratory Studies
At the very bottom are animal or laboratory studies. You know the type. A bunch of mice ate rotten cheese and now lasers come out of their eyes. Always makes the news. The problem is that it’s difficult to extrapolate from animal studies to human outcomes.
Animal studies often suffer from poor sample size, variable conditions, selection bias (having a sample that is not representative of the diversity of the greater population), and general poor study quality. Moreover, a lot of drug receptor sites in mice do not translate to humans.
Animal studies are great for establishing a mechanism for a drug action, for example, or for establishing some degree of safety, or even for getting some raw numbers for sample size calculations for human studies. Sometimes they can be used to make human-level decisions if the threshold is low for action. For example, when the bivalent COVID vaccines were updated to account for a new substrain of Omicron, it was tested on animals to get a sense of how many antibodies it would create. A whole new set of human trials weren’t needed because the change to the vaccine was so tiny that it would have been a waste of money.
In general, though, animal studies allow us to pose more questions for human studies.
2. Case Reports & Expert Opinion
This one surprises some people, but it shouldn’t. One of the lowest forms of evidence is the personal experience of an expert and his/her resulting opinion. And yet this is the type that moves most people and that possible most directs policy. As someone who has given expert testimony in a court of law, I know that my personal opinion can carry much weight in such venues. And yet this not the best kind of evidence.
Why? Because opinions are subjective. Two experts can disagree on the same subject. But two experts disagree much less frequently when assessing the results of the studies higher up on the pyramid.
3. Observational Studies
Let’s be clear. Observational studies are great. Sometimes people refer to them as “epidemiological studies”, which irritates me. Almost everything we know about how smoking increases your risk of lung cancer, for example, comes from observational studies. They are great for establishing the magnitude of the relationship between variables. For example, smoking x packs of cigarettes increases your risk of developing cancer by y percent.
But observational studies sometimes cannot determine temporality (i.e., whether an exposure came before an outcome) and certainly are not perfect in establishing causality. Ice cream sales are closely correlated with the frequency of shark attacks, for example, but one clearly does not cause the other.
What are observational studies? The big three types are cross-sectional, cohort, and case-control. Not to get too epi nerdy, but they are distinguished by when the exposure and outcomes are ascertained.
For example, a cross-sectional design ascertains exposure and outcome simultaneously. Imagine you’re answering a survey that asks you (a) are you hungry, and (b) do you have a headache. Such a design can compute the statistical relationships between hunger and headaches, but cannot determine which caused which, if either was indeed causal.
A cohort design ascertains exposure first, then waits to see if the outcome manifests. For example, imagine following 100 people for one year. You see which ones smoke cigarettes, then wait to see how many are diagnosed with lung cancer at the end of the year. Again, you can’t reliably determine if it was the smoking that caused the cancer, but you can certainly see if the smokers were more likely to get cancer than were the non-smokers.
And a case-control design is the opposite of a cohort. It ascertains outcome first, and “looks back in time” to see how many people had the exposure. So if you’ve got 50 people in a hospital with lung cancer (your “cases”), find 50 similar people without lung cancer (your “controls”) then check their medical records or interview them to establish their history of smoking. A comparison of the smoking rates between the cancer patients and the cancer-free patients gives us the strength of association. That’s a case-control study.
Taken as a group, these three designs constitute the lion’s share of observational studies in medical research. Do not underestimate or dismiss them. They are powerful, and a lot can be derived from their careful application.
4. Experiments or RCTs
An “experiment” is technically a study in which the investigator controls and applies the intervention, or exposure. Remember, with observational studies we didn’t tell the people to smoke or not to smoke. We just observed as they made their own choices. In an experiment, we would actually make the subjects become exposed to something, like cigarette smoke.
The most powerful kind of experiment in the medical world is the randomized controlled trial, or clinical trial, or RCT. It’s the classic case that most people learn about in school. Randomize people into two groups or “arms”. One arm gets the thing we’re testing, and the other arm does not. In a classical RCT, the other arm would get a placebo, or something that resembles the intervention so that the subjects don’t know if they’re getting the real thing or not.
The power of the RCT is that randomization means that most unaccounted factors are equally distributed between the two arms. We can also tell which came first, the intervention or the outcome. And we can control for all sorts of extraneous factors. Remember that an observational study really can’t 100% confirm that smoking causes cancer, though the very strong repeated association heavily suggests that it does. An RCT would tell us with almost complete certainty if that’s the case.
It would look like this. Take 100 5-year old kids and randomize them into two equal groups. One group you compel to smoke regularly for 40 years. The other group, you make sure that they don’t smoke for 40 years. Oh, and you take great pains to make sure that neither group is exposed to confounding factors, like industrial smoke or other carcinogens. Obviously, such an RCT would never take place in an ethical world. That’s why we rely on a combination of observational and animal studies to draw causal inferences in this case.
Now, the RCT is considered the “gold standard” of medical evidence. That’s what we teach medical students. They then go forth and only use published RCTs to direct their clinical practice. And when they advise governments and policymakers, they lean heavily on RCT evidence. Because that’s what we taught them. Some people (mostly physicians and economists) are so married to the idea of RCTs being the pinnacle of evidence that they will reject any health policy if it is not backed by an RCT. I won’t name any names, but maybe you can guess them.
But the RCT has very important limitations, which I will discuss later.
5. Meta-analyses of RCTs
At the top of most people’s evidence pyramids is the meta-analysis of existing high quality studies, most obviously of RCTs. “Systematic reviews” are meta studies which may or may not include a statistical meta-analysis.
The idea behind these reviews is that a given medical question has been interrogated so often via RCTs, that there is a need to distill the summary guidance offered by the best of them. This is because not all RCTs of a given subject are exactly the same. So if one an phrase a very specific research question, then one can collect all the RCTs that are relevant to that question, select the highest quality ones, then statistically combine their estimates into a pooled estimate.
These kinds of reviews are the bedrock of most clinical guidelines. Organizations like the Cochrane Collaboration specialize in creating them.
The Problem With RCTs
It is very important that people understand that RCTs are almost entirely the domain of Medicine. Almost no other science conducts them. There’s a reason for this. The RCT is the highest form of evidence because it is a fantastic tool for measuring a causal relationship when there is a poor signal-to-noise ratio.
The signal is the relationship we are seeking, whether it’s the association between smoking and lung cancer, or the extent to which mask-wearing reduces COVID-19 transmission risk. The noise is the high degree of individual variation. The noise is so high among people because people are so very different from each other. Our bodies are different. Our behaviours are different. The ways in which we react to medications or other interventions are different. The ways in which we experience and report discomfort are also different.
An RCT attempts to hold all variables constant via its controlled nature, accepting that the only major source of variance will be individual variation. And with a large enough sample size, even that individual variation can be smoothed out.
Engineers and astronomers don’t need RCTs. When testing whether a seat belt works, you just need to build it into a car and crash the car to see if the belt holds. You can crash 100 cars, but the variation will be minimal. Sure, you can vary the type of car and speed of cash, etc. But you don’t need to randomize anything to account for variation. You just write down how the belt performs in different scenarios.
An RCT for seatbelt use in humans would be completely unethical, of course. And even if you were to conduct one, all you would be really finding out is whether people were using the belts correctly, because that would be the source of individual variation. After all, we already know from the engineering tests that the belts work.
And an RCT for parachutes? Well that’s the joke we always make. Sort of.
The Question of Masks
RCTs are not always feasible, are not always ethical, and frankly might not be answering the question you think they’re answering. And just because it’s an RCT, it doesn’t mean it’s a good RCT. Consider the question of whether masks work to slow transmission and impact of COVID-19.
Last Fall, mask skeptic Dr Matt Strauss was on TVOntario and cited a famous randomized trial from Bangladesh. He said that the data he pays most attention to is that trial, where they randomized 380,000 people and found “about a 10% reduction in transmission in those villages that were randomized to masks…. a 10% effectiveness rate is not terribly effective.”
Everything about that quote is wrong. The actual study did not randomize masks to half of 380,000 village residents. They randomized masking recommendations to half of 600 villages. As a result, presumably, of the recommendation, only 13.3% of people in the control villages wore masks, while 42.3% in the villages with recommendations wore masks.
Did you catch that? In both arms of the study, a minority of people wore masks. And the difference between the two was just 29%. Yet, despite that small 29% difference in masking, symptomatic seroprevalence of COVID was reduced by about 10%. Given that the difference in masking between the two arms was small, that 10% effect size is actually pretty big.
Further, given that a minority of villagers in the intervention arm were masked, it stands to reason that some maskers and non-maskers were in fact living under the same roof, where all masks were removed. So any infection obtained from a non-masker would be transferred to a masker once at home. This surely reduced the effect size further. And yet it was still appreciable.
In fact, the authors concluded that, “Mask distribution with promotion was a scalable and effective method to reduce symptomatic SARS-CoV-2 infections.” The RCT was asking the very specific question of whether masking recommendations could increase mask usage. Its intent was misread as testing whether masking per se reduced transmission.
Other studies asking the same question reach favourable conclusions, i.e. that the implementation of mask mandates in community settings is associated with statistically significant reductions in COVID-19 case growth.
Observational studies routinely show strong associations between mask wearing and reduced COVID transmission. A 2021 case-control study showed that the odds of testing positive were reduced by more than 50% –higher with a higher quality mask. And a clever study looked at the results of lifting mask mandates in US schools, finding that doing so resulted in an additional 44.9 cases per 1000 students and staff.
A Cochrane Collaboration meta-analysis made the news some months ago for concluding that there is little evidence that masks prevent COVID transmission. The problem with this meta-study is that it combined RCTs that probably should not have been combined to answer a question that is not sufficiently related to the specific questions asked by the individual RCTs. A fantastic Scientific American article explains in detail why this systematic review was wrongheaded. In short, there may be few RCTs examining evidence of mask effectiveness, but it’s not an RCT question. It’s an engineering question! And there is plenty of engineering evidence that the masks work as advertised. Scientific American concluded, “It is therefore deeply concerning that prominent medical figures have misrepresented the protection provided by masks, when the evidence supports N95 respirators or better, ideally with two-way masking.”
Famously there’s this study, which technically was an RCT seeking to compare whether N95 masks were any better than baggy blue surgical masks in preventing COVID infections. Yes, it’s an RCT. And if that’s all you knew about it, you’d take to heart its conclusion that there is no advantage in wearing an N95 mask.
The problem is that the study is, in my opinion, quite poor. For example, it appears that N95 mask wearing was intermittent, even though the protocols describe it as continuous. It further appears, that some surgical mask-wearers could choose to wear N95s at times, making this what’s called an unplanned crossover trial.
I appreciate Michael Osterholm’s response to the study: “We just don’t need another poorly designed and conducted study on this.”
In short, I have three criticisms of RCTs on mask wearing:
(1) there is low signal-to-noise where it comes to the question of whether masks work to prevent COVID transmission. It’s a simple question that can be answered in an engineering lab.
(2) Because the mask is a technology which in and of itself experiences minimal individual variation, and RCTs are meant to control for individual human variation, the only RCT application to mask efficacy is to test whether mask mandates have an effect on transmission. That is a very different question from whether the masks work per se.
(3) Just because something is an RCT, doesn’t mean it’s a good RCT. Most people, especially physicians, just cite the conclusions and do not critically appraise the study’s methodology.
RCTs are great. But they are not meant for everything. And they are not a panacea.