Biostatistics & Population Health
Causation criteria: Bradford Hill
— Originally articulated in his presidential address to the Royal Society of Medicine on the environment–disease relationship (smoking and lung cancer was the motivating case)
— Not a checklist, not a scoring system, not a statistical test — they are heuristics for inference applied after an association has been demonstrated
— Interpreting an epidemiologic study (cohort, case-control) that reports an exposure–disease link
— Public health / occupational medicine vignettes (lead, asbestos, vaping, opioid prescribing patterns)
— Pharmacovigilance signals (drug A → adverse event B)
— Distinguishing correlation from causation when bias, confounding, or chance could explain findings
— A valid statistical association must first be established
— Chance ruled out (p-value, confidence interval)
— Bias addressed (selection, information, recall)
— Confounding addressed (stratification, multivariable regression, matching)
— Only then is it meaningful to ask: is this causal?
— Mnemonic: "Strong Coffee Should Taste Better, Please Consider Extra Aroma"

— A study is described (often cohort or case-control), an RR/OR is given, and the examinee is asked which Hill criterion is best illustrated, violated, or missing
— Alternatively, a public health official or clinician must decide whether an exposure–outcome link is "causal enough" to act on
— Strength: "Smokers had a 20-fold increase in lung cancer" → large RR/OR favors causation; small effects more vulnerable to residual confounding
— Consistency: "Findings replicated across multiple populations, study designs, and investigators" → reproducibility
— Specificity: "Exposure causes one specific disease in a specific population" → weakest criterion (most exposures cause multiple outcomes; smoking causes many cancers)
— Temporality: "Exposure clearly preceded outcome onset" → the only mandatory criterion
— Biological gradient: "Higher dose or longer duration → higher risk" (dose–response)
— Plausibility: Fits known biological mechanism
— Coherence: Does not conflict with known facts about the disease's natural history and biology
— Experiment: Removing exposure reduces incidence (smoking cessation programs ↓ lung cancer); or RCT evidence
— Analogy: Similar exposures cause similar effects (thalidomide → suspicion of other teratogens)
— Duration and intensity of exposure → biological gradient
— Sequence of exposure and symptom onset → temporality
— Family/community clustering with same exposure → consistency

— RR ≥ 2–3 generally considered "strong"; RR 1.1–1.5 considered weak and easily explained by unmeasured confounding
— Hill's smoking–lung cancer example: RR ~9–20 depending on intensity
— Weak associations are not non-causal — they simply require more rigorous control of bias
— Check for meta-analyses, systematic reviews, multi-country cohorts
— Heterogeneity (I²) helps quantify inconsistency
— Watch for publication bias — funnel plot asymmetry weakens the consistency argument
— Graded exposure (pack-years, mSv, mg/day) should yield graded risk
— Threshold effects and non-monotonic curves (J-shape for alcohol/CV mortality) complicate but don't refute causation
— Absence of dose–response is a red flag, especially for chemical or radiation exposures
— In cross-sectional studies, temporality is often unknowable → causal inference is severely limited
— Cohort studies establish temporality best; case-control studies are vulnerable to recall and reverse causation
— Plausibility = fits known biology/mechanism
— Coherence = fits known epidemiology and natural history of disease
— Both are limited by current scientific knowledge — early in a discovery, plausibility may be low yet the association still causal (e.g., H. pylori and PUD initially dismissed)

— Compute measure of effect appropriate to study design:
— Cohort study → relative risk (RR) = incidence in exposed / incidence in unexposed
— Case-control study → odds ratio (OR) as estimate of RR (valid when disease is rare, <10%)
— Cross-sectional study → prevalence ratio or prevalence OR
— RCT → RR or absolute risk reduction (ARR), NNT = 1/ARR
— p-value < 0.05 conventionally
— 95% CI for RR/OR — if it crosses 1.0, association is not statistically significant
— Wide CI = imprecise (small sample, low power)
— Type I error (α) = false positive; Type II (β) = false negative; power = 1 − β, typically ≥ 0.80
— Selection bias (Berkson, healthy worker effect, non-response)
— Information bias (recall, interviewer, misclassification)
— Lead-time / length-time bias in screening studies
— A confounder is associated with both exposure and outcome, and is not on the causal pathway
— Address by: randomization (design), restriction, matching (design), stratification, multivariable regression, propensity scores (analysis)
— Effect modification (interaction) ≠ confounding — it should be reported, not adjusted away

— Causation defined as: outcome in the exposed differs from what would have occurred in the same individuals had they been unexposed
— RCTs approximate this through randomization, which balances measured and unmeasured confounders
— Visual tool to identify confounders, mediators, and colliders
— Mediator: on causal pathway (don't adjust if estimating total effect)
— Collider: common effect of two variables — adjusting for a collider induces bias (collider-stratification bias)
— Uses genetic variants as instrumental variables (random at meiosis → mimics randomization)
— Strengthens causal inference for exposures like LDL, BMI, alcohol
— Systematic review/meta-analysis of RCTs > single RCT > cohort > case-control > cross-sectional > case series > expert opinion
— RCT is the gold standard because randomization handles unmeasured confounding
— Specificity often inappropriate (smoking causes many cancers; one cause → many outcomes is the norm)
— Plausibility is knowledge-dependent and can be circular
— No weighting between criteria

— Temporality: Exposure must precede outcome. Without it, no causal claim is possible. Reverse causation (disease causes the "exposure") is the classic alternative explanation
— Example violation: low cholesterol "causes" cancer — but occult cancer may lower cholesterol (reverse causation)
— Strength: Large RR/OR harder to explain by residual confounding
— Biological gradient: Dose–response is mechanistically reassuring
— Consistency: Replication across populations, designs, investigators
— Experiment: Intervention reverses outcome (cessation, regulation, RCT)
— Plausibility: Depends on current biological knowledge
— Coherence: Compatible with natural history of disease
— Analogy: Similar exposures cause similar effects
— Specificity: Most exposures cause multiple outcomes; most outcomes have multiple causes — violating specificity is the norm, not evidence against causation
— Identify which criteria the study demonstrates, which it fails, and which are unaddressable by the study design
— Cross-sectional studies cannot demonstrate temporality
— Single-study results cannot demonstrate consistency
— Animal models contribute plausibility but not human consistency
— Strong, consistent, dose–responsive association with temporality and plausibility → act, even without RCT (smoking, asbestos, lead)
— Weak, inconsistent association → demand more evidence before policy/clinical change

— Temporal relationship of event to drug administration
— Improvement on dechallenge (drug stopped → event resolves) → supports causation = Hill's "Experiment"
— Recurrence on rechallenge → strong causal evidence (rarely ethical)
— Dose–response → biological gradient
— Plausibility from pharmacology
— Alternative explanations excluded (confounding by indication)
— Patients prescribed drug X differ systematically from those not prescribed
— Example: antidepressants and suicide — depression itself is the confounder
— Addressed by active comparator design, propensity score matching, instrumental variables
— Sicker patients channeled to newer/safer-perceived drugs
— Can create spurious associations in either direction
— SSRI use and GI bleeding
— Temporality (use precedes bleed) ✓
— Strength (OR ~2) — moderate
— Consistency (multiple cohorts) ✓
— Dose–response (higher serotonin affinity → higher risk) ✓
— Plausibility (platelets depend on serotonin uptake) ✓
— Experiment (discontinuation reduces risk) ✓
— Conclusion: causal — labeling and clinical guidance reflect this

— Randomization balances measured and unmeasured confounders
— Blinding minimizes information bias
— Limits: cost, ethics (can't randomize to harm), generalizability (efficacy vs effectiveness), short follow-up
— Best for establishing Experiment criterion directly
— Exposure measured before outcome → satisfies temporality cleanly
— Calculates incidence and RR
— Vulnerable to loss to follow-up, confounding
— Best for common outcomes, rare exposures, long latency diseases (Framingham, Nurses' Health Study)
— Exposure and outcome both already occurred; investigator looks back through records
— Faster, cheaper; risk of incomplete data
— Best for rare outcomes (cancers, rare AEs)
— Selects on outcome → cannot calculate incidence, only OR
— Vulnerable to recall bias and selection bias
— Snapshot — measures prevalence, not incidence
— Cannot establish temporality → poor for causation
— Group-level data → risk of ecological fallacy (group-level association ≠ individual-level association)
— Useful for hypothesis generation only
— Interrupted time series, difference-in-differences, regression discontinuity — strengthen causal inference when RCT impossible (policy evaluation)

— A causal relationship demonstrated in one population may not transfer
— Hill's Consistency criterion partially addresses this — replication across diverse populations strengthens generalizability
— Magnitude of effect differs across subgroups (age, sex, genotype, comorbidity)
— Example: oral contraceptives + smoking → multiplicative VTE risk
— Effect modification should be reported and stratified, not adjusted away
— Contrasts with confounding, which should be controlled
— Often excluded from RCTs → causal inferences may not apply
— Competing risks (death from other causes) complicate outcome ascertainment
— Confounding by frailty/indication is severe
— Pharmacokinetic differences mean drug–outcome associations may have different dose–response curves in these groups
— Hill's biological gradient must be re-examined per subgroup
— Causation established in adults often extrapolated to children — but mechanisms (developmental biology) may differ
— FDA pediatric extrapolation framework formally evaluates whether adult causal evidence applies
— Almost universally excluded from RCTs → causation for pregnancy outcomes often relies on registries, case-control, and cohort studies
— Teratogenicity assessment leans heavily on Hill (thalidomide: strong, consistent, dose–responsive, temporal, plausible, analogous)
— Pharmacogenomics (CYP2C19 and clopidogrel, HLA-B*5701 and abacavir) → effect modification at genotype level
— Single-ancestry studies limit consistency claims

— Smoking and lung cancer (Doll & Hill, 1950s): strong RR, dose–response with pack-years, consistent across countries, plausible (carcinogens in smoke), reversible with cessation (experiment), analogous to other tobacco cancers
— Asbestos and mesothelioma: near-specific exposure for a near-specific disease (rare instance where specificity holds)
— Lead and pediatric cognitive deficits: dose–response, consistency, plausibility, experiment (lead abatement → improved scores)
— Aspirin and Reye syndrome: temporal, strong, experiment (warning labels → ↓incidence)
— H. pylori and PUD/gastric cancer (Marshall): initially low plausibility but strong other criteria; experiment (eradication cures) clinched causation
— Workers are healthier than general population at baseline → biases occupational cohort studies toward the null
— Use internal comparisons (high-exposure vs low-exposure workers) rather than general-population SMRs
— Many environmental/occupational exposures have decades-long latency (asbestos → mesothelioma 20–40 yr)
— Studies must have adequate follow-up; short studies may falsely reject causation
— Certain causal links trigger public health reporting (occupational lead, communicable disease, suspected cluster of cancers)
— Clinicians have a role in surveillance
— When evidence is suggestive but not conclusive, public health may act to limit exposure (BPA, PFAS) — a policy decision that goes beyond strict Hill satisfaction

— Outcome causes the apparent exposure
— Example: physical inactivity "causes" obesity vs obesity causes inactivity
— Mitigated by: prospective design, lag analyses, Mendelian randomization
— Third variable associated with both exposure and outcome
— Classic: coffee and lung cancer (smoking confounds)
— Mitigated by: randomization, restriction, matching, stratification, regression, propensity scores
— Berkson's bias (hospitalized controls), non-response, loss to follow-up, healthy-worker
— Distorts the apparent association in unpredictable directions
— Differential: misclassification differs by group → biases toward or away from null unpredictably
— Non-differential: equal across groups → typically biases toward the null
— Cases remember exposures differently than controls (case-control studies)
— Mitigated by structured interviews, records-based exposure ascertainment
— Inferring individual-level causation from group-level data
— Famous example: per capita fat intake and breast cancer correlate at country level but not at individual level
— Opposite error: inferring population effects from individual-level data alone
— Pharmacoepidemiology error: misclassifying time before drug initiation, falsely favoring "treated" group survival
— Screening studies — apparent survival benefit from earlier detection, not actual mortality reduction
— Using criteria as a checklist with scoring (Hill explicitly warned against this)
— Demanding all 9 criteria be met before accepting causation
— Treating specificity as essential (it usually isn't)

— Strong, consistent causal evidence + favorable risk-benefit + alternatives considered → change practice
— Example: COX-2 inhibitors and CV events (rofecoxib withdrawal 2004)
— Systematic reviews and meta-analyses synthesizing Hill-satisfied evidence → guideline committees (USPSTF, AHA/ACC, ADA) update recommendations
— GRADE framework formalizes evidence quality and recommendation strength
— FDA black-box warnings, drug withdrawals
— EPA exposure limits
— OSHA workplace standards
— Often requires lower causal certainty than clinical practice change because of population-scale stakes
— Strong temporal, consistent association during an outbreak may trigger action before full Hill satisfaction (precautionary principle)
— Example: Legionnaires' disease (1976) — cooling tower link acted on before full mechanistic confirmation
— Vaping-associated lung injury (EVALI, 2019) — vitamin E acetate identified rapidly using case-control and dechallenge logic
— Adverse event noticed → MedWatch/FAERS report
— Cluster noticed → notify public health department
— Patient harm from systems issue → root cause analysis, patient safety officer
— Epidemiologist / biostatistician for study design and analysis
— Public health authority for cluster investigation
— Risk management / legal for potentially preventable harms

— Each disease has multiple sufficient causes, each composed of component causes
— A component is necessary if it appears in every sufficient cause (e.g., HIV in AIDS)
— Useful for understanding multifactorial disease (CAD, cancer)
— Reframes "the cause" as a set of interacting components, not a single agent
— Causation = difference between observed outcome and counterfactual outcome had exposure differed
— Foundation of modern causal inference, RCTs, propensity scores
— Average treatment effect (ATE) and average treatment effect on the treated (ATT) are formal estimands
— Graphical representation of causal assumptions
— Identifies confounders to adjust, mediators to leave alone, colliders to avoid
— Formalizes when an observational study can estimate a causal effect (back-door criterion)
— Historical framework for infectious causation:
— Organism present in all cases
— Isolated and grown in pure culture
— Reproduces disease when inoculated
— Re-isolated from new host
— Limited in modern era (asymptomatic carriers, non-culturable organisms, viruses, multifactorial disease) → updated with molecular Koch's postulates (Falkow)

— A statistically non-significant finding (CI crosses null) may reflect insufficient power
— A significant finding may be a Type I error, especially with multiple comparisons
— Bonferroni correction or false discovery rate methods for multiple testing
— Most important non-causal explanation in observational research
— Residual / unmeasured confounding always possible — randomization is the only solution
— Cannot be fixed in analysis; must be prevented by design
— Always consider before accepting causation
— Particularly relevant in cross-sectional studies and biomarker associations
— Mediator: exposure → mediator → outcome (don't adjust if estimating total effect; do decompose for indirect effects)
— Confounder: not on pathway; adjust for it
— Effect modifier: changes magnitude of effect across subgroups; report stratified results
— Ice cream sales and drowning (confounder: summer)
— Storks and birth rates (rural areas)
— Hormone replacement therapy and CHD (observational studies suggested protective; WHI RCT showed harm — confounding by healthy-user bias)
— Surrogate endpoint correlation with clinical outcome does not guarantee that intervention effects on the surrogate translate to clinical benefit
— CAST trial: antiarrhythmics suppressed PVCs (surrogate) but increased mortality (clinical)
— Extreme values tend toward the mean on repeat measurement — can falsely suggest treatment effect

— Primordial: prevent risk factor development (built environment, food policy)
— Primary: prevent disease in those with risk factors (statins for hyperlipidemia, vaccines)
— Secondary: detect early disease (screening — mammography, colonoscopy)
— Tertiary: limit disability from established disease (cardiac rehab post-MI)
— Quaternary: prevent overmedicalization and iatrogenic harm
— Grade A/B: high/moderate certainty of net benefit → offer/provide
— Grade C: small net benefit → individualize
— Grade D: no benefit or net harm → discourage
— Grade I: insufficient evidence
— Underlying logic incorporates causal strength of exposure → outcome and intervention → outcome reduction
— Causation established (Hill, 1965)
— Cessation reverses risk over years (Experiment criterion satisfied)
— Step 3 management: assess at every visit, advise quit, assess readiness, assist (varenicline, bupropion, NRT, counseling), arrange follow-up (5 As)
— Causation of LDL → ASCVD supported by RCTs, Mendelian randomization, dose–response (LDL lowering → linear event reduction)
— Risk-based prescribing (10-yr ASCVD ≥7.5–20%)
— Diet, exercise, alcohol, sun protection — all grounded in causal epidemiology

— Rare AEs only detectable after widespread use (1 in 10,000)
— FDA MedWatch, FAERS, sentinel networks
— Hill criteria reapplied as signals emerge
— Continuously updated as new trials publish
— Cochrane reviews, GRADE updates
— HRT and CHD: observational studies suggested benefit, WHI RCT overturned → demonstrates limits of consistency criterion when underlying bias (healthy user) is shared across observational studies
— Saturated fat and CVD: ongoing re-evaluation with refined dietary epidemiology methods
— Statins: lipid panel, LFTs if symptomatic, CK if symptomatic
— Anticoagulants: INR (warfarin), renal function (DOACs), bleeding assessment
— Diabetes meds: A1c q3 months until at goal, then q6 months
— Antihypertensives: BP, K+, Cr (ACEi/ARB, diuretics)
— Shared decision-making: present causal evidence quality (RCT-derived vs observational), NNT/NNH, patient values
— Number needed to treat (NNT) translates causal effect into patient-level utility
— Absolute risk reduction is more meaningful than relative risk reduction for patients
— Frame risks in absolute terms and natural frequencies (5 of 1000 vs 0.5%)
— Avoid causal overstatement from weak observational data ("eggs cause heart disease")

— Clinical equipoise (genuine uncertainty about which arm is better) is required to ethically conduct an RCT
— Loss of equipoise (interim analyses showing strong benefit or harm) triggers DSMB to consider stopping
— Disclosure of risks, benefits, alternatives, right to withdraw
— Special protections for vulnerable populations (children, prisoners, pregnant patients, cognitively impaired)
— Tuskegee, Henrietta Lacks, Willowbrook — historical violations driving modern IRB oversight
— Suspected occupational disease (varies by state — silicosis, asbestosis)
— Communicable diseases per state list
— Suspected child/elder/intimate partner abuse (causation of injury)
— Adverse vaccine events → VAERS
— Adverse drug events → FAERS / MedWatch (voluntary for clinicians, mandatory for manufacturers)
— Daubert standard governs admissibility of expert scientific testimony in U.S. federal court
— Courts increasingly use Hill-like criteria to assess causation in toxic tort cases (asbestos, talc, glyphosate)
— Legal "more likely than not" (preponderance) is a lower bar than scientific consensus
— Overstating causation harms autonomy (unnecessary anxiety, avoidance)
— Understating causation harms beneficence (preventable disease)
— Frame evidence quality honestly
— Adverse events disproportionately occur at care transitions (hospital → home, primary → specialist)
— Documenting suspected drug–event causation in transfer summaries prevents recurrent harm — failure to document a probable adverse drug reaction at discharge is a classic Step 3 patient safety vignette


— Stem: "Patients exposed to chemical X had a 4-fold higher rate of disease Y; this pattern was observed in cohorts from three countries"
— Answer: Consistency (replication); not strength (which is the 4-fold)
— Trap: choosing strength because RR is mentioned — focus on the replication clause
— Stem: "In a cross-sectional survey, depressed adults reported higher alcohol use than non-depressed adults"
— Missing: Temporality — cross-sectional design cannot determine sequence
— Possibility of reverse causation (depression → drinking, or drinking → depression)
— Stem: "Researcher wants to determine if drug X causes rare hepatic failure"
— Answer: Case-control (rare outcome); RCT not feasible/ethical
— Stem: "Coffee drinkers had higher MI rates"
— Confounder: smoking (associated with both coffee consumption and MI, not on causal pathway)
— Stem: "Cases of lung cancer recalled occupational asbestos exposure more thoroughly than controls"
— Answer: Recall bias (information bias), not confounding
— OR 1.8, 95% CI 0.9–3.5 → not statistically significant; causal claim premature
— Given event rates of 5% (treatment) and 10% (control): ARR = 5%, NNT = 1/0.05 = 20
— Infectious agent novel pathogen → Koch/molecular Koch
— Environmental exposure chronic disease → Hill
— Strong, consistent, dose–responsive, temporal, plausible, but no RCT → public health action appropriate (smoking analogy)

Bradford Hill's nine criteria — strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, and analogy — are heuristics (not a checklist) used to judge whether an established statistical association reflects true causation, after chance, bias, and confounding have already been excluded, with temporality being the only universally required element.

