Biostatistics & Population Health
Multiple comparisons and Bonferroni correction
— With k independent tests at α = 0.05, FWER ≈ 1 − (1 − 0.05)^k
— 5 tests → ~23% chance of a spurious "significant" result
— 20 tests → ~64%; 100 tests → ~99.4%
— Study reports many subgroup analyses (age, sex, race, comorbidity strata) and flags one as "significant"
— Trial has multiple secondary or exploratory endpoints
— Genome-wide, proteomic, or "-omics" studies (thousands of comparisons)
— Post hoc pairwise comparisons after ANOVA (e.g., comparing 4 treatment arms two-at-a-time = 6 tests)
— Repeated interim analyses of the same trial over time
— Dredging through EHR data for associations

— "A trial randomized patients to drug vs placebo. The primary endpoint was negative, but in post hoc analysis of 12 subgroups, mortality was reduced in patients over 75 (p = 0.03). What is the best interpretation?"
— "Investigators tested associations between a SNP and 50 phenotypes; one reached p < 0.05."
— "A study compared 4 antihypertensives pairwise (6 comparisons) and reported one significant difference at p = 0.04."
— "A registry analysis examined 30 dietary factors and cancer risk; coffee was associated at p = 0.02."
— Words like "post hoc," "exploratory," "subgroup," "secondary endpoint," "data-driven," "hypothesis-generating"
— Borderline p-values (0.02–0.049) — these are the ones most likely to evaporate after correction
— No mention of pre-specification of the analysis plan
— No mention of α adjustment (Bonferroni, Holm, FDR)
— "The investigators pre-specified three primary endpoints and applied a Bonferroni correction (α = 0.0167 per test)"
— "The trial used a hierarchical (gatekeeping) testing procedure"
— "FDR was controlled at 5% using the Benjamini-Hochberg method"

— For k independent tests at α: FWER = 1 − (1 − α)^k
— k = 1: 0.05
— k = 2: 0.0975
— k = 5: 0.226
— k = 10: 0.401
— k = 20: 0.642
— k = 100: 0.994
— Per-comparison error rate (PCER): α applied to each test independently — no correction
— Family-wise error rate (FWER): probability of ≥1 false positive across the family — controlled by Bonferroni, Holm, Šidák
— False discovery rate (FDR): expected proportion of false positives among all rejected nulls — controlled by Benjamini-Hochberg
— Bonferroni: extremely conservative; "every test must clear a high bar"
— Holm: stepwise, slightly more powerful than Bonferroni, still controls FWER
— Benjamini-Hochberg: controls FDR, far more powerful, appropriate for screening/discovery (genomics, biomarkers)

— Adjusted α = α / k
— Equivalently: multiply each p-value by k and compare to original α
— 5 comparisons, want FWER ≤ 0.05 → each test must reach p < 0.01
— 10 comparisons → p < 0.005 per test
— 20 comparisons → p < 0.0025
— 100 comparisons → p < 0.0005
— Genome-wide significance (~1 million SNPs) → p < 5 × 10⁻⁸ (a famous Bonferroni-style threshold)
— 4 comparisons → 98.75% CIs
— The CIs widen; fewer will exclude the null
— Strong control of FWER under any dependence structure among tests
— Simple, transparent, easy to defend to regulators (FDA accepts it)
— Conservative — actual FWER is often well below the nominal α, especially when tests are correlated
— Loses power rapidly as k grows
— Treats all hypotheses as equally important (no weighting)
— Assumes the "family" is well-defined — a judgment call

— Rank p-values smallest to largest: p₁ ≤ p₂ ≤ ... ≤ pₖ
— Compare p₁ to α/k, p₂ to α/(k−1), p₃ to α/(k−2), ...
— Stop at first non-significant result; reject all earlier
— Always at least as powerful as Bonferroni, controls FWER, still simple
— α_adj = 1 − (1 − α)^(1/k)
— Slightly less conservative than Bonferroni; assumes independence
— Rarely tested but conceptually equivalent
— Specifically for all pairwise comparisons after ANOVA
— Controls FWER, more powerful than Bonferroni for this use case
— Rank p-values; find largest i such that p₍ᵢ₎ ≤ (i/k) × α
— Reject all hypotheses up to that rank
— Workhorse of genomics, microbiome, biomarker discovery
— Controls expected proportion of false discoveries, not the probability of any
— Pre-specify an order: test endpoint 1 at full α; only if significant, test endpoint 2; etc.
— Common in cardiovascular outcome trials (e.g., MACE → CV death → all-cause mortality)
— No α adjustment needed because testing stops at the first failure

— Use Bonferroni or Holm, or hierarchical gatekeeping
— FDA generally requires a pre-specified strategy in the Statistical Analysis Plan (SAP)
— Pre-specify; apply Bonferroni/Holm or label as exploratory
— Unadjusted secondary endpoints should be reported as hypothesis-generating
— Pre-specified, limited number, with test for interaction → can be interpreted cautiously
— Post hoc → essentially exploratory; require replication
— Tukey HSD (all pairs) or Dunnett (vs control)
— Use alpha-spending functions (O'Brien-Fleming, Pocock) — a temporal analog of Bonferroni
— Benjamini-Hochberg FDR, typically at 5% or 10%
— No correction needed. This is the cleanest design — and the reason RCTs emphasize one primary endpoint.
— Across separate trials → no correction (different families)
— Within one trial across multiple endpoints → one family
— Multiple publications from one dataset → still one family conceptually; honest reporting required

— 15 subgroups → adjusted α = 0.05 / 15 = 0.0033
— Observed p = 0.04 → NOT significant after correction
— Expected number of false positives by chance: 15 × 0.05 = 0.75 — finding one "significant" result was almost guaranteed
— The primary endpoint failed — the drug, as tested, does not reduce MACE
— The subgroup finding is hypothesis-generating only
— Cannot ethically or scientifically prescribe based on this; need a prospective confirmatory trial in diabetic CKD patients

— HbA1c: −0.8%, p = 0.001 → significant (< 0.0125) ✓
— Weight: −2.1 kg, p = 0.008 → significant ✓
— SBP: −3 mmHg, p = 0.02 → NOT significant after correction ✗
— LDL: −5 mg/dL, p = 0.04 → NOT significant after correction ✗
— Rank: 0.001, 0.008, 0.02, 0.04
— Compare to: 0.05/4 = 0.0125, 0.05/3 = 0.0167, 0.05/2 = 0.025, 0.05/1 = 0.05
— 0.001 < 0.0125 ✓; 0.008 < 0.0167 ✓; 0.02 < 0.025 ✓; 0.04 < 0.05 ✓
— All four significant under Holm! Holm rescued two endpoints that Bonferroni rejected.
— HbA1c p = 0.001 → significant, proceed
— Weight p = 0.008 → significant, proceed
— SBP p = 0.02 → significant, proceed
— LDL p = 0.04 → significant, all four endpoints "win"

— Approaches: principal components, composite endpoints, multivariate tests (Hotelling's T²)
— Composite endpoints (e.g., MACE = CV death + MI + stroke) inherently solve multiplicity by collapsing endpoints into one — at the cost of dilution if components don't all move together
— Bonferroni further erodes already-limited power
— Consider whether the study is fit for purpose; sometimes the right answer is "this trial is underpowered to address k endpoints"
— Comparing groups at multiple timepoints inflates error
— Solutions: mixed-effects models with a single time × treatment interaction term, or area-under-the-curve summary measures, or alpha-spending for interim analyses
— O'Brien-Fleming boundaries: very stringent early (e.g., p < 0.001 at first look), liberal later — preserves overall α near 0.05
— Pocock boundaries: constant threshold across looks (e.g., p < 0.022 each)
— Both are temporal Bonferroni-like approaches; DSMB uses these to decide early stopping
— Multiple subgroup or sensitivity analyses warrant adjustment or explicit "exploratory" labeling

— ~1 million common SNPs tested → Bonferroni threshold p < 5 × 10⁻⁸
— This is the canonical "genome-wide significance" threshold
— Suggestive associations at p < 10⁻⁵ are flagged for replication, not declared discoveries
— Test one exposure against thousands of phenotypes in EHR
— Almost always uses Benjamini-Hochberg FDR (5–10%) rather than Bonferroni — discovery context
— Hundreds to thousands of taxa/proteins; BH-FDR standard
— Testing many model variants on the same validation set inflates apparent performance
— Held-out test sets and pre-registration of the final model address this
— Step 3 increasingly tests recognition that "the model with the best AUC out of 50 attempts" is not the model's true AUC — analogous to subgroup p-hacking
— Easy to test thousands of associations; garden of forking paths problem
— Pre-registered protocols (e.g., on ClinicalTrials.gov or OSF) are the gold standard for credibility
— Small n + many endpoints = severe multiplicity tradeoff
— Often use Bayesian methods or borrowing strength across related populations to gain power without inflating α

— Running many tests, reporting only significant ones
— Equivalent to inflating Type I error without correction
— Has driven much of the reproducibility crisis in biomedical literature
— Presenting a post hoc finding as if it were the pre-specified hypothesis
— Even more pernicious because it hides the multiplicity entirely
— Changing the primary endpoint after seeing data; pre-specified endpoint becomes "secondary"
— COMPare project documented this widely; journals now demand SAP comparison
— Reporting only significant subgroups or endpoints, omitting null results
— Early stopping for "efficacy" based on a chance fluctuation
— Trials stopped early for benefit tend to overestimate effect sizes
— "Drug X works in biomarker-positive patients" based on uncorrected subgroup analysis → leads to failed confirmatory trials
— Excessive Bonferroni in correlated families → loss of true findings (false negatives)
— Patients may miss out on effective therapies when underpowered trials are over-penalized
— Premature adoption of ineffective interventions → harm, cost, opportunity cost
— Delayed adoption of effective interventions → preventable morbidity

— Before data collection, not after — design > rescue
— Any trial with > 1 primary endpoint
— Adaptive designs, interim analyses, group sequential trials
— High-dimensional data (genomics, imaging, EHR)
— Subgroup-driven hypotheses requiring pre-specification
— FDA requires a Statistical Analysis Plan (SAP) locked before unblinding for pivotal trials
— ICH E9 guidance: "multiplicity adjustment is essential when multiple primary endpoints could lead to a regulatory claim"
— CONSORT statement requires reporting of all pre-specified outcomes and any adjustments
— Many journals require pre-registration (ClinicalTrials.gov) and SAP availability
— Pre-specifies stopping rules using alpha-spending
— Reviews interim data; sponsors and investigators remain blinded
— As a practicing physician, you are a consumer of statistics — you must recognize multiplicity issues when reading papers, hearing pharma reps, or counseling patients
— When a pharma rep cites a "positive subgroup," ask: pre-specified? corrected? replicated?
— Grant reviewers expect explicit multiplicity strategy for any multi-endpoint or high-dimensional proposal
— IRB may ask about statistical rigor as part of risk/benefit assessment

— Even one well-conducted test can produce a chance-positive finding 5% of the time
— Small studies with large effects often fail to replicate — winner's curse / regression to the mean
— A "significant" association may be driven by unmeasured confounders, not multiplicity
— Distinguish: confounding inflates effect estimates; multiplicity inflates the chance of any positive finding
— Random error → bias toward null in exposure-outcome studies
— Differential misclassification → can go either direction
— Repeatedly testing as data accrue and stopping when p < 0.05 → inflates Type I error to ~30% with frequent looks
— A temporal form of multiplicity
— Even without explicit multiple tests, analytic choices (covariate selection, outlier handling, transformation) create a hidden multiplicity
— Trying 10 regression specifications and reporting the "best" — same problem as 10 hypothesis tests

— No correction needed — only one test
— Classical RCT design favors this for clarity
— Each study controls its own α; no cross-study Bonferroni
— Otherwise, the more research a field produces, the harder it becomes to find anything — absurd
— Reporting means and confidence intervals without inferential claims doesn't require multiplicity correction
— Honesty about exploratory intent is the key safeguard
— Bayesian framework does not rely on α/Type I error in the frequentist sense
— Multiple comparisons handled via hierarchical models / shrinkage rather than p-value adjustment
— Posterior probabilities directly express evidence; no need for Bonferroni
— Some methodologists (e.g., Rothman) argue against routine correction, favoring effect-size interpretation
— Used in modern epidemiology, particularly observational research
— As discussed — preserves α without Bonferroni
— Collapse multiplicity into one test by design

— Was the primary endpoint pre-specified? Was it positive on its own?
— How many secondary/subgroup endpoints were tested?
— Was a multiplicity strategy described (Bonferroni, Holm, hierarchical, FDR)?
— Did the adjusted analyses remain significant?
— Was there an interaction test for any reported subgroup effect?
— Were the results replicated in an independent cohort?
— "The trial showed benefit only in a subgroup; we need confirmatory studies before changing your treatment."
— "The finding had a borderline p-value and many comparisons were made; it may not hold up."
— "Your overall guideline-based care remains the best evidence-based plan."
— Multi-analyte panels generate many comparisons; positive findings are common by chance
— Bayesian reframing: pre-test probability × test characteristics → post-test probability; many "positive" panel results have low positive predictive value
— Patients may bring news of "breakthrough" findings; your role is to contextualize evidence quality
— Use accessible language: "Out of 20 things they checked, one looked promising — but that's what we'd expect by luck."
— Stick with guideline-based therapy anchored in confirmatory RCTs with positive primary endpoints
— Update when meta-analyses or confirmatory trials replicate exploratory signals

— Today's exploratory subgroup → tomorrow's confirmatory trial (sometimes) → eventual guideline update (rarely)
— Most exploratory findings do not replicate — the base rate of replication for subgroup claims is roughly 10–30% in cardiology and oncology
— Follow guideline updates (ACC/AHA, ADA, USPSTF, NCCN) rather than chasing individual papers
— Trust meta-analyses and systematic reviews (Cochrane, AHRQ) that quantitatively pool evidence
— Be skeptical of single-trial findings, particularly subgroup or secondary outcomes
— Major findings sometimes retract years later when shown to be artifact of multiple testing or fraud
— Tools like Retraction Watch and journal alerts help
— Industry-sponsored education may emphasize subgroup wins from negative trials — Step 3 expects you to recognize this rhetorical pattern
— Periodically review whether your prescribing has shifted toward agents whose evidence base relies on unreplicated subgroups
— Discuss with peers in journal club; teaching multiplicity concepts to trainees reinforces practice
— When a patient is on a therapy supported only by subgroup data, document the rationale, monitor outcomes, and be prepared to deprescribe if confirmatory data are negative

— Misrepresenting post hoc findings as pre-specified is scientific misconduct
— Outcome switching without disclosure violates ICMJE and CONSORT standards
— IRBs increasingly require pre-registration; failure to report negative primary endpoints is unethical
— Subjects consent to participate in a study with specific endpoints; investigators have a duty to analyze and report those endpoints honestly
— Selective reporting violates the implicit contract with research participants
— Pharma marketing based on uncorrected subgroup claims has triggered FDA warning letters and DOJ false-claims settlements
— Off-label promotion grounded in exploratory findings is legally actionable
— A common Step 3 vignette: a patient discharged on a medication that was approved based on a subgroup of a negative trial. The receiving outpatient physician should:
— (1) Review the indication and evidence base
— (2) Discuss with the patient the strength of evidence
— (3) Make a shared decision about continuation, considering side effects, cost, and alternatives
— (4) Document the rationale clearly in the chart
— When recommending or discussing a therapy with weak evidence, ethically you should convey the uncertainty
— "This is based on a subgroup analysis and isn't as well-established as our standard treatment" — supports autonomy
— Investigators with financial ties to a sponsor are statistically more likely to emphasize favorable subgroup findings — disclose and weight accordingly
— Trial results must be reported on ClinicalTrials.gov within 12 months of completion (FDAAA 801); failure can incur civil penalties

— Bonferroni threshold: α / k
— 5 tests at α = 0.05 → p < 0.01
— 10 tests → p < 0.005
— 20 tests → p < 0.0025
— GWAS genome-wide significance: p < 5 × 10⁻⁸
— FWER with 20 independent tests at α = 0.05: ~64%
— Approximation: FWER ≈ k × α for small α
— Bonferroni → simple, conservative, any dependency, few comparisons
— Holm → uniformly better than Bonferroni, always preferred when computationally feasible
— Šidák → independent tests, slightly less conservative
— Tukey HSD → all pairwise comparisons post-ANOVA
— Dunnett → multiple treatments vs single control
— Benjamini-Hochberg (FDR) → high-dimensional discovery (genomics, biomarkers)
— Hierarchical/gatekeeping → pre-specified endpoint ordering in RCTs
— O'Brien-Fleming / Pocock → interim analyses in group-sequential trials
— "Post hoc," "exploratory," "subgroup," "data-driven" → uncorrected, hypothesis-generating
— "Pre-specified," "primary endpoint," "Bonferroni," "hierarchical testing" → confirmatory
— "p < 5 × 10⁻⁸" → GWAS
— "FDR controlled at 5%" → Benjamini-Hochberg, high-dimensional
— Type I error: decreases
— Type II error: increases
— Power: decreases
— Point estimate: unchanged
— CI width: increases (e.g., 95% → 98.75% for 4 comparisons)
— Multiple comparisons + reproducibility crisis
— Subgroup analyses + interaction tests
— Bonferroni + statistical power
— Pre-specification + outcome switching

— "Trial negative for primary endpoint; in a subgroup of X, treatment showed benefit (p = 0.03). What is the best interpretation?"
— Answer: Likely chance finding due to multiple subgroup testing; requires confirmatory trial.
— "Investigators tested 5 endpoints and used Bonferroni correction at FWER = 0.05. What is the per-comparison threshold?"
— Answer: 0.05 / 5 = 0.01.
— "What happens to Type I and Type II error when Bonferroni correction is applied?"
— Answer: Type I decreases, Type II increases, power decreases.
— "If 10 independent tests are conducted at α = 0.05 with no correction, what is the approximate probability of at least one false positive?"
— Answer: ≈ 1 − 0.95^10 ≈ 0.40 (or ≈ k × α = 0.50 as quick approximation).
— "Researchers screen 10,000 genes for association with disease. Which method is most appropriate?"
— Answer: Benjamini-Hochberg FDR (discovery context) or genome-wide Bonferroni (p < 5 × 10⁻⁸).
— "A trial reports a positive secondary endpoint, but the SAP did not list it. How should this be interpreted?"
— Answer: Hypothesis-generating; not confirmatory.
— "A patient asks about starting a drug based on a recent news report of a subgroup benefit. What is the best response?"
— Answer: Explain that subgroup findings need replication; continue guideline-based therapy.
— "Trial reports positive MACE composite, driven by hospitalization-for-HF; CV death not different. How interpret?"
— Answer: Composite handles multiplicity but interpret components; effect on driver only.
— "Trial pre-specifies hierarchical testing: HbA1c → weight → BP. HbA1c fails. Weight p = 0.001. Can weight be claimed?"
— Answer: No — once the gatekeeper fails, downstream endpoints cannot be claimed.
— "Why do many published findings fail to replicate?" → Multiple testing, small samples, publication bias, selective reporting.

— Math: Bonferroni α_adjusted = α / k; with k independent tests, FWER ≈ k × α (for small α); 5 tests at 0.05 → threshold 0.01; GWAS uses p < 5 × 10⁻⁸.
— Effects: Bonferroni reduces Type I error and power; increases Type II error; leaves point estimates unchanged; widens CIs.
— Method choice: Bonferroni (simple, conservative), Holm (uniformly better), Tukey HSD (pairwise post-ANOVA), Dunnett (vs control), Benjamini-Hochberg (FDR for high-dimensional screens), hierarchical gatekeeping (pre-specified RCT endpoint order), O'Brien-Fleming/Pocock (interim analyses).
— Clinical translation: Treat unreplicated subgroup or secondary findings as hypothesis-generating; continue guideline-based therapy until confirmatory trials with positive pre-specified primary endpoints emerge; counsel patients on evidence quality with humility and transparency.

