Biostatistics & Population Health

Multiple comparisons and Bonferroni correction

Clinical Overview and When to Suspect Multiple Comparisons Problems

— With k independent tests at α = 0.05, FWER ≈ 1 − (1 − 0.05)^k

— 5 tests → ~23% chance of a spurious "significant" result

— 20 tests → ~64%; 100 tests → ~99.4%

— Study reports many subgroup analyses (age, sex, race, comorbidity strata) and flags one as "significant"

— Trial has multiple secondary or exploratory endpoints

— Genome-wide, proteomic, or "-omics" studies (thousands of comparisons)

— Post hoc pairwise comparisons after ANOVA (e.g., comparing 4 treatment arms two-at-a-time = 6 tests)

— Repeated interim analyses of the same trial over time

— Dredging through EHR data for associations

Core concept: Every statistical test carries a Type I error probability (α, typically 0.05). When multiple hypotheses are tested simultaneously, the family-wise error rate (FWER) — the probability of at least one false positive — inflates dramatically.

When to suspect a multiple comparisons problem on Step 3 stems:

Why this matters clinically: A spurious subgroup finding (e.g., "drug works only in left-handed diabetics") can drive practice change, marketing claims, or follow-up trials that waste resources and harm patients.

Board pearl: If a stem says "the investigators performed 20 subgroup analyses and found benefit in one subgroup (p = 0.04)," the correct interpretation is almost always likely a chance finding — expected purely by Type I error inflation. Always demand a pre-specified primary endpoint and an adjustment method.

Key distinction: Multiple comparisons inflate Type I (false positive) error, not Type II. Correction methods reduce false positives but cost statistical power (raise Type II/β).

Step 3 application: When counseling a patient about a "newly discovered subgroup benefit," explain that subgroup analyses are hypothesis-generating, not confirmatory, unless pre-specified with adjusted α.

Presentation Patterns and Key History in Exam Stems

— "A trial randomized patients to drug vs placebo. The primary endpoint was negative, but in post hoc analysis of 12 subgroups, mortality was reduced in patients over 75 (p = 0.03). What is the best interpretation?"

— "Investigators tested associations between a SNP and 50 phenotypes; one reached p < 0.05."

— "A study compared 4 antihypertensives pairwise (6 comparisons) and reported one significant difference at p = 0.04."

— "A registry analysis examined 30 dietary factors and cancer risk; coffee was associated at p = 0.02."

— Words like "post hoc," "exploratory," "subgroup," "secondary endpoint," "data-driven," "hypothesis-generating"

— Borderline p-values (0.02–0.049) — these are the ones most likely to evaporate after correction

— No mention of pre-specification of the analysis plan

— No mention of α adjustment (Bonferroni, Holm, FDR)

— "The investigators pre-specified three primary endpoints and applied a Bonferroni correction (α = 0.0167 per test)"

— "The trial used a hierarchical (gatekeeping) testing procedure"

— "FDR was controlled at 5% using the Benjamini-Hochberg method"

Typical stem architectures that signal a multiple comparisons issue:

Red-flag history elements in the stem:

Contextual clues pointing toward proper handling:

Board pearl: A p-value of 0.04 in a single pre-specified test is meaningful; the same p-value as the best of 20 subgroup analyses is essentially noise. Context — not the number itself — determines significance.

Key distinction: Pre-specified subgroup analyses (declared before unblinding) are stronger evidence than post hoc analyses, even when both technically use the same statistical test. Step 3 expects you to recognize this hierarchy when critiquing a paper or counseling on new evidence.

Conceptual "Exam" — Quantifying the Inflation of Error

— For k independent tests at α: FWER = 1 − (1 − α)^k

— k = 1: 0.05

— k = 2: 0.0975

— k = 5: 0.226

— k = 10: 0.401

— k = 20: 0.642

— k = 100: 0.994

— Per-comparison error rate (PCER): α applied to each test independently — no correction

— Family-wise error rate (FWER): probability of ≥1 false positive across the family — controlled by Bonferroni, Holm, Šidák

— False discovery rate (FDR): expected proportion of false positives among all rejected nulls — controlled by Benjamini-Hochberg

— Bonferroni: extremely conservative; "every test must clear a high bar"

— Holm: stepwise, slightly more powerful than Bonferroni, still controls FWER

— Benjamini-Hochberg: controls FDR, far more powerful, appropriate for screening/discovery (genomics, biomarkers)

Family-wise error rate (FWER) math:

Approximation worth memorizing: For small α, FWER ≈ k × α. So 10 tests at 0.05 ≈ 0.50 chance of ≥1 false positive. This is the back-of-envelope version a Step 3 stem expects.

Per-comparison vs family-wise vs false discovery:

Visual intuition:

Effect on power: Adjusting α downward reduces statistical power to detect true effects → increased Type II error (β). The tradeoff is unavoidable: control false positives or preserve power, but not both.

Board pearl: If a stem asks "what happens to power when you Bonferroni-correct?" → power decreases. If it asks about Type I error → decreases. Type II error → increases.

Key distinction: FWER controls any false positive; FDR controls the proportion of false positives. Use FWER for confirmatory trials, FDR for exploratory screens.

The Bonferroni Correction — Mechanics

— Adjusted α = α / k

— Equivalently: multiply each p-value by k and compare to original α

— 5 comparisons, want FWER ≤ 0.05 → each test must reach p < 0.01

— 10 comparisons → p < 0.005 per test

— 20 comparisons → p < 0.0025

— 100 comparisons → p < 0.0005

— Genome-wide significance (~1 million SNPs) → p < 5 × 10⁻⁸ (a famous Bonferroni-style threshold)

— 4 comparisons → 98.75% CIs

— The CIs widen; fewer will exclude the null

— Strong control of FWER under any dependence structure among tests

— Simple, transparent, easy to defend to regulators (FDA accepts it)

— Conservative — actual FWER is often well below the nominal α, especially when tests are correlated

— Loses power rapidly as k grows

— Treats all hypotheses as equally important (no weighting)

— Assumes the "family" is well-defined — a judgment call

The rule: Divide the desired family-wise α by the number of comparisons (k).

Worked examples:

Confidence interval analog: For k comparisons, construct (1 − α/k) × 100% CIs instead of 95% CIs.

Properties:

Limitations:

Board pearl: If a Step 3 stem says "investigators performed 4 comparisons and corrected with Bonferroni; what is the threshold for significance?" → 0.05 / 4 = 0.0125. Compute this quickly; it is among the most testable arithmetic in biostatistics.

Key distinction: Bonferroni is applied to the α threshold (or equivalently, the p-values are multiplied by k). It does not change the underlying point estimate or effect size — only the inferential threshold.

Step 3 management: When critiquing a manuscript, ask: (1) How many hypotheses? (2) Was correction applied? (3) Is the adjusted p-value still significant? Only then accept the conclusion.

Alternative Correction Methods — Comparison Studies

— Rank p-values smallest to largest: p₁ ≤ p₂ ≤ ... ≤ pₖ

— Compare p₁ to α/k, p₂ to α/(k−1), p₃ to α/(k−2), ...

— Stop at first non-significant result; reject all earlier

— Always at least as powerful as Bonferroni, controls FWER, still simple

— α_adj = 1 − (1 − α)^(1/k)

— Slightly less conservative than Bonferroni; assumes independence

— Rarely tested but conceptually equivalent

— Specifically for all pairwise comparisons after ANOVA

— Controls FWER, more powerful than Bonferroni for this use case

— Rank p-values; find largest i such that p₍ᵢ₎ ≤ (i/k) × α

— Reject all hypotheses up to that rank

— Workhorse of genomics, microbiome, biomarker discovery

— Controls expected proportion of false discoveries, not the probability of any

— Pre-specify an order: test endpoint 1 at full α; only if significant, test endpoint 2; etc.

— Common in cardiovascular outcome trials (e.g., MACE → CV death → all-cause mortality)

— No α adjustment needed because testing stops at the first failure

Holm-Bonferroni (step-down):

Šidák correction:

Tukey's HSD (honestly significant difference):

Dunnett's test: All treatments vs single control (k − 1 comparisons, not all pairwise)

Scheffé's method: All possible contrasts, most conservative; used when exploring unplanned linear combinations

Benjamini-Hochberg (BH) — FDR control:

Hierarchical (gatekeeping / fixed-sequence) testing:

Board pearl: A trial that pre-specifies a hierarchical testing sequence preserves full α (0.05) for each endpoint as long as prior endpoints are positive — an elegant alternative to Bonferroni often seen in modern RCTs (e.g., SGLT2 inhibitor trials).

Key distinction: FWER methods (Bonferroni, Holm) for confirmatory trials with few endpoints; FDR methods (BH) for high-dimensional discovery.

Decision Logic — When to Apply Which Correction

— Use Bonferroni or Holm, or hierarchical gatekeeping

— FDA generally requires a pre-specified strategy in the Statistical Analysis Plan (SAP)

— Pre-specify; apply Bonferroni/Holm or label as exploratory

— Unadjusted secondary endpoints should be reported as hypothesis-generating

— Pre-specified, limited number, with test for interaction → can be interpreted cautiously

— Post hoc → essentially exploratory; require replication

— Tukey HSD (all pairs) or Dunnett (vs control)

— Use alpha-spending functions (O'Brien-Fleming, Pocock) — a temporal analog of Bonferroni

— Benjamini-Hochberg FDR, typically at 5% or 10%

— No correction needed. This is the cleanest design — and the reason RCTs emphasize one primary endpoint.

— Across separate trials → no correction (different families)

— Within one trial across multiple endpoints → one family

— Multiple publications from one dataset → still one family conceptually; honest reporting required

Confirmatory RCT with multiple co-primary endpoints (2–5):

Multiple secondary endpoints:

Subgroup analyses:

Pairwise post-ANOVA:

Repeated/interim analyses over time:

High-dimensional screens (genomics, metabolomics, EHR phenome-wide):

Single primary endpoint, single test:

Family definition pitfalls:

Board pearl: Step 3 favors the answer that emphasizes pre-specification and transparency over any specific method. "Apply Bonferroni correction to the 4 pre-specified secondary endpoints" beats "report all p-values without adjustment."

Step 3 management: When a patient asks about a "new study showing benefit," ask yourself — was this the primary, pre-specified endpoint? If yes, take seriously. If a subgroup/secondary without correction, counsel skepticism and await confirmatory data.

Worked Example #1 — Subgroup Analysis Trap

— 15 subgroups → adjusted α = 0.05 / 15 = 0.0033

— Observed p = 0.04 → NOT significant after correction

— Expected number of false positives by chance: 15 × 0.05 = 0.75 — finding one "significant" result was almost guaranteed

— The primary endpoint failed — the drug, as tested, does not reduce MACE

— The subgroup finding is hypothesis-generating only

— Cannot ethically or scientifically prescribe based on this; need a prospective confirmatory trial in diabetic CKD patients

Scenario: A randomized trial of a novel antiplatelet vs aspirin enrolls 8,000 post-MI patients. Primary endpoint (composite MACE at 1 year): HR 0.96, 95% CI 0.88–1.05, p = 0.38 — negative.

Post hoc analysis: Investigators stratify by 15 baseline variables (age, sex, diabetes, CKD, smoking, prior stroke, statin use, EF, LDL, BMI, race, region, troponin tertile, time-to-PCI, stent type).

Finding: In patients with diabetes and CKD (n = 412), HR 0.71, p = 0.04. Headline: "Novel agent reduces MACE in high-risk diabetics."

Bonferroni analysis:

Clinical interpretation:

What the FDA would say: Subgroup-only benefit after a negative primary endpoint is generally insufficient for approval or label expansion.

What a journal editor would say: Demand the analysis be labeled exploratory, report a test for interaction (was the interaction p < 0.05?), and present adjusted p-values.

Board pearl: When you see "negative primary, positive subgroup" on Step 3, the answer is almost always "the finding is likely due to chance; confirmatory trial is needed." Do not be lured by the impressive-sounding subgroup hazard ratio.

Key distinction: A test for interaction (was the treatment effect different across subgroups?) is far more informative than a within-subgroup p-value — and is typically the missing piece in misleading subgroup claims.

Worked Example #2 — Pre-Specified Multiple Endpoints

— HbA1c: −0.8%, p = 0.001 → significant (< 0.0125) ✓

— Weight: −2.1 kg, p = 0.008 → significant ✓

— SBP: −3 mmHg, p = 0.02 → NOT significant after correction ✗

— LDL: −5 mg/dL, p = 0.04 → NOT significant after correction ✗

— Rank: 0.001, 0.008, 0.02, 0.04

— Compare to: 0.05/4 = 0.0125, 0.05/3 = 0.0167, 0.05/2 = 0.025, 0.05/1 = 0.05

— 0.001 < 0.0125 ✓; 0.008 < 0.0167 ✓; 0.02 < 0.025 ✓; 0.04 < 0.05 ✓

— All four significant under Holm! Holm rescued two endpoints that Bonferroni rejected.

— HbA1c p = 0.001 → significant, proceed

— Weight p = 0.008 → significant, proceed

— SBP p = 0.02 → significant, proceed

— LDL p = 0.04 → significant, all four endpoints "win"

Scenario: A diabetes trial pre-specifies four primary endpoints: (1) HbA1c reduction, (2) weight loss, (3) systolic BP, (4) LDL change. Analysis plan: Bonferroni correction with α = 0.05/4 = 0.0125 per test.

Results:

Holm-Bonferroni comparison:

Lesson: Holm is uniformly more powerful than Bonferroni while controlling the same FWER. If the SAP specifies "Holm," more endpoints may legitimately reach significance.

Hierarchical alternative: If the trial pre-specified the order HbA1c → weight → SBP → LDL, each tested at α = 0.05 sequentially:

Tradeoff of hierarchical: Order matters enormously — if the first endpoint fails, no later endpoint can be claimed, regardless of its p-value.

Board pearl: Bonferroni is simplest and most conservative; Holm is uniformly better; hierarchical gatekeeping is most powerful when investigators are confident in the order.

Step 3 management: When reading a label claim, look for whether the endpoint was within the pre-specified testing hierarchy — claims outside the hierarchy carry less weight, regardless of p-value.

Special Considerations — Small Samples and Correlated Tests

— Approaches: principal components, composite endpoints, multivariate tests (Hotelling's T²)

— Composite endpoints (e.g., MACE = CV death + MI + stroke) inherently solve multiplicity by collapsing endpoints into one — at the cost of dilution if components don't all move together

— Bonferroni further erodes already-limited power

— Consider whether the study is fit for purpose; sometimes the right answer is "this trial is underpowered to address k endpoints"

— Comparing groups at multiple timepoints inflates error

— Solutions: mixed-effects models with a single time × treatment interaction term, or area-under-the-curve summary measures, or alpha-spending for interim analyses

— O'Brien-Fleming boundaries: very stringent early (e.g., p < 0.001 at first look), liberal later — preserves overall α near 0.05

— Pocock boundaries: constant threshold across looks (e.g., p < 0.022 each)

— Both are temporal Bonferroni-like approaches; DSMB uses these to decide early stopping

— Multiple subgroup or sensitivity analyses warrant adjustment or explicit "exploratory" labeling

Correlated outcomes: When multiple endpoints are highly correlated (e.g., SBP and DBP, or fasting glucose and HbA1c), Bonferroni is overly conservative because the "effective number of independent tests" is less than k.

Small sample sizes:

Repeated measures over time:

Interim analyses in adaptive trials:

Cluster-randomized trials and meta-analyses:

Board pearl: Composite endpoints trade multiplicity for interpretability — but always inspect component-level results. A "positive" MACE driven entirely by hospitalization-for-HF, with no signal on mortality, is a weaker claim than one with concordant components.

Key distinction: Multiplicity correction addresses Type I error across hypotheses; alpha spending addresses Type I error across time (interim looks). Conceptually parallel, statistically distinct.

Special Populations — Genomics, EHR Studies, AI/ML Models

— ~1 million common SNPs tested → Bonferroni threshold p < 5 × 10⁻⁸

— This is the canonical "genome-wide significance" threshold

— Suggestive associations at p < 10⁻⁵ are flagged for replication, not declared discoveries

— Test one exposure against thousands of phenotypes in EHR

— Almost always uses Benjamini-Hochberg FDR (5–10%) rather than Bonferroni — discovery context

— Hundreds to thousands of taxa/proteins; BH-FDR standard

— Testing many model variants on the same validation set inflates apparent performance

— Held-out test sets and pre-registration of the final model address this

— Step 3 increasingly tests recognition that "the model with the best AUC out of 50 attempts" is not the model's true AUC — analogous to subgroup p-hacking

— Easy to test thousands of associations; garden of forking paths problem

— Pre-registered protocols (e.g., on ClinicalTrials.gov or OSF) are the gold standard for credibility

— Small n + many endpoints = severe multiplicity tradeoff

— Often use Bayesian methods or borrowing strength across related populations to gain power without inflating α

Genome-wide association studies (GWAS):

Phenome-wide association studies (PheWAS):

Microbiome and proteomics:

Machine learning / AI clinical prediction:

Real-world evidence and EHR research:

Pediatrics and rare diseases:

Board pearl: 5 × 10⁻⁸ is the GWAS Bonferroni threshold worth memorizing — it appears in stems about genetic discovery, biomarker panels, and precision medicine.

Step 3 management: When a patient brings a direct-to-consumer genetic "risk panel" suggesting a disease association, ask whether the variant reached genome-wide significance in replicated cohorts. Many DTC reports rely on suggestive associations that have not survived multiplicity correction.

Complications and Pitfalls — What Goes Wrong

— Running many tests, reporting only significant ones

— Equivalent to inflating Type I error without correction

— Has driven much of the reproducibility crisis in biomedical literature

— Presenting a post hoc finding as if it were the pre-specified hypothesis

— Even more pernicious because it hides the multiplicity entirely

— Changing the primary endpoint after seeing data; pre-specified endpoint becomes "secondary"

— COMPare project documented this widely; journals now demand SAP comparison

— Reporting only significant subgroups or endpoints, omitting null results

— Early stopping for "efficacy" based on a chance fluctuation

— Trials stopped early for benefit tend to overestimate effect sizes

— "Drug X works in biomarker-positive patients" based on uncorrected subgroup analysis → leads to failed confirmatory trials

— Excessive Bonferroni in correlated families → loss of true findings (false negatives)

— Patients may miss out on effective therapies when underpowered trials are over-penalized

— Premature adoption of ineffective interventions → harm, cost, opportunity cost

— Delayed adoption of effective interventions → preventable morbidity

p-hacking / data dredging:

HARKing (Hypothesizing After Results are Known):

Outcome switching:

Selective reporting / publication bias:

Multiple looks at data without alpha-spending:

Spurious precision medicine claims:

Cost of over-correction:

Consequences for patient care:

Board pearl: The 2005 Ioannidis paper ("Why Most Published Research Findings Are False") attributed much of the literature's noise to unaccounted multiple testing and small effect sizes. This conceptual framing appears in Step 3 stems about EBM and skepticism toward isolated findings.

Key distinction: Multiplicity correction is preventive; replication is the ultimate remedy. No statistical adjustment substitutes for independent confirmation in a new cohort.

When to Escalate — Statistical Consultation and Trial Design

— Before data collection, not after — design > rescue

— Any trial with > 1 primary endpoint

— Adaptive designs, interim analyses, group sequential trials

— High-dimensional data (genomics, imaging, EHR)

— Subgroup-driven hypotheses requiring pre-specification

— FDA requires a Statistical Analysis Plan (SAP) locked before unblinding for pivotal trials

— ICH E9 guidance: "multiplicity adjustment is essential when multiple primary endpoints could lead to a regulatory claim"

— CONSORT statement requires reporting of all pre-specified outcomes and any adjustments

— Many journals require pre-registration (ClinicalTrials.gov) and SAP availability

— Pre-specifies stopping rules using alpha-spending

— Reviews interim data; sponsors and investigators remain blinded

— As a practicing physician, you are a consumer of statistics — you must recognize multiplicity issues when reading papers, hearing pharma reps, or counseling patients

— When a pharma rep cites a "positive subgroup," ask: pre-specified? corrected? replicated?

— Grant reviewers expect explicit multiplicity strategy for any multi-endpoint or high-dimensional proposal

— IRB may ask about statistical rigor as part of risk/benefit assessment

When the clinician/researcher should involve a biostatistician early:

Regulatory and journal requirements:

DSMB (Data Safety Monitoring Board) role:

Step 3 systems thinking:

Institutional review and grant review:

CCS pearl: Although CCS cases don't ask you to perform Bonferroni arithmetic, they may test your judgment in ordering: e.g., when offered a "novel biomarker panel" that screens for 30 conditions simultaneously, the right move is to decline routine ordering and pursue targeted, pre-test-probability-driven testing — clinically the same logic as multiplicity control.

Board pearl: The phrase "the trial achieved its primary endpoint" carries the most weight; secondary and subgroup claims carry progressively less unless they were pre-specified and properly adjusted.

Key Differentials — Other Sources of Spurious "Significance"

— Even one well-conducted test can produce a chance-positive finding 5% of the time

— Small studies with large effects often fail to replicate — winner's curse / regression to the mean

— A "significant" association may be driven by unmeasured confounders, not multiplicity

— Distinguish: confounding inflates effect estimates; multiplicity inflates the chance of any positive finding

— Random error → bias toward null in exposure-outcome studies

— Differential misclassification → can go either direction

— Repeatedly testing as data accrue and stopping when p < 0.05 → inflates Type I error to ~30% with frequent looks

— A temporal form of multiplicity

— Even without explicit multiple tests, analytic choices (covariate selection, outlier handling, transformation) create a hidden multiplicity

— Trying 10 regression specifications and reporting the "best" — same problem as 10 hypothesis tests

Multiple comparisons is one of several drivers of false-positive findings. Related/competing issues to distinguish:

Small sample size + significant result:

Selection bias and confounding:

Measurement error:

Optional stopping (without alpha-spending):

Garden of forking paths (Gelman):

Multiple models tested on same data:

Board pearl: All of these share a common root: the effective number of hypotheses tested exceeds the reported number of tests. Multiplicity correction addresses only the explicit tests; the hidden ones require pre-registration to control.

Key distinction: Bias (confounding, selection) systematically distorts the estimate; chance (multiplicity, small n) inflates random false positives. Both can produce misleading "significant" p-values, but the remedies differ — adjustment vs design.

Differentials — Legitimate Reasons to NOT Correct

— No correction needed — only one test

— Classical RCT design favors this for clarity

— Each study controls its own α; no cross-study Bonferroni

— Otherwise, the more research a field produces, the harder it becomes to find anything — absurd

— Reporting means and confidence intervals without inferential claims doesn't require multiplicity correction

— Honesty about exploratory intent is the key safeguard

— Bayesian framework does not rely on α/Type I error in the frequentist sense

— Multiple comparisons handled via hierarchical models / shrinkage rather than p-value adjustment

— Posterior probabilities directly express evidence; no need for Bonferroni

— Some methodologists (e.g., Rothman) argue against routine correction, favoring effect-size interpretation

— Used in modern epidemiology, particularly observational research

— As discussed — preserves α without Bonferroni

— Collapse multiplicity into one test by design

Not every situation calls for correction. Knowing when not to adjust is equally board-relevant.

Single pre-specified primary endpoint:

Independent studies addressing different questions:

Descriptive / exploratory reporting clearly labeled as such:

Bayesian analyses:

Effect estimates and confidence intervals as primary inference:

Pre-specified hierarchical testing:

Composite endpoints:

Board pearl: The correct exam answer when asked "should this analysis be Bonferroni-corrected?" depends on context: confirmatory & multiple endpoints → yes; single pre-specified primary endpoint → no; exploratory & properly labeled → optional but transparency required.

Key distinction: Don't confuse regulatory rigor (FDA wants strict FWER control for label claims) with scientific inference (a Bayesian or effect-estimate framework may be more appropriate for understanding biology). Step 3 favors the regulatory/RCT framing.

Practical Application — Reading Papers and Counseling Patients

— Was the primary endpoint pre-specified? Was it positive on its own?

— How many secondary/subgroup endpoints were tested?

— Was a multiplicity strategy described (Bonferroni, Holm, hierarchical, FDR)?

— Did the adjusted analyses remain significant?

— Was there an interaction test for any reported subgroup effect?

— Were the results replicated in an independent cohort?

— "The trial showed benefit only in a subgroup; we need confirmatory studies before changing your treatment."

— "The finding had a borderline p-value and many comparisons were made; it may not hold up."

— "Your overall guideline-based care remains the best evidence-based plan."

— Multi-analyte panels generate many comparisons; positive findings are common by chance

— Bayesian reframing: pre-test probability × test characteristics → post-test probability; many "positive" panel results have low positive predictive value

— Patients may bring news of "breakthrough" findings; your role is to contextualize evidence quality

— Use accessible language: "Out of 20 things they checked, one looked promising — but that's what we'd expect by luck."

— Stick with guideline-based therapy anchored in confirmatory RCTs with positive primary endpoints

— Update when meta-analyses or confirmatory trials replicate exploratory signals

A checklist for evaluating a study's multiplicity handling:

Counseling patients about "new study" headlines:

Counseling on direct-to-consumer testing and biomarker panels:

Shared decision-making implications:

Long-term plan:

Step 3 management: When the stem describes a patient asking about a subgroup-based finding, the correct answer is usually NOT to change therapy based on unreplicated exploratory data. Continue evidence-based standard of care and reassess as confirmatory evidence emerges.

Board pearl: "Replication in an independent cohort" is the single most reassuring phrase in any reported finding — it transcends statistical correction.

Follow-Up — Tracking Evidence and Updating Practice

— Today's exploratory subgroup → tomorrow's confirmatory trial (sometimes) → eventual guideline update (rarely)

— Most exploratory findings do not replicate — the base rate of replication for subgroup claims is roughly 10–30% in cardiology and oncology

— Follow guideline updates (ACC/AHA, ADA, USPSTF, NCCN) rather than chasing individual papers

— Trust meta-analyses and systematic reviews (Cochrane, AHRQ) that quantitatively pool evidence

— Be skeptical of single-trial findings, particularly subgroup or secondary outcomes

— Major findings sometimes retract years later when shown to be artifact of multiple testing or fraud

— Tools like Retraction Watch and journal alerts help

— Industry-sponsored education may emphasize subgroup wins from negative trials — Step 3 expects you to recognize this rhetorical pattern

— Periodically review whether your prescribing has shifted toward agents whose evidence base relies on unreplicated subgroups

— Discuss with peers in journal club; teaching multiplicity concepts to trainees reinforces practice

— When a patient is on a therapy supported only by subgroup data, document the rationale, monitor outcomes, and be prepared to deprescribe if confirmatory data are negative

Longitudinal stewardship of evidence:

How to keep current responsibly:

Monitoring for retraction or correction:

Continuing medical education context:

Personal practice audit:

Patient-level monitoring:

Board pearl: A meta-analysis pooling many trials provides more reliable evidence than any single trial's subgroup analysis — borrowing strength across studies is a legitimate way to address multiplicity concerns while gaining power.

Key distinction: "Statistically significant" and "clinically significant" are different; multiplicity correction addresses the former, but a tiny effect size (even if real) may not warrant treatment changes.

Ethical, Legal, and Patient Safety Considerations

— Misrepresenting post hoc findings as pre-specified is scientific misconduct

— Outcome switching without disclosure violates ICMJE and CONSORT standards

— IRBs increasingly require pre-registration; failure to report negative primary endpoints is unethical

— Subjects consent to participate in a study with specific endpoints; investigators have a duty to analyze and report those endpoints honestly

— Selective reporting violates the implicit contract with research participants

— Pharma marketing based on uncorrected subgroup claims has triggered FDA warning letters and DOJ false-claims settlements

— Off-label promotion grounded in exploratory findings is legally actionable

— A common Step 3 vignette: a patient discharged on a medication that was approved based on a subgroup of a negative trial. The receiving outpatient physician should:

— (1) Review the indication and evidence base

— (2) Discuss with the patient the strength of evidence

— (3) Make a shared decision about continuation, considering side effects, cost, and alternatives

— (4) Document the rationale clearly in the chart

— When recommending or discussing a therapy with weak evidence, ethically you should convey the uncertainty

— "This is based on a subgroup analysis and isn't as well-established as our standard treatment" — supports autonomy

— Investigators with financial ties to a sponsor are statistically more likely to emphasize favorable subgroup findings — disclose and weight accordingly

— Trial results must be reported on ClinicalTrials.gov within 12 months of completion (FDAAA 801); failure can incur civil penalties

Research ethics:

Informed consent in clinical trials:

Regulatory and legal exposure:

Patient safety transition-of-care issue:

Disclosure to patients:

Conflict of interest:

Mandatory reporting analog:

Board pearl: Selectively reporting only "significant" subgroups while burying the negative primary endpoint can constitute scientific fraud and violates federal reporting requirements for registered trials.

Step 3 management: When asked about an ethical dilemma around publishing or prescribing based on a subgroup finding, choose the answer that emphasizes transparency, pre-specification, and replication before practice change.

High-Yield Associations and Rapid-Fire Facts

— Bonferroni threshold: α / k

— 5 tests at α = 0.05 → p < 0.01

— 10 tests → p < 0.005

— 20 tests → p < 0.0025

— GWAS genome-wide significance: p < 5 × 10⁻⁸

— FWER with 20 independent tests at α = 0.05: ~64%

— Approximation: FWER ≈ k × α for small α

— Bonferroni → simple, conservative, any dependency, few comparisons

— Holm → uniformly better than Bonferroni, always preferred when computationally feasible

— Šidák → independent tests, slightly less conservative

— Tukey HSD → all pairwise comparisons post-ANOVA

— Dunnett → multiple treatments vs single control

— Benjamini-Hochberg (FDR) → high-dimensional discovery (genomics, biomarkers)

— Hierarchical/gatekeeping → pre-specified endpoint ordering in RCTs

— O'Brien-Fleming / Pocock → interim analyses in group-sequential trials

— "Post hoc," "exploratory," "subgroup," "data-driven" → uncorrected, hypothesis-generating

— "Pre-specified," "primary endpoint," "Bonferroni," "hierarchical testing" → confirmatory

— "p < 5 × 10⁻⁸" → GWAS

— "FDR controlled at 5%" → Benjamini-Hochberg, high-dimensional

— Type I error: decreases

— Type II error: increases

— Power: decreases

— Point estimate: unchanged

— CI width: increases (e.g., 95% → 98.75% for 4 comparisons)

— Multiple comparisons + reproducibility crisis

— Subgroup analyses + interaction tests

— Bonferroni + statistical power

— Pre-specification + outcome switching

Memorize these numbers:

Method-to-use-case map:

Vocabulary alerts in stems:

Effects of Bonferroni correction:

Concepts that pair well on exams:

Board pearl: When in doubt on a Step 3 biostatistics question about a borderline subgroup finding → the answer is almost always "chance / requires confirmatory study."

Board Question Stem Patterns

— "Trial negative for primary endpoint; in a subgroup of X, treatment showed benefit (p = 0.03). What is the best interpretation?"

— Answer: Likely chance finding due to multiple subgroup testing; requires confirmatory trial.

— "Investigators tested 5 endpoints and used Bonferroni correction at FWER = 0.05. What is the per-comparison threshold?"

— Answer: 0.05 / 5 = 0.01.

— "What happens to Type I and Type II error when Bonferroni correction is applied?"

— Answer: Type I decreases, Type II increases, power decreases.

— "If 10 independent tests are conducted at α = 0.05 with no correction, what is the approximate probability of at least one false positive?"

— Answer: ≈ 1 − 0.95^10 ≈ 0.40 (or ≈ k × α = 0.50 as quick approximation).

— "Researchers screen 10,000 genes for association with disease. Which method is most appropriate?"

— Answer: Benjamini-Hochberg FDR (discovery context) or genome-wide Bonferroni (p < 5 × 10⁻⁸).

— "A trial reports a positive secondary endpoint, but the SAP did not list it. How should this be interpreted?"

— Answer: Hypothesis-generating; not confirmatory.

— "A patient asks about starting a drug based on a recent news report of a subgroup benefit. What is the best response?"

— Answer: Explain that subgroup findings need replication; continue guideline-based therapy.

— "Trial reports positive MACE composite, driven by hospitalization-for-HF; CV death not different. How interpret?"

— Answer: Composite handles multiplicity but interpret components; effect on driver only.

— "Trial pre-specifies hierarchical testing: HbA1c → weight → BP. HbA1c fails. Weight p = 0.001. Can weight be claimed?"

— Answer: No — once the gatekeeper fails, downstream endpoints cannot be claimed.

— "Why do many published findings fail to replicate?" → Multiple testing, small samples, publication bias, selective reporting.

Pattern 1 — The Subgroup Trap:

Pattern 2 — The Bonferroni Calculation:

Pattern 3 — Effect on Errors:

Pattern 4 — FWER Inflation:

Pattern 5 — Method Selection:

Pattern 6 — Pre-specification:

Pattern 7 — Counseling:

Pattern 8 — Composite endpoint critique:

Pattern 9 — Hierarchical testing:

Pattern 10 — Reproducibility:

Board pearl: Almost every Step 3 biostatistics stem on this topic resolves to one of: (a) calculate adjusted α, (b) recognize the subgroup trap, or (c) identify the right correction method for the context.

One-Line Recap

— Math: Bonferroni α_adjusted = α / k; with k independent tests, FWER ≈ k × α (for small α); 5 tests at 0.05 → threshold 0.01; GWAS uses p < 5 × 10⁻⁸.

— Effects: Bonferroni reduces Type I error and power; increases Type II error; leaves point estimates unchanged; widens CIs.

— Method choice: Bonferroni (simple, conservative), Holm (uniformly better), Tukey HSD (pairwise post-ANOVA), Dunnett (vs control), Benjamini-Hochberg (FDR for high-dimensional screens), hierarchical gatekeeping (pre-specified RCT endpoint order), O'Brien-Fleming/Pocock (interim analyses).

— Clinical translation: Treat unreplicated subgroup or secondary findings as hypothesis-generating; continue guideline-based therapy until confirmatory trials with positive pre-specified primary endpoints emerge; counsel patients on evidence quality with humility and transparency.

One-liner: Multiple comparisons inflate the family-wise Type I error rate, and the Bonferroni correction controls this by setting the per-test α to 0.05 divided by the number of comparisons — at the cost of statistical power — making pre-specification, transparent multiplicity strategy, and replication the foundation of trustworthy evidence.

Rapid recap bullets:

Board pearl: When a Step 3 stem describes a negative primary endpoint with a "significant" subgroup finding, the correct answer is overwhelmingly that the result is likely due to chance and requires confirmatory study before practice change — this single heuristic resolves the majority of multiplicity-related test questions.

Step 3 management: As the physician of record, your duty is to apply evidence-based, guideline-concordant care, recognize multiplicity-driven hype when you encounter it, document shared decisions transparently, and update practice only when robust, replicated, properly-adjusted evidence supports change.