Biostatistics & Population Health

Hypothesis testing: type I and type II error, power

Clinical Overview and When to Suspect Inference Errors

— Type I error (α): rejecting H₀ when it is actually true → a false-positive conclusion ("the drug works" when it doesn't).

— Type II error (β): failing to reject H₀ when H₁ is actually true → a false-negative conclusion ("the drug doesn't work" when it actually does).

— A small study (n low) reports "no difference" → suspect type II error / underpowered.

— A study with many subgroup or interim analyses reports a "significant" finding → suspect type I error inflation (multiple comparisons).

— A p-value just under 0.05 with wide confidence intervals → fragile result, both errors plausible.

— Replication failure of a prior "positive" trial → original may have been a type I error or had publication bias.

Board pearl: "Absence of evidence is not evidence of absence." A non-significant p-value in a small trial usually means insufficient power, not proven equivalence — that requires a formal non-inferiority or equivalence trial with a prespecified margin.

Hypothesis testing is the formal framework for deciding whether observed data are compatible with a prespecified "null" claim (H₀, usually "no effect") versus an alternative (H₁, "an effect exists").

Every test produces a binary decision (reject H₀ vs fail to reject H₀), and that decision can be right or wrong in two distinct ways:

Power = 1 − β = probability of correctly detecting a true effect of a specified size. Conventionally set at 80–90% in trial design.

When to "suspect" inference errors on Step 3:

Step 3 stems frame this as: physician reads a journal article and must judge whether the conclusion is trustworthy enough to change practice, or counsels a patient/colleague about why a "negative" trial isn't proof of equivalence.

Diagnostic-test analogy that exam writers love: type I error mirrors a false-positive test result; type II error mirrors a false-negative. α is to specificity (1−α ≈ specificity of the decision) as β is to sensitivity (power ≈ sensitivity of the decision).

Presentation Patterns and Key History

— "A new antihypertensive was compared to lisinopril in 60 patients. Mean SBP reduction did not differ (p=0.21). The authors conclude the drugs are equivalent." → tests recognition of type II error / inappropriate equivalence claim.

— "Investigators tested 20 dietary variables for association with colon cancer; one (p=0.04) was significant." → multiple-comparisons type I inflation.

— "A trial with 90% power to detect a 10-mmHg difference found no effect." → adequately powered negative trial — equivalence claim is more defensible.

— "The p-value was 0.001 but the confidence interval crossed the minimally important difference." → statistical vs clinical significance distinction.

— Sample size (n) and event rate — drives power.

— Effect size the study was designed to detect (the "delta").

— α level chosen (usually 0.05, sometimes 0.01 with multiple testing).

— Number of comparisons / interim looks — each inflates type I error.

— Whether the result is superiority, non-inferiority, or equivalence in design — they have different null hypotheses.

— Two-sided test: H₁ is "different" (default; α split between tails).

— One-sided test: H₁ specifies direction; rarely accepted unless prespecified and justified — examiners flag one-sided tests as a red flag for p-hacking.

Key distinction: A superiority trial has H₀ "no difference"; failing to reject it does not prove equivalence. A non-inferiority trial has H₀ "new is worse by more than margin Δ"; rejecting it supports non-inferiority. Confusing these two designs is the single most common Step 3 biostatistics trap.

Board pearl: When a stem says "the difference was not statistically significant," your first reflex should be to ask "what was the power?" — not to accept the null.

Step 3 questions present hypothesis-testing concepts inside a clinical-research vignette. Recognize the stem archetypes:

Key "history" elements to extract from any research stem:

Directionality:

Physical Exam Findings (Conceptual "Exam" of a Study)

— Was an a priori power calculation reported? Look for "we calculated that 240 patients per arm would provide 80% power to detect a 15% relative risk reduction at α=0.05."

— Underpowered studies inflate type II error and produce inflated effect estimates if positive (winner's curse / type M error).

— Single primary endpoint with α=0.05 → standard.

— Multiple primary endpoints, subgroup analyses, or interim analyses without correction → family-wise type I error rises rapidly. With k independent tests at α=0.05, P(≥1 false positive) ≈ 1−(0.95)ᵏ; 10 tests → ~40% chance of a spurious "hit."

— Corrections: Bonferroni (α/k, conservative), Holm–Bonferroni, Benjamini–Hochberg FDR, O'Brien–Fleming for interim looks.

— A 95% CI that excludes the null (1.0 for ratios, 0 for differences) corresponds to p<0.05.

— A wide CI signals imprecision → low power even if point estimate looks impressive.

— A CI that includes both a clinically trivial and a clinically important effect means the trial cannot resolve the question.

— Composite endpoints can manufacture significance if a soft component drives the result.

— Surrogate endpoints (e.g., HbA1c instead of MI) may achieve statistical significance without clinical meaning.

Step 3 management: When asked to advise a colleague on whether to adopt a new therapy, "examine" the trial for power, α handling, CI width, and clinical (not just statistical) significance before recommending practice change.

In biostatistics, the "physical exam" is the structured appraisal of a study's design for risk of type I and type II error. Walk through it like a system review:

Sample size adequacy (the "vital signs" of a study):

α handling (the "rhythm" check):

Effect size and confidence interval inspection (the "auscultation"):

Outcome selection check:

Diagnostic Workup — Defining α, β, and Power Quantitatively

• α (significance level / type I error rate):
— Prespecified probability of rejecting H₀ when H₀ is true.
— Conventional: 0.05 (two-sided). Stricter in genomics (5×10⁻⁸), interim analyses, or multiple comparisons.
— The p-value is the probability, assuming H₀ is true, of observing data as extreme or more extreme than what was seen. p < α → reject H₀.
• β (type II error rate):
— Probability of failing to reject H₀ when H₁ is true.
— Conventional acceptable ceiling: 0.20 (i.e., power ≥ 0.80).
• Power (1 − β): probability of detecting a true effect of a specified magnitude.
• Four levers that determine power — memorize these; they appear repeatedly:
— Sample size (n): ↑n → ↑power (biggest modifiable lever).
— Effect size (Δ): ↑true difference → ↑power. Smaller effects need larger trials.
— α: loosening α (e.g., 0.05 → 0.10) → ↑power, but ↑type I error.
— Variance (σ²): ↓variability (better measurement, homogeneous population) → ↑power.
• Relationship to disease prevalence: unlike PPV/NPV, α and β do not depend on the underlying prevalence of the effect; they are conditional probabilities given H₀ or H₁.
• The 2×2 truth table (mirror of diagnostic 2×2):
H₀ true	H₀ false (H₁ true)
Reject H₀	Type I (α)	Correct (1−β, power)
Fail to reject	Correct (1−α)	Type II (β)
Board pearl: Power calculations are done before the study. Post-hoc power based on the observed effect is statistically meaningless and a recognized exam distractor — if a stem says "post-hoc power was 25%, so the trial was underpowered," that reasoning is invalid; instead, examine the confidence interval.

Diagnostic Workup — Advanced Concepts and Confirmatory Frameworks

— Two-sided α=0.05 splits 0.025 into each tail.

— One-sided α=0.05 puts all 0.05 in one tail → easier to reach significance, but only justified when an effect in the opposite direction is truly impossible or irrelevant. Default to two-sided on the exam.

— Bonferroni: use α/k per test. Simple, conservative, reduces power.

— Holm step-down and Hochberg step-up: more powerful than Bonferroni while controlling family-wise error.

— Benjamini–Hochberg FDR: controls expected proportion of false discoveries; preferred in high-dimensional data (genomics, imaging).

— Hierarchical testing: prespecify endpoint order; test secondary only if primary is significant — preserves α without correction.

— Each look at accumulating data is another chance for type I error.

— O'Brien–Fleming boundaries: very strict early, near-nominal at end.

— Pocock boundaries: equal threshold each look.

— DSMB may stop trials early for efficacy, futility, or harm — early efficacy stopping tends to overestimate effect size.

— H₀ is reframed as "the new treatment is worse by ≥ Δ" (the non-inferiority margin).

— Rejecting this H₀ supports non-inferiority. The CI must lie entirely above (or below) the margin.

— Margin must be clinically justified and prespecified, typically ≤50% of the active comparator's historical effect.

Key distinction: "Statistically significant" (p<α) ≠ "clinically significant." A trial of 50,000 patients can yield p<0.001 for a 1-mmHg BP drop that nobody should prescribe a drug for. Always compare the CI to the minimal clinically important difference (MCID).

One-sided vs two-sided tests:

Multiple comparisons and α inflation:

Interim analyses (group sequential designs):

Equivalence and non-inferiority testing:

Bayesian framing (briefly): posterior probability that H₁ is true, given prior and data — increasingly cited in adaptive trials but not the Step 3 default.

Risk Stratification — Choosing Acceptable Error Rates

— Lower α (0.01, 0.001) when a type I error is catastrophic: adopting a toxic therapy, regulatory drug approval with safety signal, genome-wide screens.

— Higher α (0.10) sometimes acceptable in pilot/exploratory studies where missing a signal (type II error) is costlier than chasing a false lead.

— Lower β (higher power, 0.90+) when missing a true effect is harmful: vaccine efficacy trials, oncology survival, rare disease therapeutics, non-inferiority trials (typically 90% power).

— Standard 0.20 (80% power) for most superiority trials.

— Screening test with high cost of missing disease → tolerate more false positives (analogous to higher α, lower β) → high sensitivity / high power.

— Confirmatory test before morbid intervention → tolerate more false negatives → high specificity / low α.

— FDA typically requires two independent phase III trials at α=0.05, which jointly approximates α≈0.0025 — a structural multiple-comparison safeguard.

— Post-marketing surveillance addresses residual type II errors for rare adverse events that phase III trials are underpowered to detect.

Step 3 management: When counseling a patient about a new therapy, recognize that rare serious adverse events are systematically underdetected in pivotal trials (type II error for safety) — incorporate post-marketing data and shared decision-making about uncertainty, especially in the first 1–2 years after approval.

Setting α depends on the cost of a false positive:

Setting β depends on the cost of a false negative:

Trade-off principle: For a fixed sample size, lowering α raises β (and vice versa). The only way to lower both simultaneously is to ↑n or ↓variance.

Clinical analogy (Step 3 favorite):

Regulatory framing:

Pharmacotherapy — Sample Size Calculation as the "First-Line Regimen"

n per group ≈ (2σ² × (Z_{α/2} + Z_β)²) / Δ²

— σ²: outcome variance.

— Δ: smallest clinically meaningful difference.

— Z_{α/2}: 1.96 for two-sided α=0.05.

— Z_β: 0.84 for 80% power; 1.28 for 90% power.

— Halving Δ quadruples required n (because Δ is squared).

— Doubling σ quadruples required n.

— Going from 80% → 90% power increases n by ~34%.

— Tightening α from 0.05 → 0.01 increases n by ~50%.

— For time-to-event outcomes, number of events, not number of patients, drives power. Trials may extend follow-up rather than enroll more patients.

— Low event rates → very large n required. This is why cardiovascular outcome trials of statins or SGLT2 inhibitors enroll 10,000–17,000 patients.

Board pearl: When a stem reports a "negative" trial, check three things in order: (1) was n based on a prespecified power calculation? (2) what effect size did they power for? (3) does the 95% CI exclude clinically important effects? Only if all three pass can you reasonably accept the null.

Sample size is the primary therapeutic lever for controlling both error types. The canonical formula for comparing two means:

Practical implications (memorize directions):

Proportions / event-driven trials:

Cluster randomized trials: must inflate n by the design effect = 1 + (m−1)ρ, where m is cluster size and ρ is intracluster correlation. Ignoring clustering causes artifactually low p-values (type I inflation).

Adjustments for dropout: if expected attrition is 20%, enroll n/0.80. Underestimating dropout → underpowered trial → type II error.

Pilot studies estimate σ but should not be used to declare efficacy; their CIs are too wide.

Procedures — Performing and Interpreting the Test

1. State H₀ and H₁ (prespecified, written in the protocol).

2. Choose α and the test statistic appropriate to the data:

— Continuous, normal, 2 groups: Student t-test (paired vs unpaired).

— Continuous, non-normal: Wilcoxon rank-sum / Mann–Whitney.

— Continuous, >2 groups: ANOVA; Kruskal–Wallis if non-normal.

— Categorical, 2×2: Chi-square; Fisher exact if expected cell <5.

— Time-to-event: log-rank test; Cox proportional hazards for adjusted.

— Paired binary: McNemar.

3. Compute test statistic and p-value.

4. Compare p to α; report effect size with 95% CI.

5. Interpret in clinical context.

— Switching the primary endpoint after seeing data (outcome switching).

— HARKing (Hypothesizing After Results are Known).

— p-hacking: trying multiple analyses until one yields p<0.05.

— Repeated significance testing during accrual without α adjustment.

— Inadequate sample size, high dropout, poor adherence (dilutes effect).

— Crossover between treatment arms in intention-to-treat analyses.

— Measurement error increasing σ.

— ITT preserves randomization, conservative for superiority (biases toward null → may increase type II error) but liberal for non-inferiority (must use both ITT and PP).

CCS pearl: Choosing the wrong test for the data type is a recognized exam distractor — e.g., applying a t-test to ordinal pain scores or chi-square to a 2×2 with an expected cell of 2 (use Fisher exact).

Workflow of a hypothesis test (the "procedure"):

Common procedural errors that inflate type I error:

Common procedural errors that inflate type II error:

Intention-to-treat (ITT) vs per-protocol (PP):

Confidence interval interpretation: A 95% CI means that, under repeated sampling, 95% of such intervals would contain the true parameter — not that there is a 95% probability the true value lies in this specific interval (that is the Bayesian credible interval).

Special Populations — Small Samples, Rare Events, Skewed Data

— t-distribution accounts for small-n uncertainty; degrees of freedom matter.

— Fisher exact test preferred over chi-square when expected cell counts <5.

— Permutation/exact methods avoid asymptotic assumptions.

— Power is generally very low → high type II error risk. Pilot studies should be labeled hypothesis-generating, never confirmatory.

— Conventional RCTs may be infeasible.

— Use Bayesian designs with informative priors, adaptive designs, basket/umbrella trials, or external/historical controls.

— Regulatory agencies may accept higher α or smaller n for breakthrough/orphan designations, accepting greater type I risk in exchange for access.

— Log-transform highly skewed outcomes (e.g., LOS, biomarker levels) before t-tests.

— Or use nonparametric tests (Wilcoxon, Kruskal–Wallis); these test medians/ranks, slightly lower power than parametric tests when normality holds, but robust when it doesn't.

— High variance ↑ required n. Stratified randomization by key prognostic variables reduces σ within strata and increases power.

— Covariate adjustment in analysis (ANCOVA, adjusted regression) also boosts power if covariates are prespecified.

— Patients within clinics, eyes within patients, repeated measures — must use mixed-effects models or GEE. Ignoring correlation shrinks p-values artificially → type I inflation.

Key distinction: A trial showing no benefit in a rare disease is far more likely to reflect type II error than a similarly negative trial in a common disease. Apply much more skepticism to "negative" rare-disease trials before concluding equivalence.

Small-sample settings (n < 30 per group, or sparse events):

Rare-event trials (orphan diseases, rare adverse events):

Skewed or non-normal data:

Heterogeneous populations:

Cluster and hierarchical data:

Special Populations — Subgroup Analyses and Heterogeneity of Effect

— Prespecified, not post-hoc.

— Small number of clinically motivated subgroups.

— Tested via formal interaction term (treatment × subgroup), not by comparing within-subgroup p-values.

— Consistent across related subgroups and biologically plausible.

— Effect direction matches the overall trial.

— Within-subgroup CIs almost always cross the null because subgroups are underpowered (type II error).

— An "apparent" benefit in one subgroup with non-significance in the overall trial is almost always a chance finding.

— Often excluded from pivotal trials → effect estimates extrapolated, with unrecognized type II error for these groups.

— Step 3 ethical thread: equitable inclusion vs vulnerable-population protections.

— FDA encourages reporting, but interpret cautiously — usually exploratory.

— Real biological heterogeneity exists (e.g., EGFR-mutant NSCLC and EGFR inhibitors).

— Distinguish from statistical noise; require replication and mechanism.

Board pearl: The classic teaching example: the ISIS-2 trial famously showed (tongue-in-cheek) that aspirin "didn't work" in patients born under Gemini or Libra. The lesson: any large enough trial sliced into enough subgroups will produce nonsensical "significant" findings — interpret subgroup analyses with extreme skepticism and demand prespecification.

Subgroup analyses are the single biggest source of type I error inflation in clinical research.

Each subgroup test is another opportunity to find a spurious "significant" result. Twenty subgroups at α=0.05 → expect ~1 false positive by chance alone.

Rules for credible subgroup findings (familiar exam checklist):

Forest plot interpretation:

Pediatric, geriatric, pregnant patients:

Sex-specific and race-stratified analyses:

Heterogeneity of treatment effect (HTE):

Complications and Adverse Outcomes of Misinterpreting Tests

— Adoption of ineffective or harmful therapy (e.g., historical examples: hormone replacement for CV prevention, antiarrhythmics post-MI — CAST trial).

— Wasted resources, exposure to side effects, opportunity cost of not pursuing better therapies.

— Erosion of trust in medical literature when results fail to replicate (replication crisis).

— Effective therapies abandoned or delayed.

— Patients denied beneficial treatment.

— Particularly damaging in rare diseases and safety signals (an underpowered trial may miss a real but uncommon adverse event).

— Type M error (magnitude): when underpowered studies do find significance, effect estimates are inflated ("winner's curse").

— Type S error (sign): in low-power settings, the significant effect can be in the wrong direction.

— Publication bias: positive trials published preferentially → meta-analyses overestimate effect → systemic type I error at the literature level.

— Outcome reporting bias: investigators report endpoints that "worked," suppress others.

— Guidelines based on a single positive trial that fails to replicate.

— Reversal of practice when larger trials emerge (e.g., tight glycemic control in ICU — NICE-SUGAR overturned earlier smaller trials).

— Prospective trial registration (ClinicalTrials.gov) before enrollment.

— Prespecified statistical analysis plans.

— Independent DSMB oversight.

— Replication and meta-analysis.

— Reproducibility checklists (CONSORT for RCTs, STROBE for observational).

Step 3 management: Before changing practice based on a new trial, ask: is it registered, prespecified, adequately powered, and ideally replicated? A single underpowered or unreplicated trial — positive or negative — should rarely change established care.

Clinical consequences of type I error (false positive):

Clinical consequences of type II error (false negative):

Related statistical pathologies:

Real-world fallout:

Mitigation strategies:

When to Escalate — Demanding Replication, Meta-Analysis, or DSMB Action

— Low prior probability of truth (especially if novel mechanism); high risk of type I error.

— Action: await replication, do not change practice unless effect is large and consistent with mechanism.

— Reasonable basis for practice change in superiority contexts.

— Still prefer replication for high-stakes interventions.

— Escalate to systematic review and meta-analysis with assessment of heterogeneity (I² statistic).

— Check for publication bias (funnel plot, Egger test).

— GRADE framework rates certainty of evidence.

— DSMB may recommend early stopping for efficacy (rare; risks effect-size inflation), futility (low conditional power → type II error confirmed), or harm.

— Step 3 stems may ask: should the trial continue? Apply O'Brien–Fleming-type thinking: very high bar to stop early.

— Even nonsignificant trends toward harm in a single trial warrant escalation if biologically plausible — type II error for safety is asymmetric in cost.

— Post-marketing pharmacovigilance and FAERS reporting catch rare events that phase III trials are underpowered to detect.

— Class I / Level A recommendations rest on multiple consistent high-power RCTs.

— Class IIb / Level C — expert consensus, much higher residual uncertainty.

CCS pearl: When a Step 3 stem describes a hospital P&T committee considering a new drug after one positive trial, the "right answer" is usually to wait for confirmatory data, restrict to a formulary subset, or require ongoing outcome monitoring — not immediate broad adoption.

Escalation framework for evaluating evidence before adopting practice change:

Single small trial, p<0.05:

Single large, well-powered, prespecified trial:

Multiple trials with mixed results:

Interim analyses during ongoing trials:

Safety signals:

Guideline interpretation:

Key Differentials — Same-Category Statistical Errors

— α is the prespecified threshold; p is the observed probability under H₀.

— A p of 0.04 does not mean 4% chance H₀ is true. It means: if H₀ were true, data this extreme or more would occur 4% of the time.

— β depends on the specific alternative (effect size) you power against. Power is always "power to detect Δ."

— Correctly rejecting H₀ but for the wrong reason / wrong direction — e.g., concluding drug A > drug B when the true effect is drug B > drug A (a sign error).

— S: significant finding in the wrong direction.

— M: significant finding with grossly inflated magnitude.

— Both worsen as power drops.

— False-positive diagnostic test ≠ type I error, though analogous. Type I/II refer to inferential decisions about hypotheses, not individual patient test results.

— A 95% CI does not mean "95% probability the parameter lies here" in frequentist statistics. It means the procedure captures the truth 95% of the time across repeated samples.

— You never accept H₀; you only fail to reject it. Acceptance requires an equivalence framework with a margin.

Key distinction: A non-significant result and a result demonstrating equivalence are statistically very different. Many published trials cross this line incorrectly — a Step 3 favorite trap.

Errors within the inferential family that get confused with type I / II:

Type I error (α) vs the p-value:

Type II error (β) vs effect size:

Type III error:

Type S (sign) and Type M (magnitude) errors (Gelman):

Misclassification of test results:

Confidence interval misinterpretation:

"Failure to reject" vs "accept" H₀:

Key Differentials — Bias and Confounding (Non-Inferential Threats)

— Non-random sampling or differential loss to follow-up distorts effect estimates.

— Increasing n does not fix it.

— Misclassification of exposure or outcome.

— Non-differential → biases toward null (false type II–like effect).

— Differential → can bias in either direction.

— A third variable associated with both exposure and outcome, not on the causal pathway.

— Addressed by randomization (in RCTs) or adjustment / matching / stratification / propensity scores (in observational studies).

— Residual confounding always possible in observational data — a major reason observational "positive" findings often fail in RCTs.

— Different effect sizes in subgroups — a real biological phenomenon, not a bias.

— Extreme baseline values tend toward the mean on retesting, mimicking treatment effect — addressed by control groups.

— A study can be perfectly powered (low β) and still produce wrong conclusions due to bias.

— Large n with bias produces precisely wrong estimates — narrow CIs around a biased point estimate.

Board pearl: Bigger studies don't fix bias. When a stem describes a massive observational cohort with a tiny p-value (e.g., 500,000 patients, p<0.0001) for an effect that has failed in RCTs (e.g., vitamin E, HRT for CV prevention), the correct answer addresses confounding, not power.

Hypothesis-testing errors are random (α, β); these alternatives are systematic and not fixed by larger sample sizes:

Selection bias:

Information / measurement bias:

Recall bias (case-control), interviewer bias, detection bias (more imaging in one group → more diagnoses).

Confounding:

Effect modification (interaction):

Regression to the mean:

Hawthorne effect, placebo effect — addressed by blinding.

Why this matters for type I/II thinking:

Secondary Prevention — Designing the Next Study to Avoid Both Errors

— Prespecify one primary endpoint; place others in a hierarchical sequence.

— Adjust α for multiple comparisons, interim analyses, multiple arms.

— Register the trial and analysis plan before data lock.

— Blind investigators and outcome assessors.

— Replicate in an independent cohort before guideline change.

— Use two-sided tests by default.

— Perform an a priori power calculation based on realistic effect size and variance.

— Inflate enrollment for anticipated dropout and crossover.

— Use continuous outcomes rather than dichotomized versions when possible (preserves information).

— Reduce measurement error (standardized protocols, central adjudication).

— Use stratified randomization and prespecified covariate adjustment to reduce residual variance.

— Consider adaptive designs that adjust sample size based on interim variance estimates (with α control).

— Open data, code sharing, preregistration.

— CONSORT, SPIRIT, STROBE reporting checklists.

— Even a beautifully designed positive trial should be evaluated for external validity before applying to your population.

— Effectiveness (pragmatic) trials complement efficacy (explanatory) trials.

Step 3 management: When advising a junior colleague designing a QI project or trial, your "discharge plan" is: prespecify, power adequately for the clinically meaningful Δ, plan for multiplicity, register, and prespecify the analysis — these single decisions prevent the majority of inferential errors.

Lessons translate directly into better future trial design — a common Step 3 framing where the resident/PI is asked "how would you improve this study?":

Prevent type I error:

Prevent type II error:

Reproducibility infrastructure:

Clinical translation:

Follow-Up — Monitoring Literature and Updating Practice

— Watch for failed replications, retractions, errata.

— Track real-world effectiveness data (registries, EHR-based studies).

— Monitor adverse event reports — phase III trials are systematically underpowered for rare harms (type II error for safety).

— Cumulative meta-analyses can show when evidence first crossed the threshold for confidence — sometimes years before guidelines updated.

— Sequential meta-analysis methods (trial sequential analysis) control α across accumulating trials.

— When a once-positive practice is overturned (e.g., perioperative β-blockers after POISE, IV tPA in low-NIHSS strokes refined post hoc), recognize the original may have been a type I error or biased finding.

— De-adoption is harder than adoption — anchoring bias in clinical practice.

— Explain that "the studies show benefit on average" hides residual uncertainty.

— Shared decision-making incorporates effect size, CI width, and patient values.

— Read the primary trial for high-impact practice changes, not just the press release or abstract.

— Examine the CI, primary endpoint, prespecification, and population.

— Personal audit of cases where you applied trial evidence — outcome tracking is a form of n=1 hypothesis checking, susceptible to both type I (one good outcome → overconfidence) and type II (one bad outcome → premature abandonment) errors.

Board pearl: A statin trial with NNT=50 over 5 years to prevent one MI is "true" (rejected H₀ correctly) but the clinical conversation must include that 49/50 patients receive no individual benefit — separating population-level inference from individual decision-making is core Step 3 thinking.

Hypothesis-testing thinking continues after publication, as evidence accumulates:

Post-publication surveillance:

Living guidelines and meta-analytic updates:

De-implementation:

Patient counseling under uncertainty:

Continuing education cadence:

Tracking your own decisions:

Ethical, Legal, and Patient Safety Considerations

— Approving and prescribing an ineffective or harmful drug exposes real patients to risk based on a statistical fluke.

— Ethical duty of non-maleficence demands rigorous α control, especially for therapies that displace established care.

— Failing to detect a real benefit denies patients effective therapy.

— Particular concern in underrepresented populations (women, elderly, minorities, pregnant patients) where subgroup power is low.

— Subjects must be told the trial may not detect benefit (type II) or may yield a false positive (type I) — uncertainty is intrinsic to the enterprise.

— In non-inferiority trials, patients must understand they may receive a treatment that is allowed to be modestly worse than standard care, within a prespecified margin.

— Genuine uncertainty about which arm is better is the ethical justification for randomization. If interim data eliminate equipoise, the DSMB has an obligation to consider stopping (early efficacy or harm).

— Suppressing negative trials (publication bias) is a research misconduct issue and contributes to systemic type I error in the literature.

— FDA Amendments Act mandates ClinicalTrials.gov reporting.

— When a new therapy is added based on a single trial, document the rationale and uncertainty in the discharge summary and communicate it to the primary care physician. Premature adoption that requires later de-prescribing is a recognized transition-of-care hazard, especially in polypharmacy elderly patients.

— Industry-sponsored trials show systematic effect-size inflation; disclosure and independent analysis mitigate.

Step 3 management: When a patient asks about a "breakthrough" drug from a press release, your ethical obligation includes explaining what kind of error the underlying trial is most vulnerable to — and revisiting the decision as more evidence accrues.

Ethical dimensions of inferential errors:

Type I error → harm by exposure:

Type II error → harm by omission:

Informed consent for research:

Equipoise:

Publication and data-sharing ethics:

Patient safety / transition-of-care application:

Conflicts of interest:

High-Yield Associations and Rapid-Fire Clinical Facts

Board pearl: A single, well-known mnemonic — "α is the alarm that cries wolf (type I); β is the boy who misses the wolf (type II)." Power is the probability the boy actually spots the wolf when it's really there.

Type I error = α = false positive = "convicting the innocent" (H₀ = innocent).

Type II error = β = false negative = "letting the guilty go free."

Power = 1 − β; conventional ≥ 0.80, often 0.90 for non-inferiority.

Two-sided α = 0.05 → Z = 1.96; 80% power → Z = 0.84.

n ∝ 1/Δ² — halving the detectable effect quadruples sample size.

n ∝ σ² — doubling variance doubles required n.

CI excludes null ↔ p < α (for the same α level).

Wider CI = less precision = lower power.

Post-hoc power is meaningless — use the CI instead.

Bonferroni = α/k; conservative; reduces power.

Non-inferiority margin must be prespecified; analyze both ITT and PP.

ITT is conservative for superiority, liberal for non-inferiority.

Stopping early for efficacy inflates effect size estimates.

Subgroup analyses inflate type I error; require prespecification and interaction tests.

Fisher exact when expected cell <5; McNemar for paired binary.

Log-rank for survival curve comparison; Cox for adjusted HR.

Increasing α decreases β (and vice versa) at fixed n.

Rare-event safety detection requires post-marketing surveillance — phase III is underpowered.

Replication crisis = systemic type I error from publication bias, p-hacking, and underpowered original studies.

Type S (sign) and Type M (magnitude) errors worsen with low power.

Equivalence ≠ failure to reject H₀ — separate test design.

Hazard ratio CI crossing 1.0 = nonsignificant; for risk difference, null = 0.

Number of events, not patients, drives power in time-to-event trials.

Cluster designs require design-effect inflation.

Adjusting for prespecified covariates boosts power; post-hoc adjustment risks α inflation.

Board Question Stem Patterns

— "60 patients, no significant difference (p=0.18), authors conclude drugs are equivalent." → Answer: type II error / inadequate power; cannot claim equivalence from a superiority trial.

— "Investigators tested 25 SNPs; one was associated with disease (p=0.03)." → Answer: type I error inflation; Bonferroni-corrected threshold would be 0.002.

— "Overall trial negative, but benefit in patients aged 65–75 (p=0.04)." → Answer: post-hoc subgroup; likely chance; requires prespecification and replication.

— "After negative results, authors compute power was 30%." → Answer: post-hoc power is invalid; assess the confidence interval instead.

— "n=20,000; SBP reduction 1.2 mmHg, p<0.001." → Answer: statistically significant but not clinically meaningful.

— "ITT analysis shows non-inferiority, PP analysis does not." → Answer: non-inferiority requires both to confirm.

— "DSMB stopped trial early for efficacy; observed HR 0.45." → Answer: effect likely overestimated; await replication.

— "Large observational study, p<0.0001, contradicts RCT." → Answer: confounding / bias, not power. Bigger n does not fix systematic error.

— Two groups, ordinal pain scores, non-normal → Mann–Whitney, not t-test.

— 2×2 table, expected cell = 3 → Fisher exact, not chi-square.

— "Which would most increase the study's power?" → Largest gain typically from ↑n or ↑effect size targeted or ↓variance, not from loosening α (which would be flagged as inappropriate).

Key distinction: When a stem asks for the best response to a "negative" trial, distinguish among (a) accept the null, (b) declare equivalence, (c) suspect underpowering — the correct exam answer is almost always (c) suspect type II error and examine the CI / sample size justification, unless the trial was explicitly powered as a non-inferiority/equivalence study.

Pattern 1 — Underpowered negative trial:

Pattern 2 — Multiple comparisons:

Pattern 3 — Subgroup spurious finding:

Pattern 4 — Post-hoc power:

Pattern 5 — Statistical vs clinical significance:

Pattern 6 — Non-inferiority misinterpreted:

Pattern 7 — Effect-size inflation from early stopping:

Pattern 8 — Confounding vs power:

Pattern 9 — Choosing the right test:

Pattern 10 — Lever question:

One-Line Recap

Hypothesis testing trades off two errors: type I (α, false positive — rejecting a true null) and type II (β, false negative — missing a real effect), with power (1 − β) driven primarily by sample size, effect size, variance, and α — and clinical decision-making demands awareness of which error is operative in any given study.

Board pearl: α is the alarm that cries wolf (false positive, type I); β is the boy who misses the wolf (false negative, type II); power is whether he actually sees it when it comes. Every clinical study is an attempt to balance these two errors against the cost of being wrong in either direction.

The 2×2 truth table: reject true H₀ → type I (α); fail to reject false H₀ → type II (β); the other two cells are correct decisions. Power = correctly rejecting a false H₀.

Power levers, in descending modifiability: sample size (biggest), targeted effect size, measurement precision (variance), and α — with the trade-off that loosening α to gain power inflates false positives.

Most common Step 3 traps: equating non-significance with equivalence (it isn't — that needs a non-inferiority design with a prespecified margin); accepting subgroup or multiple-comparison findings without α adjustment; computing post-hoc power instead of inspecting confidence intervals; and assuming bigger sample size fixes bias (it doesn't — it produces precisely wrong estimates).

Clinical application: before changing practice on a new trial, verify prespecification, registration, adequate a priori power, appropriate α control for multiplicity, intention-to-treat analysis, and ideally independent replication — and recognize that rare adverse events are systematically under-detected in phase III (type II error for safety), making post-marketing surveillance and longitudinal vigilance essential.