Biostatistics & Population Health
Hypothesis testing: type I and type II error, power
— Type I error (α): rejecting H₀ when it is actually true → a false-positive conclusion ("the drug works" when it doesn't).
— Type II error (β): failing to reject H₀ when H₁ is actually true → a false-negative conclusion ("the drug doesn't work" when it actually does).
— A small study (n low) reports "no difference" → suspect type II error / underpowered.
— A study with many subgroup or interim analyses reports a "significant" finding → suspect type I error inflation (multiple comparisons).
— A p-value just under 0.05 with wide confidence intervals → fragile result, both errors plausible.
— Replication failure of a prior "positive" trial → original may have been a type I error or had publication bias.
Board pearl: "Absence of evidence is not evidence of absence." A non-significant p-value in a small trial usually means insufficient power, not proven equivalence — that requires a formal non-inferiority or equivalence trial with a prespecified margin.

— "A new antihypertensive was compared to lisinopril in 60 patients. Mean SBP reduction did not differ (p=0.21). The authors conclude the drugs are equivalent." → tests recognition of type II error / inappropriate equivalence claim.
— "Investigators tested 20 dietary variables for association with colon cancer; one (p=0.04) was significant." → multiple-comparisons type I inflation.
— "A trial with 90% power to detect a 10-mmHg difference found no effect." → adequately powered negative trial — equivalence claim is more defensible.
— "The p-value was 0.001 but the confidence interval crossed the minimally important difference." → statistical vs clinical significance distinction.
— Sample size (n) and event rate — drives power.
— Effect size the study was designed to detect (the "delta").
— α level chosen (usually 0.05, sometimes 0.01 with multiple testing).
— Number of comparisons / interim looks — each inflates type I error.
— Whether the result is superiority, non-inferiority, or equivalence in design — they have different null hypotheses.
— Two-sided test: H₁ is "different" (default; α split between tails).
— One-sided test: H₁ specifies direction; rarely accepted unless prespecified and justified — examiners flag one-sided tests as a red flag for p-hacking.
Key distinction: A superiority trial has H₀ "no difference"; failing to reject it does not prove equivalence. A non-inferiority trial has H₀ "new is worse by more than margin Δ"; rejecting it supports non-inferiority. Confusing these two designs is the single most common Step 3 biostatistics trap.
Board pearl: When a stem says "the difference was not statistically significant," your first reflex should be to ask "what was the power?" — not to accept the null.

— Was an a priori power calculation reported? Look for "we calculated that 240 patients per arm would provide 80% power to detect a 15% relative risk reduction at α=0.05."
— Underpowered studies inflate type II error and produce inflated effect estimates if positive (winner's curse / type M error).
— Single primary endpoint with α=0.05 → standard.
— Multiple primary endpoints, subgroup analyses, or interim analyses without correction → family-wise type I error rises rapidly. With k independent tests at α=0.05, P(≥1 false positive) ≈ 1−(0.95)ᵏ; 10 tests → ~40% chance of a spurious "hit."
— Corrections: Bonferroni (α/k, conservative), Holm–Bonferroni, Benjamini–Hochberg FDR, O'Brien–Fleming for interim looks.
— A 95% CI that excludes the null (1.0 for ratios, 0 for differences) corresponds to p<0.05.
— A wide CI signals imprecision → low power even if point estimate looks impressive.
— A CI that includes both a clinically trivial and a clinically important effect means the trial cannot resolve the question.
— Composite endpoints can manufacture significance if a soft component drives the result.
— Surrogate endpoints (e.g., HbA1c instead of MI) may achieve statistical significance without clinical meaning.
Step 3 management: When asked to advise a colleague on whether to adopt a new therapy, "examine" the trial for power, α handling, CI width, and clinical (not just statistical) significance before recommending practice change.

| • α (significance level / type I error rate): | ||
| — Prespecified probability of rejecting H₀ when H₀ is true. | ||
| — Conventional: 0.05 (two-sided). Stricter in genomics (5×10⁻⁸), interim analyses, or multiple comparisons. | ||
| — The p-value is the probability, assuming H₀ is true, of observing data as extreme or more extreme than what was seen. p < α → reject H₀. | ||
| • β (type II error rate): | ||
| — Probability of failing to reject H₀ when H₁ is true. | ||
| — Conventional acceptable ceiling: 0.20 (i.e., power ≥ 0.80). | ||
| • Power (1 − β): probability of detecting a true effect of a specified magnitude. | ||
| • Four levers that determine power — memorize these; they appear repeatedly: | ||
| — Sample size (n): ↑n → ↑power (biggest modifiable lever). | ||
| — Effect size (Δ): ↑true difference → ↑power. Smaller effects need larger trials. | ||
| — α: loosening α (e.g., 0.05 → 0.10) → ↑power, but ↑type I error. | ||
| — Variance (σ²): ↓variability (better measurement, homogeneous population) → ↑power. | ||
| • Relationship to disease prevalence: unlike PPV/NPV, α and β do not depend on the underlying prevalence of the effect; they are conditional probabilities given H₀ or H₁. | ||
| • The 2×2 truth table (mirror of diagnostic 2×2): | ||
| H₀ true | H₀ false (H₁ true) | |
| Reject H₀ | Type I (α) | Correct (1−β, power) |
| Fail to reject | Correct (1−α) | Type II (β) |
| Board pearl: Power calculations are done before the study. Post-hoc power based on the observed effect is statistically meaningless and a recognized exam distractor — if a stem says "post-hoc power was 25%, so the trial was underpowered," that reasoning is invalid; instead, examine the confidence interval. |

— Two-sided α=0.05 splits 0.025 into each tail.
— One-sided α=0.05 puts all 0.05 in one tail → easier to reach significance, but only justified when an effect in the opposite direction is truly impossible or irrelevant. Default to two-sided on the exam.
— Bonferroni: use α/k per test. Simple, conservative, reduces power.
— Holm step-down and Hochberg step-up: more powerful than Bonferroni while controlling family-wise error.
— Benjamini–Hochberg FDR: controls expected proportion of false discoveries; preferred in high-dimensional data (genomics, imaging).
— Hierarchical testing: prespecify endpoint order; test secondary only if primary is significant — preserves α without correction.
— Each look at accumulating data is another chance for type I error.
— O'Brien–Fleming boundaries: very strict early, near-nominal at end.
— Pocock boundaries: equal threshold each look.
— DSMB may stop trials early for efficacy, futility, or harm — early efficacy stopping tends to overestimate effect size.
— H₀ is reframed as "the new treatment is worse by ≥ Δ" (the non-inferiority margin).
— Rejecting this H₀ supports non-inferiority. The CI must lie entirely above (or below) the margin.
— Margin must be clinically justified and prespecified, typically ≤50% of the active comparator's historical effect.
Key distinction: "Statistically significant" (p<α) ≠ "clinically significant." A trial of 50,000 patients can yield p<0.001 for a 1-mmHg BP drop that nobody should prescribe a drug for. Always compare the CI to the minimal clinically important difference (MCID).

— Lower α (0.01, 0.001) when a type I error is catastrophic: adopting a toxic therapy, regulatory drug approval with safety signal, genome-wide screens.
— Higher α (0.10) sometimes acceptable in pilot/exploratory studies where missing a signal (type II error) is costlier than chasing a false lead.
— Lower β (higher power, 0.90+) when missing a true effect is harmful: vaccine efficacy trials, oncology survival, rare disease therapeutics, non-inferiority trials (typically 90% power).
— Standard 0.20 (80% power) for most superiority trials.
— Screening test with high cost of missing disease → tolerate more false positives (analogous to higher α, lower β) → high sensitivity / high power.
— Confirmatory test before morbid intervention → tolerate more false negatives → high specificity / low α.
— FDA typically requires two independent phase III trials at α=0.05, which jointly approximates α≈0.0025 — a structural multiple-comparison safeguard.
— Post-marketing surveillance addresses residual type II errors for rare adverse events that phase III trials are underpowered to detect.
Step 3 management: When counseling a patient about a new therapy, recognize that rare serious adverse events are systematically underdetected in pivotal trials (type II error for safety) — incorporate post-marketing data and shared decision-making about uncertainty, especially in the first 1–2 years after approval.

n per group ≈ (2σ² × (Z_{α/2} + Z_β)²) / Δ²
— σ²: outcome variance.
— Δ: smallest clinically meaningful difference.
— Z_{α/2}: 1.96 for two-sided α=0.05.
— Z_β: 0.84 for 80% power; 1.28 for 90% power.
— Halving Δ quadruples required n (because Δ is squared).
— Doubling σ quadruples required n.
— Going from 80% → 90% power increases n by ~34%.
— Tightening α from 0.05 → 0.01 increases n by ~50%.
— For time-to-event outcomes, number of events, not number of patients, drives power. Trials may extend follow-up rather than enroll more patients.
— Low event rates → very large n required. This is why cardiovascular outcome trials of statins or SGLT2 inhibitors enroll 10,000–17,000 patients.
Board pearl: When a stem reports a "negative" trial, check three things in order: (1) was n based on a prespecified power calculation? (2) what effect size did they power for? (3) does the 95% CI exclude clinically important effects? Only if all three pass can you reasonably accept the null.

1. State H₀ and H₁ (prespecified, written in the protocol).
2. Choose α and the test statistic appropriate to the data:
— Continuous, normal, 2 groups: Student t-test (paired vs unpaired).
— Continuous, non-normal: Wilcoxon rank-sum / Mann–Whitney.
— Continuous, >2 groups: ANOVA; Kruskal–Wallis if non-normal.
— Categorical, 2×2: Chi-square; Fisher exact if expected cell <5.
— Time-to-event: log-rank test; Cox proportional hazards for adjusted.
— Paired binary: McNemar.
3. Compute test statistic and p-value.
4. Compare p to α; report effect size with 95% CI.
5. Interpret in clinical context.
— Switching the primary endpoint after seeing data (outcome switching).
— HARKing (Hypothesizing After Results are Known).
— p-hacking: trying multiple analyses until one yields p<0.05.
— Repeated significance testing during accrual without α adjustment.
— Inadequate sample size, high dropout, poor adherence (dilutes effect).
— Crossover between treatment arms in intention-to-treat analyses.
— Measurement error increasing σ.
— ITT preserves randomization, conservative for superiority (biases toward null → may increase type II error) but liberal for non-inferiority (must use both ITT and PP).
CCS pearl: Choosing the wrong test for the data type is a recognized exam distractor — e.g., applying a t-test to ordinal pain scores or chi-square to a 2×2 with an expected cell of 2 (use Fisher exact).

— t-distribution accounts for small-n uncertainty; degrees of freedom matter.
— Fisher exact test preferred over chi-square when expected cell counts <5.
— Permutation/exact methods avoid asymptotic assumptions.
— Power is generally very low → high type II error risk. Pilot studies should be labeled hypothesis-generating, never confirmatory.
— Conventional RCTs may be infeasible.
— Use Bayesian designs with informative priors, adaptive designs, basket/umbrella trials, or external/historical controls.
— Regulatory agencies may accept higher α or smaller n for breakthrough/orphan designations, accepting greater type I risk in exchange for access.
— Log-transform highly skewed outcomes (e.g., LOS, biomarker levels) before t-tests.
— Or use nonparametric tests (Wilcoxon, Kruskal–Wallis); these test medians/ranks, slightly lower power than parametric tests when normality holds, but robust when it doesn't.
— High variance ↑ required n. Stratified randomization by key prognostic variables reduces σ within strata and increases power.
— Covariate adjustment in analysis (ANCOVA, adjusted regression) also boosts power if covariates are prespecified.
— Patients within clinics, eyes within patients, repeated measures — must use mixed-effects models or GEE. Ignoring correlation shrinks p-values artificially → type I inflation.
Key distinction: A trial showing no benefit in a rare disease is far more likely to reflect type II error than a similarly negative trial in a common disease. Apply much more skepticism to "negative" rare-disease trials before concluding equivalence.

— Prespecified, not post-hoc.
— Small number of clinically motivated subgroups.
— Tested via formal interaction term (treatment × subgroup), not by comparing within-subgroup p-values.
— Consistent across related subgroups and biologically plausible.
— Effect direction matches the overall trial.
— Within-subgroup CIs almost always cross the null because subgroups are underpowered (type II error).
— An "apparent" benefit in one subgroup with non-significance in the overall trial is almost always a chance finding.
— Often excluded from pivotal trials → effect estimates extrapolated, with unrecognized type II error for these groups.
— Step 3 ethical thread: equitable inclusion vs vulnerable-population protections.
— FDA encourages reporting, but interpret cautiously — usually exploratory.
— Real biological heterogeneity exists (e.g., EGFR-mutant NSCLC and EGFR inhibitors).
— Distinguish from statistical noise; require replication and mechanism.
Board pearl: The classic teaching example: the ISIS-2 trial famously showed (tongue-in-cheek) that aspirin "didn't work" in patients born under Gemini or Libra. The lesson: any large enough trial sliced into enough subgroups will produce nonsensical "significant" findings — interpret subgroup analyses with extreme skepticism and demand prespecification.

— Adoption of ineffective or harmful therapy (e.g., historical examples: hormone replacement for CV prevention, antiarrhythmics post-MI — CAST trial).
— Wasted resources, exposure to side effects, opportunity cost of not pursuing better therapies.
— Erosion of trust in medical literature when results fail to replicate (replication crisis).
— Effective therapies abandoned or delayed.
— Patients denied beneficial treatment.
— Particularly damaging in rare diseases and safety signals (an underpowered trial may miss a real but uncommon adverse event).
— Type M error (magnitude): when underpowered studies do find significance, effect estimates are inflated ("winner's curse").
— Type S error (sign): in low-power settings, the significant effect can be in the wrong direction.
— Publication bias: positive trials published preferentially → meta-analyses overestimate effect → systemic type I error at the literature level.
— Outcome reporting bias: investigators report endpoints that "worked," suppress others.
— Guidelines based on a single positive trial that fails to replicate.
— Reversal of practice when larger trials emerge (e.g., tight glycemic control in ICU — NICE-SUGAR overturned earlier smaller trials).
— Prospective trial registration (ClinicalTrials.gov) before enrollment.
— Prespecified statistical analysis plans.
— Independent DSMB oversight.
— Replication and meta-analysis.
— Reproducibility checklists (CONSORT for RCTs, STROBE for observational).
Step 3 management: Before changing practice based on a new trial, ask: is it registered, prespecified, adequately powered, and ideally replicated? A single underpowered or unreplicated trial — positive or negative — should rarely change established care.

— Low prior probability of truth (especially if novel mechanism); high risk of type I error.
— Action: await replication, do not change practice unless effect is large and consistent with mechanism.
— Reasonable basis for practice change in superiority contexts.
— Still prefer replication for high-stakes interventions.
— Escalate to systematic review and meta-analysis with assessment of heterogeneity (I² statistic).
— Check for publication bias (funnel plot, Egger test).
— GRADE framework rates certainty of evidence.
— DSMB may recommend early stopping for efficacy (rare; risks effect-size inflation), futility (low conditional power → type II error confirmed), or harm.
— Step 3 stems may ask: should the trial continue? Apply O'Brien–Fleming-type thinking: very high bar to stop early.
— Even nonsignificant trends toward harm in a single trial warrant escalation if biologically plausible — type II error for safety is asymmetric in cost.
— Post-marketing pharmacovigilance and FAERS reporting catch rare events that phase III trials are underpowered to detect.
— Class I / Level A recommendations rest on multiple consistent high-power RCTs.
— Class IIb / Level C — expert consensus, much higher residual uncertainty.
CCS pearl: When a Step 3 stem describes a hospital P&T committee considering a new drug after one positive trial, the "right answer" is usually to wait for confirmatory data, restrict to a formulary subset, or require ongoing outcome monitoring — not immediate broad adoption.

— α is the prespecified threshold; p is the observed probability under H₀.
— A p of 0.04 does not mean 4% chance H₀ is true. It means: if H₀ were true, data this extreme or more would occur 4% of the time.
— β depends on the specific alternative (effect size) you power against. Power is always "power to detect Δ."
— Correctly rejecting H₀ but for the wrong reason / wrong direction — e.g., concluding drug A > drug B when the true effect is drug B > drug A (a sign error).
— S: significant finding in the wrong direction.
— M: significant finding with grossly inflated magnitude.
— Both worsen as power drops.
— False-positive diagnostic test ≠ type I error, though analogous. Type I/II refer to inferential decisions about hypotheses, not individual patient test results.
— A 95% CI does not mean "95% probability the parameter lies here" in frequentist statistics. It means the procedure captures the truth 95% of the time across repeated samples.
— You never accept H₀; you only fail to reject it. Acceptance requires an equivalence framework with a margin.
Key distinction: A non-significant result and a result demonstrating equivalence are statistically very different. Many published trials cross this line incorrectly — a Step 3 favorite trap.

— Non-random sampling or differential loss to follow-up distorts effect estimates.
— Increasing n does not fix it.
— Misclassification of exposure or outcome.
— Non-differential → biases toward null (false type II–like effect).
— Differential → can bias in either direction.
— A third variable associated with both exposure and outcome, not on the causal pathway.
— Addressed by randomization (in RCTs) or adjustment / matching / stratification / propensity scores (in observational studies).
— Residual confounding always possible in observational data — a major reason observational "positive" findings often fail in RCTs.
— Different effect sizes in subgroups — a real biological phenomenon, not a bias.
— Extreme baseline values tend toward the mean on retesting, mimicking treatment effect — addressed by control groups.
— A study can be perfectly powered (low β) and still produce wrong conclusions due to bias.
— Large n with bias produces precisely wrong estimates — narrow CIs around a biased point estimate.
Board pearl: Bigger studies don't fix bias. When a stem describes a massive observational cohort with a tiny p-value (e.g., 500,000 patients, p<0.0001) for an effect that has failed in RCTs (e.g., vitamin E, HRT for CV prevention), the correct answer addresses confounding, not power.

— Prespecify one primary endpoint; place others in a hierarchical sequence.
— Adjust α for multiple comparisons, interim analyses, multiple arms.
— Register the trial and analysis plan before data lock.
— Blind investigators and outcome assessors.
— Replicate in an independent cohort before guideline change.
— Use two-sided tests by default.
— Perform an a priori power calculation based on realistic effect size and variance.
— Inflate enrollment for anticipated dropout and crossover.
— Use continuous outcomes rather than dichotomized versions when possible (preserves information).
— Reduce measurement error (standardized protocols, central adjudication).
— Use stratified randomization and prespecified covariate adjustment to reduce residual variance.
— Consider adaptive designs that adjust sample size based on interim variance estimates (with α control).
— Open data, code sharing, preregistration.
— CONSORT, SPIRIT, STROBE reporting checklists.
— Even a beautifully designed positive trial should be evaluated for external validity before applying to your population.
— Effectiveness (pragmatic) trials complement efficacy (explanatory) trials.
Step 3 management: When advising a junior colleague designing a QI project or trial, your "discharge plan" is: prespecify, power adequately for the clinically meaningful Δ, plan for multiplicity, register, and prespecify the analysis — these single decisions prevent the majority of inferential errors.

— Watch for failed replications, retractions, errata.
— Track real-world effectiveness data (registries, EHR-based studies).
— Monitor adverse event reports — phase III trials are systematically underpowered for rare harms (type II error for safety).
— Cumulative meta-analyses can show when evidence first crossed the threshold for confidence — sometimes years before guidelines updated.
— Sequential meta-analysis methods (trial sequential analysis) control α across accumulating trials.
— When a once-positive practice is overturned (e.g., perioperative β-blockers after POISE, IV tPA in low-NIHSS strokes refined post hoc), recognize the original may have been a type I error or biased finding.
— De-adoption is harder than adoption — anchoring bias in clinical practice.
— Explain that "the studies show benefit on average" hides residual uncertainty.
— Shared decision-making incorporates effect size, CI width, and patient values.
— Read the primary trial for high-impact practice changes, not just the press release or abstract.
— Examine the CI, primary endpoint, prespecification, and population.
— Personal audit of cases where you applied trial evidence — outcome tracking is a form of n=1 hypothesis checking, susceptible to both type I (one good outcome → overconfidence) and type II (one bad outcome → premature abandonment) errors.
Board pearl: A statin trial with NNT=50 over 5 years to prevent one MI is "true" (rejected H₀ correctly) but the clinical conversation must include that 49/50 patients receive no individual benefit — separating population-level inference from individual decision-making is core Step 3 thinking.

— Approving and prescribing an ineffective or harmful drug exposes real patients to risk based on a statistical fluke.
— Ethical duty of non-maleficence demands rigorous α control, especially for therapies that displace established care.
— Failing to detect a real benefit denies patients effective therapy.
— Particular concern in underrepresented populations (women, elderly, minorities, pregnant patients) where subgroup power is low.
— Subjects must be told the trial may not detect benefit (type II) or may yield a false positive (type I) — uncertainty is intrinsic to the enterprise.
— In non-inferiority trials, patients must understand they may receive a treatment that is allowed to be modestly worse than standard care, within a prespecified margin.
— Genuine uncertainty about which arm is better is the ethical justification for randomization. If interim data eliminate equipoise, the DSMB has an obligation to consider stopping (early efficacy or harm).
— Suppressing negative trials (publication bias) is a research misconduct issue and contributes to systemic type I error in the literature.
— FDA Amendments Act mandates ClinicalTrials.gov reporting.
— When a new therapy is added based on a single trial, document the rationale and uncertainty in the discharge summary and communicate it to the primary care physician. Premature adoption that requires later de-prescribing is a recognized transition-of-care hazard, especially in polypharmacy elderly patients.
— Industry-sponsored trials show systematic effect-size inflation; disclosure and independent analysis mitigate.
Step 3 management: When a patient asks about a "breakthrough" drug from a press release, your ethical obligation includes explaining what kind of error the underlying trial is most vulnerable to — and revisiting the decision as more evidence accrues.

Board pearl: A single, well-known mnemonic — "α is the alarm that cries wolf (type I); β is the boy who misses the wolf (type II)." Power is the probability the boy actually spots the wolf when it's really there.

— "60 patients, no significant difference (p=0.18), authors conclude drugs are equivalent." → Answer: type II error / inadequate power; cannot claim equivalence from a superiority trial.
— "Investigators tested 25 SNPs; one was associated with disease (p=0.03)." → Answer: type I error inflation; Bonferroni-corrected threshold would be 0.002.
— "Overall trial negative, but benefit in patients aged 65–75 (p=0.04)." → Answer: post-hoc subgroup; likely chance; requires prespecification and replication.
— "After negative results, authors compute power was 30%." → Answer: post-hoc power is invalid; assess the confidence interval instead.
— "n=20,000; SBP reduction 1.2 mmHg, p<0.001." → Answer: statistically significant but not clinically meaningful.
— "ITT analysis shows non-inferiority, PP analysis does not." → Answer: non-inferiority requires both to confirm.
— "DSMB stopped trial early for efficacy; observed HR 0.45." → Answer: effect likely overestimated; await replication.
— "Large observational study, p<0.0001, contradicts RCT." → Answer: confounding / bias, not power. Bigger n does not fix systematic error.
— Two groups, ordinal pain scores, non-normal → Mann–Whitney, not t-test.
— 2×2 table, expected cell = 3 → Fisher exact, not chi-square.
— "Which would most increase the study's power?" → Largest gain typically from ↑n or ↑effect size targeted or ↓variance, not from loosening α (which would be flagged as inappropriate).
Key distinction: When a stem asks for the best response to a "negative" trial, distinguish among (a) accept the null, (b) declare equivalence, (c) suspect underpowering — the correct exam answer is almost always (c) suspect type II error and examine the CI / sample size justification, unless the trial was explicitly powered as a non-inferiority/equivalence study.

Hypothesis testing trades off two errors: type I (α, false positive — rejecting a true null) and type II (β, false negative — missing a real effect), with power (1 − β) driven primarily by sample size, effect size, variance, and α — and clinical decision-making demands awareness of which error is operative in any given study.
Board pearl: α is the alarm that cries wolf (false positive, type I); β is the boy who misses the wolf (false negative, type II); power is whether he actually sees it when it comes. Every clinical study is an attempt to balance these two errors against the cost of being wrong in either direction.

