Biostatistics & Population Health
Sample size and power calculation principles
— On Step 3, this shows up when you must interpret a "negative" trial, decide whether to adopt a new therapy, or counsel a patient about whether a study's findings apply.
— A study that is underpowered cannot distinguish "no effect" from "we couldn't see the effect" — the classic source of false-negative (Type II) conclusions.
— Sample size is small (n < 100 per arm for a binary outcome).
— The confidence interval around the effect estimate is wide and crosses the null (e.g., RR 0.85, 95% CI 0.55–1.32).
— The trial was stopped early for futility without a pre-specified power analysis.
— Subgroup analyses are emphasized over the primary endpoint.
— α (typically 0.05) — acceptable false-positive rate.
— Power (1 − β) — typically 0.80 or 0.90; probability of detecting a true effect.
— Effect size — the minimum clinically important difference (MCID) you want to detect.
— Variance/baseline event rate — from prior data or pilot studies.

— A new drug for heart failure is compared to placebo in 80 patients; mortality is 18% vs 22% (p = 0.42). The question asks what you conclude.
— Trap answer: "The drug doesn't work." Correct answer: "The study is underpowered; the CI is too wide to exclude benefit."
— A residency QI project wants to detect a 15% reduction in readmissions (from 20% to 17%). You're asked which change increases required sample size.
— Smaller effect size, lower α, higher power, greater variance → all increase n.
— A generic anticoagulant is tested against warfarin with a non-inferiority margin of 1.5% absolute risk difference. Question: why is sample size often larger than a superiority trial?
— Because the margin is narrower than typical superiority effect sizes, requiring more precision.
— Primary endpoint (binary, continuous, time-to-event) — drives the formula.
— Pre-specified α and power — were they reported?
— Assumed event rate in control arm — was it realistic?
— Dropout/attrition assumption — typically inflated by 10–20%.
— One-sided vs two-sided test — one-sided is rarely justified and inflates apparent power.

— Control event rate (20%) — baseline incidence; if overestimated, the trial is underpowered.
— ARR 5% — the MCID; smaller MCID demands larger n (inverse-square relationship).
— 80% power — 20% chance of missing a true effect.
— Two-sided α 0.05 — standard; one-sided would halve required n but is suspect.
— Inflation for dropout — without it, effective power drops.
— Halving the detectable effect size quadruples required sample size.
— Doubling variance doubles required n.
— Inputs become mean difference and standard deviation, expressed as Cohen's d (effect size = Δ/SD).
— d = 0.2 (small), 0.5 (medium), 0.8 (large) — small effects need huge samples.
— Sample size is driven by number of events, not total enrollment.
— Longer follow-up or higher-risk population reduces required n.

— Binary outcome (death yes/no, MI yes/no) → use two-proportion formula.
— Continuous outcome (BP, HbA1c, LDL) → use two-sample t-test formula.
— Time-to-event (survival, time to relapse) → use log-rank/Cox formula based on number of events.
— Paired data (before/after same patient) → use paired formula; smaller n because within-subject variance is removed.
— n per arm ≈ [Z_α/2 + Z_β]² × [p₁(1−p₁) + p₂(1−p₂)] / (p₁ − p₂)²
— Z_α/2 = 1.96 for α=0.05 two-sided; Z_β = 0.84 for 80% power, 1.28 for 90%.
— n per arm ≈ 2 × [(Z_α/2 + Z_β) × σ / Δ]²
— Δ = mean difference; σ = pooled SD.
— Total events required ≈ 4 × (Z_α/2 + Z_β)² / [ln(HR)]²
— Then back-calculate enrollment from expected event rate and follow-up duration.
— ↑ Power, ↑ n
— ↓ α (e.g., 0.05 → 0.01), ↑ n
— ↓ Effect size, ↑ n (steeply)
— ↑ Variance, ↑ n
— ↑ Dropout, ↑ n

— Narrow CI excluding null → adequate sample, real effect.
— Narrow CI including null → adequate sample, true negative or trivial effect.
— Wide CI including null → underpowered; cannot distinguish no effect from meaningful effect.
— Wide CI excluding null → effect is real but precision is poor; replicate.
— Trial A: RR 0.80, 95% CI 0.72–0.89 → real, precise benefit.
— Trial B: RR 0.80, 95% CI 0.50–1.28 → same point estimate, but inconclusive because the CI spans from 50% reduction to 28% harm.
— Studies with wide horizontal lines crossing the null are individually underpowered.
— Pooling them via fixed- or random-effects meta-analysis restores power and narrows the summary CI.
— I² < 25% → low heterogeneity, fixed-effects acceptable.
— I² > 50% → substantial heterogeneity, use random-effects, which inflates the CI and effectively lowers pooled power.

— Two independent arms; baseline workhorse.
— Sample size driven by between-subject variance.
— Each patient serves as their own control.
— Removes between-subject variance → can need 50–75% fewer patients for the same power.
— Limited to stable, chronic, reversible conditions (HTN, migraine) — not acute MI or mortality endpoints.
— Required when intervention is delivered at group level (handwashing programs).
— Design effect = 1 + (m − 1) × ICC inflates required n; intraclass correlation (ICC) of 0.05 with cluster size 20 inflates n by ~2×.
— Tests two interventions simultaneously; efficient if no interaction.
— Powered for main effects, usually not for interaction (which needs ~4× more patients).
— Pre-planned interim analyses with stopping rules.
— Allow early stopping for efficacy, futility, or harm while preserving α via alpha-spending functions (O'Brien–Fleming, Pocock).
— Restrict enrollment to high-event-rate subgroups (e.g., troponin-positive ACS) to reduce required n while preserving generalizability concerns.

— Standard 0.05 two-sided.
— Tightened to 0.01 or 0.001 when multiple comparisons or pivotal regulatory trials → n rises sharply.
— Bonferroni correction: divide α by number of comparisons; for 5 endpoints, α = 0.01 per test.
— 80% is conventional; 90% for pivotal or safety-critical trials.
— Going from 80% → 90% power requires ~30% more patients.
— The most powerful lever — inverse-square relationship.
— Inflating the assumed effect size to make the study feasible is the #1 reason trials end up underpowered ("optimism bias").
— Reduce via inclusion criteria (narrower population), standardized measurement, baseline adjustment (ANCOVA), or paired/crossover designs.
— 1:1 randomization is most efficient.
— 2:1 (more patients to active arm for safety data) requires ~12% more total patients for the same power.
— Standard 10–20% inflation factor: n_enrolled = n_required / (1 − dropout rate).
— Each interim look "spends" α; alpha-spending functions preserve overall Type I error but slightly increase final-analysis n.

— Composite of CV death or HF hospitalization at 2 years (binary, time-to-event).
— From prior trials, ~15% per year → ~28% cumulative at 2 years.
— Clinically meaningful HR of 0.80 (20% relative reduction) — modest but practice-changing.
— Two-sided α = 0.05, power = 0.90 (pivotal trial).
— Events needed ≈ 4 × (1.96 + 1.28)² / [ln(0.80)]² ≈ 4 × 10.5 / 0.0498 ≈ 844 events.
— Expected event rate ~28% over 2 years across both arms → enroll ~3,000 patients per arm, follow for 2 years.
— 3,000 / 0.85 ≈ ~3,530 per arm, total ~7,060.
— Two interim looks using O'Brien–Fleming boundaries; minimal α penalty.
— If MCID loosens to HR 0.70 → events needed drops to ~290; trial becomes much smaller but only detects large effects.
— If event rate is overestimated and the real rate is 18% over 2 years → trial under-enrolls events and is underpowered.

— Sample size is calculated for the overall primary endpoint.
— Splitting into subgroups (sex, age, diabetes status) divides n by 2 or more per analysis.
— Power for each subgroup drops dramatically; CIs widen.
— False negatives in subgroups that truly benefit ("the drug didn't work in women" when it really did — sample was just too small).
— False positives from multiple subgroup testing without α correction.
— Asks whether the treatment effect differs significantly between subgroups, rather than testing each subgroup separately.
— Even interaction tests are typically underpowered unless pre-specified with adequate n.
— Often excluded from pivotal trials → effect estimates have wide CIs or are extrapolated.
— Sample size in these subgroups is rarely sufficient to detect harm signals.
— Real-world data and registries fill this gap but introduce confounding.

— Often use extrapolation from adult efficacy data, with pediatric studies powered only for PK/PD and safety, not efficacy.
— Bayesian borrowing from adult data can reduce required pediatric n while preserving inferential validity.
— Pregnant patients are historically excluded → most therapies have inadequate safety power in pregnancy.
— Registries (e.g., antiepileptic drug pregnancy registries) accumulate events over years to detect teratogenicity signals.
— Conventional power calculations would require global enrollment over decades.
— N-of-1 trials, single-arm trials with historical controls, and adaptive Bayesian designs are accepted by FDA for orphan drugs.
— Wider α (e.g., 0.10) and lower power (e.g., 70%) may be pre-specified and justified.
— Trials use surrogates (LDL, viral load, tumor response) instead of clinical endpoints to reduce required n and follow-up.
— Risk: surrogates may not track clinical benefit (e.g., CAST trial — antiarrhythmics suppressed PVCs but increased mortality).

— False negatives delay adoption of effective therapies by years.
— Patients are exposed to research risk without the study being able to answer its question — an ethical violation of the principle of scientific validity (Belmont Report, Declaration of Helsinki).
— IRBs increasingly reject inadequately powered protocols as ethically deficient.
— Small positive trials get published; small negative trials don't.
— Meta-analyses that include only published studies overestimate effects — corrected by funnel plots and trim-and-fill methods.
— Detect statistically significant but clinically trivial effects (the 0.1 mmHg BP example).
— Waste resources and expose more patients than necessary.
— Encourage misinterpretation: clinicians read "p < 0.001" as "huge effect" when it may be tiny.
— Each additional comparison inflates the family-wise error rate.
— With 20 independent tests at α = 0.05, expected false positives = 1.
— Corrections: Bonferroni (conservative), Holm, Benjamini–Hochberg (FDR) for genomics-scale testing.
— Trials stopped early for benefit tend to overestimate effect size ("regression to the truth").
— Independent Data Safety Monitoring Boards (DSMBs) balance early stopping vs adequate precision.

— No a priori sample size calculation reported.
— Effect size assumption is implausibly large ("we expected a 40% relative reduction").
— One-sided α without strong justification.
— Primary endpoint changed mid-trial without pre-specification.
— Composite endpoint driven entirely by the softest component (e.g., revascularization rather than death).
— Per-protocol as the primary analysis instead of intention-to-treat.
— Heavy reliance on post hoc subgroup or sensitivity analyses.
— Designing QI projects where you need to power a readmission or infection-rate intervention.
— Interpreting a non-inferiority margin before changing prescribing practice.
— Evaluating single-arm oncology trials for off-label use decisions.
— GRADE framework for quality of evidence (downgrades for imprecision = wide CIs from underpowering).
— Cochrane reviews and NNT.com for pooled, transparent effect estimates.
— ClinicalTrials.gov to verify pre-specified endpoints vs published endpoints (catches outcome switching).

— Power = probability a study detects a true treatment effect (study-level).
— Sensitivity = probability a diagnostic test detects a true case (patient-level).
— Both share the "true positive rate" math but apply to different domains.
— α = pre-specified threshold for declaring significance.
— p-value = observed probability of seeing the data (or more extreme) under the null.
— A p < α leads to rejecting the null; p-value is not the probability the null is true.
— CI = range likely to contain the true population parameter.
— Prediction interval = range likely to contain a future individual observation (wider).
— Large n can make trivial effects statistically significant.
— Anchor on MCID and CI, not p < 0.05.
— Superiority — new > standard.
— Non-inferiority — new is not unacceptably worse (one-sided margin).
— Equivalence — new is within a margin in either direction (two-sided).
— Fixed assumes one true underlying effect (low heterogeneity).
— Random assumes effects vary across studies (higher heterogeneity); wider CI.

— Reduced by increasing sample size.
— Reflected in wide CIs and large standard errors.
— Mitigated by adequate power.
— Not reduced by sample size — a biased estimate gets more precisely wrong with more data.
— Sources:
— Selection bias (non-random enrollment, healthy worker effect).
— Information/measurement bias (recall bias, observer bias).
— Confounding (unmeasured variables distort exposure–outcome relationships).
— Attrition bias (differential dropout between arms).
— Randomization — addresses confounding by indication.
— Blinding (single, double, triple) — addresses observer and patient expectation bias.
— Allocation concealment — addresses selection bias at enrollment.
— Intention-to-treat analysis — addresses attrition bias.
— Standardized outcome assessment — addresses measurement bias.
— A small RCT with rigorous methodology may yield a trustworthy but imprecise estimate.
— A huge observational study may yield a precise but biased estimate.
— Mendelian randomization, instrumental variables, propensity scoring attempt to reduce bias in observational data but cannot fully replicate randomization.

— Relative risk reduction (RRR) is stable across baseline risks but exaggerates perceived benefit in low-risk patients.
— Absolute risk reduction (ARR) is patient-specific and clinically actionable.
— NNT = 1 / ARR — the most intuitive metric for shared decision-making.
— Statin trial RRR for MI = 25%.
— High-risk patient (10-year MI risk 20%): ARR = 5%, NNT = 20.
— Low-risk patient (10-year MI risk 2%): ARR = 0.5%, NNT = 200.
— Same trial, very different clinical justification.
— Was the trial population similar to your patient?
— Were elderly, women, minorities, comorbidities adequately represented?
— Underpowered subgroups → wide CIs → uncertain effect in your patient.
— Mirror calculation for adverse events.
— Benefit:harm ratio drives the prescribing decision.
— Most RCTs follow patients 2–5 years; chronic therapies (statins, anticoagulants) are continued for decades.
— Long-term efficacy and harm are often extrapolated.
— Discontinuation rates in real-world practice exceed trial rates → diluted benefit.

— Continuously updated meta-analyses (e.g., Cochrane living reviews for COVID-19 therapies).
— Sample size grows as new trials are added; pooled CIs narrow.
— Sequential meta-analysis with trial sequential analysis (TSA) identifies when the cumulative evidence is sufficient — the meta-analytic analog of power.
— High heterogeneity (I² > 50%) → random-effects with wide CIs.
— Few studies, few events → imprecision persists.
— GRADE rating downgrades evidence quality for imprecision (sparse data, wide CIs).
— Complement RCT evidence, especially for rare adverse events that require huge sample sizes.
— Pharmacovigilance signals (e.g., FAERS) prompt post-marketing studies.
— In your own practice, power your QI projects — small pilot data may show "improvement" that is just noise.
— Pre-define MCID, baseline rate, and sample size; account for temporal trends and regression to the mean.
— When new trial data emerge, re-evaluate chronic therapies. Example: SGLT2 inhibitors expanded indications as cumulative trial evidence grew.
— Document shared decision-making each time evidence shifts.

— IRBs increasingly require a priori power calculations as part of protocol approval.
— Inadequate power = exposing subjects to risk without scientific yield → ethically deficient.
— Subjects must be told the realistic probability the study will be informative.
— In single-arm or pilot trials, this includes honest disclosure that results may be hypothesis-generating only.
— Trials are stopped early for harm, overwhelming benefit, or futility — each carries an ethical obligation.
— Stopping too early for benefit can produce inflated effect estimates that mislead practice (e.g., several early-stopped oncology trials later showed smaller true effects).
— All clinical trials must be registered on ClinicalTrials.gov before enrollment (ICMJE requirement).
— Outcome switching (changing primary endpoint after seeing data) is detectable via registry comparison and is scientific misconduct.
— A patient discharged on a therapy adopted from an underpowered trial may face uncertain harm at outpatient follow-up.
— Document shared decision-making and explicit uncertainty in the discharge summary and outpatient note.
— Underrepresentation of women, minorities, elderly, and pregnant patients in trials produces systematically underpowered evidence for these groups → perpetuates disparities.
— Inclusion is both a scientific and ethical imperative.


— "A 60-patient RCT of drug X vs placebo for migraine prevention found a 15% reduction in monthly headache days (p = 0.18). What is the most appropriate conclusion?"
— Best answer: The study is underpowered to detect a clinically meaningful effect; the CI likely includes both meaningful benefit and harm. Not "drug X is ineffective."
— "Investigators want to detect a 5% absolute reduction instead of 10%. Holding other parameters constant, the required sample size will:"
— Best answer: Quadruple (inverse-square of effect size).
— "In a subgroup analysis, women had no benefit (p = 0.22). What should you conclude?"
— Best answer: Hypothesis-generating only; subgroup is likely underpowered, especially without a significant interaction test. Apply overall trial findings.
— "A trial declared non-inferiority of drug B vs drug A using a margin of 10% absolute risk difference. What is the main concern?"
— Best answer: The margin is clinically too wide; statistically declared non-inferiority does not mean clinically equivalent.
— "A trial was stopped early at the second interim analysis after showing a 40% RRR. The full planned trial was 5,000 patients; stopping occurred at 1,800. What is the concern?"
— Best answer: Effect size is likely overestimated ("regression to the truth"); replication or longer follow-up needed.
— "After a negative trial, investigators report post hoc power was 25%. What does this tell you?"
— Best answer: Post hoc power is not informative; examine the confidence interval to assess whether clinically meaningful effects were excluded.

— Four inputs, one purpose: α (typically 0.05), power (typically 0.80), effect size (MCID), variance — together they determine n; halving effect size quadruples sample size.
— CI over p-value: A non-significant result with a wide CI crossing the null is inconclusive, not negative; meta-analyses and replication restore power lost in small individual trials.
— Subgroups, post hoc analyses, and post hoc power are traps: Trust pre-specified overall primary endpoints; treat subgroup and post hoc findings as hypothesis-generating only.
— Sample size does not fix bias: Randomization, blinding, allocation concealment, and intention-to-treat analysis address systematic error; only n addresses random error. A precise but biased estimate is still wrong.
— Translate to your patient: RRR is constant across risk strata, but ARR and NNT depend on baseline risk — frame shared decisions in absolute terms specific to the individual.

