Biostatistics & Population Health
P-values: interpretation and limitations
| • Definition: The p-value is the probability of observing data as extreme or more extreme than what was observed, assuming the null hypothesis (H₀) is true. | ||
| — It is a conditional probability: P(data | H₀), not P(H₀ | data) |
| — It does not tell you the probability the null hypothesis is true or false | ||
| — It does not measure effect size or clinical importance | ||
| • When p-value misinterpretation drives wrong answers on Step 3: | ||
| — A "statistically significant" result (p<0.05) is assumed to be clinically meaningful — it may not be | ||
| — A "non-significant" result (p>0.05) is interpreted as "no effect" — absence of evidence ≠ evidence of absence | ||
| — Multiple comparisons inflate false-positive rates without correction | ||
| — Small p-value from a huge sample reflects trivial effect size | ||
| — Large p-value from an underpowered study masks a real effect | ||
| • Conceptual anchors: | ||
| — α (alpha): pre-specified threshold for type I error, conventionally 0.05 | ||
| — Type I error: rejecting a true H₀ (false positive) | ||
| — Type II error (β): failing to reject a false H₀ (false negative) | ||
| — Power = 1 − β: probability of detecting a true effect; conventionally ≥0.80 | ||
| • Clinical scenario triggers on the exam: | ||
| — A trial reports p=0.04 for a 0.2 mmHg BP reduction in 50,000 patients → significant but clinically irrelevant | ||
| — A small RCT shows 30% mortality reduction with p=0.18 → may be a true effect, underpowered | ||
| — Subgroup analyses show one "significant" finding among 20 tested → likely chance | ||
| Board pearl: The p-value answers "how surprising is this data if the null is true?" — it never answers "is the treatment worth using?" That second question requires effect size, confidence intervals, NNT, and clinical context. Step 3 stems frequently pair a small p-value with a clinically trivial effect to test whether you conflate statistical with clinical significance. |

— Journal club scenario: resident summarizes a new RCT; attending asks what the p-value means
— Drug rep scenario: pharmaceutical representative claims "statistically significant benefit" — you must interpret critically
— Quality improvement: hospital reports a "significant" reduction in readmissions after intervention
— Screening or diagnostic test studies reporting p-values for sensitivity/specificity comparisons
— Study design: RCT, cohort, case-control, cross-sectional — affects which test and which p-value matters
— Sample size (n): huge n inflates statistical significance; tiny n underpowers detection
— Pre-specified vs post-hoc analyses: post-hoc and subgroup p-values are hypothesis-generating, not confirmatory
— Number of comparisons: multiple testing without correction multiplies false positives
— Primary vs secondary endpoint: only the pre-specified primary endpoint carries the trial's stated α
— Effect size and confidence interval: always evaluate alongside p
— "Trend toward significance" (p=0.06–0.10) — not statistically significant by convention
— "Significant in a subgroup analysis" — likely chance unless pre-specified and adjusted
— "p<0.05 in at least one of 20 outcomes" — expect ~1 false positive by chance alone
— "Borderline significant" — not a real statistical category
— Was α pre-specified?
— Was the study powered for this outcome?
— Is the confidence interval narrow or wide?
— Does the magnitude of effect matter clinically?
Key distinction: A primary endpoint p-value tests the trial's main hypothesis at the stated α; secondary endpoint p-values are exploratory unless a hierarchical testing strategy was pre-specified. Step 3 will reward candidates who refuse to treat a "significant" secondary or subgroup finding as practice-changing without replication.

— The point estimate: the observed effect (relative risk, odds ratio, mean difference, hazard ratio)
— The confidence interval (CI): range of plausible values; 95% CI is the standard companion to p<0.05
— The test used: t-test, chi-square, Fisher exact, log-rank, ANOVA, regression coefficient
— One- vs two-tailed test: two-tailed is standard; one-tailed halves the p-value and is rarely justified
— If the 95% CI for a difference excludes 0, then p < 0.05
— If the 95% CI for a ratio (RR, OR, HR) excludes 1, then p < 0.05
— If the CI crosses the null value (0 or 1), p ≥ 0.05
— A narrow CI = precise estimate (often large n); wide CI = imprecise (often small n or rare event)
— Effect size large + narrow CI + small p = robust, clinically meaningful
— Effect size large + wide CI + small p = real but imprecise (need replication)
— Effect size tiny + narrow CI + tiny p = statistically significant but clinically trivial (massive n)
— Effect size large + wide CI + p=0.08 = possibly real but underpowered — do not dismiss
— RR 1.02, 95% CI 1.01–1.03, p<0.001 → significant but trivial; n is huge
— RR 0.55, 95% CI 0.28–1.08, p=0.07 → suggestive but CI crosses 1
— HR 0.70, 95% CI 0.55–0.89, p=0.003 → meaningful and significant
Board pearl: Always read the confidence interval before the p-value. The CI conveys magnitude, direction, and precision in one glance; the p-value alone discards all three. Step 3 stems frequently provide both — choose the answer that integrates CI width and clinical relevance, not the one that simply parrots "p<0.05 means it works."

— 1. State H₀ and H₁: H₀ typically = no difference / no association; H₁ = the alternative
— 2. Set α: conventionally 0.05 (two-tailed)
— 3. Choose the appropriate test based on data type and design
— 4. Calculate the test statistic (t, z, χ², F, log-rank)
— 5. Convert to p-value using the test distribution
— 6. Compare p to α: if p < α, reject H₀
— Two means, continuous, normal: Student t-test (independent or paired)
— >2 means, continuous: ANOVA
— Non-normal continuous or ordinal: Wilcoxon rank-sum, Mann–Whitney, Kruskal–Wallis
— Categorical, large expected counts: chi-square (χ²)
— Categorical, small expected counts (<5): Fisher exact test
— Time-to-event: log-rank test (compares Kaplan–Meier curves)
— Correlation: Pearson (parametric) or Spearman (non-parametric)
— Effect size (larger difference → smaller p)
— Sample size (larger n → smaller p for same effect)
— Variability (smaller SD → smaller p)
— Therefore p conflates these three — you cannot reverse-engineer effect size from p alone
— Power = 1 − β, target ≥0.80
— Underpowered study + true effect → high false-negative rate
— Studies should report a pre-trial sample size calculation
Step 3 management: When evaluating a published trial or QI project result, do not stop at p<0.05. Confirm the test matched the data type, the analysis was pre-specified, and the trial was powered for the stated outcome. An "underpowered negative trial" is not evidence of no effect — it is evidence of inadequate evidence.

| • Multiple comparisons problem: | |
| — Testing 20 independent hypotheses at α=0.05 → expected 1 false positive by chance | |
| — Family-wise error rate (FWER) grows: 1 − (0.95)ᵏ for k tests | |
| — Bonferroni correction: new α = 0.05/k (conservative) | |
| — Holm-Bonferroni, Benjamini–Hochberg (false discovery rate): less conservative alternatives | |
| — Applies to: subgroup analyses, multiple endpoints, interim analyses, genome-wide studies | |
| • Pre-specification matters: | |
| — Pre-specified primary endpoint with pre-stated α = confirmatory | |
| — Post-hoc, exploratory, or data-dredged findings = hypothesis-generating only | |
| — "p-hacking" = trying multiple tests/cutoffs until one reaches p<0.05 | |
| • Bayesian reframing (conceptual): | |
| — Frequentist p answers P(data | H₀) |
| — Bayesian posterior answers P(H₀ | data), requires a prior |
| — A small p with low prior probability of effect (rare disease, weak biology) still implies modest posterior probability — explains why many "significant" findings fail to replicate | |
| • Replication and the "reproducibility crisis": | |
| — Single p<0.05 result is weak evidence; replication strengthens inference | |
| — Pre-registration and reporting all outcomes reduce publication bias | |
| — Meta-analyses pool effect sizes and provide narrower CIs | |
| • Interim analyses and stopping rules: | |
| — Sequential testing inflates type I error unless α is "spent" using O'Brien–Fleming or Pocock boundaries | |
| — DSMB (Data Safety Monitoring Board) oversees early stopping for efficacy, futility, or harm | |
| Board pearl: A "significant" subgroup finding from an otherwise negative trial is almost always chance. Step 3 will test whether you recommend changing practice based on it — the correct answer is "interpret cautiously, awaiting confirmatory pre-specified replication," not "adopt the new therapy in that subgroup." |

— Quadrant 1: Statistically significant + clinically meaningful
— Quadrant 2: Statistically significant + clinically trivial
— Quadrant 3: Not significant + possibly meaningful
— Quadrant 4: Not significant + trivial effect
— Smallest change a patient perceives as beneficial
— Examples: ~10 mm on 100-mm pain VAS; ~5-point change on many QoL scales
— If the entire 95% CI lies below the MCID, the effect is clinically unimportant even when "significant"
— NNT (number needed to treat) = 1 / ARR
— NNH (number needed to harm) = 1 / ARI
— Relative risk reductions sound impressive; absolute risk reductions and NNT ground the decision
— A non-significant p in a superiority trial ≠ equivalence
— Equivalence/non-inferiority requires pre-specified margins and CI-based analysis
Step 3 management: When asked whether to adopt a new therapy based on a trial result, integrate (1) effect size, (2) CI width, (3) p-value, (4) NNT vs NNH, and (5) patient-centered MCID. The exam-correct answer recommends therapy when all align, not when p<0.05 alone.

— Point estimate (mean difference, RR, OR, HR)
— 95% confidence interval
— Exact p-value (not just "p<0.05" or "NS")
— Sample size and event counts
— Pre-specified analysis plan
— Continuous outcomes: mean ± SD or median (IQR), with mean difference and 95% CI
— Binary outcomes: event rates per group, RR or OR with 95% CI
— Survival outcomes: HR with 95% CI, Kaplan–Meier curves, log-rank p
— Diagnostic studies: sensitivity, specificity, LR+/LR−, with CIs
— Mean difference: null = 0
— RR, OR, HR: null = 1
— Correlation r: null = 0
— If the 95% CI crosses the null → p ≥ 0.05
— Default: two-sided (effect could go either direction)
— One-sided is appropriate only when the opposite direction is implausible or irrelevant — rarely justified in clinical trials
— Reporting a one-sided p without justification is a red flag for p-hacking
— A significant interaction p-value suggests the effect differs across subgroups
— Without a significant interaction, subgroup-specific point estimates should not be over-interpreted
— "Test for interaction" is the right tool, not separate subgroup p-values
— Adverse event tables often lack p-values or have low power
— A non-significant safety signal is not reassurance — it may be underpowered
Board pearl: Demand the trio: effect size, CI, and p-value. If a stem gives you only a p-value, the correct interpretive answer almost always involves acknowledging that the p alone is insufficient — pick the option that asks for the confidence interval or effect magnitude.

— Compares means of two groups, continuous outcome, approximately normal distribution
— Independent (two separate groups) vs paired (same subjects, two time points)
— Assumptions: normality, equal variance (or use Welch's correction)
— Compares means across ≥3 groups
— Significant overall F-test → follow with post-hoc pairwise tests (Tukey, Bonferroni)
— Categorical data, comparing observed vs expected frequencies
— Requires expected cell counts ≥5; use Fisher exact if smaller
— Categorical data with small samples or sparse cells
— Provides exact p-value rather than approximation
— Non-parametric alternatives for non-normal or ordinal data
— Compare medians/distributions rather than means
— Compares survival distributions between groups
— Paired with Kaplan–Meier curves and Cox proportional hazards regression (yields HR)
— Linear regression: continuous outcome; coefficient with p
— Logistic regression: binary outcome; OR with p
— Cox regression: time-to-event; HR with p
— Adjusts for confounders; coefficients are interpreted "holding others constant"
— Paired categorical data (e.g., before/after, matched case-control)
— Pearson: linear, parametric, continuous normal
— Spearman: rank-based, non-parametric, ordinal or non-normal
Key distinction: Choice of test depends on (1) outcome data type, (2) number of groups, (3) paired vs independent, (4) distributional assumptions. A "wrong test" answer choice on Step 3 often features a t-test applied to categorical data or a chi-square applied to continuous data — eliminate these first.

— Parametric tests (t-test, ANOVA) lose validity when assumptions fail with small n
— Use non-parametric tests (Wilcoxon, Mann–Whitney) for small or non-normal samples
— Use Fisher exact instead of chi-square when expected cell counts <5
— Small samples → wide CIs → high type II error risk
— Standard chi-square/logistic regression unstable when events <10 per variable
— Consider exact methods, Firth's penalized logistic regression, or Poisson regression with offset for person-time
— Zero events in one arm → cannot compute OR/RR directly; use continuity correction or exact CI
— Income, length of stay, biomarker concentrations often right-skewed
— Options: log-transform then t-test, or non-parametric test on raw data
— Report median (IQR) rather than mean ± SD
— Repeated measures within patients, patients within clinics
— Standard tests assume independence — violations inflate type I error
— Use mixed-effects models, GEE (generalized estimating equations), or paired tests
— Patients lost to follow-up or event-free at study end → censored
— Use Kaplan–Meier and Cox models, not simple proportions
— Informative censoring (loss related to outcome) biases results
— Small samples = "reduced clearance" of statistical power; adjust by choosing exact or non-parametric tests
— Skewed data = "altered metabolism"; transform or use rank-based methods
Step 3 management: When a stem describes a study with 20 patients, rare outcomes, or markedly skewed labs, the correct analytic choice is almost always a non-parametric or exact test. A standard t-test or chi-square in these settings is a distractor.

— Smaller eligible populations → often underpowered; use Bayesian designs or extrapolation from adult data
— Composite endpoints common to maintain power; interpret each component
— Age-stratified analyses pre-specified to detect effect modification
— Explanatory (efficacy): ideal conditions, strict inclusion, internal validity — answers "can it work?"
— Pragmatic (effectiveness): real-world conditions, broad inclusion, external validity — answers "does it work in practice?"
— Pragmatic trials often show smaller effect sizes; p-values must be interpreted with absolute risk reduction and NNT
— Pre-specified, biologically plausible, limited in number, with formal interaction tests = trustworthy
— Post-hoc, numerous, no interaction test = likely chance
— A "positive" subgroup in an overall-negative trial is not practice-changing
— Increase event rates and power but can mislead if driven by softer components (e.g., revascularization rather than mortality)
— Always examine individual components
— Surrogate (LDL, HbA1c, BP) may not translate to clinical benefit
— Significant p on surrogate ≠ significant p on mortality (cf. niacin, ezetimibe pre-IMPROVE-IT debates)
— If trial enrolled mostly one demographic, p-value applies to that population
— Step 3 emphasizes assessing whether trial population matches your patient
Board pearl: A trial showing a "significant" benefit in a subgroup (e.g., women, diabetics) when the overall trial was neutral should prompt the answer "hypothesis-generating, requires confirmatory trial" — not adoption. This is one of the most reliably tested principles in Step 3 biostatistics vignettes.

| • Top misinterpretations to recognize and reject: | ||
| — "p-value is the probability the null is true": WRONG. It is P(data | H₀), not P(H₀ | data). |
| — "p>0.05 means no effect": WRONG. Absence of evidence ≠ evidence of absence; may be underpowered. | ||
| — "p<0.05 means clinically important": WRONG. Statistical ≠ clinical significance. | ||
| — "Smaller p = larger effect": WRONG. p reflects effect size, variability, AND sample size combined. | ||
| — "p<0.05 means the result will replicate": WRONG. Replication probability depends on power and prior plausibility. | ||
| • P-hacking and HARKing: | ||
| — P-hacking: trying many analyses until one yields p<0.05 | ||
| — HARKing: Hypothesizing After Results are Known — recasting exploratory finding as primary | ||
| — Both inflate false-positive rates dramatically | ||
| • Publication bias: | ||
| — "Positive" trials (p<0.05) more likely published | ||
| — Meta-analyses must search for unpublished data; funnel plots and Egger test detect asymmetry | ||
| • Garden of forking paths: | ||
| — Multiple defensible analytic choices (covariate selection, outcome definition, cutoffs) inflate type I error even without explicit p-hacking | ||
| • Misuse of "trend toward significance": | ||
| — p=0.06 and p=0.04 are nearly identical evidence — the 0.05 threshold is arbitrary | ||
| — Avoid dichotomizing; report the actual p and CI | ||
| • Confounding the p-value with clinical decision: | ||
| — Even a robust p<0.001 does not override patient preferences, comorbidities, costs, or harms | ||
| Key distinction: The p-value is a decision aid about the null hypothesis, not a measure of truth, importance, or replicability. Step 3 distractors often phrase p-values as if they answered questions they do not — always reject the option that states "the p-value is the probability the treatment works." |

— Study design phase: sample size calculation, randomization scheme, primary endpoint selection
— Complex data structures: longitudinal, clustered, missing-not-at-random
— Survival analyses, competing risks
— Adaptive or Bayesian trial designs
— Multiple comparisons strategies
— Interim analyses with formal stopping rules
— Single-center small trial with surprising "significant" finding
— Subgroup-only significance in an overall-negative trial
— Industry-sponsored trial with multiple endpoints and one "winner"
— Observational study with unmeasured confounding
— Surrogate endpoint result without hard-outcome confirmation
— Novel mechanism, no prior supporting evidence
— Effect size implausibly large given biology
— Single trial despite multiple prior negative trials
— Post-hoc or exploratory analyses
— DSMB triggers: futility, harm, overwhelming efficacy at interim
— Early stopping inflates effect size estimates ("regression to the mean" upon replication)
— Run charts and statistical process control (SPC) often preferred over p-values
— Special-cause variation detected by control rules, not t-tests
Step 3 management: In a journal club or QI scenario, the correct "escalation" is often (1) consult biostatistics for proper analytic plan, (2) require pre-specification, and (3) await replication. The wrong answer is "implement now because p<0.05." Recognize that biostatistics consultation, like ID or cardiology consult, is a legitimate management step in evidence-based practice questions.

| • Same-category "differentials" — what else describes evidence? | |
| — Confidence interval (95% CI): | |
| • Range of plausible values for the true parameter | |
| • Conveys precision and magnitude; preferred over p alone | |
| • CI excludes null ↔ p<0.05 | |
| — Effect size measures: | |
| • Cohen's d for continuous outcomes (0.2 small, 0.5 medium, 0.8 large) | |
| • RR, OR, HR for ratios; ARR for absolute | |
| • Independent of sample size — unlike p | |
| — NNT / NNH: | |
| • Absolute, patient-centered translations of effect size | |
| • Smaller NNT = more efficient therapy | |
| — Likelihood ratio (LR+ / LR−): | |
| • Diagnostic test performance | |
| • Updates pre-test to post-test probability via Bayes' theorem | |
| • Unrelated to hypothesis-test p-values | |
| — Bayes factor: | |
| • Ratio of evidence favoring H₁ vs H₀ | |
| • Directly addresses what p-values cannot: relative support for hypotheses | |
| — Posterior probability: | |
| • Bayesian P(H | data), the quantity clinicians intuitively want |
| • Hierarchy of evidence integration: | |
| — Effect size + CI > p-value alone | |
| — Pre-specified primary endpoint > secondary/subgroup | |
| — Replicated finding > single trial | |
| — Meta-analysis (with low heterogeneity, I²<50%) > individual trials | |
| • Forest plots and meta-analytic interpretation: | |
| — Pooled estimate with CI; diamond width = precision | |
| — Heterogeneity (I², Cochran Q) assesses consistency across trials | |
| — Random-effects model when heterogeneity present | |
| Board pearl: When a question lists "p<0.05" alongside CI, effect size, and NNT, the CI and NNT are typically the correct answers for clinical decision-making. The p-value is a screening filter; the CI and NNT are the diagnostic and management tools. |

— α = pre-specified threshold (decision rule)
— p = observed result (data summary)
— Reject H₀ when p < α
— Power is prospective (study design); p is retrospective (study result)
— Low power → high false-negative risk; doesn't change p interpretation but changes inference from non-significance
— Type I (α): false positive — reject true H₀
— Type II (β): false negative — fail to reject false H₀
— p relates to type I error rate only if H₀ is true
— Sensitivity/specificity are properties of a test; PPV/NPV depend on prevalence
— Analogously: a "significant" p depends on prior plausibility — low-prior, low-power studies yield high false-discovery rates even at p<0.05 (Ioannidis)
— RR/OR/HR are effect sizes; p tests whether they differ from null
— A trial can report RR=2.0 with p=0.3 (small, imprecise) or RR=1.05 with p<0.001 (huge, trivial)
— r measures strength and direction (−1 to +1)
— p tests whether r differs from 0
— High n yields significant p for trivially small r
— A significant p does not establish causation
— Causation requires design (randomization), Bradford Hill criteria, or counterfactual frameworks
Key distinction: The p-value answers a narrow question — "how surprising is this data under the null?" It is not interchangeable with effect size, power, predictive value, or causation. Step 3 stems test whether you can identify the right statistical quantity for the question being asked.

— 1. Always read the CI before the p-value
— 2. Demand pre-specification
— 3. Translate to absolute terms
— 4. Check the trial population
— 5. Seek replication and meta-analysis
— 6. Account for harms with same rigor as benefits
— CONSORT for RCTs
— STROBE for observational studies
— PRISMA for systematic reviews/meta-analyses
— GRADE for evidence quality ratings (high → very low)
— What was the effect size?
— How wide is the CI?
— Was this the primary endpoint?
— How many comparisons were made?
— Has it been replicated?
— Is the effect clinically meaningful (MCID, NNT)?
— Do harms outweigh benefits?
— Use absolute risks and natural frequencies ("1 in 50" rather than "RR 0.85")
— Decision aids improve shared decision-making
Step 3 management: Long-term practice "prevention" against statistical misinterpretation involves institutional habits — journal clubs, EBM rounds, pre-registration culture, and biostatistics partnerships. The exam rewards the physician who treats every p-value as one data point among many, not a verdict.

— Critical appraisal of trials in your specialty
— Familiarity with key landmark trials and their effect sizes (not just "positive/negative")
— Recognition of common statistical fallacies
— Comfort with absolute vs relative risk communication
— Journal club participation — verbalize interpretations
— Use structured appraisal tools (CASP, JAMA Users' Guides)
— Cross-check secondary sources (Cochrane, UpToDate, DynaMed) against primary trials
— Specialty guideline updates: every 1–5 years
— Major trial readouts: track via AHA/ACC, ASCO, ADA, NEJM, JAMA, Lancet
— Meta-analyses and Cochrane reviews: revisit annually for high-volume conditions
— Disclose absolute benefits and harms with NNT/NNH
— Acknowledge uncertainty when CIs are wide
— Discuss alternative options including no treatment
— SPC charts to detect process change without p-value abuse
— Small-sample QI initiatives often misuse t-tests; prefer run-chart rules
— Force yourself to state the CI and NNT before commenting on p
— Reject "significant ≠ important" conflation in your own speech
— Model proper statistical reasoning to residents and students
— Correct gently when "p<0.05 = it works" appears in case presentations
Board pearl: Step 3 expects practicing physicians to maintain lifelong evidence-based practice habits, not just memorize trial results. Questions about journal club, MOC, and CME often hinge on whether you can identify a flawed inference and propose a structured appraisal approach.

— Patients have the right to understand absolute risks and benefits, not just relative
— Quoting "50% reduction" without absolute numbers (e.g., 2% → 1%) is potentially misleading
— Consent for research participation must include realistic disclosure of likely benefit, equipoise, and uncertainty
— Selective outcome reporting (publishing only significant findings) violates scientific integrity
— P-hacking and HARKing are research misconduct when intentional
— Trial pre-registration (ClinicalTrials.gov) is required by ICMJE for publication
— DSMBs uphold patient safety by stopping trials early for harm or overwhelming efficacy
— Industry-funded trials are not inherently invalid but require transparency
— Disclosure required in publications and at point of care
— Marketing materials emphasizing relative risk reductions without absolute context warrant skepticism
— A physician handing off care must communicate uncertainty in evidence, not present therapies as definitively proven when based on a single underpowered trial
— Quality dashboards reporting "significant improvement" in readmissions or infection rates can mislead if multiple comparisons or small n are involved — verify with SPC methodology
— Adverse event signals (e.g., post-marketing pharmacovigilance) often rely on observational data with wide CIs; act on signals while acknowledging uncertainty
— FDA MedWatch reporting is mandatory for serious unexpected events regardless of "statistical significance"
— Trials excluding women, elderly, minorities yield p-values that may not generalize
— Ethical practice requires explicit acknowledgment of evidence gaps for underrepresented groups
Step 3 management: When a drug rep, QI report, or trial summary presents a "significant" result, the ethically appropriate response is to (1) request absolute numbers, (2) ask about pre-specification and replication, and (3) communicate uncertainty transparently to patients during informed consent.

| • Definitional anchors: | |
| — p-value = P(data ≥ observed | H₀ true) |
| — α = pre-specified type I error threshold (typically 0.05) | |
| — β = type II error; power = 1 − β (target ≥0.80) | |
| • CI–p shortcuts: | |
| — 95% CI excludes null → p < 0.05 | |
| — 95% CI for difference: null = 0 | |
| — 95% CI for ratio (RR/OR/HR): null = 1 | |
| • Sample size effects: | |
| — Doubling n shrinks SE by √2 → narrower CI, smaller p | |
| — Huge n can make trivial effects "significant" | |
| — Small n can miss large effects ("non-significant") | |
| • Multiple comparisons: | |
| — Bonferroni: α/k | |
| — 20 tests at α=0.05 → ~1 false positive expected | |
| • Test selection cheat sheet: | |
| — 2 means, normal: t-test | |
| — ≥3 means: ANOVA | |
| — Non-normal: Wilcoxon/Mann–Whitney/Kruskal–Wallis | |
| — Categorical: chi-square (Fisher if small) | |
| — Paired categorical: McNemar | |
| — Survival: log-rank, Cox | |
| • Common pitfalls: | |
| — p>0.05 ≠ no effect | |
| — p<0.05 ≠ important effect | |
| — Significant subgroup in negative trial = chance until replicated | |
| — "Trend toward significance" = not significant | |
| • Vocabulary: | |
| — Type I error: false positive (α) | |
| — Type II error: false negative (β) | |
| — Power: detect true effect (1 − β) | |
| — Effect size: magnitude (independent of n) | |
| — MCID: smallest patient-meaningful change | |
| — NNT = 1/ARR; NNH = 1/ARI | |
| • Frameworks: | |
| — CONSORT (RCT), STROBE (observational), PRISMA (meta-analysis), GRADE (quality) | |
| Board pearl: If forced to choose one number to report from a trial, choose the 95% CI for the effect size, not the p-value. The CI contains the p-value's information plus magnitude and precision — and Step 3 distractors reliably reward this preference. |

| • Pattern 1: "Statistically significant but trivial" | ||
| — Stem: 50,000-patient trial shows 0.3 mmHg SBP reduction, p<0.001 | ||
| — Trap: "Adopt new therapy" | ||
| — Correct: Effect too small to be clinically meaningful; do not change practice | ||
| • Pattern 2: "Underpowered negative trial" | ||
| — Stem: 40-patient pilot RCT, 25% mortality reduction, p=0.18, 95% CI 0.50–1.15 | ||
| — Trap: "Therapy is ineffective" | ||
| — Correct: Inconclusive; possibly effective; requires larger trial | ||
| • Pattern 3: "Subgroup analysis trap" | ||
| — Stem: Overall trial neutral; one of 15 subgroups (e.g., diabetics) shows p=0.03 benefit | ||
| — Trap: "Recommend therapy for diabetics" | ||
| — Correct: Hypothesis-generating only; chance likely; needs confirmatory trial | ||
| • Pattern 4: "Multiple comparisons inflation" | ||
| — Stem: Genome-wide study tests 10,000 SNPs, finds one with p=0.001 | ||
| — Trap: "Strong association" | ||
| — Correct: Expected ~10 such findings by chance; apply Bonferroni or FDR | ||
| • Pattern 5: "Wrong test" | ||
| — Stem: Investigators use chi-square on continuous outcome data | ||
| — Correct: Should use t-test or non-parametric equivalent | ||
| • Pattern 6: "Misinterpreting p as P(H₀)" | ||
| — Trap option: "p=0.03 means a 3% chance the null is true" | ||
| — Correct: p = P(data | H₀); does NOT give P(H₀ | data) |
| • Pattern 7: "Surrogate endpoint hype" | ||
| — Stem: Drug lowers LDL with p<0.001, but mortality unchanged | ||
| — Correct: Surrogate benefit ≠ clinical benefit | ||
| • Pattern 8: "Early stopping inflates effect" | ||
| — Stem: Trial stopped early for efficacy at interim | ||
| — Correct: Effect size likely overestimated; awaits replication | ||
| • Pattern 9: "Drug rep / shared decision" | ||
| — Correct: Translate to absolute risk, NNT/NNH; discuss uncertainty | ||
| Step 3 management: When stems pair a p-value with a CI, effect size, or NNT, prioritize the clinically translated answer choice over the one that simply restates statistical significance. The exam consistently rewards integration over reflex. |

| A p-value is the probability of observing data as extreme as ours assuming the null hypothesis is true — nothing more — so it must always be interpreted alongside effect size, confidence interval, pre-specification, sample size, and clinical importance before it can inform any patient care decision. | |
| • Core recap bullets: | |
| — What p IS: P(data | H₀); a measure of compatibility with the null |
| — What p IS NOT: probability the null is true, probability the treatment works, measure of effect size, or guarantee of replication | |
| • Decision framework: | |
| — Always read the 95% CI before the p-value — it captures magnitude, direction, and precision | |
| — Translate findings into absolute risk reduction, NNT, and NNH for patient-centered decisions | |
| — Confirm the result is from a pre-specified primary endpoint, not a post-hoc or subgroup analysis | |
| — Apply multiple-comparison corrections when many tests are performed | |
| • Pitfalls to reject on exam day: | |
| — "p<0.05 means clinically important" — FALSE | |
| — "p>0.05 means no effect" — FALSE; may be underpowered | |
| — "Significant subgroup in a negative trial changes practice" — FALSE; hypothesis-generating | |
| — "Smaller p means larger effect" — FALSE; p conflates effect size, variability, and n | |
| • Best practice habits: | |
| — Demand effect size + CI + p as a trio | |
| — Require pre-registration and replication before adopting novel therapies | |
| — Communicate uncertainty honestly during informed consent and shared decision-making | |
| Board pearl: The single most reliable Step 3 instinct in any biostatistics vignette is to distrust isolated p-values and demand the confidence interval, effect size, and clinical context — every time, without exception. |

