Biostatistics & Population Health

P-values: interpretation and limitations

Clinical Overview and When to Suspect Misinterpretation of P-values

• Definition: The p-value is the probability of observing data as extreme or more extreme than what was observed, assuming the null hypothesis (H₀) is true.
— It is a conditional probability: P(data	H₀), not P(H₀	data)
— It does not tell you the probability the null hypothesis is true or false
— It does not measure effect size or clinical importance
• When p-value misinterpretation drives wrong answers on Step 3:
— A "statistically significant" result (p<0.05) is assumed to be clinically meaningful — it may not be
— A "non-significant" result (p>0.05) is interpreted as "no effect" — absence of evidence ≠ evidence of absence
— Multiple comparisons inflate false-positive rates without correction
— Small p-value from a huge sample reflects trivial effect size
— Large p-value from an underpowered study masks a real effect
• Conceptual anchors:
— α (alpha): pre-specified threshold for type I error, conventionally 0.05
— Type I error: rejecting a true H₀ (false positive)
— Type II error (β): failing to reject a false H₀ (false negative)
— Power = 1 − β: probability of detecting a true effect; conventionally ≥0.80
• Clinical scenario triggers on the exam:
— A trial reports p=0.04 for a 0.2 mmHg BP reduction in 50,000 patients → significant but clinically irrelevant
— A small RCT shows 30% mortality reduction with p=0.18 → may be a true effect, underpowered
— Subgroup analyses show one "significant" finding among 20 tested → likely chance
Board pearl: The p-value answers "how surprising is this data if the null is true?" — it never answers "is the treatment worth using?" That second question requires effect size, confidence intervals, NNT, and clinical context. Step 3 stems frequently pair a small p-value with a clinically trivial effect to test whether you conflate statistical with clinical significance.

Presentation Patterns and Key History — How P-values Appear in Exam Stems

— Journal club scenario: resident summarizes a new RCT; attending asks what the p-value means

— Drug rep scenario: pharmaceutical representative claims "statistically significant benefit" — you must interpret critically

— Quality improvement: hospital reports a "significant" reduction in readmissions after intervention

— Screening or diagnostic test studies reporting p-values for sensitivity/specificity comparisons

— Study design: RCT, cohort, case-control, cross-sectional — affects which test and which p-value matters

— Sample size (n): huge n inflates statistical significance; tiny n underpowers detection

— Pre-specified vs post-hoc analyses: post-hoc and subgroup p-values are hypothesis-generating, not confirmatory

— Number of comparisons: multiple testing without correction multiplies false positives

— Primary vs secondary endpoint: only the pre-specified primary endpoint carries the trial's stated α

— Effect size and confidence interval: always evaluate alongside p

— "Trend toward significance" (p=0.06–0.10) — not statistically significant by convention

— "Significant in a subgroup analysis" — likely chance unless pre-specified and adjusted

— "p<0.05 in at least one of 20 outcomes" — expect ~1 false positive by chance alone

— "Borderline significant" — not a real statistical category

— Was α pre-specified?

— Was the study powered for this outcome?

— Is the confidence interval narrow or wide?

— Does the magnitude of effect matter clinically?

Key distinction: A primary endpoint p-value tests the trial's main hypothesis at the stated α; secondary endpoint p-values are exploratory unless a hierarchical testing strategy was pre-specified. Step 3 will reward candidates who refuse to treat a "significant" secondary or subgroup finding as practice-changing without replication.

Classic vignette frames:

Key "history" elements to extract from a study description:

Red-flag phrases in stems:

What the stem expects you to ask:

Physical Exam Findings — The "Exam" of a P-value (Anatomy of a Statistical Result)

— The point estimate: the observed effect (relative risk, odds ratio, mean difference, hazard ratio)

— The confidence interval (CI): range of plausible values; 95% CI is the standard companion to p<0.05

— The test used: t-test, chi-square, Fisher exact, log-rank, ANOVA, regression coefficient

— One- vs two-tailed test: two-tailed is standard; one-tailed halves the p-value and is rarely justified

— If the 95% CI for a difference excludes 0, then p < 0.05

— If the 95% CI for a ratio (RR, OR, HR) excludes 1, then p < 0.05

— If the CI crosses the null value (0 or 1), p ≥ 0.05

— A narrow CI = precise estimate (often large n); wide CI = imprecise (often small n or rare event)

— Effect size large + narrow CI + small p = robust, clinically meaningful

— Effect size large + wide CI + small p = real but imprecise (need replication)

— Effect size tiny + narrow CI + tiny p = statistically significant but clinically trivial (massive n)

— Effect size large + wide CI + p=0.08 = possibly real but underpowered — do not dismiss

— RR 1.02, 95% CI 1.01–1.03, p<0.001 → significant but trivial; n is huge

— RR 0.55, 95% CI 0.28–1.08, p=0.07 → suggestive but CI crosses 1

— HR 0.70, 95% CI 0.55–0.89, p=0.003 → meaningful and significant

Board pearl: Always read the confidence interval before the p-value. The CI conveys magnitude, direction, and precision in one glance; the p-value alone discards all three. Step 3 stems frequently provide both — choose the answer that integrates CI width and clinical relevance, not the one that simply parrots "p<0.05 means it works."

What to "inspect" when a p-value is reported:

The CI–p relationship (must memorize):

"Hemodynamic" assessment — is the result stable?

Common exam "findings":

Diagnostic Workup — Calculating and Setting Up a P-value

— 1. State H₀ and H₁: H₀ typically = no difference / no association; H₁ = the alternative

— 2. Set α: conventionally 0.05 (two-tailed)

— 3. Choose the appropriate test based on data type and design

— 4. Calculate the test statistic (t, z, χ², F, log-rank)

— 5. Convert to p-value using the test distribution

— 6. Compare p to α: if p < α, reject H₀

— Two means, continuous, normal: Student t-test (independent or paired)

— >2 means, continuous: ANOVA

— Non-normal continuous or ordinal: Wilcoxon rank-sum, Mann–Whitney, Kruskal–Wallis

— Categorical, large expected counts: chi-square (χ²)

— Categorical, small expected counts (<5): Fisher exact test

— Time-to-event: log-rank test (compares Kaplan–Meier curves)

— Correlation: Pearson (parametric) or Spearman (non-parametric)

— Effect size (larger difference → smaller p)

— Sample size (larger n → smaller p for same effect)

— Variability (smaller SD → smaller p)

— Therefore p conflates these three — you cannot reverse-engineer effect size from p alone

— Power = 1 − β, target ≥0.80

— Underpowered study + true effect → high false-negative rate

— Studies should report a pre-trial sample size calculation

Step 3 management: When evaluating a published trial or QI project result, do not stop at p<0.05. Confirm the test matched the data type, the analysis was pre-specified, and the trial was powered for the stated outcome. An "underpowered negative trial" is not evidence of no effect — it is evidence of inadequate evidence.

Step-by-step "workup" of a hypothesis test:

Choosing the right test (high-yield):

What determines p-value magnitude:

Sample size and power:

Diagnostic Workup — Advanced Concepts: Multiple Comparisons, Bayes, and Replication

• Multiple comparisons problem:
— Testing 20 independent hypotheses at α=0.05 → expected 1 false positive by chance
— Family-wise error rate (FWER) grows: 1 − (0.95)ᵏ for k tests
— Bonferroni correction: new α = 0.05/k (conservative)
— Holm-Bonferroni, Benjamini–Hochberg (false discovery rate): less conservative alternatives
— Applies to: subgroup analyses, multiple endpoints, interim analyses, genome-wide studies
• Pre-specification matters:
— Pre-specified primary endpoint with pre-stated α = confirmatory
— Post-hoc, exploratory, or data-dredged findings = hypothesis-generating only
— "p-hacking" = trying multiple tests/cutoffs until one reaches p<0.05
• Bayesian reframing (conceptual):
— Frequentist p answers P(data	H₀)
— Bayesian posterior answers P(H₀	data), requires a prior
— A small p with low prior probability of effect (rare disease, weak biology) still implies modest posterior probability — explains why many "significant" findings fail to replicate
• Replication and the "reproducibility crisis":
— Single p<0.05 result is weak evidence; replication strengthens inference
— Pre-registration and reporting all outcomes reduce publication bias
— Meta-analyses pool effect sizes and provide narrower CIs
• Interim analyses and stopping rules:
— Sequential testing inflates type I error unless α is "spent" using O'Brien–Fleming or Pocock boundaries
— DSMB (Data Safety Monitoring Board) oversees early stopping for efficacy, futility, or harm
Board pearl: A "significant" subgroup finding from an otherwise negative trial is almost always chance. Step 3 will test whether you recommend changing practice based on it — the correct answer is "interpret cautiously, awaiting confirmatory pre-specified replication," not "adopt the new therapy in that subgroup."

Risk Stratification — Statistical vs Clinical Significance Decision Framework

— Quadrant 1: Statistically significant + clinically meaningful

— Quadrant 2: Statistically significant + clinically trivial

— Quadrant 3: Not significant + possibly meaningful

— Quadrant 4: Not significant + trivial effect

— Smallest change a patient perceives as beneficial

— Examples: ~10 mm on 100-mm pain VAS; ~5-point change on many QoL scales

— If the entire 95% CI lies below the MCID, the effect is clinically unimportant even when "significant"

— NNT (number needed to treat) = 1 / ARR

— NNH (number needed to harm) = 1 / ARI

— Relative risk reductions sound impressive; absolute risk reductions and NNT ground the decision

— A non-significant p in a superiority trial ≠ equivalence

— Equivalence/non-inferiority requires pre-specified margins and CI-based analysis

Step 3 management: When asked whether to adopt a new therapy based on a trial result, integrate (1) effect size, (2) CI width, (3) p-value, (4) NNT vs NNH, and (5) patient-centered MCID. The exam-correct answer recommends therapy when all align, not when p<0.05 alone.

The four-quadrant framework for interpreting any trial result:

Large effect, narrow CI, small p → adopt if benefits outweigh harms

Example: HR 0.75 for mortality, 95% CI 0.65–0.86, p<0.001

Tiny effect inflated by huge n → do NOT change practice

Example: 0.5 mmHg SBP reduction, p=0.001, n=80,000

Large point estimate, wide CI, p=0.08 → underpowered; needs larger trial

Do not conclude "no effect"

Small effect, narrow CI overlapping null → consistent with no meaningful effect

Minimum clinically important difference (MCID):

Translating to absolute numbers patients understand:

Equivalence and non-inferiority trials:

Pharmacotherapy — "First-Line" Statistical Reporting Standards

— Point estimate (mean difference, RR, OR, HR)

— 95% confidence interval

— Exact p-value (not just "p<0.05" or "NS")

— Sample size and event counts

— Pre-specified analysis plan

— Continuous outcomes: mean ± SD or median (IQR), with mean difference and 95% CI

— Binary outcomes: event rates per group, RR or OR with 95% CI

— Survival outcomes: HR with 95% CI, Kaplan–Meier curves, log-rank p

— Diagnostic studies: sensitivity, specificity, LR+/LR−, with CIs

— Mean difference: null = 0

— RR, OR, HR: null = 1

— Correlation r: null = 0

— If the 95% CI crosses the null → p ≥ 0.05

— Default: two-sided (effect could go either direction)

— One-sided is appropriate only when the opposite direction is implausible or irrelevant — rarely justified in clinical trials

— Reporting a one-sided p without justification is a red flag for p-hacking

— A significant interaction p-value suggests the effect differs across subgroups

— Without a significant interaction, subgroup-specific point estimates should not be over-interpreted

— "Test for interaction" is the right tool, not separate subgroup p-values

— Adverse event tables often lack p-values or have low power

— A non-significant safety signal is not reassurance — it may be underpowered

Board pearl: Demand the trio: effect size, CI, and p-value. If a stem gives you only a p-value, the correct interpretive answer almost always involves acknowledging that the p alone is insufficient — pick the option that asks for the confidence interval or effect magnitude.

What a well-reported result must contain (and what to demand on the exam):

Common reporting conventions:

Effect measures and their nulls:

One-sided vs two-sided tests:

Interaction terms and effect modification:

Reporting harms:

Procedures — Common Statistical Tests Decoded (Expanded Reference)

— Compares means of two groups, continuous outcome, approximately normal distribution

— Independent (two separate groups) vs paired (same subjects, two time points)

— Assumptions: normality, equal variance (or use Welch's correction)

— Compares means across ≥3 groups

— Significant overall F-test → follow with post-hoc pairwise tests (Tukey, Bonferroni)

— Categorical data, comparing observed vs expected frequencies

— Requires expected cell counts ≥5; use Fisher exact if smaller

— Categorical data with small samples or sparse cells

— Provides exact p-value rather than approximation

— Non-parametric alternatives for non-normal or ordinal data

— Compare medians/distributions rather than means

— Compares survival distributions between groups

— Paired with Kaplan–Meier curves and Cox proportional hazards regression (yields HR)

— Linear regression: continuous outcome; coefficient with p

— Logistic regression: binary outcome; OR with p

— Cox regression: time-to-event; HR with p

— Adjusts for confounders; coefficients are interpreted "holding others constant"

— Paired categorical data (e.g., before/after, matched case-control)

— Pearson: linear, parametric, continuous normal

— Spearman: rank-based, non-parametric, ordinal or non-normal

Key distinction: Choice of test depends on (1) outcome data type, (2) number of groups, (3) paired vs independent, (4) distributional assumptions. A "wrong test" answer choice on Step 3 often features a t-test applied to categorical data or a chi-square applied to continuous data — eliminate these first.

t-test (Student's t-test):

ANOVA (analysis of variance):

Chi-square (χ²) test:

Fisher exact test:

Wilcoxon / Mann–Whitney U / Kruskal–Wallis:

Log-rank test:

Regression:

McNemar test:

Pearson vs Spearman correlation:

Special Populations — Small Samples, Rare Events, and Skewed Data

— Parametric tests (t-test, ANOVA) lose validity when assumptions fail with small n

— Use non-parametric tests (Wilcoxon, Mann–Whitney) for small or non-normal samples

— Use Fisher exact instead of chi-square when expected cell counts <5

— Small samples → wide CIs → high type II error risk

— Standard chi-square/logistic regression unstable when events <10 per variable

— Consider exact methods, Firth's penalized logistic regression, or Poisson regression with offset for person-time

— Zero events in one arm → cannot compute OR/RR directly; use continuity correction or exact CI

— Income, length of stay, biomarker concentrations often right-skewed

— Options: log-transform then t-test, or non-parametric test on raw data

— Report median (IQR) rather than mean ± SD

— Repeated measures within patients, patients within clinics

— Standard tests assume independence — violations inflate type I error

— Use mixed-effects models, GEE (generalized estimating equations), or paired tests

— Patients lost to follow-up or event-free at study end → censored

— Use Kaplan–Meier and Cox models, not simple proportions

— Informative censoring (loss related to outcome) biases results

— Small samples = "reduced clearance" of statistical power; adjust by choosing exact or non-parametric tests

— Skewed data = "altered metabolism"; transform or use rank-based methods

Step 3 management: When a stem describes a study with 20 patients, rare outcomes, or markedly skewed labs, the correct analytic choice is almost always a non-parametric or exact test. A standard t-test or chi-square in these settings is a distractor.

Small sample size considerations:

Rare events:

Skewed continuous data:

Clustered or correlated data:

Censored data (survival analysis):

Renal/hepatic analogy — biostatistical "dose adjustment":

Special Populations — Pediatrics, Pragmatic Trials, and Subgroup Analyses

— Smaller eligible populations → often underpowered; use Bayesian designs or extrapolation from adult data

— Composite endpoints common to maintain power; interpret each component

— Age-stratified analyses pre-specified to detect effect modification

— Explanatory (efficacy): ideal conditions, strict inclusion, internal validity — answers "can it work?"

— Pragmatic (effectiveness): real-world conditions, broad inclusion, external validity — answers "does it work in practice?"

— Pragmatic trials often show smaller effect sizes; p-values must be interpreted with absolute risk reduction and NNT

— Pre-specified, biologically plausible, limited in number, with formal interaction tests = trustworthy

— Post-hoc, numerous, no interaction test = likely chance

— A "positive" subgroup in an overall-negative trial is not practice-changing

— Increase event rates and power but can mislead if driven by softer components (e.g., revascularization rather than mortality)

— Always examine individual components

— Surrogate (LDL, HbA1c, BP) may not translate to clinical benefit

— Significant p on surrogate ≠ significant p on mortality (cf. niacin, ezetimibe pre-IMPROVE-IT debates)

— If trial enrolled mostly one demographic, p-value applies to that population

— Step 3 emphasizes assessing whether trial population matches your patient

Board pearl: A trial showing a "significant" benefit in a subgroup (e.g., women, diabetics) when the overall trial was neutral should prompt the answer "hypothesis-generating, requires confirmatory trial" — not adoption. This is one of the most reliably tested principles in Step 3 biostatistics vignettes.

Pediatric trial statistics:

Pragmatic vs explanatory trials:

Subgroup analyses — high-yield rules:

Composite endpoints:

Surrogate vs hard endpoints:

Equity and generalizability:

Complications — Common Errors and Misuses of P-values

• Top misinterpretations to recognize and reject:
— "p-value is the probability the null is true": WRONG. It is P(data	H₀), not P(H₀	data).
— "p>0.05 means no effect": WRONG. Absence of evidence ≠ evidence of absence; may be underpowered.
— "p<0.05 means clinically important": WRONG. Statistical ≠ clinical significance.
— "Smaller p = larger effect": WRONG. p reflects effect size, variability, AND sample size combined.
— "p<0.05 means the result will replicate": WRONG. Replication probability depends on power and prior plausibility.
• P-hacking and HARKing:
— P-hacking: trying many analyses until one yields p<0.05
— HARKing: Hypothesizing After Results are Known — recasting exploratory finding as primary
— Both inflate false-positive rates dramatically
• Publication bias:
— "Positive" trials (p<0.05) more likely published
— Meta-analyses must search for unpublished data; funnel plots and Egger test detect asymmetry
• Garden of forking paths:
— Multiple defensible analytic choices (covariate selection, outcome definition, cutoffs) inflate type I error even without explicit p-hacking
• Misuse of "trend toward significance":
— p=0.06 and p=0.04 are nearly identical evidence — the 0.05 threshold is arbitrary
— Avoid dichotomizing; report the actual p and CI
• Confounding the p-value with clinical decision:
— Even a robust p<0.001 does not override patient preferences, comorbidities, costs, or harms
Key distinction: The p-value is a decision aid about the null hypothesis, not a measure of truth, importance, or replicability. Step 3 distractors often phrase p-values as if they answered questions they do not — always reject the option that states "the p-value is the probability the treatment works."

When to Escalate — Statistical Consultation and Study Design Help

— Study design phase: sample size calculation, randomization scheme, primary endpoint selection

— Complex data structures: longitudinal, clustered, missing-not-at-random

— Survival analyses, competing risks

— Adaptive or Bayesian trial designs

— Multiple comparisons strategies

— Interim analyses with formal stopping rules

— Single-center small trial with surprising "significant" finding

— Subgroup-only significance in an overall-negative trial

— Industry-sponsored trial with multiple endpoints and one "winner"

— Observational study with unmeasured confounding

— Surrogate endpoint result without hard-outcome confirmation

— Novel mechanism, no prior supporting evidence

— Effect size implausibly large given biology

— Single trial despite multiple prior negative trials

— Post-hoc or exploratory analyses

— DSMB triggers: futility, harm, overwhelming efficacy at interim

— Early stopping inflates effect size estimates ("regression to the mean" upon replication)

— Run charts and statistical process control (SPC) often preferred over p-values

— Special-cause variation detected by control rules, not t-tests

Step 3 management: In a journal club or QI scenario, the correct "escalation" is often (1) consult biostatistics for proper analytic plan, (2) require pre-specification, and (3) await replication. The wrong answer is "implement now because p<0.05." Recognize that biostatistics consultation, like ID or cardiology consult, is a legitimate management step in evidence-based practice questions.

When a biostatistician should be involved (CCS-style "consult"):

When to pause and not interpret a single p-value as decisive:

When to demand replication before practice change:

IRB / regulatory escalation:

Quality improvement context:

Key Differentials — P-value vs Other Inferential Statistics

• Same-category "differentials" — what else describes evidence?
— Confidence interval (95% CI):
• Range of plausible values for the true parameter
• Conveys precision and magnitude; preferred over p alone
• CI excludes null ↔ p<0.05
— Effect size measures:
• Cohen's d for continuous outcomes (0.2 small, 0.5 medium, 0.8 large)
• RR, OR, HR for ratios; ARR for absolute
• Independent of sample size — unlike p
— NNT / NNH:
• Absolute, patient-centered translations of effect size
• Smaller NNT = more efficient therapy
— Likelihood ratio (LR+ / LR−):
• Diagnostic test performance
• Updates pre-test to post-test probability via Bayes' theorem
• Unrelated to hypothesis-test p-values
— Bayes factor:
• Ratio of evidence favoring H₁ vs H₀
• Directly addresses what p-values cannot: relative support for hypotheses
— Posterior probability:
• Bayesian P(H	data), the quantity clinicians intuitively want
• Hierarchy of evidence integration:
— Effect size + CI > p-value alone
— Pre-specified primary endpoint > secondary/subgroup
— Replicated finding > single trial
— Meta-analysis (with low heterogeneity, I²<50%) > individual trials
• Forest plots and meta-analytic interpretation:
— Pooled estimate with CI; diamond width = precision
— Heterogeneity (I², Cochran Q) assesses consistency across trials
— Random-effects model when heterogeneity present
Board pearl: When a question lists "p<0.05" alongside CI, effect size, and NNT, the CI and NNT are typically the correct answers for clinical decision-making. The p-value is a screening filter; the CI and NNT are the diagnostic and management tools.

Key Differentials — P-value vs Other Statistical Concepts (Cross-Category)

— α = pre-specified threshold (decision rule)

— p = observed result (data summary)

— Reject H₀ when p < α

— Power is prospective (study design); p is retrospective (study result)

— Low power → high false-negative risk; doesn't change p interpretation but changes inference from non-significance

— Type I (α): false positive — reject true H₀

— Type II (β): false negative — fail to reject false H₀

— p relates to type I error rate only if H₀ is true

— Sensitivity/specificity are properties of a test; PPV/NPV depend on prevalence

— Analogously: a "significant" p depends on prior plausibility — low-prior, low-power studies yield high false-discovery rates even at p<0.05 (Ioannidis)

— RR/OR/HR are effect sizes; p tests whether they differ from null

— A trial can report RR=2.0 with p=0.3 (small, imprecise) or RR=1.05 with p<0.001 (huge, trivial)

— r measures strength and direction (−1 to +1)

— p tests whether r differs from 0

— High n yields significant p for trivially small r

— A significant p does not establish causation

— Causation requires design (randomization), Bradford Hill criteria, or counterfactual frameworks

Key distinction: The p-value answers a narrow question — "how surprising is this data under the null?" It is not interchangeable with effect size, power, predictive value, or causation. Step 3 stems test whether you can identify the right statistical quantity for the question being asked.

P-value vs alpha (α):

P-value vs power (1 − β):

P-value vs type I and type II error:

P-value vs prevalence and predictive values:

P-value vs incidence/relative risk:

P-value vs correlation coefficient (r):

Statistical vs causal inference:

Secondary Prevention — Building Lifelong Habits of Critical Appraisal

— 1. Always read the CI before the p-value

— 2. Demand pre-specification

— 3. Translate to absolute terms

— 4. Check the trial population

— 5. Seek replication and meta-analysis

— 6. Account for harms with same rigor as benefits

— CONSORT for RCTs

— STROBE for observational studies

— PRISMA for systematic reviews/meta-analyses

— GRADE for evidence quality ratings (high → very low)

— What was the effect size?

— How wide is the CI?

— Was this the primary endpoint?

— How many comparisons were made?

— Has it been replicated?

— Is the effect clinically meaningful (MCID, NNT)?

— Do harms outweigh benefits?

— Use absolute risks and natural frequencies ("1 in 50" rather than "RR 0.85")

— Decision aids improve shared decision-making

Step 3 management: Long-term practice "prevention" against statistical misinterpretation involves institutional habits — journal clubs, EBM rounds, pre-registration culture, and biostatistics partnerships. The exam rewards the physician who treats every p-value as one data point among many, not a verdict.

Long-term plan for evidence-based practice (analogous to "discharge medications"):

Magnitude and precision come first

Primary endpoint, analytic plan, subgroups defined a priori

ARR, NNT, NNH per patient encounter

Does your patient resemble enrolled participants?

Single-trial results are tentative

Underpowered safety analyses are not reassurance

Critical appraisal frameworks:

Routine "follow-up" questions to ask of any p<0.05:

Communicating uncertainty to patients:

Follow-Up, Monitoring, and EBM Skill Maintenance

— Critical appraisal of trials in your specialty

— Familiarity with key landmark trials and their effect sizes (not just "positive/negative")

— Recognition of common statistical fallacies

— Comfort with absolute vs relative risk communication

— Journal club participation — verbalize interpretations

— Use structured appraisal tools (CASP, JAMA Users' Guides)

— Cross-check secondary sources (Cochrane, UpToDate, DynaMed) against primary trials

— Specialty guideline updates: every 1–5 years

— Major trial readouts: track via AHA/ACC, ASCO, ADA, NEJM, JAMA, Lancet

— Meta-analyses and Cochrane reviews: revisit annually for high-volume conditions

— Disclose absolute benefits and harms with NNT/NNH

— Acknowledge uncertainty when CIs are wide

— Discuss alternative options including no treatment

— SPC charts to detect process change without p-value abuse

— Small-sample QI initiatives often misuse t-tests; prefer run-chart rules

— Force yourself to state the CI and NNT before commenting on p

— Reject "significant ≠ important" conflation in your own speech

— Model proper statistical reasoning to residents and students

— Correct gently when "p<0.05 = it works" appears in case presentations

Board pearl: Step 3 expects practicing physicians to maintain lifelong evidence-based practice habits, not just memorize trial results. Questions about journal club, MOC, and CME often hinge on whether you can identify a flawed inference and propose a structured appraisal approach.

Ongoing competencies (ABIM/MOC-relevant):

Monitoring your interpretive skills:

Cadence of evidence review:

Patient counseling parameters tied to statistics:

Quality improvement monitoring:

Self-rehab from "p-value reflex":

Teaching and trainee oversight:

Ethical, Legal, and Patient Safety Considerations

— Patients have the right to understand absolute risks and benefits, not just relative

— Quoting "50% reduction" without absolute numbers (e.g., 2% → 1%) is potentially misleading

— Consent for research participation must include realistic disclosure of likely benefit, equipoise, and uncertainty

— Selective outcome reporting (publishing only significant findings) violates scientific integrity

— P-hacking and HARKing are research misconduct when intentional

— Trial pre-registration (ClinicalTrials.gov) is required by ICMJE for publication

— DSMBs uphold patient safety by stopping trials early for harm or overwhelming efficacy

— Industry-funded trials are not inherently invalid but require transparency

— Disclosure required in publications and at point of care

— Marketing materials emphasizing relative risk reductions without absolute context warrant skepticism

— A physician handing off care must communicate uncertainty in evidence, not present therapies as definitively proven when based on a single underpowered trial

— Quality dashboards reporting "significant improvement" in readmissions or infection rates can mislead if multiple comparisons or small n are involved — verify with SPC methodology

— Adverse event signals (e.g., post-marketing pharmacovigilance) often rely on observational data with wide CIs; act on signals while acknowledging uncertainty

— FDA MedWatch reporting is mandatory for serious unexpected events regardless of "statistical significance"

— Trials excluding women, elderly, minorities yield p-values that may not generalize

— Ethical practice requires explicit acknowledgment of evidence gaps for underrepresented groups

Step 3 management: When a drug rep, QI report, or trial summary presents a "significant" result, the ethically appropriate response is to (1) request absolute numbers, (2) ask about pre-specification and replication, and (3) communicate uncertainty transparently to patients during informed consent.

Informed consent and statistical communication:

Research ethics tied to p-values:

Conflicts of interest:

Patient safety — transition of care and statistical literacy:

Mandatory reporting and surveillance:

Equity considerations:

High-Yield Associations and Rapid-Fire Clinical Facts

• Definitional anchors:
— p-value = P(data ≥ observed	H₀ true)
— α = pre-specified type I error threshold (typically 0.05)
— β = type II error; power = 1 − β (target ≥0.80)
• CI–p shortcuts:
— 95% CI excludes null → p < 0.05
— 95% CI for difference: null = 0
— 95% CI for ratio (RR/OR/HR): null = 1
• Sample size effects:
— Doubling n shrinks SE by √2 → narrower CI, smaller p
— Huge n can make trivial effects "significant"
— Small n can miss large effects ("non-significant")
• Multiple comparisons:
— Bonferroni: α/k
— 20 tests at α=0.05 → ~1 false positive expected
• Test selection cheat sheet:
— 2 means, normal: t-test
— ≥3 means: ANOVA
— Non-normal: Wilcoxon/Mann–Whitney/Kruskal–Wallis
— Categorical: chi-square (Fisher if small)
— Paired categorical: McNemar
— Survival: log-rank, Cox
• Common pitfalls:
— p>0.05 ≠ no effect
— p<0.05 ≠ important effect
— Significant subgroup in negative trial = chance until replicated
— "Trend toward significance" = not significant
• Vocabulary:
— Type I error: false positive (α)
— Type II error: false negative (β)
— Power: detect true effect (1 − β)
— Effect size: magnitude (independent of n)
— MCID: smallest patient-meaningful change
— NNT = 1/ARR; NNH = 1/ARI
• Frameworks:
— CONSORT (RCT), STROBE (observational), PRISMA (meta-analysis), GRADE (quality)
Board pearl: If forced to choose one number to report from a trial, choose the 95% CI for the effect size, not the p-value. The CI contains the p-value's information plus magnitude and precision — and Step 3 distractors reliably reward this preference.

Board Question Stem Patterns

• Pattern 1: "Statistically significant but trivial"
— Stem: 50,000-patient trial shows 0.3 mmHg SBP reduction, p<0.001
— Trap: "Adopt new therapy"
— Correct: Effect too small to be clinically meaningful; do not change practice
• Pattern 2: "Underpowered negative trial"
— Stem: 40-patient pilot RCT, 25% mortality reduction, p=0.18, 95% CI 0.50–1.15
— Trap: "Therapy is ineffective"
— Correct: Inconclusive; possibly effective; requires larger trial
• Pattern 3: "Subgroup analysis trap"
— Stem: Overall trial neutral; one of 15 subgroups (e.g., diabetics) shows p=0.03 benefit
— Trap: "Recommend therapy for diabetics"
— Correct: Hypothesis-generating only; chance likely; needs confirmatory trial
• Pattern 4: "Multiple comparisons inflation"
— Stem: Genome-wide study tests 10,000 SNPs, finds one with p=0.001
— Trap: "Strong association"
— Correct: Expected ~10 such findings by chance; apply Bonferroni or FDR
• Pattern 5: "Wrong test"
— Stem: Investigators use chi-square on continuous outcome data
— Correct: Should use t-test or non-parametric equivalent
• Pattern 6: "Misinterpreting p as P(H₀)"
— Trap option: "p=0.03 means a 3% chance the null is true"
— Correct: p = P(data	H₀); does NOT give P(H₀	data)
• Pattern 7: "Surrogate endpoint hype"
— Stem: Drug lowers LDL with p<0.001, but mortality unchanged
— Correct: Surrogate benefit ≠ clinical benefit
• Pattern 8: "Early stopping inflates effect"
— Stem: Trial stopped early for efficacy at interim
— Correct: Effect size likely overestimated; awaits replication
• Pattern 9: "Drug rep / shared decision"
— Correct: Translate to absolute risk, NNT/NNH; discuss uncertainty
Step 3 management: When stems pair a p-value with a CI, effect size, or NNT, prioritize the clinically translated answer choice over the one that simply restates statistical significance. The exam consistently rewards integration over reflex.

One-Line Recap

A p-value is the probability of observing data as extreme as ours assuming the null hypothesis is true — nothing more — so it must always be interpreted alongside effect size, confidence interval, pre-specification, sample size, and clinical importance before it can inform any patient care decision.
• Core recap bullets:
— What p IS: P(data	H₀); a measure of compatibility with the null
— What p IS NOT: probability the null is true, probability the treatment works, measure of effect size, or guarantee of replication
• Decision framework:
— Always read the 95% CI before the p-value — it captures magnitude, direction, and precision
— Translate findings into absolute risk reduction, NNT, and NNH for patient-centered decisions
— Confirm the result is from a pre-specified primary endpoint, not a post-hoc or subgroup analysis
— Apply multiple-comparison corrections when many tests are performed
• Pitfalls to reject on exam day:
— "p<0.05 means clinically important" — FALSE
— "p>0.05 means no effect" — FALSE; may be underpowered
— "Significant subgroup in a negative trial changes practice" — FALSE; hypothesis-generating
— "Smaller p means larger effect" — FALSE; p conflates effect size, variability, and n
• Best practice habits:
— Demand effect size + CI + p as a trio
— Require pre-registration and replication before adopting novel therapies
— Communicate uncertainty honestly during informed consent and shared decision-making
Board pearl: The single most reliable Step 3 instinct in any biostatistics vignette is to distrust isolated p-values and demand the confidence interval, effect size, and clinical context — every time, without exception.