top of page

Eduovisual

Biostatistics & Population Health

P-values: interpretation and limitations

Clinical Overview and When to Suspect Misinterpretation of P-values
Definition: The p-value is the probability of observing data as extreme or more extreme than what was observed, assuming the null hypothesis (H₀) is true.
— It is a conditional probability: P(data H₀), not P(H₀ data)
— It does not tell you the probability the null hypothesis is true or false
— It does not measure effect size or clinical importance
When p-value misinterpretation drives wrong answers on Step 3:
— A "statistically significant" result (p<0.05) is assumed to be clinically meaningful — it may not be
— A "non-significant" result (p>0.05) is interpreted as "no effect" — absence of evidence ≠ evidence of absence
— Multiple comparisons inflate false-positive rates without correction
— Small p-value from a huge sample reflects trivial effect size
— Large p-value from an underpowered study masks a real effect
Conceptual anchors:
α (alpha): pre-specified threshold for type I error, conventionally 0.05
Type I error: rejecting a true H₀ (false positive)
Type II error (β): failing to reject a false H₀ (false negative)
Power = 1 − β: probability of detecting a true effect; conventionally ≥0.80
Clinical scenario triggers on the exam:
— A trial reports p=0.04 for a 0.2 mmHg BP reduction in 50,000 patients → significant but clinically irrelevant
— A small RCT shows 30% mortality reduction with p=0.18 → may be a true effect, underpowered
— Subgroup analyses show one "significant" finding among 20 tested → likely chance
Board pearl: The p-value answers "how surprising is this data if the null is true?" — it never answers "is the treatment worth using?" That second question requires effect size, confidence intervals, NNT, and clinical context. Step 3 stems frequently pair a small p-value with a clinically trivial effect to test whether you conflate statistical with clinical significance.
Solid White Background
Presentation Patterns and Key History — How P-values Appear in Exam Stems

— Journal club scenario: resident summarizes a new RCT; attending asks what the p-value means

— Drug rep scenario: pharmaceutical representative claims "statistically significant benefit" — you must interpret critically

— Quality improvement: hospital reports a "significant" reduction in readmissions after intervention

— Screening or diagnostic test studies reporting p-values for sensitivity/specificity comparisons

Study design: RCT, cohort, case-control, cross-sectional — affects which test and which p-value matters

Sample size (n): huge n inflates statistical significance; tiny n underpowers detection

Pre-specified vs post-hoc analyses: post-hoc and subgroup p-values are hypothesis-generating, not confirmatory

Number of comparisons: multiple testing without correction multiplies false positives

Primary vs secondary endpoint: only the pre-specified primary endpoint carries the trial's stated α

Effect size and confidence interval: always evaluate alongside p

— "Trend toward significance" (p=0.06–0.10) — not statistically significant by convention

— "Significant in a subgroup analysis" — likely chance unless pre-specified and adjusted

— "p<0.05 in at least one of 20 outcomes" — expect ~1 false positive by chance alone

— "Borderline significant" — not a real statistical category

— Was α pre-specified?

— Was the study powered for this outcome?

— Is the confidence interval narrow or wide?

— Does the magnitude of effect matter clinically?

Key distinction: A primary endpoint p-value tests the trial's main hypothesis at the stated α; secondary endpoint p-values are exploratory unless a hierarchical testing strategy was pre-specified. Step 3 will reward candidates who refuse to treat a "significant" secondary or subgroup finding as practice-changing without replication.

Classic vignette frames:
Key "history" elements to extract from a study description:
Red-flag phrases in stems:
What the stem expects you to ask:
Solid White Background
Physical Exam Findings — The "Exam" of a P-value (Anatomy of a Statistical Result)

The point estimate: the observed effect (relative risk, odds ratio, mean difference, hazard ratio)

The confidence interval (CI): range of plausible values; 95% CI is the standard companion to p<0.05

The test used: t-test, chi-square, Fisher exact, log-rank, ANOVA, regression coefficient

One- vs two-tailed test: two-tailed is standard; one-tailed halves the p-value and is rarely justified

— If the 95% CI for a difference excludes 0, then p < 0.05

— If the 95% CI for a ratio (RR, OR, HR) excludes 1, then p < 0.05

— If the CI crosses the null value (0 or 1), p ≥ 0.05

— A narrow CI = precise estimate (often large n); wide CI = imprecise (often small n or rare event)

— Effect size large + narrow CI + small p = robust, clinically meaningful

— Effect size large + wide CI + small p = real but imprecise (need replication)

— Effect size tiny + narrow CI + tiny p = statistically significant but clinically trivial (massive n)

— Effect size large + wide CI + p=0.08 = possibly real but underpowered — do not dismiss

— RR 1.02, 95% CI 1.01–1.03, p<0.001 → significant but trivial; n is huge

— RR 0.55, 95% CI 0.28–1.08, p=0.07 → suggestive but CI crosses 1

— HR 0.70, 95% CI 0.55–0.89, p=0.003 → meaningful and significant

Board pearl: Always read the confidence interval before the p-value. The CI conveys magnitude, direction, and precision in one glance; the p-value alone discards all three. Step 3 stems frequently provide both — choose the answer that integrates CI width and clinical relevance, not the one that simply parrots "p<0.05 means it works."

What to "inspect" when a p-value is reported:
The CI–p relationship (must memorize):
"Hemodynamic" assessment — is the result stable?
Common exam "findings":
Solid White Background
Diagnostic Workup — Calculating and Setting Up a P-value

1. State H₀ and H₁: H₀ typically = no difference / no association; H₁ = the alternative

2. Set α: conventionally 0.05 (two-tailed)

3. Choose the appropriate test based on data type and design

4. Calculate the test statistic (t, z, χ², F, log-rank)

5. Convert to p-value using the test distribution

6. Compare p to α: if p < α, reject H₀

Two means, continuous, normal: Student t-test (independent or paired)

>2 means, continuous: ANOVA

Non-normal continuous or ordinal: Wilcoxon rank-sum, Mann–Whitney, Kruskal–Wallis

Categorical, large expected counts: chi-square (χ²)

Categorical, small expected counts (<5): Fisher exact test

Time-to-event: log-rank test (compares Kaplan–Meier curves)

Correlation: Pearson (parametric) or Spearman (non-parametric)

Effect size (larger difference → smaller p)

Sample size (larger n → smaller p for same effect)

Variability (smaller SD → smaller p)

— Therefore p conflates these three — you cannot reverse-engineer effect size from p alone

— Power = 1 − β, target ≥0.80

— Underpowered study + true effect → high false-negative rate

— Studies should report a pre-trial sample size calculation

Step 3 management: When evaluating a published trial or QI project result, do not stop at p<0.05. Confirm the test matched the data type, the analysis was pre-specified, and the trial was powered for the stated outcome. An "underpowered negative trial" is not evidence of no effect — it is evidence of inadequate evidence.

Step-by-step "workup" of a hypothesis test:
Choosing the right test (high-yield):
What determines p-value magnitude:
Sample size and power:
Solid White Background
Diagnostic Workup — Advanced Concepts: Multiple Comparisons, Bayes, and Replication
Multiple comparisons problem:
— Testing 20 independent hypotheses at α=0.05 → expected 1 false positive by chance
— Family-wise error rate (FWER) grows: 1 − (0.95)ᵏ for k tests
Bonferroni correction: new α = 0.05/k (conservative)
Holm-Bonferroni, Benjamini–Hochberg (false discovery rate): less conservative alternatives
— Applies to: subgroup analyses, multiple endpoints, interim analyses, genome-wide studies
Pre-specification matters:
— Pre-specified primary endpoint with pre-stated α = confirmatory
— Post-hoc, exploratory, or data-dredged findings = hypothesis-generating only
— "p-hacking" = trying multiple tests/cutoffs until one reaches p<0.05
Bayesian reframing (conceptual):
— Frequentist p answers P(data H₀)
— Bayesian posterior answers P(H₀ data), requires a prior
— A small p with low prior probability of effect (rare disease, weak biology) still implies modest posterior probability — explains why many "significant" findings fail to replicate
Replication and the "reproducibility crisis":
— Single p<0.05 result is weak evidence; replication strengthens inference
— Pre-registration and reporting all outcomes reduce publication bias
— Meta-analyses pool effect sizes and provide narrower CIs
Interim analyses and stopping rules:
— Sequential testing inflates type I error unless α is "spent" using O'Brien–Fleming or Pocock boundaries
— DSMB (Data Safety Monitoring Board) oversees early stopping for efficacy, futility, or harm
Board pearl: A "significant" subgroup finding from an otherwise negative trial is almost always chance. Step 3 will test whether you recommend changing practice based on it — the correct answer is "interpret cautiously, awaiting confirmatory pre-specified replication," not "adopt the new therapy in that subgroup."
Solid White Background
Risk Stratification — Statistical vs Clinical Significance Decision Framework

Quadrant 1: Statistically significant + clinically meaningful

Quadrant 2: Statistically significant + clinically trivial

Quadrant 3: Not significant + possibly meaningful

Quadrant 4: Not significant + trivial effect

— Smallest change a patient perceives as beneficial

— Examples: ~10 mm on 100-mm pain VAS; ~5-point change on many QoL scales

— If the entire 95% CI lies below the MCID, the effect is clinically unimportant even when "significant"

NNT (number needed to treat) = 1 / ARR

NNH (number needed to harm) = 1 / ARI

— Relative risk reductions sound impressive; absolute risk reductions and NNT ground the decision

— A non-significant p in a superiority trial ≠ equivalence

— Equivalence/non-inferiority requires pre-specified margins and CI-based analysis

Step 3 management: When asked whether to adopt a new therapy based on a trial result, integrate (1) effect size, (2) CI width, (3) p-value, (4) NNT vs NNH, and (5) patient-centered MCID. The exam-correct answer recommends therapy when all align, not when p<0.05 alone.

The four-quadrant framework for interpreting any trial result:
Large effect, narrow CI, small p → adopt if benefits outweigh harms
Example: HR 0.75 for mortality, 95% CI 0.65–0.86, p<0.001
Tiny effect inflated by huge n → do NOT change practice
Example: 0.5 mmHg SBP reduction, p=0.001, n=80,000
Large point estimate, wide CI, p=0.08 → underpowered; needs larger trial
Do not conclude "no effect"
Small effect, narrow CI overlapping null → consistent with no meaningful effect
Minimum clinically important difference (MCID):
Translating to absolute numbers patients understand:
Equivalence and non-inferiority trials:
Solid White Background
Pharmacotherapy — "First-Line" Statistical Reporting Standards

Point estimate (mean difference, RR, OR, HR)

95% confidence interval

Exact p-value (not just "p<0.05" or "NS")

Sample size and event counts

Pre-specified analysis plan

— Continuous outcomes: mean ± SD or median (IQR), with mean difference and 95% CI

— Binary outcomes: event rates per group, RR or OR with 95% CI

— Survival outcomes: HR with 95% CI, Kaplan–Meier curves, log-rank p

— Diagnostic studies: sensitivity, specificity, LR+/LR−, with CIs

— Mean difference: null = 0

— RR, OR, HR: null = 1

— Correlation r: null = 0

— If the 95% CI crosses the null → p ≥ 0.05

— Default: two-sided (effect could go either direction)

— One-sided is appropriate only when the opposite direction is implausible or irrelevant — rarely justified in clinical trials

— Reporting a one-sided p without justification is a red flag for p-hacking

— A significant interaction p-value suggests the effect differs across subgroups

— Without a significant interaction, subgroup-specific point estimates should not be over-interpreted

— "Test for interaction" is the right tool, not separate subgroup p-values

— Adverse event tables often lack p-values or have low power

— A non-significant safety signal is not reassurance — it may be underpowered

Board pearl: Demand the trio: effect size, CI, and p-value. If a stem gives you only a p-value, the correct interpretive answer almost always involves acknowledging that the p alone is insufficient — pick the option that asks for the confidence interval or effect magnitude.

What a well-reported result must contain (and what to demand on the exam):
Common reporting conventions:
Effect measures and their nulls:
One-sided vs two-sided tests:
Interaction terms and effect modification:
Reporting harms:
Solid White Background
Procedures — Common Statistical Tests Decoded (Expanded Reference)

— Compares means of two groups, continuous outcome, approximately normal distribution

— Independent (two separate groups) vs paired (same subjects, two time points)

— Assumptions: normality, equal variance (or use Welch's correction)

— Compares means across ≥3 groups

— Significant overall F-test → follow with post-hoc pairwise tests (Tukey, Bonferroni)

— Categorical data, comparing observed vs expected frequencies

— Requires expected cell counts ≥5; use Fisher exact if smaller

— Categorical data with small samples or sparse cells

— Provides exact p-value rather than approximation

— Non-parametric alternatives for non-normal or ordinal data

— Compare medians/distributions rather than means

— Compares survival distributions between groups

— Paired with Kaplan–Meier curves and Cox proportional hazards regression (yields HR)

— Linear regression: continuous outcome; coefficient with p

— Logistic regression: binary outcome; OR with p

— Cox regression: time-to-event; HR with p

— Adjusts for confounders; coefficients are interpreted "holding others constant"

— Paired categorical data (e.g., before/after, matched case-control)

— Pearson: linear, parametric, continuous normal

— Spearman: rank-based, non-parametric, ordinal or non-normal

Key distinction: Choice of test depends on (1) outcome data type, (2) number of groups, (3) paired vs independent, (4) distributional assumptions. A "wrong test" answer choice on Step 3 often features a t-test applied to categorical data or a chi-square applied to continuous data — eliminate these first.

t-test (Student's t-test):
ANOVA (analysis of variance):
Chi-square (χ²) test:
Fisher exact test:
Wilcoxon / Mann–Whitney U / Kruskal–Wallis:
Log-rank test:
Regression:
McNemar test:
Pearson vs Spearman correlation:
Solid White Background
Special Populations — Small Samples, Rare Events, and Skewed Data

— Parametric tests (t-test, ANOVA) lose validity when assumptions fail with small n

— Use non-parametric tests (Wilcoxon, Mann–Whitney) for small or non-normal samples

— Use Fisher exact instead of chi-square when expected cell counts <5

— Small samples → wide CIs → high type II error risk

— Standard chi-square/logistic regression unstable when events <10 per variable

— Consider exact methods, Firth's penalized logistic regression, or Poisson regression with offset for person-time

— Zero events in one arm → cannot compute OR/RR directly; use continuity correction or exact CI

— Income, length of stay, biomarker concentrations often right-skewed

— Options: log-transform then t-test, or non-parametric test on raw data

— Report median (IQR) rather than mean ± SD

— Repeated measures within patients, patients within clinics

— Standard tests assume independence — violations inflate type I error

— Use mixed-effects models, GEE (generalized estimating equations), or paired tests

— Patients lost to follow-up or event-free at study end → censored

— Use Kaplan–Meier and Cox models, not simple proportions

— Informative censoring (loss related to outcome) biases results

— Small samples = "reduced clearance" of statistical power; adjust by choosing exact or non-parametric tests

— Skewed data = "altered metabolism"; transform or use rank-based methods

Step 3 management: When a stem describes a study with 20 patients, rare outcomes, or markedly skewed labs, the correct analytic choice is almost always a non-parametric or exact test. A standard t-test or chi-square in these settings is a distractor.

Small sample size considerations:
Rare events:
Skewed continuous data:
Clustered or correlated data:
Censored data (survival analysis):
Renal/hepatic analogy — biostatistical "dose adjustment":
Solid White Background
Special Populations — Pediatrics, Pragmatic Trials, and Subgroup Analyses

— Smaller eligible populations → often underpowered; use Bayesian designs or extrapolation from adult data

— Composite endpoints common to maintain power; interpret each component

— Age-stratified analyses pre-specified to detect effect modification

Explanatory (efficacy): ideal conditions, strict inclusion, internal validity — answers "can it work?"

Pragmatic (effectiveness): real-world conditions, broad inclusion, external validity — answers "does it work in practice?"

— Pragmatic trials often show smaller effect sizes; p-values must be interpreted with absolute risk reduction and NNT

— Pre-specified, biologically plausible, limited in number, with formal interaction tests = trustworthy

— Post-hoc, numerous, no interaction test = likely chance

— A "positive" subgroup in an overall-negative trial is not practice-changing

— Increase event rates and power but can mislead if driven by softer components (e.g., revascularization rather than mortality)

— Always examine individual components

— Surrogate (LDL, HbA1c, BP) may not translate to clinical benefit

— Significant p on surrogate ≠ significant p on mortality (cf. niacin, ezetimibe pre-IMPROVE-IT debates)

— If trial enrolled mostly one demographic, p-value applies to that population

— Step 3 emphasizes assessing whether trial population matches your patient

Board pearl: A trial showing a "significant" benefit in a subgroup (e.g., women, diabetics) when the overall trial was neutral should prompt the answer "hypothesis-generating, requires confirmatory trial" — not adoption. This is one of the most reliably tested principles in Step 3 biostatistics vignettes.

Pediatric trial statistics:
Pragmatic vs explanatory trials:
Subgroup analyses — high-yield rules:
Composite endpoints:
Surrogate vs hard endpoints:
Equity and generalizability:
Solid White Background
Complications — Common Errors and Misuses of P-values
Top misinterpretations to recognize and reject:
"p-value is the probability the null is true": WRONG. It is P(data H₀), not P(H₀ data).
"p>0.05 means no effect": WRONG. Absence of evidence ≠ evidence of absence; may be underpowered.
"p<0.05 means clinically important": WRONG. Statistical ≠ clinical significance.
"Smaller p = larger effect": WRONG. p reflects effect size, variability, AND sample size combined.
"p<0.05 means the result will replicate": WRONG. Replication probability depends on power and prior plausibility.
P-hacking and HARKing:
P-hacking: trying many analyses until one yields p<0.05
HARKing: Hypothesizing After Results are Known — recasting exploratory finding as primary
— Both inflate false-positive rates dramatically
Publication bias:
— "Positive" trials (p<0.05) more likely published
— Meta-analyses must search for unpublished data; funnel plots and Egger test detect asymmetry
Garden of forking paths:
— Multiple defensible analytic choices (covariate selection, outcome definition, cutoffs) inflate type I error even without explicit p-hacking
Misuse of "trend toward significance":
— p=0.06 and p=0.04 are nearly identical evidence — the 0.05 threshold is arbitrary
— Avoid dichotomizing; report the actual p and CI
Confounding the p-value with clinical decision:
— Even a robust p<0.001 does not override patient preferences, comorbidities, costs, or harms
Key distinction: The p-value is a decision aid about the null hypothesis, not a measure of truth, importance, or replicability. Step 3 distractors often phrase p-values as if they answered questions they do not — always reject the option that states "the p-value is the probability the treatment works."
Solid White Background
When to Escalate — Statistical Consultation and Study Design Help

— Study design phase: sample size calculation, randomization scheme, primary endpoint selection

— Complex data structures: longitudinal, clustered, missing-not-at-random

— Survival analyses, competing risks

— Adaptive or Bayesian trial designs

— Multiple comparisons strategies

— Interim analyses with formal stopping rules

— Single-center small trial with surprising "significant" finding

— Subgroup-only significance in an overall-negative trial

— Industry-sponsored trial with multiple endpoints and one "winner"

— Observational study with unmeasured confounding

— Surrogate endpoint result without hard-outcome confirmation

— Novel mechanism, no prior supporting evidence

— Effect size implausibly large given biology

— Single trial despite multiple prior negative trials

— Post-hoc or exploratory analyses

— DSMB triggers: futility, harm, overwhelming efficacy at interim

— Early stopping inflates effect size estimates ("regression to the mean" upon replication)

— Run charts and statistical process control (SPC) often preferred over p-values

— Special-cause variation detected by control rules, not t-tests

Step 3 management: In a journal club or QI scenario, the correct "escalation" is often (1) consult biostatistics for proper analytic plan, (2) require pre-specification, and (3) await replication. The wrong answer is "implement now because p<0.05." Recognize that biostatistics consultation, like ID or cardiology consult, is a legitimate management step in evidence-based practice questions.

When a biostatistician should be involved (CCS-style "consult"):
When to pause and not interpret a single p-value as decisive:
When to demand replication before practice change:
IRB / regulatory escalation:
Quality improvement context:
Solid White Background
Key Differentials — P-value vs Other Inferential Statistics
Same-category "differentials" — what else describes evidence?
Confidence interval (95% CI):
• Range of plausible values for the true parameter
• Conveys precision and magnitude; preferred over p alone
• CI excludes null ↔ p<0.05
Effect size measures:
• Cohen's d for continuous outcomes (0.2 small, 0.5 medium, 0.8 large)
• RR, OR, HR for ratios; ARR for absolute
• Independent of sample size — unlike p
NNT / NNH:
• Absolute, patient-centered translations of effect size
• Smaller NNT = more efficient therapy
Likelihood ratio (LR+ / LR−):
• Diagnostic test performance
• Updates pre-test to post-test probability via Bayes' theorem
• Unrelated to hypothesis-test p-values
Bayes factor:
• Ratio of evidence favoring H₁ vs H₀
• Directly addresses what p-values cannot: relative support for hypotheses
Posterior probability:
• Bayesian P(H data), the quantity clinicians intuitively want
Hierarchy of evidence integration:
— Effect size + CI > p-value alone
— Pre-specified primary endpoint > secondary/subgroup
— Replicated finding > single trial
— Meta-analysis (with low heterogeneity, I²<50%) > individual trials
Forest plots and meta-analytic interpretation:
— Pooled estimate with CI; diamond width = precision
— Heterogeneity (I², Cochran Q) assesses consistency across trials
— Random-effects model when heterogeneity present
Board pearl: When a question lists "p<0.05" alongside CI, effect size, and NNT, the CI and NNT are typically the correct answers for clinical decision-making. The p-value is a screening filter; the CI and NNT are the diagnostic and management tools.
Solid White Background
Key Differentials — P-value vs Other Statistical Concepts (Cross-Category)

— α = pre-specified threshold (decision rule)

— p = observed result (data summary)

— Reject H₀ when p < α

— Power is prospective (study design); p is retrospective (study result)

— Low power → high false-negative risk; doesn't change p interpretation but changes inference from non-significance

— Type I (α): false positive — reject true H₀

— Type II (β): false negative — fail to reject false H₀

— p relates to type I error rate only if H₀ is true

— Sensitivity/specificity are properties of a test; PPV/NPV depend on prevalence

— Analogously: a "significant" p depends on prior plausibility — low-prior, low-power studies yield high false-discovery rates even at p<0.05 (Ioannidis)

— RR/OR/HR are effect sizes; p tests whether they differ from null

— A trial can report RR=2.0 with p=0.3 (small, imprecise) or RR=1.05 with p<0.001 (huge, trivial)

— r measures strength and direction (−1 to +1)

— p tests whether r differs from 0

— High n yields significant p for trivially small r

— A significant p does not establish causation

— Causation requires design (randomization), Bradford Hill criteria, or counterfactual frameworks

Key distinction: The p-value answers a narrow question — "how surprising is this data under the null?" It is not interchangeable with effect size, power, predictive value, or causation. Step 3 stems test whether you can identify the right statistical quantity for the question being asked.

P-value vs alpha (α):
P-value vs power (1 − β):
P-value vs type I and type II error:
P-value vs prevalence and predictive values:
P-value vs incidence/relative risk:
P-value vs correlation coefficient (r):
Statistical vs causal inference:
Solid White Background
Secondary Prevention — Building Lifelong Habits of Critical Appraisal

1. Always read the CI before the p-value

2. Demand pre-specification

3. Translate to absolute terms

4. Check the trial population

5. Seek replication and meta-analysis

6. Account for harms with same rigor as benefits

CONSORT for RCTs

STROBE for observational studies

PRISMA for systematic reviews/meta-analyses

GRADE for evidence quality ratings (high → very low)

— What was the effect size?

— How wide is the CI?

— Was this the primary endpoint?

— How many comparisons were made?

— Has it been replicated?

— Is the effect clinically meaningful (MCID, NNT)?

— Do harms outweigh benefits?

— Use absolute risks and natural frequencies ("1 in 50" rather than "RR 0.85")

— Decision aids improve shared decision-making

Step 3 management: Long-term practice "prevention" against statistical misinterpretation involves institutional habits — journal clubs, EBM rounds, pre-registration culture, and biostatistics partnerships. The exam rewards the physician who treats every p-value as one data point among many, not a verdict.

Long-term plan for evidence-based practice (analogous to "discharge medications"):
Magnitude and precision come first
Primary endpoint, analytic plan, subgroups defined a priori
ARR, NNT, NNH per patient encounter
Does your patient resemble enrolled participants?
Single-trial results are tentative
Underpowered safety analyses are not reassurance
Critical appraisal frameworks:
Routine "follow-up" questions to ask of any p<0.05:
Communicating uncertainty to patients:
Solid White Background
Follow-Up, Monitoring, and EBM Skill Maintenance

— Critical appraisal of trials in your specialty

— Familiarity with key landmark trials and their effect sizes (not just "positive/negative")

— Recognition of common statistical fallacies

— Comfort with absolute vs relative risk communication

— Journal club participation — verbalize interpretations

— Use structured appraisal tools (CASP, JAMA Users' Guides)

— Cross-check secondary sources (Cochrane, UpToDate, DynaMed) against primary trials

— Specialty guideline updates: every 1–5 years

— Major trial readouts: track via AHA/ACC, ASCO, ADA, NEJM, JAMA, Lancet

— Meta-analyses and Cochrane reviews: revisit annually for high-volume conditions

— Disclose absolute benefits and harms with NNT/NNH

— Acknowledge uncertainty when CIs are wide

— Discuss alternative options including no treatment

— SPC charts to detect process change without p-value abuse

— Small-sample QI initiatives often misuse t-tests; prefer run-chart rules

— Force yourself to state the CI and NNT before commenting on p

— Reject "significant ≠ important" conflation in your own speech

— Model proper statistical reasoning to residents and students

— Correct gently when "p<0.05 = it works" appears in case presentations

Board pearl: Step 3 expects practicing physicians to maintain lifelong evidence-based practice habits, not just memorize trial results. Questions about journal club, MOC, and CME often hinge on whether you can identify a flawed inference and propose a structured appraisal approach.

Ongoing competencies (ABIM/MOC-relevant):
Monitoring your interpretive skills:
Cadence of evidence review:
Patient counseling parameters tied to statistics:
Quality improvement monitoring:
Self-rehab from "p-value reflex":
Teaching and trainee oversight:
Solid White Background
Ethical, Legal, and Patient Safety Considerations

— Patients have the right to understand absolute risks and benefits, not just relative

— Quoting "50% reduction" without absolute numbers (e.g., 2% → 1%) is potentially misleading

— Consent for research participation must include realistic disclosure of likely benefit, equipoise, and uncertainty

Selective outcome reporting (publishing only significant findings) violates scientific integrity

P-hacking and HARKing are research misconduct when intentional

— Trial pre-registration (ClinicalTrials.gov) is required by ICMJE for publication

— DSMBs uphold patient safety by stopping trials early for harm or overwhelming efficacy

— Industry-funded trials are not inherently invalid but require transparency

— Disclosure required in publications and at point of care

— Marketing materials emphasizing relative risk reductions without absolute context warrant skepticism

— A physician handing off care must communicate uncertainty in evidence, not present therapies as definitively proven when based on a single underpowered trial

— Quality dashboards reporting "significant improvement" in readmissions or infection rates can mislead if multiple comparisons or small n are involved — verify with SPC methodology

— Adverse event signals (e.g., post-marketing pharmacovigilance) often rely on observational data with wide CIs; act on signals while acknowledging uncertainty

— FDA MedWatch reporting is mandatory for serious unexpected events regardless of "statistical significance"

— Trials excluding women, elderly, minorities yield p-values that may not generalize

— Ethical practice requires explicit acknowledgment of evidence gaps for underrepresented groups

Step 3 management: When a drug rep, QI report, or trial summary presents a "significant" result, the ethically appropriate response is to (1) request absolute numbers, (2) ask about pre-specification and replication, and (3) communicate uncertainty transparently to patients during informed consent.

Informed consent and statistical communication:
Research ethics tied to p-values:
Conflicts of interest:
Patient safety — transition of care and statistical literacy:
Mandatory reporting and surveillance:
Equity considerations:
Solid White Background
High-Yield Associations and Rapid-Fire Clinical Facts
Definitional anchors:
— p-value = P(data ≥ observed H₀ true)
— α = pre-specified type I error threshold (typically 0.05)
— β = type II error; power = 1 − β (target ≥0.80)
CI–p shortcuts:
— 95% CI excludes null → p < 0.05
— 95% CI for difference: null = 0
— 95% CI for ratio (RR/OR/HR): null = 1
Sample size effects:
— Doubling n shrinks SE by √2 → narrower CI, smaller p
— Huge n can make trivial effects "significant"
— Small n can miss large effects ("non-significant")
Multiple comparisons:
— Bonferroni: α/k
— 20 tests at α=0.05 → ~1 false positive expected
Test selection cheat sheet:
— 2 means, normal: t-test
— ≥3 means: ANOVA
— Non-normal: Wilcoxon/Mann–Whitney/Kruskal–Wallis
— Categorical: chi-square (Fisher if small)
— Paired categorical: McNemar
— Survival: log-rank, Cox
Common pitfalls:
— p>0.05 ≠ no effect
— p<0.05 ≠ important effect
— Significant subgroup in negative trial = chance until replicated
— "Trend toward significance" = not significant
Vocabulary:
— Type I error: false positive (α)
— Type II error: false negative (β)
— Power: detect true effect (1 − β)
— Effect size: magnitude (independent of n)
— MCID: smallest patient-meaningful change
— NNT = 1/ARR; NNH = 1/ARI
Frameworks:
— CONSORT (RCT), STROBE (observational), PRISMA (meta-analysis), GRADE (quality)
Board pearl: If forced to choose one number to report from a trial, choose the 95% CI for the effect size, not the p-value. The CI contains the p-value's information plus magnitude and precision — and Step 3 distractors reliably reward this preference.
Solid White Background
Board Question Stem Patterns
Pattern 1: "Statistically significant but trivial"
— Stem: 50,000-patient trial shows 0.3 mmHg SBP reduction, p<0.001
— Trap: "Adopt new therapy"
— Correct: Effect too small to be clinically meaningful; do not change practice
Pattern 2: "Underpowered negative trial"
— Stem: 40-patient pilot RCT, 25% mortality reduction, p=0.18, 95% CI 0.50–1.15
— Trap: "Therapy is ineffective"
— Correct: Inconclusive; possibly effective; requires larger trial
Pattern 3: "Subgroup analysis trap"
— Stem: Overall trial neutral; one of 15 subgroups (e.g., diabetics) shows p=0.03 benefit
— Trap: "Recommend therapy for diabetics"
— Correct: Hypothesis-generating only; chance likely; needs confirmatory trial
Pattern 4: "Multiple comparisons inflation"
— Stem: Genome-wide study tests 10,000 SNPs, finds one with p=0.001
— Trap: "Strong association"
— Correct: Expected ~10 such findings by chance; apply Bonferroni or FDR
Pattern 5: "Wrong test"
— Stem: Investigators use chi-square on continuous outcome data
— Correct: Should use t-test or non-parametric equivalent
Pattern 6: "Misinterpreting p as P(H₀)"
— Trap option: "p=0.03 means a 3% chance the null is true"
— Correct: p = P(data H₀); does NOT give P(H₀ data)
Pattern 7: "Surrogate endpoint hype"
— Stem: Drug lowers LDL with p<0.001, but mortality unchanged
— Correct: Surrogate benefit ≠ clinical benefit
Pattern 8: "Early stopping inflates effect"
— Stem: Trial stopped early for efficacy at interim
— Correct: Effect size likely overestimated; awaits replication
Pattern 9: "Drug rep / shared decision"
— Correct: Translate to absolute risk, NNT/NNH; discuss uncertainty
Step 3 management: When stems pair a p-value with a CI, effect size, or NNT, prioritize the clinically translated answer choice over the one that simply restates statistical significance. The exam consistently rewards integration over reflex.
Solid White Background
One-Line Recap
A p-value is the probability of observing data as extreme as ours assuming the null hypothesis is true — nothing more — so it must always be interpreted alongside effect size, confidence interval, pre-specification, sample size, and clinical importance before it can inform any patient care decision.
Core recap bullets:
What p IS: P(data H₀); a measure of compatibility with the null
What p IS NOT: probability the null is true, probability the treatment works, measure of effect size, or guarantee of replication
Decision framework:
— Always read the 95% CI before the p-value — it captures magnitude, direction, and precision
— Translate findings into absolute risk reduction, NNT, and NNH for patient-centered decisions
— Confirm the result is from a pre-specified primary endpoint, not a post-hoc or subgroup analysis
— Apply multiple-comparison corrections when many tests are performed
Pitfalls to reject on exam day:
— "p<0.05 means clinically important" — FALSE
— "p>0.05 means no effect" — FALSE; may be underpowered
— "Significant subgroup in a negative trial changes practice" — FALSE; hypothesis-generating
— "Smaller p means larger effect" — FALSE; p conflates effect size, variability, and n
Best practice habits:
— Demand effect size + CI + p as a trio
— Require pre-registration and replication before adopting novel therapies
— Communicate uncertainty honestly during informed consent and shared decision-making
Board pearl: The single most reliable Step 3 instinct in any biostatistics vignette is to distrust isolated p-values and demand the confidence interval, effect size, and clinical context — every time, without exception.
Solid White Background
bottom of page